Workshop in Computational Bioskills

Workshop in Computational Bioskills - Lesson 4

Workshop in Computational Bioskills - Spring 2011

Lesson 4 - Perl III

Part 1 - Safe code
Part 2 - Regular Expressions (Testing file properties)
Part 3 - References
Part 4 - Modules

Safe Code

my() : Private Variables:
my() takes a list of variable names and creates local versions of them.

o an improvement to the 2nd example from Lesson 3 :
sub add {
my ($sum); # make $sum a local variable
$sum = 0; # initialize the sum
foreach $_ (@_) {
$sum += $_; # add each element
}
return $sum;
}

o 1st example :
sub list_abs {
my (@l) = @_;
foreach $_ (@l) {
$_ = abs($_);
}
return @l;
}

local() : Semi-Private Variables:
local() is quite similar to my() by making another (local) copy of variable. Here, though, it is not so local... These local variables are also visible to functions called from within the block in which those variables are declared.

o 2nd example

#!/usr/bin/perl -w
$a = 5;
{
    local $a = 3;
    f();
}
f();


sub f {
    if (defined $a) {
	print "$a\n";
    } else {
	print "\$a not defined\n";
    }
}

!?!What will be the output of this program?

strict: Forcing Variables Declaration:
It is convenient to use the strict pragma:
use strict;
It forces the user to declare all (global) variable using my(), before they can be used.

- This is highly recommended!!

Regular Expressions:

Patterns are very useful. We would often like to know if they can be found in files,
how many times, where, and sometimes even replace them with another.

In UNIX, we might use grep, sed & tr to find and manipulate patterns.
In Perl, we can find almost similar commands.

Pattern Matching - m//:
(See m// in perlop manpage)

m/pattern/options searches a string for a pattern match,
and in scalar context returns true (1) or false ('').

By default the search is done upon $_
Other variables can be searched using =~

if (m/hello/) { ... }; # search hello in $_
if ($_ =~ m/hello/) { ... }; # same hello in $_
if (/hello/) { ... }; # same search
print if (m/hello/); # print $_ if matches m/hello/
if ($string =~ m/hello/) { ... }; # search hello in $string

Options are:
g Match globally, i.e., find all occurrences.
i Do case-insensitive pattern matching.
c Do not reset search position on a failed match when /g is in effect.
m Treat string as multiple lines.
o Compile pattern only once.
s Treat string as single line.
x Use extended regular expressions.

Regular Expression Syntax:
(See man perlre)

In particular the following meta-characters have their standard egrep-ish
meanings:
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class

The following standard quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times

Do patterns must be so greedy ?
By default, a quantified subpattern is ``greedy'', that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.

If you want it to match the minimum number of times possible, follow the quantifier with a ``?''. Note that the meaning doesn't change, just the ``greediness'':
*? Match 0 or more times
+? Match 1 or more times
?? Match 0 or 1 time
{n}? Match exactly n times
{n,}? Match at least n times
{n,m}? Match at least n but not more than m times

Special Characters:
\w Match a "word" character (alpha-numeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character

Perl also defines the following zero-width assertions:
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string

Storing Patterns in Memory:
When the bracketing construct ( ... ) is used, \<digit> matches the digit'th substring.

Outside of the pattern, use ``$'' instead of ``\'' in front of the digit.

- Using \1,\2,\3 inside the pattern
if (m/Time: (..):\1:\1/) { # Will match "Time: 12:12:12"
��$hours = $minutes = $seconds = $1;
}

- Using $1,$2,$3 outside the pattern
if (m/Time: (..):(..):(..)/) {# Will match any hour.
��$hours = $1;
��$minutes = $2;
��$seconds = $3;
}

Pattern Replacing - s///:
(See s// in perlop manpage)

s/pattern/replacement/options searches a string for a pattern, and if found, replaces that pattern with the replacement text and returns the number of substitutions made. Otherwise it returns false (specifically, the empty string).

Options are:
g Replace globally, i.e., all occurrences.
i Do case-insensitive pattern matching.
e Evaluate the right side as an expression.
m Treat string as multiple lines.
o Compile pattern only once.
s Treat string as single line.
x Use extended regular expressions.

!?!

How can we replace "green" with "white", without changing "greens" ?
And if "green" appears more than once in the string ?
What will this do: $in = "Who are you?"; $out = ($in =~ s/w.*?\b/what/i);
Can we use a variable in the statement?
How can we replace the patters: "aabb" with "ab" ?

!?!

Pattern Transliterating - tr///:
(See tr// in perlop manpage)

tr/searchlist/replacementlist/options transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.

tr returns the number of translations/deletions done.

A character range may be specified with a hyphen, so tr/A-J/0-9/ does the same replacement as tr/ACEGIBDFHJ/0246813579/.

Options:
c Complement the SEARCHLIST.
d Delete found but unreplaced characters.
s Squash duplicate replaced characters.

!?!

Can we use tr to count the number of upper-case characters without translating?
How about counting the number of NON-numeric characters ?
How can we delete a character ?

!?!

Example - Finding the binding sites of a TF:

In Ex1 you search for TF binding sites in promoters, based on the consensus sequence.
For example, the consensus binding sequence of the Yeast TF Reb1 is CGGGTAA.
This is nice and easy, however many times the binding preferences of DNA binding proteins is less specific.
For example, look at the following sequence-logo, representing the binding site of the TF Gcn4:

How can we represent this binding site using a regular expression ?
Let's search the promoter file YeastPromoters.tfa for Gcn4 binding sites: scanTF.pl

And how about the binding site of the Yeast TF Fhl1:

Not so simple...

Part III: References:
perldoc perlref

A reference in Perl is a typed pointer. It stores the type & address of some variable.

Furthermore, you can see Perl references as C++'s "smart pointers", keeping track of the number of pointers, pointing to some variable, releasing it when it is no longer needed.

Creating a references:
To create a reference, use the Backslash (\) operator.

o Referencing variables
$scalarref = \$scalar;
$arrayref = \@array;
$hashref = \%hash;
$code_ref = \&function;

These are references to explicit variables.

o Referencing anonymous objects
We've seen the \ creates a reference for existing variables.

But how can we reference anonymous objects which don't have names ?
We need to create them from scratch.

To create an anonymous array reference, use [ ].
$arrayref = [1, 2, ['a', 'b', 'c']];

This is a reference to an array with 3 elements - 2 scalars and one reference to aother anonymous array.

To create an anonymous hash reference, use { }.
$hashref = {
'Adam' => 'Eve',
'Clyde' => 'Bonnie',
};

You can also take a reference to a source code.
$coderef = sub { print "Hello, World!\n" };

Using references:
After referencing so many different objects, we might want to use them ...

Dereferencing
The principle is very simple: add the relevant symbol for each type of reference:

$bar = $$scalarref;
@array = @$arrayref
%hash = %$hashref

Rmember that the context is what matters
$bar = $$scalarref;
push(@$arrayref, $filename);
$$arrayref[0] = "January";
$$hashref{"KEY"} = "VALUE";
&$code_ref(1,2,3);

Using the Arrow operator:
$arrayref->[0] = "January";
$hashref->{KEY} = "VALUE";

!?! What is the difference between an array and an array-reference :

As a return value of a sub functions?

As an input of a sub function?

See an example: refExample.pl

Let's try writing a function factory : function_generator.pl

List of Lists:
perldoc perldsc (Data Structures)
perldoc perllol (Lists of Lists)

an LoL is merely a list of list references.
@LoL = (
   [ "fred", "barney" ],
   [ "george", "jane", "elroy" ],
   [ "homer", "marge", "bart" ],
);
print $LoL[2][2]; # bart

What happens if we create a list of lists, with no references? (try this at home...)
@list1 = ( "fred", "barney" );
@list2 = ( "george", "jane", "elroy" );
@list3 = ( "homer", "marge", "bart" );
$LoL[0] = @list1;
$LoL[1] = @list2;
$LoL[2] = @list3;

Ref. to List of Lists:
# a reference to a list of list references
$ref_to_LoL = [
   [ "fred", "barney", "pebbles", "bambam", "dino", ],
   [ "homer", "bart", "marge", "maggie", ],
   [ "george", "jane", "elroy", "judy", ],
];
print $ref_to_LoL->[2][2]; # "elroy"

!?! Who is Elroy?

A Hash of a Lists:
%HoL = (
    "flintstons" => [ "fred", "barney" ],
    "jetsons" => [ "george", "jane", "elroy" ],
    "simpsons" => [ "homer", "marge", "bart" ],
);
print $HoL{jetsons}[2]; # "elroy"

Once you get the idea, the rest is quite simple.

Example: Finding differentially expressed genes: parseGE.pl
Here we parse a file which maps between probes on the an array, genes, and the measured log-ratio (tab separated file).
Each gene is represented by more than one probe on the array.
We serach for genes with an average log-ratio > 2.
Exmaple of an unput file:logRatioByProbe.txt

The ref function

ref(\$a) will output "SCALAR".
ref(\@a) will output "ARRAY".
ref(\%a) will output "HASH".
ref(\$reference) will output "REF".
ref(sub {return 1}) will output "CODE".
ref(3) will output "" (the empty string).

!?! How to use a hash with array values ? What's wrong with the folowing code:

my ($i,@a,%hash);
while(<>){
    chomp;
    ($i,@a)=split "\t";
    $hash{$i}=\@a;
}
How can you solve the problem ?
And how would you handle a hash of hashes with array values ?

Part IV: Perl Modules
Packages, Libraries, Modules & Programs are all different types of namespaces and classes, which helps us maintain simple and organized code. Perl provides mechanisms to protect packages from stomping on each other's variables.
perldoc perlmod

Package:
Quite similar to C++ namespace. The default package of your code is 'main'.

package Alpha;
$name = "first";

package Omega;
$name = "last";

package main;
print "> Alpha is $Alpha::name, Omega is $Omega::name.\n";
> Alpha is first, Omega is last.
The package includes all the variables and the subroutines that are defined in its scope.

Module:
A bunch of subroutines that conforms to specific conventions. Stored in a *.pm file. Modules can be loaded with use (at compile time) or require (at run time)

A Pragma:
is a module that affects the compilation behavior (like strict).

Writing a Module:
You must follow some basic rules. Here is an example to the Coffee module. In the package, we'll include some variables ($with_milk and $sugar) and a function: drink()

It must sit in a file called "Coffee.pm", within @INC (the UNIX shell variable $PERL5LIB might be handy here. See perlmod perlrun)

# ------------ Coffee.pm ------------
package Coffee;
use Exporter;
@ISA = ('Exporter');
@EXPORT = qw(&drink $with_milk $sugar);
$with_milk = "no";
$sugar = 1;
sub drink {
print "The coffee was great ($with_milk milk, $sugar sugar)\n";
}
1;
# ------------ Coffee.pm ------------

In your program, do:

use Coffee;
$sugar++;
drink();
> The coffee was great (no milk, 2 sugar)

or

require Coffee;
$Coffee::with_milk = "a little";
Coffee::drink();
> The coffee was great (a little milk, 1 sugar)

You can read more about lines 2-4, and the last line, in Coffee.pm in the perlmod man page.

Installing new modules: How to get those cool modules from CPAN, and install them on my machine ? - search CPAN, and download package, say GD-2.39.tar.gz > tar zxvf GD-2.39.tar.gz > cd GD-2.39 > perl Makefile.PL LIB=~/perllib PREFIX=~/perllib > make > make test > make install > cd .. > rm -Rf GD-2.39.tar.gz > setenv PERL5LIB ${HOME}/perllib > setenv MANPATH ${MANPATH}:${HOME}/perllib/lib/perl5/man/ What do you know about environment variables ? the setenv command ? and the .cshrc file under your home directory ?