Workshop in Computational Bioskills - Lesson 1

Workshop in Computational Bioskills - Spring 2010

Lesson 1 - Unix Shell Programming

Part I - The basics of text manipulation in UNIX
Part II - Primitive motif search - E.Coli sigma70 sites
Part III - Visualization of E.Coli genes' length
Part IV - Whole genomes nucleotide composition
Part V - foreach & friends
Part VI - Writing a shell script
Part VII - Regular Expressions


Part I - The basics of text manipulation in UNIX

UNIX commands:

Text:    cat, sort, uniq, grep, echo
         head, tail, wc
         tr, cut, fold
         awk, sed
         paste, join
Web:     lynx, wget (We'll get back to this in advanced lesson) 
Control: if, foreach, while, 
         jobs, kill, ctrl-D, ctrl-S, ctrl-Q, bg, fg
         [or in general: tcsh manual, (ba)sh manual]
Files:   File redirections (<, >, >>, >&, >!, >&!), tee
Other useful commands will be given throughout the course
<201|0>bioskill:~> cal
     March 2010     
Su Mo Tu We Th Fr Sa
 1  2  3  4  5  6  7
 8  9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
<202|0>bioskill:~> cal 2005

                            2005

      January               February               March
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
                   1         1  2  3  4  5         1  2  3  4  5
 2  3  4  5  6  7  8   6  7  8  9 10 11 12   6  7  8  9 10 11 12
 9 10 11 12 13 14 15  13 14 15 16 17 18 19  13 14 15 16 17 18 19
16 17 18 19 20 21 22  20 21 22 23 24 25 26  20 21 22 23 24 25 26
23 24 25 26 27 28 29  27 28                 27 28 29 30 31
30 31                                       
       April                  May                   June
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
                1  2   1  2  3  4  5  6  7            1  2  3  4
 3  4  5  6  7  8  9   8  9 10 11 12 13 14   5  6  7  8  9 10 11
10 11 12 13 14 15 16  15 16 17 18 19 20 21  12 13 14 15 16 17 18
17 18 19 20 21 22 23  22 23 24 25 26 27 28  19 20 21 22 23 24 25
24 25 26 27 28 29 30  29 30 31              26 27 28 29 30
                                            
        July                 August              September
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
                1  2      1  2  3  4  5  6               1  2  3
 3  4  5  6  7  8  9   7  8  9 10 11 12 13   4  5  6  7  8  9 10
10 11 12 13 14 15 16  14 15 16 17 18 19 20  11 12 13 14 15 16 17
17 18 19 20 21 22 23  21 22 23 24 25 26 27  18 19 20 21 22 23 24
24 25 26 27 28 29 30  28 29 30 31           25 26 27 28 29 30
31                                          
      October               November              December
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
                   1         1  2  3  4  5               1  2  3
 2  3  4  5  6  7  8   6  7  8  9 10 11 12   4  5  6  7  8  9 10
 9 10 11 12 13 14 15  13 14 15 16 17 18 19  11 12 13 14 15 16 17
16 17 18 19 20 21 22  20 21 22 23 24 25 26  18 19 20 21 22 23 24
23 24 25 26 27 28 29  27 28 29 30           25 26 27 28 29 30 31
30 31                                       

<204|0>bioskill:~> cal 2005 | cat -n
     1	                             2005
     2	
     3	      January               February               March
     4	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
     5	                   1         1  2  3  4  5         1  2  3  4  5
     6	 2  3  4  5  6  7  8   6  7  8  9 10 11 12   6  7  8  9 10 11 12
     7	 9 10 11 12 13 14 15  13 14 15 16 17 18 19  13 14 15 16 17 18 19
     8	16 17 18 19 20 21 22  20 21 22 23 24 25 26  20 21 22 23 24 25 26
     9	23 24 25 26 27 28 29  27 28                 27 28 29 30 31
    10	30 31                                       
    11	       April                  May                   June
    12	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
    13	                1  2   1  2  3  4  5  6  7            1  2  3  4
    14	 3  4  5  6  7  8  9   8  9 10 11 12 13 14   5  6  7  8  9 10 11
    15	10 11 12 13 14 15 16  15 16 17 18 19 20 21  12 13 14 15 16 17 18
    16	17 18 19 20 21 22 23  22 23 24 25 26 27 28  19 20 21 22 23 24 25
    17	24 25 26 27 28 29 30  29 30 31              26 27 28 29 30
    18	                                            
    19	        July                 August              September
    20	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
    21	                1  2      1  2  3  4  5  6               1  2  3
    22	 3  4  5  6  7  8  9   7  8  9 10 11 12 13   4  5  6  7  8  9 10
    23	10 11 12 13 14 15 16  14 15 16 17 18 19 20  11 12 13 14 15 16 17
    24	17 18 19 20 21 22 23  21 22 23 24 25 26 27  18 19 20 21 22 23 24
    25	24 25 26 27 28 29 30  28 29 30 31           25 26 27 28 29 30
    26	31                                          
    27	      October               November              December
    28	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
    29	                   1         1  2  3  4  5               1  2  3
    30	 2  3  4  5  6  7  8   6  7  8  9 10 11 12   4  5  6  7  8  9 10
    31	 9 10 11 12 13 14 15  13 14 15 16 17 18 19  11 12 13 14 15 16 17
    32	16 17 18 19 20 21 22  20 21 22 23 24 25 26  18 19 20 21 22 23 24
    33	23 24 25 26 27 28 29  27 28 29 30           25 26 27 28 29 30 31
    34	30 31                                       
<206|0>bioskill:~> cal 2005 | cat -n | head -10
    1	                             2005
     2	
     3	      January               February               March
     4	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
     5	                   1         1  2  3  4  5         1  2  3  4  5
     6	 2  3  4  5  6  7  8   6  7  8  9 10 11 12   6  7  8  9 10 11 12
     7	 9 10 11 12 13 14 15  13 14 15 16 17 18 19  13 14 15 16 17 18 19
     8	16 17 18 19 20 21 22  20 21 22 23 24 25 26  20 21 22 23 24 25 26
     9	23 24 25 26 27 28 29  27 28                 27 28 29 30 31
    10	30 31         
<207|0>bioskill:~> cal 2005 | cat -n | tail -10
    25	24 25 26 27 28 29 30  28 29 30 31           25 26 27 28 29 30
    26	31                                          
    27	      October               November              December
    28	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
    29	                   1         1  2  3  4  5               1  2  3
    30	 2  3  4  5  6  7  8   6  7  8  9 10 11 12   4  5  6  7  8  9 10
    31	 9 10 11 12 13 14 15  13 14 15 16 17 18 19  11 12 13 14 15 16 17
    32	16 17 18 19 20 21 22  20 21 22 23 24 25 26  18 19 20 21 22 23 24
    33	23 24 25 26 27 28 29  27 28 29 30           25 26 27 28 29 30 31
    34	30 31                        
<208|0>bioskill:~> cal 2005 | cat -n | tail -n +20
    20	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
    21	                1  2      1  2  3  4  5  6               1  2  3
    22	 3  4  5  6  7  8  9   7  8  9 10 11 12 13   4  5  6  7  8  9 10
    23	10 11 12 13 14 15 16  14 15 16 17 18 19 20  11 12 13 14 15 16 17
    24	17 18 19 20 21 22 23  21 22 23 24 25 26 27  18 19 20 21 22 23 24
    25	24 25 26 27 28 29 30  28 29 30 31           25 26 27 28 29 30
    26	31                                          
    27	      October               November              December
    28	Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
    29	                   1         1  2  3  4  5               1  2  3
    30	 2  3  4  5  6  7  8   6  7  8  9 10 11 12   4  5  6  7  8  9 10
    31	 9 10 11 12 13 14 15  13 14 15 16 17 18 19  11 12 13 14 15 16 17
    32	16 17 18 19 20 21 22  20 21 22 23 24 25 26  18 19 20 21 22 23 24
    33	23 24 25 26 27 28 29  27 28 29 30           25 26 27 28 29 30 31
    34	30 31                 
<203|0>bioskill:~> man grep 

<203|0>bioskill:~> cal 2005 | grep 30

23 24 25 26 27 28 29  27 28                 27 28 29 30 31
30 31                                       
24 25 26 27 28 29 30  29 30 31              26 27 28 29 30
24 25 26 27 28 29 30  28 29 30 31           25 26 27 28 29 30
23 24 25 26 27 28 29  27 28 29 30           25 26 27 28 29 30 31
30 31     
!?! What is the meaning of the flags -i,-v,-A,-B 
        in the grep command?
<209|0>bioskill:~> cal 2005 | tr 'A-Z' 'a-z'
                             2005

      january               february               march
su mo tu we th fr sa  su mo tu we th fr sa  su mo tu we th fr sa
                   1         1  2  3  4  5         1  2  3  4  5
 2  3  4  5  6  7  8   6  7  8  9 10 11 12   6  7  8  9 10 11 12
 9 10 11 12 13 14 15  13 14 15 16 17 18 19  13 14 15 16 17 18 19
16 17 18 19 20 21 22  20 21 22 23 24 25 26  20 21 22 23 24 25 26
23 24 25 26 27 28 29  27 28                 27 28 29 30 31
30 31                                       
       april                  may                   june
su mo tu we th fr sa  su mo tu we th fr sa  su mo tu we th fr sa
                1  2   1  2  3  4  5  6  7            1  2  3  4
 3  4  5  6  7  8  9   8  9 10 11 12 13 14   5  6  7  8  9 10 11
10 11 12 13 14 15 16  15 16 17 18 19 20 21  12 13 14 15 16 17 18
17 18 19 20 21 22 23  22 23 24 25 26 27 28  19 20 21 22 23 24 25
24 25 26 27 28 29 30  29 30 31              26 27 28 29 30
                                            
        july                 august              september
su mo tu we th fr sa  su mo tu we th fr sa  su mo tu we th fr sa
                1  2      1  2  3  4  5  6               1  2  3
 3  4  5  6  7  8  9   7  8  9 10 11 12 13   4  5  6  7  8  9 10
10 11 12 13 14 15 16  14 15 16 17 18 19 20  11 12 13 14 15 16 17
17 18 19 20 21 22 23  21 22 23 24 25 26 27  18 19 20 21 22 23 24
24 25 26 27 28 29 30  28 29 30 31           25 26 27 28 29 30
31                                          
      october               november              december
su mo tu we th fr sa  su mo tu we th fr sa  su mo tu we th fr sa
                   1         1  2  3  4  5               1  2  3
 2  3  4  5  6  7  8   6  7  8  9 10 11 12   4  5  6  7  8  9 10
 9 10 11 12 13 14 15  13 14 15 16 17 18 19  11 12 13 14 15 16 17
16 17 18 19 20 21 22  20 21 22 23 24 25 26  18 19 20 21 22 23 24
23 24 25 26 27 28 29  27 28 29 30           25 26 27 28 29 30 31
30 31                                       
<210|0>bioskill:~> cal | sed 's/February/bioskill/'
   bioskill 2010
Su Mo Tu We Th Fr Sa
       1  2  3  4  5
 6  7  8  9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28
<211|0>bioskill:~> cal | sed 's/february/bioskill/i'
   bioskill 2010
Su Mo Tu We Th Fr Sa
       1  2  3  4  5
 6  7  8  9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28
!?! What does the flag 'g' in sed command? Any ther important flags?
<212|0>bioskill:~> cal | fold -w 7
   Febr
uary 20
05
Su Mo T
u We Th
 Fr Sa
       
1  2  3
  4  5
 6  7  
8  9 10
 11 12
13 14 1
5 16 17
 18 19
20 21 2
2 23 24
 25 26
27 28
!?! Instead of using cat for viewing a file, try more or less (Look at the MAN pages)

Part II - Primitive motif search - E.Coli sigma70 sites

<201|0>bioskill:~> cd Data/EColi/Promoters/

<202|0>bioskill:Promoters> cat 50-TSS.tfa | head -10

>ORF83.1
GCATAAAAAACTCTGCTGGCATTCACAAATGCGCAGGGGTAAAACGTTTC
>ORF83.2
AATGCGCAGGGGTAAAACGTTTCCTGTAGCACCGTGAGTTATACTTTGTA
>accA
TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA
>accB
TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG
>accD
TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA

!?! Do you know what is the Fasta format ?

<203|141>bioskill:Promoters> cat 50-TSS.tfa | 
FASTA2line.pl | head -10

>ORF83.1        GCATAAAAAACTCTGCTGGCATTCACAAATGCGCAGGGGTAAAACGTTTC
>ORF83.2        AATGCGCAGGGGTAAAACGTTTCCTGTAGCACCGTGAGTTATACTTTGTA
>accA   TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA
>accB   TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG
>accD   TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA
>aceBAK TAAAATGGAAATTGTTTTTGATTTTGCATTTTAAATGAGTAGTCTTAGTT
>aceE   ACTAAACGTAGAACCTGTCTTATTGAGCTTTCCGGCGAGAGTTCAATGGG
>acnAP1 CGTTATTCCAGACGACTGGCAACTAACATCGCAGCAGCAAGCCTTTATAG
>acnAP2 ATTTGGGTTGTTATCAAATCGTTACGCGATGTTTGTGTTATCTTTAATAT
>acnB   TTTTTTGTAAACAGATTAACACCTCGTCAAAATCCTGCTATTCTGCCCGT

<204|141>bioskill:Promoters> cat50-TSS.tfa | FASTA2line.pl | grep -v ORF | head -10 >accA TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA >accB TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG >accD TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA >aceBAK TAAAATGGAAATTGTTTTTGATTTTGCATTTTAAATGAGTAGTCTTAGTT >aceE ACTAAACGTAGAACCTGTCTTATTGAGCTTTCCGGCGAGAGTTCAATGGG >acnAP1 CGTTATTCCAGACGACTGGCAACTAACATCGCAGCAGCAAGCCTTTATAG >acnAP2 ATTTGGGTTGTTATCAAATCGTTACGCGATGTTTGTGTTATCTTTAATAT >acnB TTTTTTGTAAACAGATTAACACCTCGTCAAAATCCTGCTATTCTGCCCGT >ada GCGCAAGATTGTTGGTTTTTGCGTGATGGTGACCGGGCAGCCTAAAGGCT >adhE GTTGTGCAAAACATGCTAATGTAGCCACCAAATCATACTACAATTTATTA

<205|141>bioskill:Promoters> cat 50-TSS.tfa | FASTA2line.pl | grep -v ORF | FASTAfromline.pl > 50-TSS.no-ORF.tfa

<206|141>bioskill:Promoters> head -7 50-TSS.no-ORF.tfa

>accA
TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA
>accB
TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG
>accD
TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA
>aceBAK

<207|0>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | wc -l


    447

<208|0>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | grep TATAAT | wc -l


    25

# Searching for the TF motif TTGACA :
<209|1>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | grep TTGACA | wc -l

    18

# And both :
<210|0>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | grep TATAAT | grep TTGACA | wc -l

    0
!?! How can we find the names of TATA-box containg genes? 

Part III - Statistics on E.Coli gene length

<201|0>bioskill:~> cd Data/EColi/

<202|0>bioskill:EColi> head -10 Colibri_Gene.list

Colibri name      Gene length     SWISS-PROT      Location (kb)   Description
aarF    1641    P27854  4017.80 Regulator of 2'-N-acetyltransferase; involved in respiratory cofactor ubiquinone production
aas     2160    P31119  2974.00 2-Acyl-glycerophosphoethanolamine acyltransferase; acyl-ACP synthetase; salvage pathway for reacylation; inner membrane; bifunctional for turnover/incorporation
aat     705     P23885   926.70 Aminoacyl-tRNA-protein-transferase
abc     1032    P30750   222.60 ABC protein family homolog
abgA    1311    P77357  1402.60 Gene in putative abgABT operon; function unknown
abgB    1446    P76052  1401.30 Gene in putative abgABT operon; function unknown
abgR    909     P77744  1402.80 Putative regulator of abgABT operon
abgT    1533    P46133  1399.80 Para-aminobenzoyl-glutamate utilization; cryptic gene
abrB    1047    P75747   747.00 Possible regulator of aidB expression
!?! Are you familiar with the awk pattern scanning and processing language?
AWK scans each input file for lines that match a set of patterns. With each pattern there can be an associated action. A statement has the form: (pattern){ action }

<203|141>bioskill:EColi> cat Colibri_Gene.list | awk -F '\t' '{print $2}' | head -10

Gene length
1641
2160
705
1032
1311
1446
909
1533
1047

<204|141>bioskill:EColi> cat Colibri_Gene.list | awk -F'\t' '(FNR>1){print $2}' | head -10

1641
2160
705
1032
1311
1446
909
1533
1047
960
!?! What will be the result of: 
awk -F '\t' '(FNR>1){print $0}' and 
awk -F '\t' '(FNR>1)'

<205|141>bioskill:EColi> cat Colibri_Gene.list | 
awk -F'\t' '(FNR>1){print NF}' | sort | uniq 

   5

!?! What will be the result of the same command when omitting " -F '\t' "?

<205|141>bioskill:EColi> cat Colibri_Gene.list | awk -F '\t' '(FNR>1){sum+=$2;n++}END{print sum/n}'

948.826

<206|141>bioskill:EColi> tail -n +2 Colibri_Gene.list | cut -f 2 | stats.pl

Average : 948.826452064381
Sum     : 4067619
Sum Sqrs: 5631951129
Variance: 413552.642295925
Std dev.: 643.080587715043
N       : 4287

<207|0>bioskill:EColi> tail -n +2 Colibri_Gene.list | awk '{a[int($2/100)]++}END{for (i in a){print 100*(i+1),a[i]}}' | sort -g > tmp

<207|0>bioskill:EColi> head tmp

100 104
200 96
300 269
400 295
500 313
600 306
700 319
800 342
900 299
1000 334

Part IV - Whole genomes nucleotide composition
Can we find the nucleotide composition by using simple shell commands?

<201|0>bioskill:~> cd Data/Bacteria

<202|0>bioskill:Bacteria> ls -l

total 1954
-rw-r--r--   1 bioskill users      923807 Mar 10 19:48 Borrelia_Burgdorferi.tfa
-rw-r--r--   1 bioskill users      828126 Mar 10 19:44 Mycoplasma_Pneumonia.tfa

<203|0>bioskill:Bacteria> grep -v '^>' Borrelia_Burgdorferi.tfa | head -10

TAAATATAATTTAATAGTATAAAAAAAATTAAATCAAATTAATAATAGTTTAAAAAACTGTTTGTATAAT
ATAATATTATTATATATAATATTAAGCAACTACTATGATACTAATGAAGTATAGTGCTATTTTATTAATA
TGTAGCGTTAATTTATTTTGTTTTCAAAATAAATTAACTACTTCTCGATGGGAATTCCCTAAAGAAGATT
TAATTAAAAAAAAAATAAAAATAGGCATAATTTACCATAATTACATAAATTCTATCTTTTACAATGAAAA
TTATAAATACATTGCCTTTATCGGAATATTGACATCTTATAATGAATGGATTGAAATACAATTTAGCCCC
ATAAATTTTTTTACTATCCCAACAAATAAAGATTTTATTTCAAATACTTATTTCAATTTAGCTTTCACTA
TTTACATTACCAAGTATTCAATTTTAACTGATACACTTGCTATAAAATTTTTTATTGGAACCCAAATCGA
TTTAACTCTGAGAACTACTATATTTACAGGAAAAACAACTCATGCATTTCTCTATCCAATTCTTCCCATA
ATTACCTTCAAATTTGAAATTGATTTCATACCTAATAACTATAGTATTTACTATAAATTATCGACTTCTT
TTAAAGAATTTATCCTTTTAGATCTAGGAATTTCTATATTTATATAATCCTTTTTTTATTATAGAACTTT

<204|0>bioskill:Bacteria> grep -v '^>' Borrelia_Burgdorferi.tfa | tr 'acgt' 'ACGT' | tr -cd 'ACGT' | fold -w 1 | sort | uniq -c

323079 A
130760 C
129646 G
327196 T

<205|0>bioskill:Bacteria> grep -v '^>' Mycoplasma_Pneumonia.tfa | tr 'acgt' 'ACGT' | tr -cd 'ACGT' | fold -w 1 | sort | uniq -c

249211 A
162920 C
163703 G
240560 T

<205|0>bioskill:Bacteria> grep -v '^>' Mycoplasma_Pneumonia.tfa | fold -w 1 | awk '{a[$1]++}END{for (i in a){print i,a[i]}}'

A 249211
C 162920
G 163703
T 240560

<206|0>bioskill:Bacteria> cd ../Virus

<207|0>bioskill:Virus> ls -l

total 242
-rw-r--r--   1 bioskill users      235788 Mar 10 19:59 Amsacta_Moorei_Entomopoxvirus.tfa
-rw-r--r--   1 bioskill users       10589 Mar 10 19:48 Human_Immunodeficiency_Virus.tfa

<208|0>bioskill:Virus> grep -v '^>' Amsacta_Moorei_Entomopoxvirus.tfa | head -10

ATTTTTTTAAAATGAAAAAAAAAAATATCATAACTACTAACTATGGATTTACCTATAGAAATTTTAGAAA
TTATATTTAATTATACAGATACATACATAAAATTATAATTTATATATTTAAAATATTTAGAATTTATTGA
AAATTAGTAAAATTAGATTGTTCTAAAACATATATTGATTCTCTAAAAGGAATACATTATCTTACTAATT
TACAAAAATTAATTCTTTAAAAGAAATATGTTGCCTTAATAATATTAAAAAAATAAATTGTTCATATACA
ATCATTGATTCTCTAAAAGGAATAAGTCTTAATAATTTAGAAGAATTATATTGTTATAATATAAAAATTT
ATTCTTTAAATATAATAATAAAAAATCTGCTTATTAAAAATATTAAATGGTTATAAATACATAAATTAAT
TATTTTATATAAATTATTGTTAAACATTTATATTAATATTCTAATATTAAAAATTGAAAAAAAAAATAAT
TATGTTAAAATGGAGTTACCTGTAGAAATGTTAGAAATTATATTTAATTATTTAGATAATGATACTAAAT
TACAATTTATAGATTCAAAATGTATTATATCAAAACTTATATATAAATTAAAATATAATTCTTGTTTAAA
AGAAATAAAGAATTTTATTAATTTAAAAGAATTAATATATAATAATTATTATATAAAATCTTTAGAAGGT

<209|0>bioskill:Virus> grep -v '^>' Amsacta_Moorei_Entomopoxvirus.tfa | tr 'acgt' 'ACGT' | tr -cd 'ACGT' | fold -w 1 | awk '{a[$1]++}END{for (i in a){print i,a[i]}}'

A  94121
C  20868
G  20454
T  96949

<210|0>bioskill:Virus> grep -v '^>' Human_Immunodeficiency_Virus.tfa | tr 'acgt' 'ACGT' | tr -cd 'ACGT' | fold -w 1 | awk '{a[$1]++}END{for (i in a){print i,a[i]}}'

A  3506
C  2132
G  2598
T  2123

Part V - foreach & friends

# Important note:
# Read the excellent
manual of tcsh to learn more about foreach, while, if
# IO, shell variables, and other tcsh options like :r,:h,:t,:a,:s,:e, etc.

<201|0>bioskill:~> cd Data/EColi

<202|0>bioskill:~> foreach f (1 2 3 4 5) # Store each item from the list (1 2 3 4 5) in $f variable,
# and run some command on it.

foreach? echo "hello: $f"
foreach? end
hello: 1
hello: 2
hello: 3
hello: 4
hello: 5

<203|0>bioskill:~> foreach f (*.tfa)
# The :r flag ommits the last part
foreach? echo -n "$f -> "
foreach? echo $f:r
foreach? end
Colibri_Prot.tfa -> Colibri_Prot
Colibri_Seq.tfa -> Colibri_Seq

<204|0>bioskill:~> foreach f (*.tfa)
# Let's rename all *.tfa files to *.fasta

foreach? mv $f $f:r.fasta
foreach? end
<205|0>bioskill:~> ls *.fasta *.tfa
Colibri_Prot.fasta
Colibri_Seq.fasta
<206|0>bioskill:~> foreach f (*.fasta)
# and back
foreach? mv $f $f:r.tfa
foreach? end
<207|0>bioskill:~> ls *.fasta *.tfa
Colibri_Prot.tfa
Colibri_Seq.tfa

Part VI - Writing a shell script

Basic scripting rules:

First line of script must begin with a shebang (#!), and then contain the path to the program that interprets your script (in our case, /bin/csh):
#!/path/to/script/interpreter

On unix platforms, remember to make the file executable by changing its permissions.
chmod +x <script-name>

Your First script:


#! /bin/csh
set name = $1
echo "Hellow $name. Have a nice day!"

<207|0>bioskill~> firstScript.csh Naomi
Hellow Naomi. Have a nice day!

<207|0>bioskill~> secondScript.csh TATAAT Promoters/50-TSS.tfa

There are 25 promoters that contain the binding site TATAAT
Their Nucleotide Composition is:
A=353
C=234
G=242
T=421

Finding the regulator of a set of genes

<207|0>bioskill~> ls TFs/

gene-set.list           Gcn4.targets.txt         
Ste12.targets.txt       Rpn4.targets.txt        
Hap4.targets.txt        Sko1.targets.txt

<207|0>bioskill:> head -4 TFs/Gcn4.targets.txt

Gene ID:    Description:
aarF    Regulator of 2'-N-acetyltransferase; involved in respiratory cofactor ubiquinone production
aas     2-Acyl-glycerophosphoethanolamine acyltransferase; acyl-ACP synthetase; salvage pathway for reacylation; inner membrane; bifunctional for turnover/incorporation
aat     Aminoacyl-tRNA-protein-transferase
abc     ABC protein family homolog

<207|0>bioskill:TFs> head -4 TFs/gene-set.list

Gene ID:
ORF10.1
aaG
aarF
aaA1

<207|0>bioskill:~> findOverlap.csh gene-set.list TFs

The TF Hap4  Has  36 Target genes (out of a total of 44 )
The TF Ste12  Has  4 Target genes (out of a total of 44 )
The TF Sko1  Has  3 Target genes (out of a total of 44 )
The TF Rpn4  Has  3 Target genes (out of a total of 44 )
The TF Gcn4  Has  3 Target genes (out of a total of 44 )
!?! In the script we used the command sort.
How can we sort by more than one column? 

Part VII - Regular Expressions:

A regular expression is a pattern that describes 
a set of strings.
Regular expressions are constructed analogously 
to arithmetic expressions, by using various 
operators to combine smaller expressions 
(eg., *,+,?,{},[]).
Patterns are very useful. We would often like 
to know if they can be found in files, how many 
times, where, and sometimes even replace them 
with another.

In UNIX, we can use grep (egrep),sed ,tr & awk 
to find and manipulate patterns.
In Perl, we can find almost similar commands. 
We will learn more about this in Lesson 4.

Lets' see some examples:

<207|0>bioskill:~> ls TFs/*

TFs/gene.set.list         TFs/Gcn4.targets.txt
TFs/Ste12.targets.txt     TFs/Rpn4.targets.txt         
TFs/Hap4.targets.txt      TFs/Sko1.targets.txt         

<207|0>bioskill:~> ls TFs/* | sed 's/.*\///'

gene-set.list           Gcn4_targets.txt         
Ste12_targets.txt       Rpn4_targets.txt        
Hap4_targets.txt        Sko1_targets.txt         

<207|0>bioskill:~> ls TFs/* | sed 's/[^:space:]\///'

gene-set.list           Gcn4_targets.txt         
Ste12_targets.txt       Rpn4_targets.txt        
Hap4_targets.txt        Sko1_targets.txt         
!?! What is the difference between '\s', '\S', '\W','\w' and '.'?
Read all about regular expressions in the manual of grep (use grep, grep -E, or egrep)!!!

!?! How can we find all lines 
that contain the pattern xxxxyyy 
(for any x and y)? 
Only at the end of a line? 
Only where x and y are digits?

<207|0>bioskill:~> echo "aaaabbb" | sed 's/\(ab\)/c\1c/'

aaacabcbb

<207|0>bioskill:~> echo "aaaabbb" | sed 's/\(.\)\1*/(\1)/'

(a)bbb

<207|0>bioskill:~> echo "aaaabbb" | sed 's/\(.\)\1*/(\1)/g'

(a)(b)
!?! Do we always need to use \ before ( ? 

<202|0>bioskill:EColi> head -10 Colibri_Gene.list

Colibri name      Gene length     SWISS-PROT      Location (kb)   Description
aarF    1641    P27854  4017.80 Regulator of 2'-N-acetyltransferase; involved in respiratory cofactor ubiquinone production
aas     2160    P31119  2974.00 2-Acyl-glycerophosphoethanolamine acyltransferase; acyl-ACP synthetase; salvage pathway for reacylation; inner membrane; bifunctional for turnover/incorporation
aat     705     P23885   926.70 Aminoacyl-tRNA-protein-transferase
abc     1032    P30750   222.60 ABC protein family homolog
abgA    1311    P77357  1402.60 Gene in putative abgABT operon; function unknown
abgB    1446    P76052  1401.30 Gene in putative abgABT operon; function unknown
abgR    909     P77744  1402.80 Putative regulator of abgABT operon
abgT    1533    P46133  1399.80 Para-aminobenzoyl-glutamate utilization; cryptic gene
abrB    1047    P75747   747.00 Possible regulator of aidB expression


<202|0>bioskill:EColi> cat Colibri_Gene.list  | 
awk -F '\t' '($5 ~/putative/) 
{OFS="\t"; print "Putative gene:",$3,$5}' | head -3 
Putative gene:    P77357  Gene in putative abgABT operon; function unknown
Putative gene:    P76052  Gene in putative abgABT operon; function unknown
Putative gene:    P77744  putative regulator of abgABT operon


Useful links:

Look at the course' website:
Examples:
Entrez - The Life Sciences Search Engine, at NCBI (including PubMed)
UCSC Genome Browser - the gold standard of genome browsers
National Center for Biotechnology Information (NCBI) - clearinghouse for all genomic information
SwissProt - at ExPASy.org
TRANSFAC - All the data on Transcription Factors