Workshop in Computational Bioskills - Spring 2010
Lesson 1 - Unix Shell Programming
Part I - The basics of text manipulation in
UNIX
Part II - Primitive motif search - E.Coli
sigma70 sites
Part III - Visualization of E.Coli genes'
length
Part IV - Whole genomes nucleotide
composition
Part V - foreach & friends
Part VI - Writing a shell script
Part VII - Regular Expressions
Text: cat, sort, uniq, grep, echo head, tail, wc tr, cut, fold awk, sed paste, join Web: lynx, wget (We'll get back to this in advanced lesson) Control: if, foreach, while, jobs, kill, ctrl-D, ctrl-S, ctrl-Q, bg, fg [or in general: tcsh manual, (ba)sh manual] Files: File redirections (<, >, >>, >&, >!, >&!), tee Other useful commands will be given throughout the course
<201|0>bioskill:~> cal March 2010 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
<202|0>bioskill:~> cal 2005 2005 January February March Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 1 2 3 4 5 1 2 3 4 5 2 3 4 5 6 7 8 6 7 8 9 10 11 12 6 7 8 9 10 11 12 9 10 11 12 13 14 15 13 14 15 16 17 18 19 13 14 15 16 17 18 19 16 17 18 19 20 21 22 20 21 22 23 24 25 26 20 21 22 23 24 25 26 23 24 25 26 27 28 29 27 28 27 28 29 30 31 30 31 April May June Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 2 1 2 3 4 5 6 7 1 2 3 4 3 4 5 6 7 8 9 8 9 10 11 12 13 14 5 6 7 8 9 10 11 10 11 12 13 14 15 16 15 16 17 18 19 20 21 12 13 14 15 16 17 18 17 18 19 20 21 22 23 22 23 24 25 26 27 28 19 20 21 22 23 24 25 24 25 26 27 28 29 30 29 30 31 26 27 28 29 30 July August September Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 2 1 2 3 4 5 6 1 2 3 3 4 5 6 7 8 9 7 8 9 10 11 12 13 4 5 6 7 8 9 10 10 11 12 13 14 15 16 14 15 16 17 18 19 20 11 12 13 14 15 16 17 17 18 19 20 21 22 23 21 22 23 24 25 26 27 18 19 20 21 22 23 24 24 25 26 27 28 29 30 28 29 30 31 25 26 27 28 29 30 31 October November December Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 1 2 3 4 5 1 2 3 2 3 4 5 6 7 8 6 7 8 9 10 11 12 4 5 6 7 8 9 10 9 10 11 12 13 14 15 13 14 15 16 17 18 19 11 12 13 14 15 16 17 16 17 18 19 20 21 22 20 21 22 23 24 25 26 18 19 20 21 22 23 24 23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31 30 31
<204|0>bioskill:~> cal 2005 | cat -n 1 2005 2 3 January February March 4 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 5 1 1 2 3 4 5 1 2 3 4 5 6 2 3 4 5 6 7 8 6 7 8 9 10 11 12 6 7 8 9 10 11 12 7 9 10 11 12 13 14 15 13 14 15 16 17 18 19 13 14 15 16 17 18 19 8 16 17 18 19 20 21 22 20 21 22 23 24 25 26 20 21 22 23 24 25 26 9 23 24 25 26 27 28 29 27 28 27 28 29 30 31 10 30 31 11 April May June 12 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 13 1 2 1 2 3 4 5 6 7 1 2 3 4 14 3 4 5 6 7 8 9 8 9 10 11 12 13 14 5 6 7 8 9 10 11 15 10 11 12 13 14 15 16 15 16 17 18 19 20 21 12 13 14 15 16 17 18 16 17 18 19 20 21 22 23 22 23 24 25 26 27 28 19 20 21 22 23 24 25 17 24 25 26 27 28 29 30 29 30 31 26 27 28 29 30 18 19 July August September 20 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 21 1 2 1 2 3 4 5 6 1 2 3 22 3 4 5 6 7 8 9 7 8 9 10 11 12 13 4 5 6 7 8 9 10 23 10 11 12 13 14 15 16 14 15 16 17 18 19 20 11 12 13 14 15 16 17 24 17 18 19 20 21 22 23 21 22 23 24 25 26 27 18 19 20 21 22 23 24 25 24 25 26 27 28 29 30 28 29 30 31 25 26 27 28 29 30 26 31 27 October November December 28 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 29 1 1 2 3 4 5 1 2 3 30 2 3 4 5 6 7 8 6 7 8 9 10 11 12 4 5 6 7 8 9 10 31 9 10 11 12 13 14 15 13 14 15 16 17 18 19 11 12 13 14 15 16 17 32 16 17 18 19 20 21 22 20 21 22 23 24 25 26 18 19 20 21 22 23 24 33 23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31 34 30 31
<206|0>bioskill:~> cal 2005 | cat -n | head -10 1 2005 2 3 January February March 4 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 5 1 1 2 3 4 5 1 2 3 4 5 6 2 3 4 5 6 7 8 6 7 8 9 10 11 12 6 7 8 9 10 11 12 7 9 10 11 12 13 14 15 13 14 15 16 17 18 19 13 14 15 16 17 18 19 8 16 17 18 19 20 21 22 20 21 22 23 24 25 26 20 21 22 23 24 25 26 9 23 24 25 26 27 28 29 27 28 27 28 29 30 31 10 30 31
<207|0>bioskill:~> cal 2005 | cat -n | tail -10 25 24 25 26 27 28 29 30 28 29 30 31 25 26 27 28 29 30 26 31 27 October November December 28 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 29 1 1 2 3 4 5 1 2 3 30 2 3 4 5 6 7 8 6 7 8 9 10 11 12 4 5 6 7 8 9 10 31 9 10 11 12 13 14 15 13 14 15 16 17 18 19 11 12 13 14 15 16 17 32 16 17 18 19 20 21 22 20 21 22 23 24 25 26 18 19 20 21 22 23 24 33 23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31 34 30 31
<208|0>bioskill:~> cal 2005 | cat -n | tail -n +20 20 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 21 1 2 1 2 3 4 5 6 1 2 3 22 3 4 5 6 7 8 9 7 8 9 10 11 12 13 4 5 6 7 8 9 10 23 10 11 12 13 14 15 16 14 15 16 17 18 19 20 11 12 13 14 15 16 17 24 17 18 19 20 21 22 23 21 22 23 24 25 26 27 18 19 20 21 22 23 24 25 24 25 26 27 28 29 30 28 29 30 31 25 26 27 28 29 30 26 31 27 October November December 28 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 29 1 1 2 3 4 5 1 2 3 30 2 3 4 5 6 7 8 6 7 8 9 10 11 12 4 5 6 7 8 9 10 31 9 10 11 12 13 14 15 13 14 15 16 17 18 19 11 12 13 14 15 16 17 32 16 17 18 19 20 21 22 20 21 22 23 24 25 26 18 19 20 21 22 23 24 33 23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31 34 30 31
<203|0>bioskill:~> man grep<203|0>bioskill:~> cal 2005 | grep 30 23 24 25 26 27 28 29 27 28 27 28 29 30 31 30 31 24 25 26 27 28 29 30 29 30 31 26 27 28 29 30 24 25 26 27 28 29 30 28 29 30 31 25 26 27 28 29 30 23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31 30 31!?! What is the meaning of the flags -i,-v,-A,-B in the grep command?<209|0>bioskill:~> cal 2005 | tr 'A-Z' 'a-z' 2005 january february march su mo tu we th fr sa su mo tu we th fr sa su mo tu we th fr sa 1 1 2 3 4 5 1 2 3 4 5 2 3 4 5 6 7 8 6 7 8 9 10 11 12 6 7 8 9 10 11 12 9 10 11 12 13 14 15 13 14 15 16 17 18 19 13 14 15 16 17 18 19 16 17 18 19 20 21 22 20 21 22 23 24 25 26 20 21 22 23 24 25 26 23 24 25 26 27 28 29 27 28 27 28 29 30 31 30 31 april may june su mo tu we th fr sa su mo tu we th fr sa su mo tu we th fr sa 1 2 1 2 3 4 5 6 7 1 2 3 4 3 4 5 6 7 8 9 8 9 10 11 12 13 14 5 6 7 8 9 10 11 10 11 12 13 14 15 16 15 16 17 18 19 20 21 12 13 14 15 16 17 18 17 18 19 20 21 22 23 22 23 24 25 26 27 28 19 20 21 22 23 24 25 24 25 26 27 28 29 30 29 30 31 26 27 28 29 30 july august september su mo tu we th fr sa su mo tu we th fr sa su mo tu we th fr sa 1 2 1 2 3 4 5 6 1 2 3 3 4 5 6 7 8 9 7 8 9 10 11 12 13 4 5 6 7 8 9 10 10 11 12 13 14 15 16 14 15 16 17 18 19 20 11 12 13 14 15 16 17 17 18 19 20 21 22 23 21 22 23 24 25 26 27 18 19 20 21 22 23 24 24 25 26 27 28 29 30 28 29 30 31 25 26 27 28 29 30 31 october november december su mo tu we th fr sa su mo tu we th fr sa su mo tu we th fr sa 1 1 2 3 4 5 1 2 3 2 3 4 5 6 7 8 6 7 8 9 10 11 12 4 5 6 7 8 9 10 9 10 11 12 13 14 15 13 14 15 16 17 18 19 11 12 13 14 15 16 17 16 17 18 19 20 21 22 20 21 22 23 24 25 26 18 19 20 21 22 23 24 23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31 30 31<210|0>bioskill:~> cal | sed 's/February/bioskill/' bioskill 2010 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28<211|0>bioskill:~> cal | sed 's/february/bioskill/i' bioskill 2010 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28!?! What does the flag 'g' in sed command? Any ther important flags?<212|0>bioskill:~> cal | fold -w 7 Febr uary 20 05 Su Mo T u We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 5 16 17 18 19 20 21 2 2 23 24 25 26 27 28!?! Instead of using cat for viewing a file, try more or less (Look at the MAN pages)Part II - Primitive motif search - E.Coli sigma70 sites
<201|0>bioskill:~> cd Data/EColi/Promoters/
<202|0>bioskill:Promoters> cat 50-TSS.tfa | head -10
>ORF83.1 GCATAAAAAACTCTGCTGGCATTCACAAATGCGCAGGGGTAAAACGTTTC >ORF83.2 AATGCGCAGGGGTAAAACGTTTCCTGTAGCACCGTGAGTTATACTTTGTA >accA TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA >accB TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG >accD TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA!?! Do you know what is the Fasta format ?<203|141>bioskill:Promoters> cat 50-TSS.tfa | FASTA2line.pl | head -10 >ORF83.1 GCATAAAAAACTCTGCTGGCATTCACAAATGCGCAGGGGTAAAACGTTTC >ORF83.2 AATGCGCAGGGGTAAAACGTTTCCTGTAGCACCGTGAGTTATACTTTGTA >accA TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA >accB TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG >accD TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA >aceBAK TAAAATGGAAATTGTTTTTGATTTTGCATTTTAAATGAGTAGTCTTAGTT >aceE ACTAAACGTAGAACCTGTCTTATTGAGCTTTCCGGCGAGAGTTCAATGGG >acnAP1 CGTTATTCCAGACGACTGGCAACTAACATCGCAGCAGCAAGCCTTTATAG >acnAP2 ATTTGGGTTGTTATCAAATCGTTACGCGATGTTTGTGTTATCTTTAATAT >acnB TTTTTTGTAAACAGATTAACACCTCGTCAAAATCCTGCTATTCTGCCCGT<204|141>bioskill:Promoters> cat50-TSS.tfa | FASTA2line.pl | grep -v ORF | head -10 >accA TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA >accB TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG >accD TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA >aceBAK TAAAATGGAAATTGTTTTTGATTTTGCATTTTAAATGAGTAGTCTTAGTT >aceE ACTAAACGTAGAACCTGTCTTATTGAGCTTTCCGGCGAGAGTTCAATGGG >acnAP1 CGTTATTCCAGACGACTGGCAACTAACATCGCAGCAGCAAGCCTTTATAG >acnAP2 ATTTGGGTTGTTATCAAATCGTTACGCGATGTTTGTGTTATCTTTAATAT >acnB TTTTTTGTAAACAGATTAACACCTCGTCAAAATCCTGCTATTCTGCCCGT >ada GCGCAAGATTGTTGGTTTTTGCGTGATGGTGACCGGGCAGCCTAAAGGCT >adhE GTTGTGCAAAACATGCTAATGTAGCCACCAAATCATACTACAATTTATTA
<205|141>bioskill:Promoters> cat 50-TSS.tfa | FASTA2line.pl | grep -v ORF | FASTAfromline.pl > 50-TSS.no-ORF.tfa
<206|141>bioskill:Promoters> head -7 50-TSS.no-ORF.tfa
>accA TATGCTCGCGGGCTTGCTATCTCGCTGACGGACAGGCAAATTGATGACCA >accB TTACATGTTAGCTGTTGATTATCTTCCCTGATAAGACCAGTATTTAGCTG >accD TCGACCACTTTTTTATCCAAAGTTTCGGGCTGTTATGTTTTAATGTGCAA >aceBAK
<207|0>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | wc -l
447
<208|0>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | grep TATAAT | wc -l
25
# Searching for the TF motif TTGACA :
<209|1>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | grep TTGACA |
wc -l
18
# And both :
<210|0>bioskill:Promoters> FASTA2line.pl 50-TSS.tfa | grep TATAAT |
grep TTGACA | wc -l
0
!?! How can we find the names of TATA-box containg genes?
<201|0>bioskill:~> cd Data/EColi/
<202|0>bioskill:EColi> head -10 Colibri_Gene.list
Colibri name Gene length SWISS-PROT Location (kb) Description aarF 1641 P27854 4017.80 Regulator of 2'-N-acetyltransferase; involved in respiratory cofactor ubiquinone production aas 2160 P31119 2974.00 2-Acyl-glycerophosphoethanolamine acyltransferase; acyl-ACP synthetase; salvage pathway for reacylation; inner membrane; bifunctional for turnover/incorporation aat 705 P23885 926.70 Aminoacyl-tRNA-protein-transferase abc 1032 P30750 222.60 ABC protein family homolog abgA 1311 P77357 1402.60 Gene in putative abgABT operon; function unknown abgB 1446 P76052 1401.30 Gene in putative abgABT operon; function unknown abgR 909 P77744 1402.80 Putative regulator of abgABT operon abgT 1533 P46133 1399.80 Para-aminobenzoyl-glutamate utilization; cryptic gene abrB 1047 P75747 747.00 Possible regulator of aidB expression!?! Are you familiar with the awk pattern scanning and processing language?
<203|141>bioskill:EColi> cat Colibri_Gene.list | awk -F '\t' '{print $2}' | head -10
Gene length 1641 2160 705 1032 1311 1446 909 1533 1047
<204|141>bioskill:EColi> cat Colibri_Gene.list | awk -F'\t' '(FNR>1){print $2}' | head -10
1641 2160 705 1032 1311 1446 909 1533 1047 960
!?! What will be the result of: awk -F '\t' '(FNR>1){print $0}' and awk -F '\t' '(FNR>1)'
<205|141>bioskill:EColi> cat Colibri_Gene.list | awk -F'\t' '(FNR>1){print NF}' | sort | uniq 5!?! What will be the result of the same command when omitting " -F '\t' "?
<205|141>bioskill:EColi> cat Colibri_Gene.list | awk -F '\t' '(FNR>1){sum+=$2;n++}END{print sum/n}'
948.826
<206|141>bioskill:EColi> tail -n +2 Colibri_Gene.list | cut -f 2 | stats.pl
Average : 948.826452064381 Sum : 4067619 Sum Sqrs: 5631951129 Variance: 413552.642295925 Std dev.: 643.080587715043 N : 4287
<207|0>bioskill:EColi> tail -n +2 Colibri_Gene.list | awk '{a[int($2/100)]++}END{for (i in a){print 100*(i+1),a[i]}}' | sort -g > tmp
<207|0>bioskill:EColi> head tmp
<201|0>bioskill:~> cd Data/Bacteria <202|0>bioskill:Bacteria> ls -l <203|0>bioskill:Bacteria> grep -v '^>' Borrelia_Burgdorferi.tfa | head -10 <204|0>bioskill:Bacteria> grep -v '^>' Borrelia_Burgdorferi.tfa | tr 'acgt' 'ACGT' | tr -cd 'ACGT' |
fold -w 1 | sort | uniq -c <205|0>bioskill:Bacteria> grep -v '^>' Mycoplasma_Pneumonia.tfa | tr 'acgt'
'ACGT' | tr -cd 'ACGT' | fold -w 1 | sort | uniq -c <205|0>bioskill:Bacteria> grep -v '^>' Mycoplasma_Pneumonia.tfa | fold -w 1 | awk '{a[$1]++}END{for (i in a){print i,a[i]}}' <206|0>bioskill:Bacteria> cd ../Virus <207|0>bioskill:Virus> ls -l <208|0>bioskill:Virus> grep -v '^>' Amsacta_Moorei_Entomopoxvirus.tfa | head -10 <209|0>bioskill:Virus> grep
-v '^>' Amsacta_Moorei_Entomopoxvirus.tfa | tr 'acgt'
'ACGT' | tr -cd 'ACGT' | fold -w 1 | awk '{a[$1]++}END{for (i in
a){print i,a[i]}}' <210|0>bioskill:Virus> grep -v
'^>' Human_Immunodeficiency_Virus.tfa | tr 'acgt' 'ACGT' |
tr -cd 'ACGT' | fold -w 1 | awk '{a[$1]++}END{for (i in a){print i,a[i]}}' # Important note: <201|0>bioskill:~> cd Data/EColi <202|0>bioskill:~> foreach f (1 2 3 4 5)
# Store each item from the list (1 2 3 4 5) in $f variable, <204|0>bioskill:~> foreach f (*.tfa) First line of script must
begin with a shebang (#!),
and then contain the path to the program
that interprets your script
(in our case, /bin/csh): On unix platforms, remember to make the
file executable by changing its permissions. Your First script:100 104
200 96
300 269
400 295
500 313
600 306
700 319
800 342
900 299
1000 334
Part IV - Whole genomes nucleotide composition
Can we find the nucleotide composition by
using simple shell commands?total 1954
-rw-r--r-- 1 bioskill users 923807 Mar 10 19:48 Borrelia_Burgdorferi.tfa
-rw-r--r-- 1 bioskill users 828126 Mar 10 19:44 Mycoplasma_Pneumonia.tfa
TAAATATAATTTAATAGTATAAAAAAAATTAAATCAAATTAATAATAGTTTAAAAAACTGTTTGTATAAT
ATAATATTATTATATATAATATTAAGCAACTACTATGATACTAATGAAGTATAGTGCTATTTTATTAATA
TGTAGCGTTAATTTATTTTGTTTTCAAAATAAATTAACTACTTCTCGATGGGAATTCCCTAAAGAAGATT
TAATTAAAAAAAAAATAAAAATAGGCATAATTTACCATAATTACATAAATTCTATCTTTTACAATGAAAA
TTATAAATACATTGCCTTTATCGGAATATTGACATCTTATAATGAATGGATTGAAATACAATTTAGCCCC
ATAAATTTTTTTACTATCCCAACAAATAAAGATTTTATTTCAAATACTTATTTCAATTTAGCTTTCACTA
TTTACATTACCAAGTATTCAATTTTAACTGATACACTTGCTATAAAATTTTTTATTGGAACCCAAATCGA
TTTAACTCTGAGAACTACTATATTTACAGGAAAAACAACTCATGCATTTCTCTATCCAATTCTTCCCATA
ATTACCTTCAAATTTGAAATTGATTTCATACCTAATAACTATAGTATTTACTATAAATTATCGACTTCTT
TTAAAGAATTTATCCTTTTAGATCTAGGAATTTCTATATTTATATAATCCTTTTTTTATTATAGAACTTT
323079 A
130760 C
129646 G
327196 T
249211 A
162920 C
163703 G
240560 T
A 249211
C 162920
G 163703
T 240560
total 242
-rw-r--r-- 1 bioskill users 235788 Mar 10 19:59 Amsacta_Moorei_Entomopoxvirus.tfa
-rw-r--r-- 1 bioskill users 10589 Mar 10 19:48 Human_Immunodeficiency_Virus.tfa
ATTTTTTTAAAATGAAAAAAAAAAATATCATAACTACTAACTATGGATTTACCTATAGAAATTTTAGAAA
TTATATTTAATTATACAGATACATACATAAAATTATAATTTATATATTTAAAATATTTAGAATTTATTGA
AAATTAGTAAAATTAGATTGTTCTAAAACATATATTGATTCTCTAAAAGGAATACATTATCTTACTAATT
TACAAAAATTAATTCTTTAAAAGAAATATGTTGCCTTAATAATATTAAAAAAATAAATTGTTCATATACA
ATCATTGATTCTCTAAAAGGAATAAGTCTTAATAATTTAGAAGAATTATATTGTTATAATATAAAAATTT
ATTCTTTAAATATAATAATAAAAAATCTGCTTATTAAAAATATTAAATGGTTATAAATACATAAATTAAT
TATTTTATATAAATTATTGTTAAACATTTATATTAATATTCTAATATTAAAAATTGAAAAAAAAAATAAT
TATGTTAAAATGGAGTTACCTGTAGAAATGTTAGAAATTATATTTAATTATTTAGATAATGATACTAAAT
TACAATTTATAGATTCAAAATGTATTATATCAAAACTTATATATAAATTAAAATATAATTCTTGTTTAAA
AGAAATAAAGAATTTTATTAATTTAAAAGAATTAATATATAATAATTATTATATAAAATCTTTAGAAGGT
A 94121
C 20868
G 20454
T 96949
A 3506
C 2132
G 2598
T 2123
Part V - foreach & friends
# Read the excellent manual of tcsh to learn more
about foreach, while, if
# IO, shell variables, and other tcsh
options like :r,:h,:t,:a,:s,:e, etc.
# and run some command on it.
foreach? echo "hello: $f"
foreach? end
hello: 1
hello: 2
hello: 3
hello: 4
hello: 5
<203|0>bioskill:~> foreach f (*.tfa)
# The :r flag ommits the last part
foreach? echo -n "$f -> "
foreach? echo $f:r
foreach? end
Colibri_Prot.tfa -> Colibri_Prot
Colibri_Seq.tfa -> Colibri_Seq
# Let's rename all *.tfa files to *.fasta
foreach? mv $f $f:r.fasta
foreach? end
<205|0>bioskill:~> ls *.fasta *.tfa
Colibri_Prot.fasta
Colibri_Seq.fasta
<206|0>bioskill:~> foreach f (*.fasta)
# and back
foreach? mv $f $f:r.tfa
foreach? end
<207|0>bioskill:~> ls *.fasta *.tfa
Colibri_Prot.tfa
Colibri_Seq.tfa
Part VI - Writing a shell script
Basic scripting rules:
#!/path/to/script/interpreter
chmod +x <script-name>
#! /bin/csh
set name = $1
echo "Hellow $name. Have a nice day!"
<207|0>bioskill~> firstScript.csh Naomi
Hellow Naomi. Have a nice day!
<207|0>bioskill~> secondScript.csh TATAAT Promoters/50-TSS.tfa
There are 25 promoters that contain the binding site TATAAT Their Nucleotide Composition is: A=353 C=234 G=242 T=421
<207|0>bioskill~> ls TFs/
gene-set.list Gcn4.targets.txt Ste12.targets.txt Rpn4.targets.txt Hap4.targets.txt Sko1.targets.txt
<207|0>bioskill:> head -4 TFs/Gcn4.targets.txt
Gene ID: Description: aarF Regulator of 2'-N-acetyltransferase; involved in respiratory cofactor ubiquinone production aas 2-Acyl-glycerophosphoethanolamine acyltransferase; acyl-ACP synthetase; salvage pathway for reacylation; inner membrane; bifunctional for turnover/incorporation aat Aminoacyl-tRNA-protein-transferase abc ABC protein family homolog
<207|0>bioskill:TFs> head -4 TFs/gene-set.list
Gene ID: ORF10.1 aaG aarF aaA1
<207|0>bioskill:~> findOverlap.csh gene-set.list TFs
The TF Hap4 Has 36 Target genes (out of a total of 44 ) The TF Ste12 Has 4 Target genes (out of a total of 44 ) The TF Sko1 Has 3 Target genes (out of a total of 44 ) The TF Rpn4 Has 3 Target genes (out of a total of 44 ) The TF Gcn4 Has 3 Target genes (out of a total of 44 )
!?! In the script we used the command sort. How can we sort by more than one column?
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions (eg., *,+,?,{},[]). Patterns are very useful. We would often like to know if they can be found in files, how many times, where, and sometimes even replace them with another. In UNIX, we can use grep (egrep),sed ,tr & awk to find and manipulate patterns. In Perl, we can find almost similar commands. We will learn more about this in Lesson 4.Lets' see some examples:
<207|0>bioskill:~> ls TFs/*
TFs/gene.set.list TFs/Gcn4.targets.txt TFs/Ste12.targets.txt TFs/Rpn4.targets.txt TFs/Hap4.targets.txt TFs/Sko1.targets.txt
<207|0>bioskill:~> ls TFs/* | sed 's/.*\///'
gene-set.list Gcn4_targets.txt Ste12_targets.txt Rpn4_targets.txt Hap4_targets.txt Sko1_targets.txt
<207|0>bioskill:~> ls TFs/* | sed 's/[^:space:]\///'
gene-set.list Gcn4_targets.txt Ste12_targets.txt Rpn4_targets.txt Hap4_targets.txt Sko1_targets.txt!?! What is the difference between '\s', '\S', '\W','\w' and '.'?
!?! How can we find all lines that contain the pattern xxxxyyy (for any x and y)? Only at the end of a line? Only where x and y are digits?
<207|0>bioskill:~> echo "aaaabbb" | sed 's/\(ab\)/c\1c/'
aaacabcbb<207|0>bioskill:~> echo "aaaabbb" | sed 's/\(.\)\1*/(\1)/'
(a)bbb<207|0>bioskill:~> echo "aaaabbb" | sed 's/\(.\)\1*/(\1)/g'
(a)(b)!?! Do we always need to use \ before ( ?<202|0>bioskill:EColi> head -10 Colibri_Gene.list
Colibri name Gene length SWISS-PROT Location (kb) Description aarF 1641 P27854 4017.80 Regulator of 2'-N-acetyltransferase; involved in respiratory cofactor ubiquinone production aas 2160 P31119 2974.00 2-Acyl-glycerophosphoethanolamine acyltransferase; acyl-ACP synthetase; salvage pathway for reacylation; inner membrane; bifunctional for turnover/incorporation aat 705 P23885 926.70 Aminoacyl-tRNA-protein-transferase abc 1032 P30750 222.60 ABC protein family homolog abgA 1311 P77357 1402.60 Gene in putative abgABT operon; function unknown abgB 1446 P76052 1401.30 Gene in putative abgABT operon; function unknown abgR 909 P77744 1402.80 Putative regulator of abgABT operon abgT 1533 P46133 1399.80 Para-aminobenzoyl-glutamate utilization; cryptic gene abrB 1047 P75747 747.00 Possible regulator of aidB expression <202|0>bioskill:EColi> cat Colibri_Gene.list | awk -F '\t' '($5 ~/putative/) {OFS="\t"; print "Putative gene:",$3,$5}' | head -3 Putative gene: P77357 Gene in putative abgABT operon; function unknown Putative gene: P76052 Gene in putative abgABT operon; function unknown Putative gene: P77744 putative regulator of abgABT operon
Useful links:
Look at the course' website:
Examples:
Entrez - The Life Sciences Search Engine, at NCBI (including PubMed)
UCSC Genome Browser - the gold standard of genome browsers
National Center for Biotechnology Information (NCBI) - clearinghouse for all genomic information
SwissProt - at ExPASy.org
TRANSFAC - All the data on Transcription Factors