Workshop in Computational Bioskills - Exercise 1

Workshop in Computational Bioskills - Exercise 1:

Please read each question thoroughly, and think before answering.

Submission deadline: Tuesday 23/3, Ross closing time. This exercise is NOT intended to be done in pairs !
Submit the solutions automatically through moodle from : Submit exercises & see grades

Q1: Mastering Regular Expressionss:

In this question, you'll rehearse some simple regular expressions. In order to do that, we'll practice on the file general_file.txt, using grep, egrep sed and awk.
You may use all unix commands that were shown in class.

- What to submit?

A Unix shell script file called 'Ex1-1.csh'. that will run on a given file and answer all the queries bellow.
Note: You're supposed to submit one UNIX command (line) per item. Use | to concatenate commands.
The output of the program should be ONLY the number of query followed by the answer to the query.
The format should be EXACTLY :"#Qx:" where x is the number of query.
e.g. query number 7: find all words that contain AAA in the file input.txt. In the script you should write:
echo "#Q7:"
grep AAA input.txt
Make sure the files can be executed (chmod command), and that the first line is in a legal shebang format (e.g. #! /bin/csh -f )

The file should get the name of the input file as an argument.
Usage: Ex1-1.csh < file >.

 For example, running the script on the file general_file.txt:
<206|0>bioskill:~> Ex1-1.csh general_file.txt

Queries

Last 3 lines containing the string 'high'.

Last 3 lines containing ONLY the word 'high' EXACTLY (e.g. not including 'higher' ).

How many lines end with the sequence 'AA' OR 'AAG' ?

All lines excluding the first 2 that end with '----' (i.e. out of the relevant lines, exclude the first 2)

How many words contain the string 'translation' ?

First line with a word containing both 'late' and 'reg' (in some order)

How many lines contain a word that ends with the string 'ption'(e.g. transcription), but this word is NOT at the end of the LINE

How many lines contain the characters c,d,m,s in reverse alphabetic order

How many lines contain the character 'i' exactly two times?

How many lines contain the character 'i' or 'I' (case insensitive) exactly two times?

Print the lines in rows number 10-15

Out of these lines (in row 10-15), print only the lines where the word in the second column begin with a letter, but a NON capital letter

In how many lines, the word in the second column has less than 12 characters?

In How many lines, the word in the second column has exactly 10 characters?

What is the length of the shortest word in the file? (Note that you should check the length of the words, not the lines, and do not include the empty character)
What is the longest word in the file?(again: words, not lines)

How many lines contain the same character twice, but NOT next to each other ('xyx' or 'xyzux' etc. but NOT 'xyyx' or 'xx' )?

And how many lines contain exactly the repeating 'xyyx' pattern (x and y could be any characters but not the same)?

Q2: Whole-Genome Nucleotide Composition:

- Background: Recall lesson 1. Where we calculated the whole genome nucleotide composition of our favorite organisms.

-Thermophile : In this part, you're asked to pick up a thermophile organism, to download its genome from the web (You can find useful links here), and do a similar analysis to the one we did in class: calculate the GC-content.
The GC content is the percent of G and C in the genome.

- What to submit: A script called Ex1-2.csh:

The program gets a genome in FASTA format as an input parameter (you can assume the file is a fasta file with the right format)

The program outputs the answer to the two questions bellow - this should be the ONLY output of the program.

Question 1: What is the GC-content of the given input genome ? (In one line UNIX command!).

The format of the output should be like this: 33% GC

Question 2: Count the nucleotide number of (overlapping) adjacent nucleotide pairs in the genome. For example, given the sequence "actctct", we'll get:
1 ac
3 ct
2 tc
How will you do it in the most simple way(could be more than one command)?

Bonus : If the answer in one UNIX command (If you are not sure of your answer add this in one line as a comment in the script)

As a comment : write the URL of the genome you've picked.

As a comment : write a short documentation: Who is this organism ? Where does it live ? (Try finding the information in pubmed )

As a comment : after running your program on the chosen genome, write (shortly) : is the result surprising ? Why?

Q3: Yeast Transcription factor Binding sites:

In this question we will do a toy analysis of Transcription Factors targets in Yeast.

YeastTransfac.txt includes a list of all known yeast binding sites of transcription factors in the following format:
Gene.AC | Gene.DE | Factor.OS | Site.AC | Factor.AC | Factor.FA | Factor.CL | Class.CL | Site.SQ
where AC = accession number, DE = description, OS = organism, FA = factor name, CL = class, SQ = sequence.
This information was extracted from TRANSFAC. In the file you have information about each binding site on the DNA (fields: Site.AC and Site.SQ), the target gene itself (fields: Gene.AC and Gene.DE) and the Transcription factor that binds it (fields: Factor.AC , Factor.CL, Factor.CL and Class.CL).

In YeastPromoters.tfa you will find list of 5647 yeast promoters in FASTA format. This format includes a short header with information on the gene, and a sequence of approximately 500 bp upstearm the transcription start site (TSS). Now answer the questions below by using the unix commands we learned in class.

- What to submit? Submit a script Ex1-3.csh that contains:
The commands that calculate the answers to the questions

How many genes in YeastPromoters.tfa have the Reb1 binding sequence : CGGGTAA in their promoter sequence? You can use the script FASTA2line.pl and FASTAfromline.pl to convert FASTA formatted files to one-liners.
(Note : Have you remembered to include sequences with reverse complement motifs?? You should.)

You have found a list of genes which have the sequence element of Reb1 binding site. Do the putative target genes appear more in certain chromosoms, and less in others? Print out the 2 chromosomes with the highest number of targets found, and the 2 with the lowest (print ONLY the chromosome number, and print the two highest first, and then the two lowest).

How many genes contain a known binding site of Reb1 based on YeastTransfac.txt? Create file Reb1_targets.out with the Gene.AC information of the Reb1 targets (Note that some genes contain more than one binding site - don't count them twice)

Extract from YeastTransfac.txt all the sites that are knownRox1 targets, and print their sequence element (Site.SQ field).

Find the main transcription factors which regulate : "histidine metabolism" "ribosomal RNA" and "galactose". For each of the three topics, search for the factor which has the highest number of targets related to this topic (these can be three different factors).
For this exercise we will define a gene as related to galactose, for example, if the description of the gene (field Gene.DE) contains the word galactose.
Print out the topic and the name of the main regulator, in the following format: "histidine metabolism => Reb1" (You can use any kind and number of spaces you prefer). Obviously Reb1 is not the right answer, that's just an example...
Use the command foreach in your answer.

Good Luck!
Now what?

Pack all your files in a tar file called Ex1.tar:
tar -cvf Ex1.tar Ex1-1.csh Ex1-2.csh Ex1-3.csh