Please read each
question thoroughly, and think before answering.
Submission
deadline: Tuesday 23/3, Ross closing time. This exercise is NOT intended to be
done in pairs !
Submit the solutions automatically through moodle from :
Submit exercises & see grades
Q1: Mastering Regular Expressionss:
In this question, you'll rehearse
some simple regular expressions. In order to do that, we'll practice on the file
general_file.txt,
using grep, egrep sed and awk.
You may use all unix commands that were shown in class.
- What to
submit?
Queries
- Last 3 lines containing the string 'high'.
- Last 3 lines containing ONLY the word 'high' EXACTLY (e.g. not including 'higher' ).
- How many lines end with the
sequence 'AA' OR 'AAG' ?
- All lines excluding the first 2 that end with '----' (i.e. out of the relevant lines, exclude the first 2)
- How many words contain the
string 'translation' ?
- First line with a word containing both 'late' and 'reg' (in some order)
- How many lines contain a word that ends with the
string 'ption'(e.g. transcription), but this word is NOT at the end of the LINE
- How many lines contain the
characters c,d,m,s in reverse alphabetic order
- How many lines contain the
character 'i' exactly two times?
- How many lines contain the
character 'i' or 'I' (case insensitive) exactly two times?
- Print the lines in rows number 10-15
- Out of these lines (in row 10-15), print only the lines where
the word in the second column begin with a letter, but a NON capital letter
- In how many lines, the word in the second column has less than 12 characters?
- In How many lines, the word in the second column has exactly 10 characters?
- What is the length of the shortest
word in the file? (Note that you should check the length of the words, not the lines, and do not include the empty character)
- What is the longest word in the file?(again: words, not lines)
- How many lines contain
the same character twice, but NOT next to each other ('xyx' or 'xyzux' etc. but NOT 'xyyx' or 'xx' )?
- And how many lines contain exactly the repeating 'xyyx' pattern (x and y could be any characters but not the same)?
Q2: Whole-Genome Nucleotide
Composition:
- Background: Recall
lesson
1. Where we calculated the whole genome nucleotide
composition of our favorite organisms.
-Thermophile : In
this part, you're asked to pick up a thermophile organism,
to download its genome from the web (You can find useful links here),
and do a similar analysis to the one we did in class: calculate the GC-content.
The GC content is the percent of G and C in the genome.
- What to submit: A
script called Ex1-2.csh:
- The program gets a genome in FASTA format as an input parameter (you can assume the file is a fasta file with the right format)
- The program outputs the answer to the two questions bellow - this should be the ONLY output of the program.
- Question 1: What is the GC-content of the given input genome ? (In one line UNIX command!).
The format of the output should be like this:
33% GC
- Question 2: Count the nucleotide number of (overlapping) adjacent
nucleotide pairs in the genome. For example, given the
sequence "actctct", we'll get:
1 ac
3 ct
2 tc
How will you do it in the most simple way(could be more than one command)?
- Bonus : If the answer in one
UNIX command (If you are not sure of your answer add this in one line as a comment in the script)
- As a comment : write the URL of the genome you've picked.
- As a comment : write a short documentation: Who is this organism ? Where does it live ? (Try finding the information in pubmed )
- As a comment : after running your program on the chosen genome, write (shortly) : is the result surprising ? Why?
Q3: Yeast Transcription factor Binding sites:
In this question we will do a toy
analysis of Transcription Factors targets in Yeast.
YeastTransfac.txt includes a list of all known yeast binding sites of transcription factors in the
following format:
Gene.AC |
Gene.DE | Factor.OS | Site.AC | Factor.AC | Factor.FA | Factor.CL
| Class.CL | Site.SQ
where AC = accession number, DE = description, OS = organism, FA
= factor name, CL = class, SQ = sequence.
This information was extracted from TRANSFAC. In the file you have information about each binding site on the DNA (fields: Site.AC and Site.SQ), the target gene itself (fields: Gene.AC and Gene.DE) and the Transcription factor that binds it (fields: Factor.AC , Factor.CL,
Factor.CL and Class.CL).
In YeastPromoters.tfa you will find list of 5647 yeast promoters in FASTA
format. This format includes a short header with information on
the gene, and a sequence of approximately 500 bp upstearm the
transcription start site (TSS). Now answer the questions below by using the unix
commands we learned in class.
- What to
submit?
Submit a script Ex1-3.csh that contains:
The commands that calculate the answers to the questions
- How many genes in YeastPromoters.tfa
have the Reb1
binding sequence : CGGGTAA in their promoter sequence?
You can use the script FASTA2line.pl and
FASTAfromline.pl
to convert FASTA formatted
files to one-liners.
(Note : Have you remembered to include sequences
with reverse complement motifs?? You should.)
- You have found a list of
genes which have the sequence element of Reb1
binding site. Do the putative target genes
appear more in certain chromosoms, and less in others?
Print out the 2 chromosomes with the highest number of
targets found, and the 2 with the lowest (print ONLY the chromosome number, and print the two highest first, and then the two lowest).
- How many genes contain a known binding site of Reb1 based on YeastTransfac.txt?
Create file Reb1_targets.out
with the Gene.AC information of the Reb1 targets (Note that some genes contain more than one binding site - don't count them twice)
- Extract from YeastTransfac.txt
all the sites that are knownRox1
targets, and print their
sequence element (Site.SQ field).
- Find the main transcription factors which regulate : "histidine metabolism" "ribosomal RNA" and "galactose".
For each of the three topics, search for the factor which has the highest number of
targets related to this topic (these can be three different factors).
For this exercise we will define a gene as related to galactose, for example, if the description of the gene (field Gene.DE)
contains the word galactose.
Print out the topic and the name of the main regulator, in the following format: "histidine metabolism => Reb1"
(You can use any kind and number of spaces you prefer). Obviously Reb1 is not the right answer, that's just an example...
Use the command foreach in your answer.
Good Luck!
Now what?
Pack all your files in a tar file called Ex1.tar:
tar -cvf Ex1.tar
Ex1-1.csh Ex1-2.csh Ex1-3.csh