Please read each question thoroughly, and
think before answering.
Submission deadline: Wednesday 9/3, Ross closing time. This exercise is NOT intended to be done in pairs
!
Submit the solutions automatically through moodle from : Submit exercises & see grades
Q1: Mastering Regular Expressionss:
In this question, you'll rehearse some
simple regular expressions. In order to do that, we'll practice on the file general_file.txt,
using grep, egrep sed and awk.
You may use all unix
commands that were shown in class.
- What to submit?
- A Unix shell script file called 'Ex1-1.csh'. that
will run on a given file and answer all the queries bellow.
- Note: You're supposed to submit one UNIX command
(line) per item. Use | to concatenate commands.
- The output of the program should be ONLY the number
of query followed by the answer to the query.
- The format should be EXACTLY :"#Qx:" where x is the number of query.
e.g. query number 7: find all words that contain
AAA in the file input.txt. In the script you should write:
echo "#Q7:"
grep AAA input.txt
- Make sure the files can be executed (chmod command), and that the first line is in a legal
shebang format (e.g. #! /bin/csh -f )
- The file should get the name of the input file as an
argument.
Usage: Ex1-1.csh < file >.
For example, running the script on the file general_file.txt: <206|0>bioskill:~> Ex1-1.csh general_file.txt
Queries
- Last 5 lines containing the string 'domain'.
- Last 3 lines containing ONLY the word 'domain'
EXACTLY (e.g. not including 'homeodomain'
).
- How many lines
end with the
sequence 'TA' OR 'AAG' ?
- All lines
excluding the first 3 that end with '----' (i.e. out of the relevant
lines, exclude the first 3)
- How many words
contain the string 'high' ?
- First line
with a word containing both 'late' and 'reg' (in
some order)
- How many lines
contain a word that ends with the string 'ption'(e.g. transcription), but this word is
NOT at the end of the LINE
- How many lines
contain the characters c,d,m,s
in reverse alphabetic order
- How many lines
contain the character 'i' exactly two times?
- How many lines
contain the character 'i' or 'I' (case
insensitive) exactly two times?
- Print the
lines in rows number 12-17
- Out of these
lines (in row 12-17), print only the lines where the word in the second
column begin with a letter, but a NON capital letter
- In how many
lines, the word in the second column has less than 12 characters?
- In How many
lines, the word in the second column has exactly 10 characters?
- What is the
length of the shortest word in the file? (Note that you should check the
length of the words, not the lines, and do not include the empty
character)
- What is the
longest word in the file?(again: words, not
lines)
- How many lines
contain exactly the repeating 'xyyx' pattern (x
and y could be any characters but not the same)?
Q2: Whole-Genome
Nucleotide Composition:
- Background: Recall lesson 1.
Where we calculated the whole genome nucleotide composition of our favorite
organisms.
-Thermophile :
In this part, you're asked to pick up a thermophile
organism, to download its genome from the web (You can find useful links here), and
do a similar analysis to the one we did in class: calculate the GC-content.
The GC content is the percent of G and C in the genome.
-
What to submit: A script called Ex1-2.csh:
- The program
gets a genome in FASTA format as an input parameter (you can assume the
file is a fasta file with the right format)
- The program
outputs the answer to the two questions bellow - this should be the ONLY
output of the program.
- Question 1:
What is the GC-content of the given input genome ?
(In one line UNIX command!).
The format of the output should be like this: 33% GC
- Question 2:
Count the nucleotide number of (overlapping) adjacent nucleotide pairs in
the genome. For example, given the sequence "actctct",
we'll get:
1 ac
3 ct
2 tc
How will you do it in the most simple way(could
be more than one command)?
- Bonus : If the
answer in one UNIX
command (If you are not sure of your answer add this in one line as a
comment in the script)
- As a comment : write the URL of the genome you've picked.
- As a comment : write a short documentation: Who is this
organism ? Where does it live ? (Try finding the
information in pubmed )
- As a comment : after running your program on the chosen
genome, write (shortly) : is the result surprising ? Why?
Q3: Yeast Transcription factor Binding sites:
In this question we will do a toy analysis
of Transcription Factors targets in Yeast.
YeastTransfac.txt
includes a list of all known yeast binding sites of
transcription factors in the following format:
Gene.AC | Gene.DE | Factor.OS
| Site.AC | Factor.AC | Factor.FA | Factor.CL | Class.CL | Site.SQ
where AC = accession number, DE = description, OS = organism, FA = factor name,
CL = class, SQ = sequence.
This information was extracted from TRANSFAC.
In the file you have information about each binding site on the DNA (fields: Site.AC and Site.SQ), the target
gene itself (fields: Gene.AC and Gene.DE)
and the Transcription factor that binds it (fields: Factor.AC , Factor.CL, Factor.CL and Class.CL).
In
YeastPromoters.tfa
you will find list of 5647 yeast promoters in FASTA
format. This format includes a short header with information on the gene, and a
sequence of approximately 500 bp upstearm
the transcription start site (TSS). Now answer the questions below by using the
unix commands we learned in
class.
- What to submit? Submit a script Ex1-3.csh
that contains:
The commands that calculate the answers to the questions
- How many genes
in YeastPromoters.tfa have the Reb1 binding sequence : CGGGTAA in their promoter sequence? You can
use the script FASTA2line.pl
and FASTAfromline.pl
to convert FASTA formatted files to one-liners.
(Note : Have you remembered to include sequences
with reverse complement motifs?? You should.)
- You have found
a list of genes which have the sequence element
of Reb1 binding site. Do the putative target genes appear more in certain chromosoms, and less in others? Print out the 2
chromosomes with the highest number of targets found, and the 2 with the
lowest (print ONLY the chromosome number, and print the two highest first,
and then the two lowest).
- How many genes
contain a known binding site of Reb1 based on YeastTransfac.txt?
Create file Reb1_targets.out with the Gene.AC
information of the Reb1 targets (Note that some genes contain more than
one binding site - don't count them twice)
- Extract from YeastTransfac.txt
all the sites that are known Gcn4
targets, and print their sequence element (Site.SQ
field).
- Find the main
transcription factors which regulate : "histidine metabolism" "ribosomal RNA"
and "galactose". For each of the three
topics, search for the factor which has the highest number of targets
related to this topic (these can be three different factors).
For this exercise we will define a gene as related to galactose,
for example, if the description of the gene (field Gene.DE)
contains the word galactose.
Print out the topic and the name of the main regulator, in the following
format: "histidine metabolism =>
Reb1" (You can use any kind and number of spaces you prefer).
Obviously Reb1 is not the right answer, that's just an example...
Use the command foreach in your answer.
Good Luck!
Now what?
Pack all your files in a tar file called Ex1.tar:
tar -cvf Ex1.tar Ex1-1.csh Ex1-2.csh
Ex1-3.csh