Workshop in Computational Bioskills - Exercise 3

Workshop in Computational Bioskills - Exercise 3: Writing a Web-Crawler & Visualizing results

Deadline: 11/5
To be done in pairs.
Submit your solutions through:
here


This exercise has 2 parts: In the first part you need to visualize results of a study regarding transcription factor binding sites. In the scond part you will write a web-crawler designed to mirror a website for offline browsing.

Part 1 : Visualization

In this exercise you will work on visualizing results of a large scale experimental assay. You will visualize experimental results by two different methods. The first is using the UCSC Genome Browser, and the second is by summarizing some statistics on the data in a graph.
We Will use the results taken from a paper by Harbison et. al., of transcription factor binding sites (TFBS) of different TFs in S.cerevisiae (baking yeast). The TFBS are based on chip-on-chip experiments and evolutionary conservation.

Write a program for visualizing and analyzing TFBS called TFBSvis.pl:
General description:
The purpose of the program is to give information on known TFBS and their location on the genome, including visualization of this data.
Your program should visualize the distribution of the number of binding sites per TF using a graph (You can choose any type of graph: such as a bar-graph line-graph etc.)
In addition you should present the TFBS information as a 'BED' format file, which can be browsed in the Yeast Genome Browser.


Now let's do it step by step:
Step 1: The transcription factors binding site (TFBS) data

Copy this file, containing the TFBS data, yeast_TF_MAP.IGR_v24.2.p001b.txt 
(You can use the command wget to download the file). 
If you have space problem, you can copy the file to a tmp directory and create a soft link in your working directory. 
When I test your program, I will assume that these files are located in the current working directory.

Step 2: TFBSvis.pl program

Write a program TFBSvis.pl : 
  1. Print the following usage message: <204|0>bioskill:~>TFBSvis.pl Usage: TFBSvis.pl -chr <chr_id> [-bed]
  2. chr flag : Given a chromosome id as an input (e.g. chr1), the program will create a graph presenting the distribution of number of TFBS per TF in the given chromosome, and print it to a file in PNG format, called chr_id.png (e.g. chr1.png).
    • Choose the appropriate graph type (there are several good options), using the GD::Graph modules.
    • In the graph you should display the distribution of number of binding sites per TF: i.e. How many TFs have less than 5 targets, exactly 100 targets, etc (this is only an example, you should choose the numbers, but you do NOT need to display how many targets does every TF have).
    • If the given chr-id is ALL, you should display the TFBS distribution in all the chromosomes combined together.
    • The chromosome name should be in the format (chr[0-9]* or ALL). You can assume correct input.
    • Don't forget to write the name of the chromosome (or all) in the title of the graph.
  3. Given the flag '-bed', the program will print the information on TFBS in a 'BED' format file named TFBS.BED. This file can be uploaded as a 'custom annotation track' to the Genome Browser - check it out!
Step 3: Graphs To draw the graphs use the module perl GD::Graph and then save the graph as a PNG file. More specifically, use the GD module you need for the type of graph you want: e.g GD:Graph:bars. Step 4: The BED format file Given the flag '-bed', the program will print the TFBS information into a file named TFBS.BED This file has a special format that enables to browse it in the graphical Genome Browser. This is a very useful graphical display, where any researcher can display results (as a .BED file). You can uploaded the BED file as a 'custom annotation track' to the Genome Browser.
The file should contain for each TFBS, its location on the genome (chromosome,start position,end position and strand) and the name of the binding factor. The description of the format can be found here. For example TFBS.BED file:
browser position chr1:14638-15638 track name=example description="BioEx4 TFBS" color=255,0,0 chr1 14638 14650 STE1 chr1 14646928 14646938 RAP12 chr1 14648136 14648150 RAP12 chr1 14649932 14649943 STE1 chr2 14655206 14655215 MSN3 ... The first line defines the location of the browser - make sure it starts from the first TFBS in chromosome 1. The second line defines the name and color of the presentation. You are welcome to try and add other information to this basic presentation.

Some comments: o Parse the command line using the Getopt::Long package, it's much easier (more details bellow). o Print clear usage message (i.e. how to use the program) if the user call it with too few or too many parameters.



Part 2 : Writing a Web-Crawler (Robot)

In this part of the exercise you are asked to write a program called robot.pl, which mirrors a website for an offline browsing, i.e. creating a local copy of a web site while keeping its hierarchical structure of directories.

First of all, make sure you understand the examples in Lesson 6. This should give you a pretty good start. Read some of the LWP manuals and the guidelines for robot writers (from the lesson).

When you design your program think of the following points:

  • How to manipulate lots of links/files simultaneously :
    What data structure do you need ?
  • Make sure you're browsing smartly - Visit URLs once and only once.
  • You'll need to handle local files.
    Think how to create directories and files - Use functions !
  • Remember to make sure that the mirrored copy of the site works locally (i.e. change absolute links to relative).

Program specifications

robot.pl [options] URL

where the options include:

  • -dir <name>
    The name of the directory where you should store the local copy of mirrored site.
    Create the directory if needed.
    Default value: __$$, where $$ is the process id of the current run of the program
    example: robot.pl -dir=tmp http://www.cs.huji.ac.il/~bioskill/syllabus.html
  • -base <URL>
    Ignore all links outside this given base domain/URL.
    Build the local directory tree starting from it.
    Default value: use the input URL as the base
    example 1: robot.pl http://www.cs.huji.ac.il/~bioskill/Lesson7b/index.html
    will stay under
    http://www.cs.huji.ac.il/~bioskill/Lesson7b, and will name all the local files relatively. e.g. :
    • The given url will be saved as _$$/index.html
    • The link http://www.cs.huji.ac.il/~bioskill/Lesson7b/MAN/LWP::Simple.html should be saved as: _$$/MAN/LWP::Simple.html
    • The link: http://www.cs.huji.ac.il/~bioskill should not be saved since it's not under the base URL.
    example 2: robot.pl -base=http://www.cs.huji.ac.il/~bioskill http://www.cs.huji.ac.il/~bioskill/Lesson7b/index.html
    will save the input URL file as __$$/Lesson7b/index.html
    Since Lesson7b links to the main page of the course, this run will eventually mirror all the bioskill site.
    Note that if the starting URL is not under the base URL, the program should do nothing.
  • -img
    When given this flag mirror all links, including images.
    If omitted, follow only 'a' links (e.g. "a href=URL"). Only 'a' type links is the default!!
    example: robot.pl -img -dir=tmp http://www.cs.huji.ac.il/~bioskill/syllabus.html
  • -proxy <proxy>:
    Specifies the URL of the proxy server to use.
    example: robot.pl -proxy=http://wwwproxy.huji.ac.il:8080 http://www.cs.huji.ac.il/~bioskill/syllabus.html
    If omitted, no proxy server is used.
  • -time <secs>:
    Specifies the maximum amount of time (in seconds) you're willing to wait when sending a request.
    After that period of time, the program should continue to the next request. Use the 'timeout' option for this(search in the manuals). The default is 15.
    example: robot.pl -time 10 http://www.cs.huji.ac.il/~bioskill/index.html

Some relevant issues

Important robots related issues:

  • Your agent should be called "bioskill", and should follow the site's "robots.txt" rules.
  • Use a delay of at least one second when mirroring a site.
  • Debug your program inside CS and with no proxy.
    When debuging you can use no delays, but only for short periods of time (up to one minute runs).



What to submit?
a tar file named ex3.tar that includes: robot.pl TFBSvis.pl and README files.
README should include the login and id of both authors. If you think the graphical presentation you are using requires explanations, write them in the README.


Good luck !