Workshop in Computational Bioskills

Workshop in Computational Bioskills - Exercise 3

Workshop in Computational Bioskills - Exercise 3: Writing a Web-Crawler & Visualizing results

Deadline: 16/5
To be done in pairs.
Submit your solutions through: here

This exercise has 2 parts: In the first part you need to visualize results of a study regarding transcription factor binding sites. In the scond part you will write a web-crawler designed to mirror a website for offline browsing.

Part 1 : Visualization

In this exercise you will work on visualizing results of a large scale experimental assay. You will visualize experimental results by two different methods. The first is using the UCSC Genome Browser, and the second is by summarizing some statistics on the data in a graph.
We Will use the results taken from a paper by Harbison et. al., of transcription factor binding sites (TFBS) of different TFs in S.cerevisiae (baking yeast). The TFBS are based on chip-on-chip experiments and evolutionary conservation.

Write a program for visualizing and analyzing TFBS called TFBSvis.pl:
General description:
The purpose of the program is to give information on known TFBS and their location on the genome, including visualization of this data.
Your program should visualize the distribution of the number of binding sites per TF using a graph (You can choose any type of graph: such as a bar-graph line-graph etc.)
In addition you should present the TFBS information as a 'BED' format file, which can be browsed in the Yeast Genome Browser.

Now let's do it step by step:

Step 1: The transcription factors binding site (TFBS) data Copy this file, containing the TFBS data, yeast_TF_MAP.IGR_v24.2.p001b.txt (You can use the command wget to download the file). If you have space problem, you can copy the file to a tmp directory and create a soft link in your working directory. When I test your program, I will assume that these files are located in the current working directory. Step 2: TFBSvis.pl program Write a program TFBSvis.pl :

Print the following usage message: <204|0>bioskill:~>TFBSvis.pl Usage: TFBSvis.pl -chr <chr_id> [-bed]

chr flag : Given a chromosome id as an input (e.g. chr1), the program will create a graph presenting the distribution of number of TFBS per TF in the given chromosome, and print it to a file in PNG format, called chr_id.png (e.g. chr1.png).
Choose the appropriate graph type (there are several good options), using the GD::Graph modules.

In the graph you should display the distribution of number of binding sites per TF: i.e. How many TFs have less than 5 targets, exactly 100 targets, etc (this is only an example, you should choose the numbers, but you do NOT need to display how many targets does every TF have).

If the given chr-id is ALL, you should display the TFBS distribution in all the chromosomes combined together.

The chromosome name should be in the format (chr[0-9]* or ALL). You can assume correct input.

Don't forget to write the name of the chromosome (or all) in the title of the graph.
Given the flag '-bed', the program will print the information on TFBS in a 'BED' format file named TFBS.BED. This file can be uploaded as a 'custom annotation track' to the Genome Browser - check it out!

Step 3: Graphs To draw the graphs use the module perl GD::Graph and then save the graph as a PNG file. More specifically, use the GD module you need for the type of graph you want: e.g GD:Graph:bars. If you forgot how to use the GD module, go back to Lesson 5. Step 4: The BED format file Given the flag '-bed', the program will print the TFBS information into a file named TFBS.BED This file has a special format that enables to browse it in the graphical Genome Browser. This is a very useful graphical display, where any researcher can display results (as a .BED file). You can uploaded the BED file as a 'custom annotation track' to the Genome Browser.
The file should contain for each TFBS, its location on the genome (chromosome,start position,end position and strand) and the name of the binding factor. The description of the format can be found here. For example TFBS.BED file:
browser position chr1:14638-15638 track name=example description="BioEx4 TFBS" color=255,0,0 chr1 14638 14650 STE1 chr1 14646928 14646938 RAP12 chr1 14648136 14648150 RAP12 chr1 14649932 14649943 STE1 chr2 14655206 14655215 MSN3 ... The first line defines the location of the browser - make sure it starts from the first TFBS in chromosome 1. The second line defines the name and color of the presentation. You are welcome to try and add other information to this basic presentation.

Some comments: o Parse the command line using the Getopt::Long package, it's much easier (more details bellow). o Print clear usage message (i.e. how to use the program) if the user call it with too few or too many parameters.

Part 2 : Writing a Web-Crawler (Robot)

In this part of the exercise you are asked to write a program called robot.pl, which mirrors a website for an offline browsing, i.e. creating a local copy of a web site while keeping its hierarchical structure of directories.

First of all, make sure you understand the examples in Lesson 6. This should give you a pretty good start. Read some of the LWP manuals and the guidelines for robot writers (from the lesson).

When you design your program think of the following points:

How to manipulate lots of links/files simultaneously :
What data structure do you need ?
Make sure you're browsing smartly - Visit URLs once and only once.
You'll need to handle local files.
Think how to create directories and files - Use functions !
Remember to make sure that the mirrored copy of the site works locally (i.e. change absolute links to relative).

Program specifications

robot.pl [options] URL

where the options include:

-dir <name>
The name of the directory where you should store the local copy of mirrored site.
Create the directory if needed.
Default value: __$$, where $$ is the process id of the current run of the program
example: robot.pl -dir=tmp http://www.cs.huji.ac.il/~bioskill/syllabus.html
-base <URL>
Ignore all links outside this given base domain/URL.
Build the local directory tree starting from it.
Default value: use the input URL as the base
example 1: robot.pl http://www.cs.huji.ac.il/~bioskill/Lesson7b/index.html
will stay under http://www.cs.huji.ac.il/~bioskill/Lesson7b
, and will name all the local files relatively. e.g. :
- The given url will be saved as _$$/index.html
- The link http://www.cs.huji.ac.il/~bioskill/Lesson7b/MAN/LWP::Simple.html should be saved as: _$$/MAN/LWP::Simple.html
- The link: http://www.cs.huji.ac.il/~bioskill should not be saved since it's not under the base URL.
example 2: robot.pl -base=http://www.cs.huji.ac.il/~bioskill http://www.cs.huji.ac.il/~bioskill/Lesson7b/index.html
will save the input URL file as __$$/Lesson7b/index.html
Since Lesson7b links to the main page of the course, this run will eventually mirror all the bioskill site.
Note that if the starting URL is not under the base URL, the program should do nothing.
-img
When given this flag mirror all links, including images.
If omitted, follow only 'a' links (e.g. "a href=URL"). Only 'a' type links is the default!!
example: robot.pl -img -dir=tmp http://www.cs.huji.ac.il/~bioskill/syllabus.html
-proxy <proxy>:
Specifies the URL of the proxy server to use.
example: robot.pl -proxy=http://wwwproxy.huji.ac.il:8080 http://www.cs.huji.ac.il/~bioskill/syllabus.html
If omitted, no proxy server is used.
-time <secs>:
Specifies the maximum amount of time (in seconds) you're willing to wait when sending a request.
After that period of time, the program should continue to the next request. Use the 'timeout' option for this(search in the manuals). The default is 15.
example: robot.pl -time 10 http://www.cs.huji.ac.il/~bioskill/index.html

Some relevant issues

What to mirror and where ?
You need to create the same directory structure as in the original site, and maintain the same file names.
Never follow links outside of the base URL.
Algorithm used for mirroring a site
You can retrieve the links in any order you want. You can think of it as a simple BFS (first mirror all the links in the current page and then dig deeper) or DFS, etc.
Note that different pages can have the same links. So, keep track of this using the appropriate data structes - to help you avoid looping.
Parsing the command-line arguments
Use the Getopt::Long module ! It's much easier. Take a look at this example.
Absolute vs. local links
If the mirrored site uses local links (e.g. "Ex2/index.html"), you don't need to change the content of the html file you're mirroring, and the local copy should work perfectly.
But, if you see absolute links (like "http://www.cs.huji.ac.il/~bioskill/Ex2/index.html") you need to replace them to locals (i.e. "index.html" or "Ex2/index.html" or "../Ex2/index.html"). This depends on the relative location of the linked URL and the current directory.
Some help with this:
To make your life easier you can look at this example : The program linkFix.pl gets as parameters: a url of a html file, the base URL (a directory), and the name of the home directory you are working in. It replaces all the absolute links to local ones and saves the file. To make the task easier I used the full path to the file and not a relative one. You can combine this code in your program.
Usage: linkFix.pl url base dir
Working with files
You might find the Perl functions mkdir, chdir & -X useful.
How to determine if a URL points to a directory or a file ?
Tough one. (Note that some URLs pointing to a directory omit the last "/".)
In this exercise, we'll use the naive convention that file names always contain a dot ("."), otherwise these are directories.
How to mirror URLs that point to a directory ?
If mirroring a directory with no file name, save the content in "index.html".
This assumption will usually work, although some links, like: http://www.cs.huji.ac.il/~bioskill/Data/dict/words might cause problems ...
Note that this convention might also create problems with Windows-based sites, that use index.htm (instead of index.html) extensions.
Correctness of the input
Use the Heuristic module we saw in class to avoid small mistakes in the URL given by the user (both in the input URL and in the base URL). If the given URL does not exist, print an informative error message to the user.
Broken Links
If you encounter any broken links, you should print an informative error message and CONTINUE.

Important robots related issues:

Your agent should be called "bioskill", and should follow the site's "robots.txt" rules.

Use a delay of at least one second when mirroring a site.
Debug your program inside CS and with no proxy.
When debuging you can use no delays, but only for short periods of time (up to one minute runs).

What to submit?
a tar file named ex3.tar that includes: robot.pl TFBSvis.pl and README files.
README should include the login and id of both authors. If you think the graphical presentation you are using requires explanations, write them in the README.

Good luck !