Workshop in Computational Bioskills - Lesson 5

Workshop in Computational Bioskills - Spring 2011

Lesson 5 - Perl & the Web

Part 1 - briefly on HTTP and HTML
Part 2 - Getting data from the Web
Part 3 - Web crawlers & Robots


Today's subject: The Web:

Part 1 - Briefly on HTTP and HTML


Main actors

Server:  A machine that runs a special
daemon program that waits for input from the communication ports. 
Client (User Agent): Any program that communicates with the 
server. Most common (and friendly) are the browsers.
Proxy: A machine that sits between the client and 
the server, and acts as a server for the client (and as a client for the server...)
Advantages: Caches popular pages; Firewall


URL - Universal Resource Locator

General structure:
protocol://host:port/path#anchor?parameters
Example:
http://www.cs.huji.ac.il/~bioskill/Lesson6/index.html#PC 



HTTP - HyperText Transfer Protocol

Standardize the communication between servers and clients.

HTTP Session

Client opens a connection
Client sends a request
Server returns a response
Server closes the connection



HTTP Request [Method url version] [header...] [header...] ... [message body] Methods: GET, POST, HEAD, PUT, DELETE, TRACE Version: HTTP/1.0, HTTP/1.1 Headers: User-Agent, Host, Connection...(HTTP/1.1 defines 64 possible headers) POST Method: Sends information to the server (filling a form), sends the parameters in the message body GET Method: Sends information to the server (filling a form), sends the parameters in the url
HTTP Response [version status-code reason-phrase] [header...] [header...] ... [message body] Status code: Some you know for sure... Headers: Date, Content-Type, Content-Length Message body What is displayed on the browser. Good way to get information on the files without all it's content.

HTML Links

Link syntax:

<a href="url">Link text</a>

The start tag contains attributes about the link.

The element content (Link text) defines the part to be displayed.

Learn more

Fetching a URL in UNIX:

lynx [URL]: Simple textual browser

lynx -dump [-width=NUM] [URL]: Simple HTML2text

lynx -source [URL]: Retrieve HTML code of a web page

wget [URL]: Retrieve contect of a URL

(You might need to set http_proxy to http://wwwproxy.huji.ac.il:8080)

Fetching a URL using Perl:

How to get the content of a URL ?
use
LWP::Simple;
$content = get($url);

Or a nicer version:
use LWP::Simple;
unless (defined ($content = get $URL)) {
    die "could not get $URL\n";
}


More professional modules:

o LWP::UserAgent
This module creates a virtual browser. The object returned from the new constructor is used to make the actual request.

o HTTP::Request
This module creates a request but doesn't send it yet.

o HTTP::Response
This is the object type returned when the user agent actually runs the request. We check it for errors and contents.

o URI::Heuristic
This curious little module uses a guessing algorithms to expand partial URLs.
perl =>
http://www.perl.com
www.oreilly.com =>
http://www.oreilly.com
ftp.funet.fi =>
ftp://ftp.funet.fi
/etc/passwd =>
file:/etc/passwd


o Let's start with a simple program:
TitleBytes.pl
: finding the title and size of documents.

Example run:
> TitleBytes.pl http://www.tpj.com/
http:/www.tpj.com/ =>
The Perl Journal (109 lines, 4530 bytes)


Perl Cookbook. 20.3 - Extracting URLs:
How to extract all URLs from an HTML file.

o a simple program: xurl.pl

Example run:
> xurl.pl http://www.cs.huji.ac.il/~bioskill/Lesson1/
http://www.cs.huji.ac.il/~bioskill/Data
http://www.cs.huji.ac.il/~bioskill/Lesson1/1.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/2.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/3.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/4.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/FASTA2line.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/FASTAfromline.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/igul.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/plot.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/plotps.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/stats.pl.html
http://www.cs.huji.ac.il/~bioskill/MAN/awk.html
http://www.cs.huji.ac.il/~bioskill/MAN/bc.html
...


Perl Cookbook. 20.5 - Converting HTML to ASCII:
How to convert an HTML file into formatted plain ASCII.

$ascii = `lynx -dump $filename`;

If you want to do it within your program and don't care about the things that the HTML::TreeBuilder formatter cannot handle yet (tables and frames):

2txt.pl
2ps.pl


Perl Cookbook. 20.6 - Extracting or Removing HTML Tags:
How to remove HTML tags from a string, leaving just plain text.

use HTML::Parse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));


Perl Cookbook. 20.10 - Mirroring Web Pages:
You want to keep a local copy of *one* web page up-to-date.

use LWP::Simple;
mirror($URL, $local_filename);


More advance programs:

surl.pl sort URLs by their last modification date

Example run:
> xurl.pl http://www.cs.huji.ac.il/~bioskill/Lesson1/ | surl.pl

  22/2/2010 http://www.cs.huji.ac.il/~bioskill/Lesson1/lesson1.html
  10/2/2010 http://www.cs.huji.ac.il/~bioskill/index.html
  4/2/2010 http://www.cs.huji.ac.il/~bioskill/Lesson1/findOverlap.csh.html
  2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/wc.html
  2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/uniq.html
  2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/tee.html
  2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/awk.html
  2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/paste.html
  ...
  <NONE SPECIFIED& gt;http://www.cs.huji.ac.il/~bioskill/Data/Virus
  <NONE SPECIFIED& gt;http://us.expasy.org/sprot/sprot-top.html
  ...

Robots:

Now that we know how to get documents from the web, we can write fast and powerful robots,
that will do nothing than to put even more presure on the slow university proxies.

But before we do that (or at least, before we do that to other people's sites),
we should know some basic facts about being nice on the web or "net etiquette".


Some guidelines for Robot Writers
[by Martijn Koster, 1993]

- Reconsider
o Are you sure you really need a robot?

- Be Accountable
o Identify your Web Wanderer
HTTP supports a User-agent field to identify a WWW browser.
As your robot is a kind of WWW browser, use this field to name your robot e.g. "NottinghamRobot/1.0". This will allow server maintainers to set your robot apart from human users using interactive browsers. It is also recommended to run it from a machine registered in the DNS, which will make it easier to recognise, and will indicate to people where you are.

o Identify yourself
HTTP supports a From field to identify the user who runs the WWW browser. Use this to advertise your email address e.g. "j.smith@somehwere.edu". This will allow server maintainers to contact you in case of problems, so that you can start a dialogue on better terms than if you were hard to track down.

o Announce it to the target
If you are only targetting a single site, or a few, contact its administrator and inform him/her. Be informative Server maintainers often wonder why their server is hit. If you use the HTTP Referer field you can tell them. This costs no effort on your part, and may be informative.

o Be there
Don't set your Web Wanderer going and then go on holiday for a couple of days. If in your absence it does things that upset people you are the only one who can fix it. It is best to remain logged in to the machine that is running your robot.

o Notify your authorities
It is advisable to tell your system administrator / network provider what you are planning to do. You will be asking a lot of the services they offer, and if something goes wrong they like to hear it from you first, not from external people.

- Test Locally
Don't run repeated test on remote servers, instead run a number of servers locally and use them to test your robot first.

- Don't hog resources
Robots consume a lot of resources. To minimise the impact, keep the following in mind:

o Walk, don't run
Retrieving 1 document per minute is a lot better than one per second. One per 5 minutes is better still. Yes, your robot will take longer, but what's the rush, it's only a program.

o Use If-modified-since or HEAD where possible
If your application can use the HTTP If-modified-since header, or the HEAD method for its purposes, that gives less overhead than full GETs.

o Ask for what you want
HTTP has a Accept field in which a browser (or your robot) can specify the kinds of data it can handle. Use it: if you only analyse text, specify so.

o Ask only for what you want
You can build in some logic yourself: if a link refers to a ".ps", ".zip", ".Z", ".gif" etc, and you only handle text, then don't ask for it.

o Check URL's
Don't assume the HTML documents you are going to get back are sensible. When scanning for URL be wary of things like <A HREF="
http://host.dom/doc">.
A lot of sites don't put the trailing / on urls for directories, a naieve strategy of concatenating the names of sub urls can result in bad names.

o Check the results
Check what comes back. If a server refuses a number of documents in a row, check what it is saying. It may be that the server refuses to let you retrieve these things because you're a robot.

o Don't Loop or Repeat
Remember all the places you have visited, so you can check that you're not looping.

o Run at opportune times
On some systems there are preferred times of access, when the machine is only lightly loaded.

o Don't run it often
How often people find acceptable differs, but I'd say once every two months is probably too often.

o Don't try queries
Some WWW documents are searcheable (ISINDEX) or contain forms. Don't follow these.

- Stay with it
It is vital you know what your robot is doing, and that it remains under control

o Log
Make sure it provides ample logging, and it wouldn't hurt to keep certain statistics, such as the number of successes/failures, the hosts accessed recently, the average size of recent files, and keep an eye on it.

o Be interactive
Arrange for you to be able to guide your robot. Commands that suspend or cancel the robot, or make it skip the current host can be very useful.

o Be prepared
Your robot will visit hundreds of sites. It will probably upset a number of people. Be prepared to respond quickly to their enquiries, and tell them what you're doing.

o Be understanding
If your robot upsets someone, instruct it not to visit his/her site, or only the home page.

- Share results
OK, so you are using the resources of a lot of people to do this. Do something back:

o Keep results
This may sound obvious, but think about what you are going to do with the retrieved documents.

o Raw Result
Make your raw results available, from FTP, or the Web or whatever. This means other people can use it, and don't need to run their own servers.

o Polished Result
You are running a robot for a reason, probably to create a database, or gather statistics. If you make these results available on the Web people are more likely to think it worth it. And you might get in touch with people with similar interests.

o Report Errors
Your robot might come accross dangling links. You might as well publish them on the Web somewhere (after checking they really are. If you are convinced they are in error (as opposed to restricted), notify the administrator of the server.


robots.txt - Where can robots go.

o Example: http://www.ncbi.nlm.nih.gov/robots.txt (local copy)
o As you see, robots are not welcome here...


RobotTitleBytes.pl

Example run:
> RobotTitleBytes.pl http://www.cs.huji.ac.il/~bioskill/
http://www.cs.huji.ac.il/~bioskill/ =>
Workshop in Computational Bioskills - Guest page (12 lines, 224 bytes)


RobotTitleBytes.pl

Example run:
> RobotTitleBytes.pl http://www.ncbi.nlm.nih.gov/sites/entrez /
http://www.ncbi.nlm.nih.gov/sites/entrez =>
403 Forbidden by robots.txt