Workshop in Computational Bioskills - Spring 2011
Part
1 - briefly on HTTP and HTML
Part
2 - Getting data from the Web
Part 3 - Web crawlers & Robots
Today's subject: The Web:
Part 1 - Briefly on HTTP and HTML
Main actors Server: A machine that runs a special daemon program that waits for input from the communication ports. Client (User Agent): Any program that communicates with the server. Most common (and friendly) are the browsers. Proxy: A machine that sits between the client and the server, and acts as a server for the client (and as a client for the server...) Advantages: Caches popular pages; Firewall
URL - Universal Resource Locator General structure: protocol://host:port/path#anchor?parameters Example: http://www.cs.huji.ac.il/~bioskill/Lesson6/index.html#PC
HTTP - HyperText Transfer Protocol Standardize the communication between servers and clients. HTTP Session Client opens a connection Client sends a request Server returns a response Server closes the connection
HTTP Request [Method url version] [header...] [header...] ... [message body] Methods: GET, POST, HEAD, PUT, DELETE, TRACE Version: HTTP/1.0, HTTP/1.1 Headers: User-Agent, Host, Connection...(HTTP/1.1 defines 64 possible headers) POST Method: Sends information to the server (filling a form), sends the parameters in the message body GET Method: Sends information to the server (filling a form), sends the parameters in the url
HTTP Response [version status-code reason-phrase] [header...] [header...] ... [message body] Status code: Some you know for sure... Headers: Date, Content-Type, Content-Length Message body What is displayed on the browser. Good way to get information on the files without all it's content.
Link syntax:
<a href="url">Link text</a> |
The start tag contains attributes about the link.
The element content (Link text) defines the part to be displayed.
Learn morelynx [URL]: Simple textual browser
lynx -dump [-width=NUM] [URL]: Simple HTML2text
lynx -source [URL]: Retrieve HTML code of a web page
wget [URL]: Retrieve contect of a URL
(You might need to set http_proxy to http://wwwproxy.huji.ac.il:8080)
How to get the content of a
URL ?
use LWP::Simple;
$content
= get($url);
Or a nicer version:
use
LWP::Simple;
unless
(defined ($content = get $URL)) {
die "could not get $URL\n";
}
More professional modules:
o LWP::UserAgent
This module creates a
virtual browser. The object returned from the new constructor is
used to make the actual request.
o HTTP::Request
This module creates a
request but doesn't send it yet.
o HTTP::Response
This is the object type
returned when the user agent actually runs the request. We check
it for errors and contents.
o URI::Heuristic
This curious little
module uses a guessing algorithms to expand partial URLs.
perl => http://www.perl.com
www.oreilly.com => http://www.oreilly.com
ftp.funet.fi => ftp://ftp.funet.fi
/etc/passwd => file:/etc/passwd
o Let's start with a
simple program:
TitleBytes.pl : finding the title and size of documents.
Example
run:
>
TitleBytes.pl http://www.tpj.com/
http:/www.tpj.com/ =>
The Perl Journal (109 lines, 4530 bytes)
Perl
Cookbook. 20.3 - Extracting URLs:
How to extract all
URLs from an HTML file.
o a simple program: xurl.pl
Example
run:
>
xurl.pl http://www.cs.huji.ac.il/~bioskill/Lesson1/
http://www.cs.huji.ac.il/~bioskill/Data
http://www.cs.huji.ac.il/~bioskill/Lesson1/1.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/2.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/3.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/4.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/FASTA2line.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/FASTAfromline.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/igul.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/plot.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/plotps.pl.html
http://www.cs.huji.ac.il/~bioskill/Lesson1/stats.pl.html
http://www.cs.huji.ac.il/~bioskill/MAN/awk.html
http://www.cs.huji.ac.il/~bioskill/MAN/bc.html
...
Perl
Cookbook. 20.5 - Converting HTML to ASCII:
How to convert an
HTML file into formatted plain ASCII.
$ascii = `lynx -dump $filename`;
If you want to do it within your program and don't care about the things that the HTML::TreeBuilder formatter cannot handle yet (tables and frames):
Perl
Cookbook. 20.6 - Extracting or Removing HTML Tags:
How to remove HTML tags
from a string, leaving just plain text.
use HTML::Parse;
use HTML::FormatText;
$plain_text
= HTML::FormatText->new->format(parse_html($html_text));
Perl
Cookbook. 20.10 - Mirroring Web Pages:
You want to keep a local
copy of *one* web page up-to-date.
use
LWP::Simple;
mirror($URL,
$local_filename);
More advance programs:
surl.pl sort URLs by their last modification date
Example
run:
>
xurl.pl http://www.cs.huji.ac.il/~bioskill/Lesson1/ | surl.pl
22/2/2010 http://www.cs.huji.ac.il/~bioskill/Lesson1/lesson1.html 10/2/2010 http://www.cs.huji.ac.il/~bioskill/index.html 4/2/2010 http://www.cs.huji.ac.il/~bioskill/Lesson1/findOverlap.csh.html 2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/wc.html 2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/uniq.html 2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/tee.html 2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/awk.html 2/2/2010 http://www.cs.huji.ac.il/~bioskill/MAN/paste.html ... <NONE SPECIFIED& gt;http://www.cs.huji.ac.il/~bioskill/Data/Virus <NONE SPECIFIED& gt;http://us.expasy.org/sprot/sprot-top.html ...
Robots:
Now that we know how to get
documents from the web, we can write fast and powerful robots,
that will do nothing than to put even more presure on the slow
university proxies.
But before we do that (or at
least, before we do that to other people's sites),
we should know
some basic facts about being nice on the web or "net
etiquette".
Some guidelines for
Robot Writers
[by Martijn Koster, 1993]
-
Reconsider
o Are
you sure you really need a robot?
- Be
Accountable
o
Identify your Web Wanderer
HTTP supports a User-agent field to identify a WWW browser.
As your robot is a kind of WWW browser, use this field to name
your robot e.g. "NottinghamRobot/1.0". This will allow
server maintainers to set your robot apart from human users using
interactive browsers.
It is also recommended to run it from a
machine registered in the DNS, which will make it easier to
recognise, and will indicate to people where you are.
o
Identify yourself
HTTP supports a From field to identify the user who runs the WWW
browser. Use this to advertise your email address e.g.
"j.smith@somehwere.edu". This will allow server
maintainers to contact you in case of problems, so that you can
start a dialogue on better terms than if you were hard to track
down.
o
Announce it to the target
If you are only targetting a single site, or a few, contact its
administrator and inform him/her. Be informative Server
maintainers often wonder why their server is hit. If you use the
HTTP Referer field you can tell them. This costs no effort on
your part, and may be informative.
o Be
there
Don't set your Web Wanderer going and then go on holiday for a
couple of days. If in your absence it does things that upset
people you are the only one who can fix it. It is best to remain
logged in to the machine that is running your robot.
o Notify
your authorities
It is advisable to tell your system administrator / network
provider what you are planning to do. You will be asking a lot of
the services they offer, and if something goes wrong they like to
hear it from you first, not from external people.
- Test
Locally
Don't run repeated test on remote servers, instead run a number
of servers locally and use them to test your robot first.
- Don't
hog resources
Robots consume a lot of resources. To minimise the impact, keep
the following in mind:
o Walk,
don't run
Retrieving 1 document per minute is a lot better than one per
second. One per 5 minutes is better still. Yes, your robot will
take longer, but what's the rush, it's only a program.
o Use
If-modified-since or HEAD where possible
If your application can use the HTTP If-modified-since header, or
the HEAD method for its purposes, that gives less overhead than
full GETs.
o Ask for
what you want
HTTP has a Accept field in which a browser (or your robot) can
specify the kinds of data it can handle. Use it: if you only
analyse text, specify so.
o Ask
only for what you want
You can build in some logic yourself: if a link refers to a
".ps", ".zip", ".Z",
".gif" etc, and you only handle text, then don't ask
for it.
o Check
URL's
Don't assume the HTML documents you are going to get back are
sensible. When scanning for URL be wary of things like <A
HREF="http://host.dom/doc">.
A lot of sites don't
put the trailing / on urls for directories, a naieve strategy of
concatenating the names of sub urls can result in bad names.
o Check
the results
Check what comes back. If a server refuses a number of documents
in a row, check what it is saying. It may be that the server
refuses to let you retrieve these things because you're a robot.
o Don't
Loop or Repeat
Remember all the places you have visited, so you can check that
you're not looping.
o Run at
opportune times
On some systems there are preferred times of access, when the
machine is only lightly loaded.
o Don't
run it often
How often people find acceptable differs, but I'd say once every
two months is probably too often.
o Don't
try queries
Some WWW documents are searcheable (ISINDEX) or contain forms.
Don't follow these.
- Stay
with it
It is vital you know what your robot is doing, and that it
remains under control
o Log
Make sure it provides ample logging, and it wouldn't hurt to keep
certain statistics, such as the number of successes/failures, the
hosts accessed recently, the average size of recent files, and
keep an eye on it.
o Be
interactive
Arrange for you to be able to guide your robot. Commands that
suspend or cancel the robot, or make it skip the current host can
be very useful.
o Be
prepared
Your robot will visit hundreds of sites. It will probably upset a
number of people. Be prepared to respond quickly to their
enquiries, and tell them what you're doing.
o Be
understanding
If your robot upsets someone, instruct it not to visit his/her
site, or only the home page.
- Share
results
OK, so you are using the
resources of a lot of people to do this. Do something back:
o Keep
results
This may sound obvious, but think about what you are going to do
with the retrieved documents.
o Raw
Result
Make your raw results available, from FTP, or the Web or
whatever. This means other people can use it, and don't need to
run their own servers.
o
Polished Result
You are running a robot for a reason, probably to create a
database, or gather statistics. If you make these results
available on the Web people are more likely to think it worth it.
And you might get in touch with people with similar interests.
o Report
Errors
Your robot might come accross dangling links. You might as well
publish them on the Web somewhere (after checking they really
are. If you are convinced they are in error (as opposed to
restricted), notify the administrator of the server.
robots.txt - Where can robots go.
o
Example: http://www.ncbi.nlm.nih.gov/robots.txt (local copy)
o As
you see, robots are not welcome here...
Example
run:
>
RobotTitleBytes.pl http://www.cs.huji.ac.il/~bioskill/
http://www.cs.huji.ac.il/~bioskill/ =>
Workshop in Computational Bioskills - Guest page (12 lines, 224
bytes)
Example
run:
>
RobotTitleBytes.pl http://www.ncbi.nlm.nih.gov/sites/entrez /
http://www.ncbi.nlm.nih.gov/sites/entrez =>
403 Forbidden by robots.txt