Using Sequence Classification for Filtering Web Pages

Speaker: Ronen Feldman
Date: Thursday, 6 March 2008
Time: 2pm
Place: Ross 201

Abstract:

Web pages often contain text that is irrelevant to their main content,
such as advertisements, generic format elements, and references to
other pages on the same site. When used by automatic
content-processing systems, e.g., for Web indexing, text
classification, or information extraction, this irrelevant text often
produces a substantial amount of noise. This paper describes a trainable
filtering system based on a feature-rich sequence classifier that
removes irrelevant parts from pages, while keeping the content intact.
Most of the features the system uses are purely form-related: HTML
tags and their positions, sizes of elements, etc. This keeps the
system general and domain-independent. We also experiment with content
words and show that while they perform very poorly alone, they can
slightly improve the performance of pure-form features, without
jeopardizing the domain-independence. Our system achieves very high
accuracy (95% and above) on several collections of Web pages. We also
do a series of tests with different features and different
classifiers, comparing the contribution of different components to the
system performance, and comparing two known sequence classifiers,
Robust Risk Minimization (RRM) and Conditional Random Fields (CRF), in
a novel setting.

Joint work with Benjamin Rosenfeld and Lyle Ungar.