Using Sequence Classification for Filtering Web Pages Speaker: Ronen Feldman Date: Thursday, 6 March 2008 Time: 2pm Place: Ross 201 Abstract: Web pages often contain text that is irrelevant to their main content, such as advertisements, generic format elements, and references to other pages on the same site. When used by automatic content-processing systems, e.g., for Web indexing, text classification, or information extraction, this irrelevant text often produces a substantial amount of noise. This paper describes a trainable filtering system based on a feature-rich sequence classifier that removes irrelevant parts from pages, while keeping the content intact. Most of the features the system uses are purely form-related: HTML tags and their positions, sizes of elements, etc. This keeps the system general and domain-independent. We also experiment with content words and show that while they perform very poorly alone, they can slightly improve the performance of pure-form features, without jeopardizing the domain-independence. Our system achieves very high accuracy (95% and above) on several collections of Web pages. We also do a series of tests with different features and different classifiers, comparing the contribution of different components to the system performance, and comparing two known sequence classifiers, Robust Risk Minimization (RRM) and Conditional Random Fields (CRF), in a novel setting. Joint work with Benjamin Rosenfeld and Lyle Ungar.