HTML Extraction Language (HEL)
Tree-based data-model
- an HTML page is seen as a labeled tree (DOMDocument Object Model)
Tree navigation via path-expressions (with conditions)
- extraction rules are described as paths along the tree
- path expressions always return text values
Regular expression
- regular expressions (a la Perl) can be applied on text values to capture finer granularity
<TABLE> <TBODY><TR><TD>Shady Grove</TD><TD>Aeolian</TD></TR><TR><TD>Over the River, Charlie</TD><TD>Dorian</TD></TR></TBODY></TABLE>