Content Extraction based on Hierarchical Relations in DOM Structures
This article introduces a new approach for content extraction that exploits the hierarchical inter-relations of the elements in a webpage. Content extraction is a technique used to extract from a webpage the main textual content. This is useful in order to filter out the advertisements and all the a...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This article introduces a new approach for content
extraction that exploits the hierarchical inter-relations of the
elements in a webpage. Content extraction is a technique used
to extract from a webpage the main textual content. This is
useful in order to filter out the advertisements and all the
additional information that is not part of the main content. The
main idea behind our approach is to use the DOM tree as an
explicit representation of the inter-relations of the elements in a
webpage. Using the information contained in the DOM tree we
can identify blocks of content and we can easily determine what
of the blocks contains more text. Thanks to this information, the
technique achieves a considerable recall and precision. Using the
DOM structure for content extraction gives us the benefits of
other approaches based on the syntax of the webpage (such as
characters, words and tags), but it also gives us a very precise
information regarding the related components in a block, thus,
producing very cohesive blocks.
López Romero, S.; Silva Galiana, JF.; Insa Cabrera, D. (2012). Content Extraction based on Hierarchical Relations in DOM Structures. Research and Development in Computer Science and Engineering. 45:5-12. http://hdl.handle.net/10251/47738 |
---|