A Heuristic Approach for Converting HTML Documents to XML Documents
XML is rapidly emerging, and yet there still exist numerous HTML documents on the Web. In this paper, we present a heuristic approach for converting HTML documents to XML documents. During the conversion process, we eliminate all the HTML elements in an HTML document from the resulting XML document...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buchkapitel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | XML is rapidly emerging, and yet there still exist numerous HTML documents on the Web. In this paper, we present a heuristic approach for converting HTML documents to XML documents. During the conversion process, we eliminate all the HTML elements in an HTML document from the resulting XML document since these elements are designed for the display of data exclusively, but retain the character data of each element along with the implicit hierarchy among the data. The proposed conversion approach extracts the data hierarchy of HTML documents as closely as possible with no human intervention. The approach can be adopted to construct the data hierarchy of an HTML document and to collect data in HTML documents into an XML repository. |
---|---|
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/3-540-44957-4_79 |