Web Content Information Extraction Approach Based on Removing Noise and Content-Features

This paper presents an improved approach to extract the main content from web pages. There are a good many financial news pages which have so many links that the algorithms mainly based on link density have poor performance in extracting main content. To solve this problem, we put forward an extract...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Dingkui Yang, Jihua Song
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper presents an improved approach to extract the main content from web pages. There are a good many financial news pages which have so many links that the algorithms mainly based on link density have poor performance in extracting main content. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages.
DOI:10.1109/WISM.2010.82