Distributed online real-time processing method and system for multi-source and heterogeneous flow-state big data

The invention provides a distributed online real-time processing method and system for multi-source and heterogeneous flow-state big data. The method specifically comprises the steps of crawling webpage data of each source by utilizing a distributed crawler duplicate removal algorithm; pre-processin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: YANG ZIJIANG, LI CHEN, GUO JIANPING, WEI MOJI, ZHU SHIWEI, YU JUNFENG, LI SISI, LIU CUIQIN, YANG AIQIN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention provides a distributed online real-time processing method and system for multi-source and heterogeneous flow-state big data. The method specifically comprises the steps of crawling webpage data of each source by utilizing a distributed crawler duplicate removal algorithm; pre-processing the crawled page, constructing a corresponding tree by utilizing a visual page segmentation algorithm, pruning noise nodes according to a visual rule, classifying multi-layer pages, determining predicates under different types of pages according to different characteristics, and inferring data recording block nodes and data attribute nodes through the rule; Distributing the preprocessed data source by using a distributed message system, providing a data stream, and describing the state of a data node in the data stream to form state information; Selective storage operation is carried out on data streams by utilizing a Hadoop distributed file system based on K- And detecting the processed data by using a means tex