Distributed online real-time processing method and system for multi-source and heterogeneous flow-state big data
The invention provides a distributed online real-time processing method and system for multi-source and heterogeneous flow-state big data. The method specifically comprises the steps of crawling webpage data of each source by utilizing a distributed crawler duplicate removal algorithm; pre-processin...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention provides a distributed online real-time processing method and system for multi-source and heterogeneous flow-state big data. The method specifically comprises the steps of crawling webpage data of each source by utilizing a distributed crawler duplicate removal algorithm; pre-processing the crawled page, constructing a corresponding tree by utilizing a visual page segmentation algorithm, pruning noise nodes according to a visual rule, classifying multi-layer pages, determining predicates under different types of pages according to different characteristics, and inferring data recording block nodes and data attribute nodes through the rule; Distributing the preprocessed data source by using a distributed message system, providing a data stream, and describing the state of a data node in the data stream to form state information; Selective storage operation is carried out on data streams by utilizing a Hadoop distributed file system based on K- And detecting the processed data by using a means tex |
---|