PDF table structure identification method based on graph attention mechanism

The invention relates to a PDF table structure recognition method based on a graph attention mechanism, and belongs to the technical field of document analysis in a data mining technology. The methodcomprises the following steps of 1, preprocessing, wherein all cells in a table and position coordina...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: XU HENGDA, CHI ZEWEN, MAO XIANLING
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention relates to a PDF table structure recognition method based on a graph attention mechanism, and belongs to the technical field of document analysis in a data mining technology. The methodcomprises the following steps of 1, preprocessing, wherein all cells in a table and position coordinates of the cells are obtained; 2, graph construction: establishing an undirected graph for the obtained cells; and 3, relationship prediction: classifying the edges on the constructed undirected graph, and predicting the adjacency relationship between the cells by using a neural network model. Compared with the prior art, the method for identifying the complex table structure in the PDF is proposed for the first time, the best effect is achieved on two table structure identification data sets,and particularly, the effect is obviously improved on complex table structure identification. 本发明涉及一种基于图注意力机制的PDF表格结构识别方法,属于数据挖掘技术中的文档分析技术领域;包括以下步骤:一、预处理:获取表格中的所有单元格以及它们的位置坐标;二、图构建:对得到的单元格建立无向图;三、关系预测:通过对构建的无向图上的边进行分类,使用神经网络模型