Unstructured document format recognition method based on image data processing

The invention discloses an unstructured document format recognition method based on image data processing, which comprises the following steps of: S1, opening and analyzing a file, and converting an unstructured document format into a picture format; s2, angle correction is carried out on the pictur...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: ZHANG DAPING, ZHOU CHUANG, JIN ZHENGLEI
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses an unstructured document format recognition method based on image data processing, which comprises the following steps of: S1, opening and analyzing a file, and converting an unstructured document format into a picture format; s2, angle correction is carried out on the picture obtained in S1, and the specific process is as follows: a) Hough transform is carried out on the picture, and the straight line angle of each text line in the image is detected; according to the method, the converted picture is corrected, so that the picture is in a horizontal and vertical state, the recognition rate of the OCR text detection and recognition unit is greatly improved, text typesetting is performed on the text recognized by the recognition unit, and the consistency of recognized content and original file specifications and styles is guaranteed. 本发明公开了一种基于图像数据处理的非结构化文档格式识别方法,包括以下步骤:S1、打开文件并解析,将非结构化的文档格式转换为图片格式;S2、将S1获取到的图片进行角度校正,具体流程如下:a)对图片使用霍夫变换,检测出图像中各文本行直线角度。本发明通过将转换的图片进行矫正,使图片处于横平竖直状态,大大提高了OCR文