Paragraph information restoration method after paper document electronization

The invention discloses a method for restoring paragraph information after paper document electronization, which comprises the following steps of: carrying out pre-processing of multi-column splitting according to position coordinates of characters in a double-layer PDF (Portable Document Format) do...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: JI DAQI, FENG JIANI, JI CHUANJUN, WU JUNJIE, CHEN YUNWEN, SHANG YAMENG, LIU YOUMIN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a method for restoring paragraph information after paper document electronization, which comprises the following steps of: carrying out pre-processing of multi-column splitting according to position coordinates of characters in a double-layer PDF (Portable Document Format) document, and correcting and positioning the boundary of a character region by utilizing a least square method and a projection segmentation method according to the existing coordinate information and related characteristics of the characters, so as to restore the paragraph information of the double-layer PDF document. The method comprises the following steps: determining a position where a multi-column text needs to be segmented, calculating paragraph boundaries in an area according to a relationship between rows and text information, generating new paragraphs according to the text area boundaries and the paragraph boundaries, forming the areas by the paragraphs, forming a new article by the areas, and finally reali