Paragraph information restoration method after paper document electronization
The invention discloses a method for restoring paragraph information after paper document electronization, which comprises the following steps of: carrying out pre-processing of multi-column splitting according to position coordinates of characters in a double-layer PDF (Portable Document Format) do...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention discloses a method for restoring paragraph information after paper document electronization, which comprises the following steps of: carrying out pre-processing of multi-column splitting according to position coordinates of characters in a double-layer PDF (Portable Document Format) document, and correcting and positioning the boundary of a character region by utilizing a least square method and a projection segmentation method according to the existing coordinate information and related characteristics of the characters, so as to restore the paragraph information of the double-layer PDF document. The method comprises the following steps: determining a position where a multi-column text needs to be segmented, calculating paragraph boundaries in an area according to a relationship between rows and text information, generating new paragraphs according to the text area boundaries and the paragraph boundaries, forming the areas by the paragraphs, forming a new article by the areas, and finally reali |
---|