Method and device for extracting text content from damaged doc document and medium
The invention discloses a method and device for extracting text content from a damaged doc document and a medium, and the method comprises the steps: obtaining binary data of the damaged doc document, and writing the binary data into a first memory buffer area; acquiring an initial offset position o...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention discloses a method and device for extracting text content from a damaged doc document and a medium, and the method comprises the steps: obtaining binary data of the damaged doc document, and writing the binary data into a first memory buffer area; acquiring an initial offset position of the binary data; starting to scan the binary data from the initial offset position to obtain a plurality of readable Unicode text blocks; the readable Unicode text blocks are sequentially written into a second memory buffer area; scanning non-display data of a plurality of readable Unicode text blocks in the second memory buffer area in sequence; and after the non-display data is cleared, splicing the plurality of readable Unicode text blocks in sequence to obtain text data. According to the method, a COM interface of a system does not need to be called, the le format of the damaged doc document does not need to be analyzed and a storage stream in the damaged doc document does not need to be read, scanning and an |
---|