Method and device for extracting text content from damaged doc document and medium

The invention discloses a method and device for extracting text content from a damaged doc document and a medium, and the method comprises the steps: obtaining binary data of the damaged doc document, and writing the binary data into a first memory buffer area; acquiring an initial offset position o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: HUANG YINGXIN, YANG CHUNBAI, LIU HAIFENG, LIU YUN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a method and device for extracting text content from a damaged doc document and a medium, and the method comprises the steps: obtaining binary data of the damaged doc document, and writing the binary data into a first memory buffer area; acquiring an initial offset position of the binary data; starting to scan the binary data from the initial offset position to obtain a plurality of readable Unicode text blocks; the readable Unicode text blocks are sequentially written into a second memory buffer area; scanning non-display data of a plurality of readable Unicode text blocks in the second memory buffer area in sequence; and after the non-display data is cleared, splicing the plurality of readable Unicode text blocks in sequence to obtain text data. According to the method, a COM interface of a system does not need to be called, the le format of the damaged doc document does not need to be analyzed and a storage stream in the damaged doc document does not need to be read, scanning and an