A visual analysis approach for data transformation via domain knowledge and intelligent models

Industry benchmarking involves comparing and analyzing a company’s performance with other top-performing enterprises. PDF documents contain valuable corporate information, but their non-editable nature makes data extraction complex. This study focuses on converting unstructured data from PDF documen...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia systems 2024-06, Vol.30 (3), Article 126
Hauptverfasser: Zhu, Haiyang, Yin, Jun, Chu, Chengcan, Zhu, Minfeng, Wei, Yating, Pan, Jiacheng, Han, Dongming, Tan, Xuwei, Chen, Wei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Industry benchmarking involves comparing and analyzing a company’s performance with other top-performing enterprises. PDF documents contain valuable corporate information, but their non-editable nature makes data extraction complex. This study focuses on converting unstructured data from PDF documents, including tables, images, and text, to a structured format that is suitable for analysis and decision-making. The methods that are currently used for PDF document conversion primarily involve manual extraction, PDF converters, and artificial intelligence algorithms. However, they are often restricted to processing a single modality, have limitations in dealing with complex structured tables, or cannot achieve the required accuracy in practice. This study focuses on converting the periodic reports documents of listed companies from PDF format to structured data. We propose a unified framework for extracting tables, images, and text by parsing PDF documents into constituent objects. We introduce three bespoke algorithms to process complex structured tables and to develop a prototype system of visual analysis that combines AI for automated data extraction with the domain knowledge of human experts for auditing. Quantitative and qualitative experiments are conducted to validate the methodology’s superiority, including its efficiency, quality, and user-friendliness.
ISSN:0942-4962
1432-1882
DOI:10.1007/s00530-024-01331-x