Document Information Extraction: An Analysis of Invoice Anatomy

In this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice data...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied Computational Intelligence and Soft Computing 2024-06, Vol.2024 (1)
Hauptverfasser: Hamri, Mouad, Devanne, Maxime, Weber, Jonathan, Hassenforder, Michel
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we present a new approach of document information extraction by studying the document anatomy where we investigated the possible variants and forms it could have for each document component. To overcome the lack of publicly available document datasets, we used a generated invoice database where we conceived 9 different templates and generated 100 samples for each one where the documents were annotated automatically during the generation process. We analysed the following invoice components: dates block (invoice date and invoice due date), address block, amounts block (tax-free amount, tax amount, and total amount), and lines block (lines table) by investigating the impact of training our model on various block variants. We conducted several experiments where we compared the results obtained when we tested on templates that included variants not encountered during the training phase versus when we introduced them to the training dataset. This allowed us to analyse the improvement in results after adding these previously unseen variants. The obtained results have shown that the model generalises better when trained on a large variety of cases and achieves remarkable performance. We conducted experiments on various models to highlight the model-agnostic character of our proposed approach. This methodology allows to have great performance, even with models that have significantly fewer parameters, especially in comparison to recently published models with millions of parameters.
ISSN:1687-9724
1687-9732
DOI:10.1155/2024/7599415