EXTRACTING SEARCHABLE INFORMATION FROM DIGITIZED DOCUMENT

Data extraction and automatic validation from digitized documents in non-editable formats is disclosed. Paper documents are digitized or converted into formats suitable for storage on computers or other digital devices. The digitized documents are classified into one of a plurality of document types...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: SAMPAT NIRAV, JAIN ASHISH, MAHAPATRA SUVENDU KUMAR, KRISHNAN ARAVIND, MANI REKHA, VISWANATHAN KUMAR, KOTNALA RAHUL, GHATAGE PRAKASH, NARAYANAN SRIKANTH, LAKSHMINARAYANAN KAMESHKUMAR
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Data extraction and automatic validation from digitized documents in non-editable formats is disclosed. Paper documents are digitized or converted into formats suitable for storage on computers or other digital devices. The digitized documents are classified into one of a plurality of document types and based on the document type, document processing rules are selected for analyzing the digitizeddocuments to enable data extraction and automatic validation. The positions and values of the data fields in the digitized documents are obtained using machine learning techniques. The data field values are automatically validated and assigned confidence scores. Data fields with low confidence scores are flagged for manual review. 本公开内容的各实施例涉及从数字化文档提取可搜索的信息。公开了以不可编辑格式从数字化文档进行数据提取和自动验证。纸质文档被数字化或转换成适合用于存储在计算机或其他数字设备上的格式。数字化文档被分类为多种文档类型中的种文档类型,并且基于文档类型,文档处理规则被选择用于分析数字化文档以实现数据提取和自动验证。数字化文档中的数据字段的位置和值使用机器学习技术而被获取。数据字段值被自动地验证并且被指派置信度得分。具有低置信度得分的数据字段被标记用于手动检查。