Handling tree-structured text: parsing directory pages

The determination of the reading sequence of text is fundamental to document understanding. This problem is easily solved in pages where the text is organized into a sequence of lines and vertical alignment runs the height of the page (producing multiple columns which can be read from left to right)...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-11
Hauptverfasser:	Shrivastava, Sarang, Shaikh, Afreen, Shrivastava, Shivani, Chung, Ming Ho, Reddy, Pradeep, Saraswat, Vijay
Format:	Artikel
Sprache:	eng
Schlagworte:	Columns (structural) Orientations Reading Segments Structural hierarchy
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Shrivastava, Sarang Shaikh, Afreen Shrivastava, Shivani Chung, Ming Ho Reddy, Pradeep Saraswat, Vijay
description	The determination of the reading sequence of text is fundamental to document understanding. This problem is easily solved in pages where the text is organized into a sequence of lines and vertical alignment runs the height of the page (producing multiple columns which can be read from left to right). We present a situation -- the directory page parsing problem -- where information is presented on the page in an irregular, visually-organized, two-dimensional format. Directory pages are fairly common in financial prospectuses and carry information about organizations, their addresses and relationships that is key to business tasks in client onboarding. Interestingly, directory pages sometimes have hierarchical structure, motivating the need to generalize the reading sequence to a reading tree. We present solutions to the problem of identifying directory pages and constructing the reading tree, using (learnt) classifiers for text segments and a bottom-up (right to left, bottom-to-top) traversal of segments. The solution is a key part of a production service supporting automatic extraction of organization, address and relationship information from client onboarding documents.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2602336617</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2602336617</sourcerecordid><originalsourceid>FETCH-proquest_journals_26023366173</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw80jMS8nJzEtXKClKTdUtLikqTS4pLUpNUShJrSixUihILCoGyaZkFqUml-QXVQJF0lOLeRhY0xJzilN5oTQ3g7Kba4izh25BUX5haWpxSXxWfmlRHlAq3sjMAGi_mZmhuTFxqgAJ9zXS</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2602336617</pqid></control><display><type>article</type><title>Handling tree-structured text: parsing directory pages</title><source>Free E- Journals</source><creator>Shrivastava, Sarang ; Shaikh, Afreen ; Shrivastava, Shivani ; Chung, Ming Ho ; Reddy, Pradeep ; Saraswat, Vijay</creator><creatorcontrib>Shrivastava, Sarang ; Shaikh, Afreen ; Shrivastava, Shivani ; Chung, Ming Ho ; Reddy, Pradeep ; Saraswat, Vijay</creatorcontrib><description>The determination of the reading sequence of text is fundamental to document understanding. This problem is easily solved in pages where the text is organized into a sequence of lines and vertical alignment runs the height of the page (producing multiple columns which can be read from left to right). We present a situation -- the directory page parsing problem -- where information is presented on the page in an irregular, visually-organized, two-dimensional format. Directory pages are fairly common in financial prospectuses and carry information about organizations, their addresses and relationships that is key to business tasks in client onboarding. Interestingly, directory pages sometimes have hierarchical structure, motivating the need to generalize the reading sequence to a reading tree. We present solutions to the problem of identifying directory pages and constructing the reading tree, using (learnt) classifiers for text segments and a bottom-up (right to left, bottom-to-top) traversal of segments. The solution is a key part of a production service supporting automatic extraction of organization, address and relationship information from client onboarding documents.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Columns (structural) ; Orientations ; Reading ; Segments ; Structural hierarchy</subject><ispartof>arXiv.org, 2021-11</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Shrivastava, Sarang</creatorcontrib><creatorcontrib>Shaikh, Afreen</creatorcontrib><creatorcontrib>Shrivastava, Shivani</creatorcontrib><creatorcontrib>Chung, Ming Ho</creatorcontrib><creatorcontrib>Reddy, Pradeep</creatorcontrib><creatorcontrib>Saraswat, Vijay</creatorcontrib><title>Handling tree-structured text: parsing directory pages</title><title>arXiv.org</title><description>The determination of the reading sequence of text is fundamental to document understanding. This problem is easily solved in pages where the text is organized into a sequence of lines and vertical alignment runs the height of the page (producing multiple columns which can be read from left to right). We present a situation -- the directory page parsing problem -- where information is presented on the page in an irregular, visually-organized, two-dimensional format. Directory pages are fairly common in financial prospectuses and carry information about organizations, their addresses and relationships that is key to business tasks in client onboarding. Interestingly, directory pages sometimes have hierarchical structure, motivating the need to generalize the reading sequence to a reading tree. We present solutions to the problem of identifying directory pages and constructing the reading tree, using (learnt) classifiers for text segments and a bottom-up (right to left, bottom-to-top) traversal of segments. The solution is a key part of a production service supporting automatic extraction of organization, address and relationship information from client onboarding documents.</description><subject>Columns (structural)</subject><subject>Orientations</subject><subject>Reading</subject><subject>Segments</subject><subject>Structural hierarchy</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw80jMS8nJzEtXKClKTdUtLikqTS4pLUpNUShJrSixUihILCoGyaZkFqUml-QXVQJF0lOLeRhY0xJzilN5oTQ3g7Kba4izh25BUX5haWpxSXxWfmlRHlAq3sjMAGi_mZmhuTFxqgAJ9zXS</recordid><startdate>20211124</startdate><enddate>20211124</enddate><creator>Shrivastava, Sarang</creator><creator>Shaikh, Afreen</creator><creator>Shrivastava, Shivani</creator><creator>Chung, Ming Ho</creator><creator>Reddy, Pradeep</creator><creator>Saraswat, Vijay</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20211124</creationdate><title>Handling tree-structured text: parsing directory pages</title><author>Shrivastava, Sarang ; Shaikh, Afreen ; Shrivastava, Shivani ; Chung, Ming Ho ; Reddy, Pradeep ; Saraswat, Vijay</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26023366173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Columns (structural)</topic><topic>Orientations</topic><topic>Reading</topic><topic>Segments</topic><topic>Structural hierarchy</topic><toplevel>online_resources</toplevel><creatorcontrib>Shrivastava, Sarang</creatorcontrib><creatorcontrib>Shaikh, Afreen</creatorcontrib><creatorcontrib>Shrivastava, Shivani</creatorcontrib><creatorcontrib>Chung, Ming Ho</creatorcontrib><creatorcontrib>Reddy, Pradeep</creatorcontrib><creatorcontrib>Saraswat, Vijay</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shrivastava, Sarang</au><au>Shaikh, Afreen</au><au>Shrivastava, Shivani</au><au>Chung, Ming Ho</au><au>Reddy, Pradeep</au><au>Saraswat, Vijay</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Handling tree-structured text: parsing directory pages</atitle><jtitle>arXiv.org</jtitle><date>2021-11-24</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>The determination of the reading sequence of text is fundamental to document understanding. This problem is easily solved in pages where the text is organized into a sequence of lines and vertical alignment runs the height of the page (producing multiple columns which can be read from left to right). We present a situation -- the directory page parsing problem -- where information is presented on the page in an irregular, visually-organized, two-dimensional format. Directory pages are fairly common in financial prospectuses and carry information about organizations, their addresses and relationships that is key to business tasks in client onboarding. Interestingly, directory pages sometimes have hierarchical structure, motivating the need to generalize the reading sequence to a reading tree. We present solutions to the problem of identifying directory pages and constructing the reading tree, using (learnt) classifiers for text segments and a bottom-up (right to left, bottom-to-top) traversal of segments. The solution is a key part of a production service supporting automatic extraction of organization, address and relationship information from client onboarding documents.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2602336617
source	Free E- Journals
subjects	Columns (structural) Orientations Reading Segments Structural hierarchy
title	Handling tree-structured text: parsing directory pages
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T21%3A25%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Handling%20tree-structured%20text:%20parsing%20directory%20pages&rft.jtitle=arXiv.org&rft.au=Shrivastava,%20Sarang&rft.date=2021-11-24&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2602336617%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2602336617&rft_id=info:pmid/&rfr_iscdi=true