Heuristic based script identification from multilingual text documents

A multilingual document may contain text words in more than one language. In a multilingual country like India it is necessary that a document should be composed of text contents in different languages in order to reach a larger cross section of people, But on the other hand, this causes practical d...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Das, M. S., Rani, D. S., Reddy, C. R. K.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Feature extraction Gabor filters Information technology Internet OCR Optical character recognition software pipe density profiles Shape tick components Visual features Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	492
container_issue
container_start_page	487
container_title
container_volume
creator	Das, M. S. Rani, D. S. Reddy, C. R. K.
description	A multilingual document may contain text words in more than one language. In a multilingual country like India it is necessary that a document should be composed of text contents in different languages in order to reach a larger cross section of people, But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR (Optical Character Recognition). It is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. So, it is necessary to identify the language region of the document before feeding the document to the corresponding OCR system. Script identification aims to extract information presented in digital documents namely articles, newspapers, magazines and e-books. This has given rise to many language identification systems. The objective of this paper is to propose a model to identify script type of different text portions using visual clues. In this work seven feature namely bottom max row, top horizontal lines, vertical lines, bottom components, tick components, top holes and bottom holes have been used to identify the script type. In this work, multilingual documents with Telugu, English and Hindi scripts have been used. From the experimentation it is understood that the identification accuracy of above 93% is achieved.
doi_str_mv	10.1109/RAIT.2012.6194627
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6194627</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6194627</ieee_id><sourcerecordid>6194627</sourcerecordid><originalsourceid>FETCH-LOGICAL-i90t-e652d5c64edacea4e14c3f8ec3be588ecc2715a9c42122ae571314697ec74e3b3</originalsourceid><addsrcrecordid>eNo1j9tKw0AURUekoLb5APFlfiDp3CfzWIq1hYIgeS-TyYkcyaVkJqB_b8C6XxYbNhsWIc-cFZwzt_3YnapCMC4Kw50ywt6RzNmSK20tM86qe_L0X5R5IFmMX2yJZbLU-pEcjjBPGBMGWvsIDY1hwmui2MCQsMXgE44Dbaexp_3cJexw-Jx9RxN8J9qMYe6XYdyQVeu7CNmNa1IdXqv9MT-_v532u3OOjqUcjBaNDkZB4wN4BVwF2ZYQZA26XBiE5dq7oAQXwoO2XHK1WECwCmQt1-Tl7xYB4HKdsPfTz-VmLn8BjfhOGw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Heuristic based script identification from multilingual text documents</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Das, M. S. ; Rani, D. S. ; Reddy, C. R. K.</creator><creatorcontrib>Das, M. S. ; Rani, D. S. ; Reddy, C. R. K.</creatorcontrib><description>A multilingual document may contain text words in more than one language. In a multilingual country like India it is necessary that a document should be composed of text contents in different languages in order to reach a larger cross section of people, But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR (Optical Character Recognition). It is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. So, it is necessary to identify the language region of the document before feeding the document to the corresponding OCR system. Script identification aims to extract information presented in digital documents namely articles, newspapers, magazines and e-books. This has given rise to many language identification systems. The objective of this paper is to propose a model to identify script type of different text portions using visual clues. In this work seven feature namely bottom max row, top horizontal lines, vertical lines, bottom components, tick components, top holes and bottom holes have been used to identify the script type. In this work, multilingual documents with Telugu, English and Hindi scripts have been used. From the experimentation it is understood that the identification accuracy of above 93% is achieved.</description><identifier>ISBN: 1457706946</identifier><identifier>ISBN: 9781457706943</identifier><identifier>EISBN: 9781457706974</identifier><identifier>EISBN: 1457706970</identifier><identifier>EISBN: 9781457706967</identifier><identifier>EISBN: 1457706962</identifier><identifier>DOI: 10.1109/RAIT.2012.6194627</identifier><language>eng</language><publisher>IEEE</publisher><subject>Feature extraction ; Gabor filters ; Information technology ; Internet ; OCR ; Optical character recognition software ; pipe density ; profiles ; Shape ; tick components ; Visual features ; Visualization</subject><ispartof>2012 1st International Conference on Recent Advances in Information Technology (RAIT), 2012, p.487-492</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6194627$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6194627$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Das, M. S.</creatorcontrib><creatorcontrib>Rani, D. S.</creatorcontrib><creatorcontrib>Reddy, C. R. K.</creatorcontrib><title>Heuristic based script identification from multilingual text documents</title><title>2012 1st International Conference on Recent Advances in Information Technology (RAIT)</title><addtitle>RAIT</addtitle><description>A multilingual document may contain text words in more than one language. In a multilingual country like India it is necessary that a document should be composed of text contents in different languages in order to reach a larger cross section of people, But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR (Optical Character Recognition). It is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. So, it is necessary to identify the language region of the document before feeding the document to the corresponding OCR system. Script identification aims to extract information presented in digital documents namely articles, newspapers, magazines and e-books. This has given rise to many language identification systems. The objective of this paper is to propose a model to identify script type of different text portions using visual clues. In this work seven feature namely bottom max row, top horizontal lines, vertical lines, bottom components, tick components, top holes and bottom holes have been used to identify the script type. In this work, multilingual documents with Telugu, English and Hindi scripts have been used. From the experimentation it is understood that the identification accuracy of above 93% is achieved.</description><subject>Feature extraction</subject><subject>Gabor filters</subject><subject>Information technology</subject><subject>Internet</subject><subject>OCR</subject><subject>Optical character recognition software</subject><subject>pipe density</subject><subject>profiles</subject><subject>Shape</subject><subject>tick components</subject><subject>Visual features</subject><subject>Visualization</subject><isbn>1457706946</isbn><isbn>9781457706943</isbn><isbn>9781457706974</isbn><isbn>1457706970</isbn><isbn>9781457706967</isbn><isbn>1457706962</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1j9tKw0AURUekoLb5APFlfiDp3CfzWIq1hYIgeS-TyYkcyaVkJqB_b8C6XxYbNhsWIc-cFZwzt_3YnapCMC4Kw50ywt6RzNmSK20tM86qe_L0X5R5IFmMX2yJZbLU-pEcjjBPGBMGWvsIDY1hwmui2MCQsMXgE44Dbaexp_3cJexw-Jx9RxN8J9qMYe6XYdyQVeu7CNmNa1IdXqv9MT-_v532u3OOjqUcjBaNDkZB4wN4BVwF2ZYQZA26XBiE5dq7oAQXwoO2XHK1WECwCmQt1-Tl7xYB4HKdsPfTz-VmLn8BjfhOGw</recordid><startdate>201203</startdate><enddate>201203</enddate><creator>Das, M. S.</creator><creator>Rani, D. S.</creator><creator>Reddy, C. R. K.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201203</creationdate><title>Heuristic based script identification from multilingual text documents</title><author>Das, M. S. ; Rani, D. S. ; Reddy, C. R. K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i90t-e652d5c64edacea4e14c3f8ec3be588ecc2715a9c42122ae571314697ec74e3b3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Feature extraction</topic><topic>Gabor filters</topic><topic>Information technology</topic><topic>Internet</topic><topic>OCR</topic><topic>Optical character recognition software</topic><topic>pipe density</topic><topic>profiles</topic><topic>Shape</topic><topic>tick components</topic><topic>Visual features</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Das, M. S.</creatorcontrib><creatorcontrib>Rani, D. S.</creatorcontrib><creatorcontrib>Reddy, C. R. K.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Das, M. S.</au><au>Rani, D. S.</au><au>Reddy, C. R. K.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Heuristic based script identification from multilingual text documents</atitle><btitle>2012 1st International Conference on Recent Advances in Information Technology (RAIT)</btitle><stitle>RAIT</stitle><date>2012-03</date><risdate>2012</risdate><spage>487</spage><epage>492</epage><pages>487-492</pages><isbn>1457706946</isbn><isbn>9781457706943</isbn><eisbn>9781457706974</eisbn><eisbn>1457706970</eisbn><eisbn>9781457706967</eisbn><eisbn>1457706962</eisbn><abstract>A multilingual document may contain text words in more than one language. In a multilingual country like India it is necessary that a document should be composed of text contents in different languages in order to reach a larger cross section of people, But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR (Optical Character Recognition). It is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. So, it is necessary to identify the language region of the document before feeding the document to the corresponding OCR system. Script identification aims to extract information presented in digital documents namely articles, newspapers, magazines and e-books. This has given rise to many language identification systems. The objective of this paper is to propose a model to identify script type of different text portions using visual clues. In this work seven feature namely bottom max row, top horizontal lines, vertical lines, bottom components, tick components, top holes and bottom holes have been used to identify the script type. In this work, multilingual documents with Telugu, English and Hindi scripts have been used. From the experimentation it is understood that the identification accuracy of above 93% is achieved.</abstract><pub>IEEE</pub><doi>10.1109/RAIT.2012.6194627</doi><tpages>6</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISBN: 1457706946
ispartof	2012 1st International Conference on Recent Advances in Information Technology (RAIT), 2012, p.487-492
issn
language	eng
recordid	cdi_ieee_primary_6194627
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Feature extraction Gabor filters Information technology Internet OCR Optical character recognition software pipe density profiles Shape tick components Visual features Visualization
title	Heuristic based script identification from multilingual text documents
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T05%3A09%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Heuristic%20based%20script%20identification%20from%20multilingual%20text%20documents&rft.btitle=2012%201st%20International%20Conference%20on%20Recent%20Advances%20in%20Information%20Technology%20(RAIT)&rft.au=Das,%20M.%20S.&rft.date=2012-03&rft.spage=487&rft.epage=492&rft.pages=487-492&rft.isbn=1457706946&rft.isbn_list=9781457706943&rft_id=info:doi/10.1109/RAIT.2012.6194627&rft_dat=%3Cieee_6IE%3E6194627%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781457706974&rft.eisbn_list=1457706970&rft.eisbn_list=9781457706967&rft.eisbn_list=1457706962&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6194627&rfr_iscdi=true