Domain-aware Evaluation of Named Entity Recognition Systems for Croatian

We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper tex...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computing and information technology 2013-09, Vol.21 (3), p.195
Hauptverfasser: Agic, Zeljko, Bekavac, Bozo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 3
container_start_page 195
container_title Journal of computing and information technology
container_volume 21
creator Agic, Zeljko
Bekavac, Bozo
description We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper text genre. The dataset was annotated using a three-class named entity tag set--denoting personal names, locations and organizations. We give insight to feature selection, domain sensitivity and effects of increase in training set size for statistical named entity recognition using the state-of-the-art Stanford NER system. We also sketch a comparison of publicly available named entity recognition systems for Croatian considering domain dependence, regardless of their underlying paradigms. Our top-performing system achieved an [F.sub.1]-score of 0.884 in a mixed-domain testing scenario, scoring 0.925 and 0.843 in the two domains separated for the experiment. The system shows consistency in state-of-the-art scores for detecting names of persons, locations and organizations.
doi_str_mv 10.2498/cit.1002190
format Article
fullrecord <record><control><sourceid>gale_hrcak</sourceid><recordid>TN_cdi_hrcak_primary_oai_hrcak_srce_hr_110027</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A361943259</galeid><sourcerecordid>A361943259</sourcerecordid><originalsourceid>FETCH-LOGICAL-c245t-f754cd2ac143ed13405f0930d4bb89ebc07d45b08e25ce21d75cc7a660a31ecf3</originalsourceid><addsrcrecordid>eNptkU9LAzEQxYMoWGpPfoEFTyJbk032T46lVlsoCq2ewzSb1Gh3I0mq9tubukUoODnMY_J7Q8hD6JLgYcZ4dStNGBKMM8LxCeqRihUp5bg6jZpSnBJCi3M08P4Nx6K8KBjpoemdbcC0KXyBU8nkEzZbCMa2idXJIzSqTiZtMGGXLJS069b83i13PqjGJ9q6ZOxsNEB7gc40bLwaHHofvdxPnsfTdP70MBuP5qnMWB5SXeZM1hlIwqiqCWU415hTXLPVquJqJXFZs3yFK5XlUmWkLnMpSygKDJQoqWkfpd3eVyfhXXw404DbCQtGdBPvpIpSkP1flJG_6vg1bJQwrbbBgWyMl2JEC8IZzXIeqeE_VDy1aoy0rdImzo8M10eGyAT1Hdaw9V7Mlotj9qZjpbPeO6X_Xk2w2EcnYnTiEB39AaGHiiE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Domain-aware Evaluation of Named Entity Recognition Systems for Croatian</title><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Alma/SFX Local Collection</source><creator>Agic, Zeljko ; Bekavac, Bozo</creator><creatorcontrib>Agic, Zeljko ; Bekavac, Bozo</creatorcontrib><description>We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper text genre. The dataset was annotated using a three-class named entity tag set--denoting personal names, locations and organizations. We give insight to feature selection, domain sensitivity and effects of increase in training set size for statistical named entity recognition using the state-of-the-art Stanford NER system. We also sketch a comparison of publicly available named entity recognition systems for Croatian considering domain dependence, regardless of their underlying paradigms. Our top-performing system achieved an [F.sub.1]-score of 0.884 in a mixed-domain testing scenario, scoring 0.925 and 0.843 in the two domains separated for the experiment. The system shows consistency in state-of-the-art scores for detecting names of persons, locations and organizations.</description><identifier>ISSN: 1330-1136</identifier><identifier>EISSN: 1846-3908</identifier><identifier>DOI: 10.2498/cit.1002190</identifier><identifier>CODEN: CJCTEM</identifier><language>eng</language><publisher>Sveuciliste U Zagrebu</publisher><subject>Computational linguistics ; Croatian language ; domain dependence ; evaluation ; Language processing ; Methods ; named entity recognition ; Natural language interfaces ; Serbo-Croatian language ; text domain ; Text processing</subject><ispartof>Journal of computing and information technology, 2013-09, Vol.21 (3), p.195</ispartof><rights>COPYRIGHT 2013 Sveuciliste U Zagrebu</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttps://hrcak.srce.hr/logo_broj/8996.jpg</thumbnail><link.rule.ids>230,314,776,780,881,27903,27904</link.rule.ids></links><search><creatorcontrib>Agic, Zeljko</creatorcontrib><creatorcontrib>Bekavac, Bozo</creatorcontrib><title>Domain-aware Evaluation of Named Entity Recognition Systems for Croatian</title><title>Journal of computing and information technology</title><description>We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper text genre. The dataset was annotated using a three-class named entity tag set--denoting personal names, locations and organizations. We give insight to feature selection, domain sensitivity and effects of increase in training set size for statistical named entity recognition using the state-of-the-art Stanford NER system. We also sketch a comparison of publicly available named entity recognition systems for Croatian considering domain dependence, regardless of their underlying paradigms. Our top-performing system achieved an [F.sub.1]-score of 0.884 in a mixed-domain testing scenario, scoring 0.925 and 0.843 in the two domains separated for the experiment. The system shows consistency in state-of-the-art scores for detecting names of persons, locations and organizations.</description><subject>Computational linguistics</subject><subject>Croatian language</subject><subject>domain dependence</subject><subject>evaluation</subject><subject>Language processing</subject><subject>Methods</subject><subject>named entity recognition</subject><subject>Natural language interfaces</subject><subject>Serbo-Croatian language</subject><subject>text domain</subject><subject>Text processing</subject><issn>1330-1136</issn><issn>1846-3908</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNptkU9LAzEQxYMoWGpPfoEFTyJbk032T46lVlsoCq2ewzSb1Gh3I0mq9tubukUoODnMY_J7Q8hD6JLgYcZ4dStNGBKMM8LxCeqRihUp5bg6jZpSnBJCi3M08P4Nx6K8KBjpoemdbcC0KXyBU8nkEzZbCMa2idXJIzSqTiZtMGGXLJS069b83i13PqjGJ9q6ZOxsNEB7gc40bLwaHHofvdxPnsfTdP70MBuP5qnMWB5SXeZM1hlIwqiqCWU415hTXLPVquJqJXFZs3yFK5XlUmWkLnMpSygKDJQoqWkfpd3eVyfhXXw404DbCQtGdBPvpIpSkP1flJG_6vg1bJQwrbbBgWyMl2JEC8IZzXIeqeE_VDy1aoy0rdImzo8M10eGyAT1Hdaw9V7Mlotj9qZjpbPeO6X_Xk2w2EcnYnTiEB39AaGHiiE</recordid><startdate>20130901</startdate><enddate>20130901</enddate><creator>Agic, Zeljko</creator><creator>Bekavac, Bozo</creator><general>Sveuciliste U Zagrebu</general><general>Fakultet elektrotehnike i računarstva Sveučilišta u Zagrebu</general><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>VP8</scope></search><sort><creationdate>20130901</creationdate><title>Domain-aware Evaluation of Named Entity Recognition Systems for Croatian</title><author>Agic, Zeljko ; Bekavac, Bozo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c245t-f754cd2ac143ed13405f0930d4bb89ebc07d45b08e25ce21d75cc7a660a31ecf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Computational linguistics</topic><topic>Croatian language</topic><topic>domain dependence</topic><topic>evaluation</topic><topic>Language processing</topic><topic>Methods</topic><topic>named entity recognition</topic><topic>Natural language interfaces</topic><topic>Serbo-Croatian language</topic><topic>text domain</topic><topic>Text processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Agic, Zeljko</creatorcontrib><creatorcontrib>Bekavac, Bozo</creatorcontrib><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>Hrcak: Portal of scientific journals of Croatia</collection><jtitle>Journal of computing and information technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Agic, Zeljko</au><au>Bekavac, Bozo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Domain-aware Evaluation of Named Entity Recognition Systems for Croatian</atitle><jtitle>Journal of computing and information technology</jtitle><date>2013-09-01</date><risdate>2013</risdate><volume>21</volume><issue>3</issue><spage>195</spage><pages>195-</pages><issn>1330-1136</issn><eissn>1846-3908</eissn><coden>CJCTEM</coden><abstract>We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper text genre. The dataset was annotated using a three-class named entity tag set--denoting personal names, locations and organizations. We give insight to feature selection, domain sensitivity and effects of increase in training set size for statistical named entity recognition using the state-of-the-art Stanford NER system. We also sketch a comparison of publicly available named entity recognition systems for Croatian considering domain dependence, regardless of their underlying paradigms. Our top-performing system achieved an [F.sub.1]-score of 0.884 in a mixed-domain testing scenario, scoring 0.925 and 0.843 in the two domains separated for the experiment. The system shows consistency in state-of-the-art scores for detecting names of persons, locations and organizations.</abstract><pub>Sveuciliste U Zagrebu</pub><doi>10.2498/cit.1002190</doi><tpages>15</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1330-1136
ispartof Journal of computing and information technology, 2013-09, Vol.21 (3), p.195
issn 1330-1136
1846-3908
language eng
recordid cdi_hrcak_primary_oai_hrcak_srce_hr_110027
source Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Alma/SFX Local Collection
subjects Computational linguistics
Croatian language
domain dependence
evaluation
Language processing
Methods
named entity recognition
Natural language interfaces
Serbo-Croatian language
text domain
Text processing
title Domain-aware Evaluation of Named Entity Recognition Systems for Croatian
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T07%3A26%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_hrcak&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Domain-aware%20Evaluation%20of%20Named%20Entity%20Recognition%20Systems%20for%20Croatian&rft.jtitle=Journal%20of%20computing%20and%20information%20technology&rft.au=Agic,%20Zeljko&rft.date=2013-09-01&rft.volume=21&rft.issue=3&rft.spage=195&rft.pages=195-&rft.issn=1330-1136&rft.eissn=1846-3908&rft.coden=CJCTEM&rft_id=info:doi/10.2498/cit.1002190&rft_dat=%3Cgale_hrcak%3EA361943259%3C/gale_hrcak%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_galeid=A361943259&rfr_iscdi=true