Information Theoretic Based Segments for Language Identification

In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Harbeck, Stefan, Ohler, Uwe, Nöth, Elmar, Niemann, Heinrich
Format:	Buchkapitel
Sprache:	eng
Schlagworte:	Applied sciences Artificial intelligence Computer science control theory systems Exact sciences and technology Good Segmentation Language Model Minimum Description Length Principle Recognition Rate Speech and sound recognition and synthesis. Linguistics Training Material
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	192
container_issue
container_start_page	187
container_title
container_volume	1692
creator	Harbeck, Stefan Ohler, Uwe Nöth, Elmar Niemann, Heinrich
description	In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.
doi_str_mv	10.1007/3-540-48239-3_34
format	Book Chapter
fullrecord	<record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_1827545</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>EBC6486039_227_199</sourcerecordid><originalsourceid>FETCH-LOGICAL-p312t-ad1b1e4ec7aecbd4e00f6d46961467c81b970d0c8c49be5a482bf18e23cc54fd3</originalsourceid><addsrcrecordid>eNqNkElPwzAQhc0qCvTOMQeuKR6PY8c3FrFUqsQBOFuOMymBNil2euDf4y4_gLmM9OZ9M5rH2BXwCXCubzAvJM9lKdDkaFEesHNMylbAQzYCBZAjSnO0GygljdTHbMSRi9xoiadsZIqyEKABztg4xi-eCoVO6IjdTrumD0s3tH2XvX9SH2hofXbvItXZG82X1A0xS5Zs5rr52s0pm9ZJa5vWb6FLdtK4RaTxvl-wj6fH94eXfPb6PH24m-UrBDHkroYKSJLXjnxVS-K8UbVURoFU2pdQGc1r7ksvTUWFSw9WDZQk0PtCNjVesOvd3pWL3i2a4DrfRrsK7dKFXwul0IUskm2ys8U06eYUbNX339ECt5s8LdoUkt3GZzd5JkDu94b-Z01xsLQhfPoxuIX_dKuBQrRKloonRghtwZiE4X8w5Fpong6m86bEPyZehZI</addsrcrecordid><sourcetype>Index Database</sourcetype><iscdi>true</iscdi><recordtype>book_chapter</recordtype><pqid>EBC3072700_40_198</pqid></control><display><type>book_chapter</type><title>Information Theoretic Based Segments for Language Identification</title><source>Springer Books</source><creator>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich</creator><contributor>Goos, G ; Sojka, Petr ; Siekmann, Jörg ; Carbonell, Jaime G ; Mautner, Pavel ; Matousek, Vaclav ; Ocelikova, Jana ; Sojka, Petr ; Mautner, Pavel ; Ocelíková, Jana ; Matousek, Václav</contributor><creatorcontrib>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich ; Goos, G ; Sojka, Petr ; Siekmann, Jörg ; Carbonell, Jaime G ; Mautner, Pavel ; Matousek, Vaclav ; Ocelikova, Jana ; Sojka, Petr ; Mautner, Pavel ; Ocelíková, Jana ; Matousek, Václav</creatorcontrib><description>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</description><identifier>ISSN: 0302-9743</identifier><identifier>ISBN: 3540664947</identifier><identifier>ISBN: 9783540664949</identifier><identifier>EISSN: 1611-3349</identifier><identifier>EISBN: 3540482393</identifier><identifier>EISBN: 9783540482390</identifier><identifier>DOI: 10.1007/3-540-48239-3_34</identifier><identifier>OCLC: 958521711</identifier><identifier>OCLC: 1245674162</identifier><identifier>LCCallNum: TK5102.9TA1637-1638</identifier><language>eng</language><publisher>Germany: Springer Berlin / Heidelberg</publisher><subject>Applied sciences ; Artificial intelligence ; Computer science; control theory; systems ; Exact sciences and technology ; Good Segmentation ; Language Model ; Minimum Description Length Principle ; Recognition Rate ; Speech and sound recognition and synthesis. Linguistics ; Training Material</subject><ispartof>Lecture notes in computer science, 1999, Vol.1692, p.187-192</ispartof><rights>Springer-Verlag Berlin Heidelberg 1999</rights><rights>1999 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><relation>Lecture Notes in Computer Science</relation></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttps://ebookcentral.proquest.com/covers/3072700-l.jpg</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/3-540-48239-3_34$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/3-540-48239-3_34$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>309,310,775,776,780,785,786,789,4036,4037,27902,38232,41418,42487</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=1827545$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><contributor>Goos, G</contributor><contributor>Sojka, Petr</contributor><contributor>Siekmann, Jörg</contributor><contributor>Carbonell, Jaime G</contributor><contributor>Mautner, Pavel</contributor><contributor>Matousek, Vaclav</contributor><contributor>Ocelikova, Jana</contributor><contributor>Sojka, Petr</contributor><contributor>Mautner, Pavel</contributor><contributor>Ocelíková, Jana</contributor><contributor>Matousek, Václav</contributor><creatorcontrib>Harbeck, Stefan</creatorcontrib><creatorcontrib>Ohler, Uwe</creatorcontrib><creatorcontrib>Nöth, Elmar</creatorcontrib><creatorcontrib>Niemann, Heinrich</creatorcontrib><title>Information Theoretic Based Segments for Language Identification</title><title>Lecture notes in computer science</title><description>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</description><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Computer science; control theory; systems</subject><subject>Exact sciences and technology</subject><subject>Good Segmentation</subject><subject>Language Model</subject><subject>Minimum Description Length Principle</subject><subject>Recognition Rate</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Training Material</subject><issn>0302-9743</issn><issn>1611-3349</issn><isbn>3540664947</isbn><isbn>9783540664949</isbn><isbn>3540482393</isbn><isbn>9783540482390</isbn><fulltext>true</fulltext><rsrctype>book_chapter</rsrctype><creationdate>1999</creationdate><recordtype>book_chapter</recordtype><recordid>eNqNkElPwzAQhc0qCvTOMQeuKR6PY8c3FrFUqsQBOFuOMymBNil2euDf4y4_gLmM9OZ9M5rH2BXwCXCubzAvJM9lKdDkaFEesHNMylbAQzYCBZAjSnO0GygljdTHbMSRi9xoiadsZIqyEKABztg4xi-eCoVO6IjdTrumD0s3tH2XvX9SH2hofXbvItXZG82X1A0xS5Zs5rr52s0pm9ZJa5vWb6FLdtK4RaTxvl-wj6fH94eXfPb6PH24m-UrBDHkroYKSJLXjnxVS-K8UbVURoFU2pdQGc1r7ksvTUWFSw9WDZQk0PtCNjVesOvd3pWL3i2a4DrfRrsK7dKFXwul0IUskm2ys8U06eYUbNX339ECt5s8LdoUkt3GZzd5JkDu94b-Z01xsLQhfPoxuIX_dKuBQrRKloonRghtwZiE4X8w5Fpong6m86bEPyZehZI</recordid><startdate>1999</startdate><enddate>1999</enddate><creator>Harbeck, Stefan</creator><creator>Ohler, Uwe</creator><creator>Nöth, Elmar</creator><creator>Niemann, Heinrich</creator><general>Springer Berlin / Heidelberg</general><general>Springer Berlin Heidelberg</general><general>Springer</general><scope>FFUUA</scope><scope>IQODW</scope></search><sort><creationdate>1999</creationdate><title>Information Theoretic Based Segments for Language Identification</title><author>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p312t-ad1b1e4ec7aecbd4e00f6d46961467c81b970d0c8c49be5a482bf18e23cc54fd3</frbrgroupid><rsrctype>book_chapters</rsrctype><prefilter>book_chapters</prefilter><language>eng</language><creationdate>1999</creationdate><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Computer science; control theory; systems</topic><topic>Exact sciences and technology</topic><topic>Good Segmentation</topic><topic>Language Model</topic><topic>Minimum Description Length Principle</topic><topic>Recognition Rate</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Training Material</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Harbeck, Stefan</creatorcontrib><creatorcontrib>Ohler, Uwe</creatorcontrib><creatorcontrib>Nöth, Elmar</creatorcontrib><creatorcontrib>Niemann, Heinrich</creatorcontrib><collection>ProQuest Ebook Central - Book Chapters - Demo use only</collection><collection>Pascal-Francis</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Harbeck, Stefan</au><au>Ohler, Uwe</au><au>Nöth, Elmar</au><au>Niemann, Heinrich</au><au>Goos, G</au><au>Sojka, Petr</au><au>Siekmann, Jörg</au><au>Carbonell, Jaime G</au><au>Mautner, Pavel</au><au>Matousek, Vaclav</au><au>Ocelikova, Jana</au><au>Sojka, Petr</au><au>Mautner, Pavel</au><au>Ocelíková, Jana</au><au>Matousek, Václav</au><format>book</format><genre>bookitem</genre><ristype>CHAP</ristype><atitle>Information Theoretic Based Segments for Language Identification</atitle><btitle>Lecture notes in computer science</btitle><seriestitle>Lecture Notes in Computer Science</seriestitle><date>1999</date><risdate>1999</risdate><volume>1692</volume><spage>187</spage><epage>192</epage><pages>187-192</pages><issn>0302-9743</issn><eissn>1611-3349</eissn><isbn>3540664947</isbn><isbn>9783540664949</isbn><eisbn>3540482393</eisbn><eisbn>9783540482390</eisbn><abstract>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</abstract><cop>Germany</cop><pub>Springer Berlin / Heidelberg</pub><doi>10.1007/3-540-48239-3_34</doi><oclcid>958521711</oclcid><oclcid>1245674162</oclcid><tpages>6</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0302-9743
ispartof	Lecture notes in computer science, 1999, Vol.1692, p.187-192
issn	0302-9743 1611-3349
language	eng
recordid	cdi_pascalfrancis_primary_1827545
source	Springer Books
subjects	Applied sciences Artificial intelligence Computer science control theory systems Exact sciences and technology Good Segmentation Language Model Minimum Description Length Principle Recognition Rate Speech and sound recognition and synthesis. Linguistics Training Material
title	Information Theoretic Based Segments for Language Identification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T19%3A58%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=bookitem&rft.atitle=Information%20Theoretic%20Based%20Segments%20for%20Language%20Identification&rft.btitle=Lecture%20notes%20in%20computer%20science&rft.au=Harbeck,%20Stefan&rft.date=1999&rft.volume=1692&rft.spage=187&rft.epage=192&rft.pages=187-192&rft.issn=0302-9743&rft.eissn=1611-3349&rft.isbn=3540664947&rft.isbn_list=9783540664949&rft_id=info:doi/10.1007/3-540-48239-3_34&rft_dat=%3Cproquest_pasca%3EEBC6486039_227_199%3C/proquest_pasca%3E%3Curl%3E%3C/url%3E&rft.eisbn=3540482393&rft.eisbn_list=9783540482390&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=EBC3072700_40_198&rft_id=info:pmid/&rfr_iscdi=true