Information Theoretic Based Segments for Language Identification

In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Harbeck, Stefan, Ohler, Uwe, Nöth, Elmar, Niemann, Heinrich
Format: Buchkapitel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 192
container_issue
container_start_page 187
container_title
container_volume 1692
creator Harbeck, Stefan
Ohler, Uwe
Nöth, Elmar
Niemann, Heinrich
description In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.
doi_str_mv 10.1007/3-540-48239-3_34
format Book Chapter
fullrecord <record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_1827545</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>EBC6486039_227_199</sourcerecordid><originalsourceid>FETCH-LOGICAL-p312t-ad1b1e4ec7aecbd4e00f6d46961467c81b970d0c8c49be5a482bf18e23cc54fd3</originalsourceid><addsrcrecordid>eNqNkElPwzAQhc0qCvTOMQeuKR6PY8c3FrFUqsQBOFuOMymBNil2euDf4y4_gLmM9OZ9M5rH2BXwCXCubzAvJM9lKdDkaFEesHNMylbAQzYCBZAjSnO0GygljdTHbMSRi9xoiadsZIqyEKABztg4xi-eCoVO6IjdTrumD0s3tH2XvX9SH2hofXbvItXZG82X1A0xS5Zs5rr52s0pm9ZJa5vWb6FLdtK4RaTxvl-wj6fH94eXfPb6PH24m-UrBDHkroYKSJLXjnxVS-K8UbVURoFU2pdQGc1r7ksvTUWFSw9WDZQk0PtCNjVesOvd3pWL3i2a4DrfRrsK7dKFXwul0IUskm2ys8U06eYUbNX339ECt5s8LdoUkt3GZzd5JkDu94b-Z01xsLQhfPoxuIX_dKuBQrRKloonRghtwZiE4X8w5Fpong6m86bEPyZehZI</addsrcrecordid><sourcetype>Index Database</sourcetype><iscdi>true</iscdi><recordtype>book_chapter</recordtype><pqid>EBC3072700_40_198</pqid></control><display><type>book_chapter</type><title>Information Theoretic Based Segments for Language Identification</title><source>Springer Books</source><creator>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich</creator><contributor>Goos, G ; Sojka, Petr ; Siekmann, Jörg ; Carbonell, Jaime G ; Mautner, Pavel ; Matousek, Vaclav ; Ocelikova, Jana ; Sojka, Petr ; Mautner, Pavel ; Ocelíková, Jana ; Matousek, Václav</contributor><creatorcontrib>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich ; Goos, G ; Sojka, Petr ; Siekmann, Jörg ; Carbonell, Jaime G ; Mautner, Pavel ; Matousek, Vaclav ; Ocelikova, Jana ; Sojka, Petr ; Mautner, Pavel ; Ocelíková, Jana ; Matousek, Václav</creatorcontrib><description>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</description><identifier>ISSN: 0302-9743</identifier><identifier>ISBN: 3540664947</identifier><identifier>ISBN: 9783540664949</identifier><identifier>EISSN: 1611-3349</identifier><identifier>EISBN: 3540482393</identifier><identifier>EISBN: 9783540482390</identifier><identifier>DOI: 10.1007/3-540-48239-3_34</identifier><identifier>OCLC: 958521711</identifier><identifier>OCLC: 1245674162</identifier><identifier>LCCallNum: TK5102.9TA1637-1638</identifier><language>eng</language><publisher>Germany: Springer Berlin / Heidelberg</publisher><subject>Applied sciences ; Artificial intelligence ; Computer science; control theory; systems ; Exact sciences and technology ; Good Segmentation ; Language Model ; Minimum Description Length Principle ; Recognition Rate ; Speech and sound recognition and synthesis. Linguistics ; Training Material</subject><ispartof>Lecture notes in computer science, 1999, Vol.1692, p.187-192</ispartof><rights>Springer-Verlag Berlin Heidelberg 1999</rights><rights>1999 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><relation>Lecture Notes in Computer Science</relation></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttps://ebookcentral.proquest.com/covers/3072700-l.jpg</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/3-540-48239-3_34$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/3-540-48239-3_34$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>309,310,775,776,780,785,786,789,4036,4037,27902,38232,41418,42487</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=1827545$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><contributor>Goos, G</contributor><contributor>Sojka, Petr</contributor><contributor>Siekmann, Jörg</contributor><contributor>Carbonell, Jaime G</contributor><contributor>Mautner, Pavel</contributor><contributor>Matousek, Vaclav</contributor><contributor>Ocelikova, Jana</contributor><contributor>Sojka, Petr</contributor><contributor>Mautner, Pavel</contributor><contributor>Ocelíková, Jana</contributor><contributor>Matousek, Václav</contributor><creatorcontrib>Harbeck, Stefan</creatorcontrib><creatorcontrib>Ohler, Uwe</creatorcontrib><creatorcontrib>Nöth, Elmar</creatorcontrib><creatorcontrib>Niemann, Heinrich</creatorcontrib><title>Information Theoretic Based Segments for Language Identification</title><title>Lecture notes in computer science</title><description>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</description><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Computer science; control theory; systems</subject><subject>Exact sciences and technology</subject><subject>Good Segmentation</subject><subject>Language Model</subject><subject>Minimum Description Length Principle</subject><subject>Recognition Rate</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Training Material</subject><issn>0302-9743</issn><issn>1611-3349</issn><isbn>3540664947</isbn><isbn>9783540664949</isbn><isbn>3540482393</isbn><isbn>9783540482390</isbn><fulltext>true</fulltext><rsrctype>book_chapter</rsrctype><creationdate>1999</creationdate><recordtype>book_chapter</recordtype><recordid>eNqNkElPwzAQhc0qCvTOMQeuKR6PY8c3FrFUqsQBOFuOMymBNil2euDf4y4_gLmM9OZ9M5rH2BXwCXCubzAvJM9lKdDkaFEesHNMylbAQzYCBZAjSnO0GygljdTHbMSRi9xoiadsZIqyEKABztg4xi-eCoVO6IjdTrumD0s3tH2XvX9SH2hofXbvItXZG82X1A0xS5Zs5rr52s0pm9ZJa5vWb6FLdtK4RaTxvl-wj6fH94eXfPb6PH24m-UrBDHkroYKSJLXjnxVS-K8UbVURoFU2pdQGc1r7ksvTUWFSw9WDZQk0PtCNjVesOvd3pWL3i2a4DrfRrsK7dKFXwul0IUskm2ys8U06eYUbNX339ECt5s8LdoUkt3GZzd5JkDu94b-Z01xsLQhfPoxuIX_dKuBQrRKloonRghtwZiE4X8w5Fpong6m86bEPyZehZI</recordid><startdate>1999</startdate><enddate>1999</enddate><creator>Harbeck, Stefan</creator><creator>Ohler, Uwe</creator><creator>Nöth, Elmar</creator><creator>Niemann, Heinrich</creator><general>Springer Berlin / Heidelberg</general><general>Springer Berlin Heidelberg</general><general>Springer</general><scope>FFUUA</scope><scope>IQODW</scope></search><sort><creationdate>1999</creationdate><title>Information Theoretic Based Segments for Language Identification</title><author>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p312t-ad1b1e4ec7aecbd4e00f6d46961467c81b970d0c8c49be5a482bf18e23cc54fd3</frbrgroupid><rsrctype>book_chapters</rsrctype><prefilter>book_chapters</prefilter><language>eng</language><creationdate>1999</creationdate><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Computer science; control theory; systems</topic><topic>Exact sciences and technology</topic><topic>Good Segmentation</topic><topic>Language Model</topic><topic>Minimum Description Length Principle</topic><topic>Recognition Rate</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Training Material</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Harbeck, Stefan</creatorcontrib><creatorcontrib>Ohler, Uwe</creatorcontrib><creatorcontrib>Nöth, Elmar</creatorcontrib><creatorcontrib>Niemann, Heinrich</creatorcontrib><collection>ProQuest Ebook Central - Book Chapters - Demo use only</collection><collection>Pascal-Francis</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Harbeck, Stefan</au><au>Ohler, Uwe</au><au>Nöth, Elmar</au><au>Niemann, Heinrich</au><au>Goos, G</au><au>Sojka, Petr</au><au>Siekmann, Jörg</au><au>Carbonell, Jaime G</au><au>Mautner, Pavel</au><au>Matousek, Vaclav</au><au>Ocelikova, Jana</au><au>Sojka, Petr</au><au>Mautner, Pavel</au><au>Ocelíková, Jana</au><au>Matousek, Václav</au><format>book</format><genre>bookitem</genre><ristype>CHAP</ristype><atitle>Information Theoretic Based Segments for Language Identification</atitle><btitle>Lecture notes in computer science</btitle><seriestitle>Lecture Notes in Computer Science</seriestitle><date>1999</date><risdate>1999</risdate><volume>1692</volume><spage>187</spage><epage>192</epage><pages>187-192</pages><issn>0302-9743</issn><eissn>1611-3349</eissn><isbn>3540664947</isbn><isbn>9783540664949</isbn><eisbn>3540482393</eisbn><eisbn>9783540482390</eisbn><abstract>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</abstract><cop>Germany</cop><pub>Springer Berlin / Heidelberg</pub><doi>10.1007/3-540-48239-3_34</doi><oclcid>958521711</oclcid><oclcid>1245674162</oclcid><tpages>6</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0302-9743
ispartof Lecture notes in computer science, 1999, Vol.1692, p.187-192
issn 0302-9743
1611-3349
language eng
recordid cdi_pascalfrancis_primary_1827545
source Springer Books
subjects Applied sciences
Artificial intelligence
Computer science
control theory
systems
Exact sciences and technology
Good Segmentation
Language Model
Minimum Description Length Principle
Recognition Rate
Speech and sound recognition and synthesis. Linguistics
Training Material
title Information Theoretic Based Segments for Language Identification
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T19%3A58%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=bookitem&rft.atitle=Information%20Theoretic%20Based%20Segments%20for%20Language%20Identification&rft.btitle=Lecture%20notes%20in%20computer%20science&rft.au=Harbeck,%20Stefan&rft.date=1999&rft.volume=1692&rft.spage=187&rft.epage=192&rft.pages=187-192&rft.issn=0302-9743&rft.eissn=1611-3349&rft.isbn=3540664947&rft.isbn_list=9783540664949&rft_id=info:doi/10.1007/3-540-48239-3_34&rft_dat=%3Cproquest_pasca%3EEBC6486039_227_199%3C/proquest_pasca%3E%3Curl%3E%3C/url%3E&rft.eisbn=3540482393&rft.eisbn_list=9783540482390&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=EBC3072700_40_198&rft_id=info:pmid/&rfr_iscdi=true