Information Theoretic Based Segments for Language Identification
In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The m...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Buchkapitel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 192 |
---|---|
container_issue | |
container_start_page | 187 |
container_title | |
container_volume | 1692 |
creator | Harbeck, Stefan Ohler, Uwe Nöth, Elmar Niemann, Heinrich |
description | In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances. |
doi_str_mv | 10.1007/3-540-48239-3_34 |
format | Book Chapter |
fullrecord | <record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_1827545</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>EBC6486039_227_199</sourcerecordid><originalsourceid>FETCH-LOGICAL-p312t-ad1b1e4ec7aecbd4e00f6d46961467c81b970d0c8c49be5a482bf18e23cc54fd3</originalsourceid><addsrcrecordid>eNqNkElPwzAQhc0qCvTOMQeuKR6PY8c3FrFUqsQBOFuOMymBNil2euDf4y4_gLmM9OZ9M5rH2BXwCXCubzAvJM9lKdDkaFEesHNMylbAQzYCBZAjSnO0GygljdTHbMSRi9xoiadsZIqyEKABztg4xi-eCoVO6IjdTrumD0s3tH2XvX9SH2hofXbvItXZG82X1A0xS5Zs5rr52s0pm9ZJa5vWb6FLdtK4RaTxvl-wj6fH94eXfPb6PH24m-UrBDHkroYKSJLXjnxVS-K8UbVURoFU2pdQGc1r7ksvTUWFSw9WDZQk0PtCNjVesOvd3pWL3i2a4DrfRrsK7dKFXwul0IUskm2ys8U06eYUbNX339ECt5s8LdoUkt3GZzd5JkDu94b-Z01xsLQhfPoxuIX_dKuBQrRKloonRghtwZiE4X8w5Fpong6m86bEPyZehZI</addsrcrecordid><sourcetype>Index Database</sourcetype><iscdi>true</iscdi><recordtype>book_chapter</recordtype><pqid>EBC3072700_40_198</pqid></control><display><type>book_chapter</type><title>Information Theoretic Based Segments for Language Identification</title><source>Springer Books</source><creator>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich</creator><contributor>Goos, G ; Sojka, Petr ; Siekmann, Jörg ; Carbonell, Jaime G ; Mautner, Pavel ; Matousek, Vaclav ; Ocelikova, Jana ; Sojka, Petr ; Mautner, Pavel ; Ocelíková, Jana ; Matousek, Václav</contributor><creatorcontrib>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich ; Goos, G ; Sojka, Petr ; Siekmann, Jörg ; Carbonell, Jaime G ; Mautner, Pavel ; Matousek, Vaclav ; Ocelikova, Jana ; Sojka, Petr ; Mautner, Pavel ; Ocelíková, Jana ; Matousek, Václav</creatorcontrib><description>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</description><identifier>ISSN: 0302-9743</identifier><identifier>ISBN: 3540664947</identifier><identifier>ISBN: 9783540664949</identifier><identifier>EISSN: 1611-3349</identifier><identifier>EISBN: 3540482393</identifier><identifier>EISBN: 9783540482390</identifier><identifier>DOI: 10.1007/3-540-48239-3_34</identifier><identifier>OCLC: 958521711</identifier><identifier>OCLC: 1245674162</identifier><identifier>LCCallNum: TK5102.9TA1637-1638</identifier><language>eng</language><publisher>Germany: Springer Berlin / Heidelberg</publisher><subject>Applied sciences ; Artificial intelligence ; Computer science; control theory; systems ; Exact sciences and technology ; Good Segmentation ; Language Model ; Minimum Description Length Principle ; Recognition Rate ; Speech and sound recognition and synthesis. Linguistics ; Training Material</subject><ispartof>Lecture notes in computer science, 1999, Vol.1692, p.187-192</ispartof><rights>Springer-Verlag Berlin Heidelberg 1999</rights><rights>1999 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><relation>Lecture Notes in Computer Science</relation></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttps://ebookcentral.proquest.com/covers/3072700-l.jpg</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/3-540-48239-3_34$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/3-540-48239-3_34$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>309,310,775,776,780,785,786,789,4036,4037,27902,38232,41418,42487</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=1827545$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><contributor>Goos, G</contributor><contributor>Sojka, Petr</contributor><contributor>Siekmann, Jörg</contributor><contributor>Carbonell, Jaime G</contributor><contributor>Mautner, Pavel</contributor><contributor>Matousek, Vaclav</contributor><contributor>Ocelikova, Jana</contributor><contributor>Sojka, Petr</contributor><contributor>Mautner, Pavel</contributor><contributor>Ocelíková, Jana</contributor><contributor>Matousek, Václav</contributor><creatorcontrib>Harbeck, Stefan</creatorcontrib><creatorcontrib>Ohler, Uwe</creatorcontrib><creatorcontrib>Nöth, Elmar</creatorcontrib><creatorcontrib>Niemann, Heinrich</creatorcontrib><title>Information Theoretic Based Segments for Language Identification</title><title>Lecture notes in computer science</title><description>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</description><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Computer science; control theory; systems</subject><subject>Exact sciences and technology</subject><subject>Good Segmentation</subject><subject>Language Model</subject><subject>Minimum Description Length Principle</subject><subject>Recognition Rate</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Training Material</subject><issn>0302-9743</issn><issn>1611-3349</issn><isbn>3540664947</isbn><isbn>9783540664949</isbn><isbn>3540482393</isbn><isbn>9783540482390</isbn><fulltext>true</fulltext><rsrctype>book_chapter</rsrctype><creationdate>1999</creationdate><recordtype>book_chapter</recordtype><recordid>eNqNkElPwzAQhc0qCvTOMQeuKR6PY8c3FrFUqsQBOFuOMymBNil2euDf4y4_gLmM9OZ9M5rH2BXwCXCubzAvJM9lKdDkaFEesHNMylbAQzYCBZAjSnO0GygljdTHbMSRi9xoiadsZIqyEKABztg4xi-eCoVO6IjdTrumD0s3tH2XvX9SH2hofXbvItXZG82X1A0xS5Zs5rr52s0pm9ZJa5vWb6FLdtK4RaTxvl-wj6fH94eXfPb6PH24m-UrBDHkroYKSJLXjnxVS-K8UbVURoFU2pdQGc1r7ksvTUWFSw9WDZQk0PtCNjVesOvd3pWL3i2a4DrfRrsK7dKFXwul0IUskm2ys8U06eYUbNX339ECt5s8LdoUkt3GZzd5JkDu94b-Z01xsLQhfPoxuIX_dKuBQrRKloonRghtwZiE4X8w5Fpong6m86bEPyZehZI</recordid><startdate>1999</startdate><enddate>1999</enddate><creator>Harbeck, Stefan</creator><creator>Ohler, Uwe</creator><creator>Nöth, Elmar</creator><creator>Niemann, Heinrich</creator><general>Springer Berlin / Heidelberg</general><general>Springer Berlin Heidelberg</general><general>Springer</general><scope>FFUUA</scope><scope>IQODW</scope></search><sort><creationdate>1999</creationdate><title>Information Theoretic Based Segments for Language Identification</title><author>Harbeck, Stefan ; Ohler, Uwe ; Nöth, Elmar ; Niemann, Heinrich</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p312t-ad1b1e4ec7aecbd4e00f6d46961467c81b970d0c8c49be5a482bf18e23cc54fd3</frbrgroupid><rsrctype>book_chapters</rsrctype><prefilter>book_chapters</prefilter><language>eng</language><creationdate>1999</creationdate><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Computer science; control theory; systems</topic><topic>Exact sciences and technology</topic><topic>Good Segmentation</topic><topic>Language Model</topic><topic>Minimum Description Length Principle</topic><topic>Recognition Rate</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Training Material</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Harbeck, Stefan</creatorcontrib><creatorcontrib>Ohler, Uwe</creatorcontrib><creatorcontrib>Nöth, Elmar</creatorcontrib><creatorcontrib>Niemann, Heinrich</creatorcontrib><collection>ProQuest Ebook Central - Book Chapters - Demo use only</collection><collection>Pascal-Francis</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Harbeck, Stefan</au><au>Ohler, Uwe</au><au>Nöth, Elmar</au><au>Niemann, Heinrich</au><au>Goos, G</au><au>Sojka, Petr</au><au>Siekmann, Jörg</au><au>Carbonell, Jaime G</au><au>Mautner, Pavel</au><au>Matousek, Vaclav</au><au>Ocelikova, Jana</au><au>Sojka, Petr</au><au>Mautner, Pavel</au><au>Ocelíková, Jana</au><au>Matousek, Václav</au><format>book</format><genre>bookitem</genre><ristype>CHAP</ristype><atitle>Information Theoretic Based Segments for Language Identification</atitle><btitle>Lecture notes in computer science</btitle><seriestitle>Lecture Notes in Computer Science</seriestitle><date>1999</date><risdate>1999</risdate><volume>1692</volume><spage>187</spage><epage>192</epage><pages>187-192</pages><issn>0302-9743</issn><eissn>1611-3349</eissn><isbn>3540664947</isbn><isbn>9783540664949</isbn><eisbn>3540482393</eisbn><eisbn>9783540482390</eisbn><abstract>In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as “words” inside the recognition vocabulary. On the OGI test corpus and on the NIST’95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.</abstract><cop>Germany</cop><pub>Springer Berlin / Heidelberg</pub><doi>10.1007/3-540-48239-3_34</doi><oclcid>958521711</oclcid><oclcid>1245674162</oclcid><tpages>6</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0302-9743 |
ispartof | Lecture notes in computer science, 1999, Vol.1692, p.187-192 |
issn | 0302-9743 1611-3349 |
language | eng |
recordid | cdi_pascalfrancis_primary_1827545 |
source | Springer Books |
subjects | Applied sciences Artificial intelligence Computer science control theory systems Exact sciences and technology Good Segmentation Language Model Minimum Description Length Principle Recognition Rate Speech and sound recognition and synthesis. Linguistics Training Material |
title | Information Theoretic Based Segments for Language Identification |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T19%3A58%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=bookitem&rft.atitle=Information%20Theoretic%20Based%20Segments%20for%20Language%20Identification&rft.btitle=Lecture%20notes%20in%20computer%20science&rft.au=Harbeck,%20Stefan&rft.date=1999&rft.volume=1692&rft.spage=187&rft.epage=192&rft.pages=187-192&rft.issn=0302-9743&rft.eissn=1611-3349&rft.isbn=3540664947&rft.isbn_list=9783540664949&rft_id=info:doi/10.1007/3-540-48239-3_34&rft_dat=%3Cproquest_pasca%3EEBC6486039_227_199%3C/proquest_pasca%3E%3Curl%3E%3C/url%3E&rft.eisbn=3540482393&rft.eisbn_list=9783540482390&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=EBC3072700_40_198&rft_id=info:pmid/&rfr_iscdi=true |