Lexical speaker identification in TV shows

It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2015-02, Vol.74 (4), p.1377-1396
Hauptverfasser: Roy, Anindya, Bredin, Hervé, Hartmann, William, Le, Viet Bac, Barras, Claude, Gauvain, Jean-Luc
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1396
container_issue 4
container_start_page 1377
container_title Multimedia tools and applications
container_volume 74
creator Roy, Anindya
Bredin, Hervé
Hartmann, William
Le, Viet Bac
Barras, Claude
Gauvain, Jean-Luc
description It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.
doi_str_mv 10.1007/s11042-014-1940-3
format Article
fullrecord <record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_01690342v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1669905613</sourcerecordid><originalsourceid>FETCH-LOGICAL-c459t-5181a8d557d8983aa8a5ed7df611c004476e5ac74d6e2bce33a9ce23442422193</originalsourceid><addsrcrecordid>eNp1kEtLxDAUhYMoOI7-AHcFNypEc_NsloOoIxTcjG5DTFOnY6cdk46Pf29KRURwdS-X7xzOPQgdA7kAQtRlBCCcYgIcg-YEsx00AaEYVorCbtpZTrASBPbRQYwrQkAKyifovPAftbNNFjfevviQ1aVv-7pKt77u2qxus8VjFpfdezxEe5Vtoj_6nlP0cHO9uJrj4v727mpWYMeF7rGAHGxeCqHKXOfM2twKX6qykgCOEM6V9MI6xUvp6ZPzjFntPGWcU04paDZFZ6Pv0jZmE-q1DZ-ms7WZzwoz3FJ2TRinb5DY05HdhO5162Nv1nV0vmls67ttNCCl1kRIYAk9-YOuum1o0ycGlIScgeJ5omCkXOhiDL76SQDEDE2bsekUgpuhaTM401ETE9s--_DL-V_RFyK3fMw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1761831748</pqid></control><display><type>article</type><title>Lexical speaker identification in TV shows</title><source>Springer Nature - Complete Springer Journals</source><creator>Roy, Anindya ; Bredin, Hervé ; Hartmann, William ; Le, Viet Bac ; Barras, Claude ; Gauvain, Jean-Luc</creator><creatorcontrib>Roy, Anindya ; Bredin, Hervé ; Hartmann, William ; Le, Viet Bac ; Barras, Claude ; Gauvain, Jean-Luc</creatorcontrib><description>It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&amp;A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-014-1940-3</identifier><language>eng</language><publisher>Boston: Springer US</publisher><subject>Acoustics ; Analysis ; Classification ; Computer Communication Networks ; Computer Science ; Conversation ; Conversational language ; Data Structures and Information Theory ; Experiments ; Gaussian ; Multimedia ; Multimedia computer applications ; Multimedia Information Systems ; Pattern recognition ; Phonetics ; Special Purpose and Application-Based Systems ; Speech ; Studies ; Support vector machines ; Television ; Television programs ; Texts ; Trains ; Voice recognition</subject><ispartof>Multimedia tools and applications, 2015-02, Vol.74 (4), p.1377-1396</ispartof><rights>Springer Science+Business Media New York 2014</rights><rights>Springer Science+Business Media New York 2015</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c459t-5181a8d557d8983aa8a5ed7df611c004476e5ac74d6e2bce33a9ce23442422193</citedby><cites>FETCH-LOGICAL-c459t-5181a8d557d8983aa8a5ed7df611c004476e5ac74d6e2bce33a9ce23442422193</cites><orcidid>0000-0002-3739-925X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-014-1940-3$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-014-1940-3$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>230,314,776,780,881,27901,27902,41464,42533,51294</link.rule.ids><backlink>$$Uhttps://hal.science/hal-01690342$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Roy, Anindya</creatorcontrib><creatorcontrib>Bredin, Hervé</creatorcontrib><creatorcontrib>Hartmann, William</creatorcontrib><creatorcontrib>Le, Viet Bac</creatorcontrib><creatorcontrib>Barras, Claude</creatorcontrib><creatorcontrib>Gauvain, Jean-Luc</creatorcontrib><title>Lexical speaker identification in TV shows</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&amp;A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.</description><subject>Acoustics</subject><subject>Analysis</subject><subject>Classification</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Conversation</subject><subject>Conversational language</subject><subject>Data Structures and Information Theory</subject><subject>Experiments</subject><subject>Gaussian</subject><subject>Multimedia</subject><subject>Multimedia computer applications</subject><subject>Multimedia Information Systems</subject><subject>Pattern recognition</subject><subject>Phonetics</subject><subject>Special Purpose and Application-Based Systems</subject><subject>Speech</subject><subject>Studies</subject><subject>Support vector machines</subject><subject>Television</subject><subject>Television programs</subject><subject>Texts</subject><subject>Trains</subject><subject>Voice recognition</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>BENPR</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp1kEtLxDAUhYMoOI7-AHcFNypEc_NsloOoIxTcjG5DTFOnY6cdk46Pf29KRURwdS-X7xzOPQgdA7kAQtRlBCCcYgIcg-YEsx00AaEYVorCbtpZTrASBPbRQYwrQkAKyifovPAftbNNFjfevviQ1aVv-7pKt77u2qxus8VjFpfdezxEe5Vtoj_6nlP0cHO9uJrj4v727mpWYMeF7rGAHGxeCqHKXOfM2twKX6qykgCOEM6V9MI6xUvp6ZPzjFntPGWcU04paDZFZ6Pv0jZmE-q1DZ-ms7WZzwoz3FJ2TRinb5DY05HdhO5162Nv1nV0vmls67ttNCCl1kRIYAk9-YOuum1o0ycGlIScgeJ5omCkXOhiDL76SQDEDE2bsekUgpuhaTM401ETE9s--_DL-V_RFyK3fMw</recordid><startdate>20150201</startdate><enddate>20150201</enddate><creator>Roy, Anindya</creator><creator>Bredin, Hervé</creator><creator>Hartmann, William</creator><creator>Le, Viet Bac</creator><creator>Barras, Claude</creator><creator>Gauvain, Jean-Luc</creator><general>Springer US</general><general>Springer Nature B.V</general><general>Springer Verlag</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-3739-925X</orcidid></search><sort><creationdate>20150201</creationdate><title>Lexical speaker identification in TV shows</title><author>Roy, Anindya ; Bredin, Hervé ; Hartmann, William ; Le, Viet Bac ; Barras, Claude ; Gauvain, Jean-Luc</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c459t-5181a8d557d8983aa8a5ed7df611c004476e5ac74d6e2bce33a9ce23442422193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Acoustics</topic><topic>Analysis</topic><topic>Classification</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Conversation</topic><topic>Conversational language</topic><topic>Data Structures and Information Theory</topic><topic>Experiments</topic><topic>Gaussian</topic><topic>Multimedia</topic><topic>Multimedia computer applications</topic><topic>Multimedia Information Systems</topic><topic>Pattern recognition</topic><topic>Phonetics</topic><topic>Special Purpose and Application-Based Systems</topic><topic>Speech</topic><topic>Studies</topic><topic>Support vector machines</topic><topic>Television</topic><topic>Television programs</topic><topic>Texts</topic><topic>Trains</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Roy, Anindya</creatorcontrib><creatorcontrib>Bredin, Hervé</creatorcontrib><creatorcontrib>Hartmann, William</creatorcontrib><creatorcontrib>Le, Viet Bac</creatorcontrib><creatorcontrib>Barras, Claude</creatorcontrib><creatorcontrib>Gauvain, Jean-Luc</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Roy, Anindya</au><au>Bredin, Hervé</au><au>Hartmann, William</au><au>Le, Viet Bac</au><au>Barras, Claude</au><au>Gauvain, Jean-Luc</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Lexical speaker identification in TV shows</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2015-02-01</date><risdate>2015</risdate><volume>74</volume><issue>4</issue><spage>1377</spage><epage>1396</epage><pages>1377-1396</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&amp;A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.</abstract><cop>Boston</cop><pub>Springer US</pub><doi>10.1007/s11042-014-1940-3</doi><tpages>20</tpages><orcidid>https://orcid.org/0000-0002-3739-925X</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1380-7501
ispartof Multimedia tools and applications, 2015-02, Vol.74 (4), p.1377-1396
issn 1380-7501
1573-7721
language eng
recordid cdi_hal_primary_oai_HAL_hal_01690342v1
source Springer Nature - Complete Springer Journals
subjects Acoustics
Analysis
Classification
Computer Communication Networks
Computer Science
Conversation
Conversational language
Data Structures and Information Theory
Experiments
Gaussian
Multimedia
Multimedia computer applications
Multimedia Information Systems
Pattern recognition
Phonetics
Special Purpose and Application-Based Systems
Speech
Studies
Support vector machines
Television
Television programs
Texts
Trains
Voice recognition
title Lexical speaker identification in TV shows
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T15%3A58%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Lexical%20speaker%20identification%20in%20TV%20shows&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Roy,%20Anindya&rft.date=2015-02-01&rft.volume=74&rft.issue=4&rft.spage=1377&rft.epage=1396&rft.pages=1377-1396&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-014-1940-3&rft_dat=%3Cproquest_hal_p%3E1669905613%3C/proquest_hal_p%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1761831748&rft_id=info:pmid/&rfr_iscdi=true