Multimodal Segmentation for Vocal Tract Modeling

Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-06
Hauptverfasser:	Jain, Rishi, Bohan, Yu, Wu, Peter, Prabhune, Tejas, Anumanchipalli, Gopala
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Avatars Datasets Image segmentation Labeling Labels Linguistics Magnetic resonance imaging Modelling Motion capture Speech processing Vocal tract
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Jain, Rishi Bohan, Yu Wu, Peter Prabhune, Tejas Anumanchipalli, Gopala
description	Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3072055503</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3072055503</sourcerecordid><originalsourceid>FETCH-proquest_journals_30720555033</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8C3NKcnMzU9JzFEITk3PTc0rSSzJzM9TSMsvUgjLTwYKhxQlJpco-OanpOZk5qXzMLCmJeYUp_JCaW4GZTfXEGcP3YKi_MLS1OKS-Kz80qI8oFS8sYG5kYGpqamBsTFxqgBuFDMA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3072055503</pqid></control><display><type>article</type><title>Multimodal Segmentation for Vocal Tract Modeling</title><source>Freely Accessible Journals</source><creator>Jain, Rishi ; Bohan, Yu ; Wu, Peter ; Prabhune, Tejas ; Anumanchipalli, Gopala</creator><creatorcontrib>Jain, Rishi ; Bohan, Yu ; Wu, Peter ; Prabhune, Tejas ; Anumanchipalli, Gopala</creatorcontrib><description>Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Avatars ; Datasets ; Image segmentation ; Labeling ; Labels ; Linguistics ; Magnetic resonance imaging ; Modelling ; Motion capture ; Speech processing ; Vocal tract</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>782,786</link.rule.ids></links><search><creatorcontrib>Jain, Rishi</creatorcontrib><creatorcontrib>Bohan, Yu</creatorcontrib><creatorcontrib>Wu, Peter</creatorcontrib><creatorcontrib>Prabhune, Tejas</creatorcontrib><creatorcontrib>Anumanchipalli, Gopala</creatorcontrib><title>Multimodal Segmentation for Vocal Tract Modeling</title><title>arXiv.org</title><description>Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}.</description><subject>Algorithms</subject><subject>Avatars</subject><subject>Datasets</subject><subject>Image segmentation</subject><subject>Labeling</subject><subject>Labels</subject><subject>Linguistics</subject><subject>Magnetic resonance imaging</subject><subject>Modelling</subject><subject>Motion capture</subject><subject>Speech processing</subject><subject>Vocal tract</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8C3NKcnMzU9JzFEITk3PTc0rSSzJzM9TSMsvUgjLTwYKhxQlJpco-OanpOZk5qXzMLCmJeYUp_JCaW4GZTfXEGcP3YKi_MLS1OKS-Kz80qI8oFS8sYG5kYGpqamBsTFxqgBuFDMA</recordid><startdate>20240622</startdate><enddate>20240622</enddate><creator>Jain, Rishi</creator><creator>Bohan, Yu</creator><creator>Wu, Peter</creator><creator>Prabhune, Tejas</creator><creator>Anumanchipalli, Gopala</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240622</creationdate><title>Multimodal Segmentation for Vocal Tract Modeling</title><author>Jain, Rishi ; Bohan, Yu ; Wu, Peter ; Prabhune, Tejas ; Anumanchipalli, Gopala</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30720555033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Avatars</topic><topic>Datasets</topic><topic>Image segmentation</topic><topic>Labeling</topic><topic>Labels</topic><topic>Linguistics</topic><topic>Magnetic resonance imaging</topic><topic>Modelling</topic><topic>Motion capture</topic><topic>Speech processing</topic><topic>Vocal tract</topic><toplevel>online_resources</toplevel><creatorcontrib>Jain, Rishi</creatorcontrib><creatorcontrib>Bohan, Yu</creatorcontrib><creatorcontrib>Wu, Peter</creatorcontrib><creatorcontrib>Prabhune, Tejas</creatorcontrib><creatorcontrib>Anumanchipalli, Gopala</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jain, Rishi</au><au>Bohan, Yu</au><au>Wu, Peter</au><au>Prabhune, Tejas</au><au>Anumanchipalli, Gopala</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Multimodal Segmentation for Vocal Tract Modeling</atitle><jtitle>arXiv.org</jtitle><date>2024-06-22</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3072055503
source	Freely Accessible Journals
subjects	Algorithms Avatars Datasets Image segmentation Labeling Labels Linguistics Magnetic resonance imaging Modelling Motion capture Speech processing Vocal tract
title	Multimodal Segmentation for Vocal Tract Modeling
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-11-29T10%3A45%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Multimodal%20Segmentation%20for%20Vocal%20Tract%20Modeling&rft.jtitle=arXiv.org&rft.au=Jain,%20Rishi&rft.date=2024-06-22&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3072055503%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3072055503&rft_id=info:pmid/&rfr_iscdi=true