Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech

Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-07
Hauptverfasser: Loukas Ilias, Askounis, Dimitris
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Loukas Ilias
Askounis, Dimitris
description Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively.
doi_str_mv 10.48550/arxiv.2305.16406
format Article
fullrecord <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2305_16406</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2820201882</sourcerecordid><originalsourceid>FETCH-LOGICAL-a956-e6cbb1cf829bfa6b4d42b19f3f8216e047f78f49430c159dfd6392b7ecc668913</originalsourceid><addsrcrecordid>eNotkMtqwzAQRUWh0JDmA7qqoGunelmWlyX0BYFusjdjS0ocbMmV5Cbpx_Rb66RdzXA5c2fmInRHyVKoPCePEI7t15Jxki-pFEReoRnjnGZKMHaDFjHuCSFMFizP-Qz9rLxL5pgyOEAwGFIyLrXe4Q5OJkTc-HHojMaHNu2wH1LbQ4dTABcHHxLWvofWYdAwJLjMgdO4H7sJ9HpC7RjPam_SzuuIrQ84mMZvXfvdui3Wpj_vA2yD7_Hk6RI448c49cY0u1t0baGLZvFf52jz8rxZvWXrj9f31dM6gzKXmZFNXdPGKlbWFmQttGA1LS2fFCoNEYUtlBWl4KSheamtlrxkdWGaRkpVUj5H93-2l_CqIUxvhlN1DrG6hDgRD3_EEPznaGKq9n4MbrqpYooRRqhSjP8CCPp6Qg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2820201882</pqid></control><display><type>article</type><title>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Loukas Ilias ; Askounis, Dimitris</creator><creatorcontrib>Loukas Ilias ; Askounis, Dimitris</creatorcontrib><description>Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2305.16406</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Adaptation ; Alzheimer's disease ; Audio data ; Calibration ; Computer Science - Computation and Language ; Context ; Dementia ; Domains ; Medical imaging ; Model accuracy ; R&amp;D ; Representations ; Research &amp; development ; Spectrograms ; Speech</subject><ispartof>arXiv.org, 2023-07</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,780,881,27904</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.16406$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.1016/j.knosys.2023.110834$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Loukas Ilias</creatorcontrib><creatorcontrib>Askounis, Dimitris</creatorcontrib><title>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</title><title>arXiv.org</title><description>Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively.</description><subject>Adaptation</subject><subject>Alzheimer's disease</subject><subject>Audio data</subject><subject>Calibration</subject><subject>Computer Science - Computation and Language</subject><subject>Context</subject><subject>Dementia</subject><subject>Domains</subject><subject>Medical imaging</subject><subject>Model accuracy</subject><subject>R&amp;D</subject><subject>Representations</subject><subject>Research &amp; development</subject><subject>Spectrograms</subject><subject>Speech</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><sourceid>GOX</sourceid><recordid>eNotkMtqwzAQRUWh0JDmA7qqoGunelmWlyX0BYFusjdjS0ocbMmV5Cbpx_Rb66RdzXA5c2fmInRHyVKoPCePEI7t15Jxki-pFEReoRnjnGZKMHaDFjHuCSFMFizP-Qz9rLxL5pgyOEAwGFIyLrXe4Q5OJkTc-HHojMaHNu2wH1LbQ4dTABcHHxLWvofWYdAwJLjMgdO4H7sJ9HpC7RjPam_SzuuIrQ84mMZvXfvdui3Wpj_vA2yD7_Hk6RI448c49cY0u1t0baGLZvFf52jz8rxZvWXrj9f31dM6gzKXmZFNXdPGKlbWFmQttGA1LS2fFCoNEYUtlBWl4KSheamtlrxkdWGaRkpVUj5H93-2l_CqIUxvhlN1DrG6hDgRD3_EEPznaGKq9n4MbrqpYooRRqhSjP8CCPp6Qg</recordid><startdate>20230726</startdate><enddate>20230726</enddate><creator>Loukas Ilias</creator><creator>Askounis, Dimitris</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230726</creationdate><title>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</title><author>Loukas Ilias ; Askounis, Dimitris</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a956-e6cbb1cf829bfa6b4d42b19f3f8216e047f78f49430c159dfd6392b7ecc668913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Adaptation</topic><topic>Alzheimer's disease</topic><topic>Audio data</topic><topic>Calibration</topic><topic>Computer Science - Computation and Language</topic><topic>Context</topic><topic>Dementia</topic><topic>Domains</topic><topic>Medical imaging</topic><topic>Model accuracy</topic><topic>R&amp;D</topic><topic>Representations</topic><topic>Research &amp; development</topic><topic>Spectrograms</topic><topic>Speech</topic><toplevel>online_resources</toplevel><creatorcontrib>Loukas Ilias</creatorcontrib><creatorcontrib>Askounis, Dimitris</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Loukas Ilias</au><au>Askounis, Dimitris</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</atitle><jtitle>arXiv.org</jtitle><date>2023-07-26</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2305.16406</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-07
issn 2331-8422
language eng
recordid cdi_arxiv_primary_2305_16406
source arXiv.org; Free E- Journals
subjects Adaptation
Alzheimer's disease
Audio data
Calibration
Computer Science - Computation and Language
Context
Dementia
Domains
Medical imaging
Model accuracy
R&D
Representations
Research & development
Spectrograms
Speech
title Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T00%3A57%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Context-aware%20attention%20layers%20coupled%20with%20optimal%20transport%20domain%20adaptation%20and%20multimodal%20fusion%20methods%20for%20recognizing%20dementia%20from%20spontaneous%20speech&rft.jtitle=arXiv.org&rft.au=Loukas%20Ilias&rft.date=2023-07-26&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2305.16406&rft_dat=%3Cproquest_arxiv%3E2820201882%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2820201882&rft_id=info:pmid/&rfr_iscdi=true