Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech
Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimo...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-07 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Loukas Ilias Askounis, Dimitris |
description | Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively. |
doi_str_mv | 10.48550/arxiv.2305.16406 |
format | Article |
fullrecord | <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2305_16406</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2820201882</sourcerecordid><originalsourceid>FETCH-LOGICAL-a956-e6cbb1cf829bfa6b4d42b19f3f8216e047f78f49430c159dfd6392b7ecc668913</originalsourceid><addsrcrecordid>eNotkMtqwzAQRUWh0JDmA7qqoGunelmWlyX0BYFusjdjS0ocbMmV5Cbpx_Rb66RdzXA5c2fmInRHyVKoPCePEI7t15Jxki-pFEReoRnjnGZKMHaDFjHuCSFMFizP-Qz9rLxL5pgyOEAwGFIyLrXe4Q5OJkTc-HHojMaHNu2wH1LbQ4dTABcHHxLWvofWYdAwJLjMgdO4H7sJ9HpC7RjPam_SzuuIrQ84mMZvXfvdui3Wpj_vA2yD7_Hk6RI448c49cY0u1t0baGLZvFf52jz8rxZvWXrj9f31dM6gzKXmZFNXdPGKlbWFmQttGA1LS2fFCoNEYUtlBWl4KSheamtlrxkdWGaRkpVUj5H93-2l_CqIUxvhlN1DrG6hDgRD3_EEPznaGKq9n4MbrqpYooRRqhSjP8CCPp6Qg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2820201882</pqid></control><display><type>article</type><title>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Loukas Ilias ; Askounis, Dimitris</creator><creatorcontrib>Loukas Ilias ; Askounis, Dimitris</creatorcontrib><description>Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2305.16406</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Adaptation ; Alzheimer's disease ; Audio data ; Calibration ; Computer Science - Computation and Language ; Context ; Dementia ; Domains ; Medical imaging ; Model accuracy ; R&D ; Representations ; Research & development ; Spectrograms ; Speech</subject><ispartof>arXiv.org, 2023-07</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,780,881,27904</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.16406$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.1016/j.knosys.2023.110834$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Loukas Ilias</creatorcontrib><creatorcontrib>Askounis, Dimitris</creatorcontrib><title>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</title><title>arXiv.org</title><description>Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively.</description><subject>Adaptation</subject><subject>Alzheimer's disease</subject><subject>Audio data</subject><subject>Calibration</subject><subject>Computer Science - Computation and Language</subject><subject>Context</subject><subject>Dementia</subject><subject>Domains</subject><subject>Medical imaging</subject><subject>Model accuracy</subject><subject>R&D</subject><subject>Representations</subject><subject>Research & development</subject><subject>Spectrograms</subject><subject>Speech</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><sourceid>GOX</sourceid><recordid>eNotkMtqwzAQRUWh0JDmA7qqoGunelmWlyX0BYFusjdjS0ocbMmV5Cbpx_Rb66RdzXA5c2fmInRHyVKoPCePEI7t15Jxki-pFEReoRnjnGZKMHaDFjHuCSFMFizP-Qz9rLxL5pgyOEAwGFIyLrXe4Q5OJkTc-HHojMaHNu2wH1LbQ4dTABcHHxLWvofWYdAwJLjMgdO4H7sJ9HpC7RjPam_SzuuIrQ84mMZvXfvdui3Wpj_vA2yD7_Hk6RI448c49cY0u1t0baGLZvFf52jz8rxZvWXrj9f31dM6gzKXmZFNXdPGKlbWFmQttGA1LS2fFCoNEYUtlBWl4KSheamtlrxkdWGaRkpVUj5H93-2l_CqIUxvhlN1DrG6hDgRD3_EEPznaGKq9n4MbrqpYooRRqhSjP8CCPp6Qg</recordid><startdate>20230726</startdate><enddate>20230726</enddate><creator>Loukas Ilias</creator><creator>Askounis, Dimitris</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230726</creationdate><title>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</title><author>Loukas Ilias ; Askounis, Dimitris</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a956-e6cbb1cf829bfa6b4d42b19f3f8216e047f78f49430c159dfd6392b7ecc668913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Adaptation</topic><topic>Alzheimer's disease</topic><topic>Audio data</topic><topic>Calibration</topic><topic>Computer Science - Computation and Language</topic><topic>Context</topic><topic>Dementia</topic><topic>Domains</topic><topic>Medical imaging</topic><topic>Model accuracy</topic><topic>R&D</topic><topic>Representations</topic><topic>Research & development</topic><topic>Spectrograms</topic><topic>Speech</topic><toplevel>online_resources</toplevel><creatorcontrib>Loukas Ilias</creatorcontrib><creatorcontrib>Askounis, Dimitris</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Loukas Ilias</au><au>Askounis, Dimitris</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech</atitle><jtitle>arXiv.org</jtitle><date>2023-07-26</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Alzheimer's disease (AD) constitutes a complex neurocognitive disease and is the main cause of dementia. Although many studies have been proposed targeting at diagnosing dementia through spontaneous speech, there are still limitations. Existing state-of-the-art approaches, which propose multimodal methods, train separately language and acoustic models, employ majority-vote approaches, and concatenate the representations of the different modalities either at the input level, i.e., early fusion, or during training. Also, some of them employ self-attention layers, which calculate the dependencies between representations without considering the contextual information. In addition, no prior work has taken into consideration the model calibration. To address these limitations, we propose some new methods for detecting AD patients, which capture the intra- and cross-modal interactions. First, we convert the audio files into log-Mel spectrograms, their delta, and delta-delta and create in this way an image per audio file consisting of three channels. Next, we pass each transcript and image through BERT and DeiT models respectively. After that, context-based self-attention layers, self-attention layers with a gate model, and optimal transport domain adaptation methods are employed for capturing the intra- and inter-modal interactions. Finally, we exploit two methods for fusing the self and cross-attention features. For taking into account the model calibration, we apply label smoothing. We use both performance and calibration metrics. Experiments conducted on the ADReSS and ADReSSo Challenge datasets indicate the efficacy of our introduced approaches over existing research initiatives with our best performing model reaching Accuracy and F1-score up to 91.25% and 91.06% respectively.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2305.16406</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-07 |
issn | 2331-8422 |
language | eng |
recordid | cdi_arxiv_primary_2305_16406 |
source | arXiv.org; Free E- Journals |
subjects | Adaptation Alzheimer's disease Audio data Calibration Computer Science - Computation and Language Context Dementia Domains Medical imaging Model accuracy R&D Representations Research & development Spectrograms Speech |
title | Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T00%3A57%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Context-aware%20attention%20layers%20coupled%20with%20optimal%20transport%20domain%20adaptation%20and%20multimodal%20fusion%20methods%20for%20recognizing%20dementia%20from%20spontaneous%20speech&rft.jtitle=arXiv.org&rft.au=Loukas%20Ilias&rft.date=2023-07-26&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2305.16406&rft_dat=%3Cproquest_arxiv%3E2820201882%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2820201882&rft_id=info:pmid/&rfr_iscdi=true |