Medical Image Description Based on Multimodal Auxiliary Signals and Transformer

Medical image description can be applied to clinical medical diagnosis, but the field still faces serious challenges. There is a serious problem of visual and textual data bias in medical datasets, which are the imbalanced distribution of health and disease data. This can greatly affect the learning...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of intelligent systems 2024-02, Vol.2024, p.1-12
Hauptverfasser:	Tan, Yun, Li, Chunzhi, Qin, Jiaohua, Xue, Youyuan, Xiang, Xuyu
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio data Bias Business metrics Datasets Deep learning Medical imaging Medical research Neural networks Radiology Researchers Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Medical image description can be applied to clinical medical diagnosis, but the field still faces serious challenges. There is a serious problem of visual and textual data bias in medical datasets, which are the imbalanced distribution of health and disease data. This can greatly affect the learning performance of data-driven neural networks and finally lead to errors in the generated medical image descriptions. To address this problem, we propose a new medical image description network architecture named multimodal data-assisted knowledge fusion network (MDAKF), which introduces multimodal auxiliary signals to guide the Transformer network to generate more accurate medical reports. In detail, audio auxiliary signals provide clear abnormal visual regions to alleviate the visual data bias problem. However, the audio modality signals with similar pronunciation lack recognizability, which may lead to incorrect mapping of audio labels to medical image regions. Therefore, we further fuse the audio with text features as the auxiliary signal to improve the overall performance of the model. Through the experiments on two medical image description datasets, IU-X-ray and COV-CTR, it is found that the proposed model is superior to the previous models in terms of language generation evaluation indicators.
ISSN:	0884-8173 1098-111X
DOI:	10.1155/2024/6680546