CE-DCVSI: Multimodal relational extraction based on collaborative enhancement of dual-channel visual semantic information

Visual information implied by the images in multimodal relation extraction (MRE) usually contains details that are difficult to describe in text sentences. Integrating textual and visual information is the mainstream method to enhance the understanding and extraction of relations between entities. H...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2025-03, Vol.262, p.125608, Article 125608
Hauptverfasser: Gong, Yunchao, Lv, Xueqiang, Yuan, Zhu, Hu, Feng, Cai, Zangtai, Chen, Yuzhong, Wang, Zhaojun, You, Xindong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Visual information implied by the images in multimodal relation extraction (MRE) usually contains details that are difficult to describe in text sentences. Integrating textual and visual information is the mainstream method to enhance the understanding and extraction of relations between entities. However, existing MRE methods neglect the semantic gap caused by data heterogeneity. Besides, some approaches map the relations between target objects in image scene graphs to text, but massive invalid visual relations introduce noise. To alleviate the above problems, we propose a novel multimodal relation extraction method based on cooperative enhancement of dual-channel visual semantic information (CE-DCVSI). Specifically, to mitigate the semantic gap between modalities, we realize fine-grained semantic alignment between entities and target objects through multimodal heterogeneous graphs, aligning feature representations of different modalities into the same semantic space using the heterogeneous graph Transformer, thus promoting the consistency and accuracy of feature representations. To eliminate the effect of useless visual relations, we perform multi-scale feature fusion between different levels of visual information and textual representations to increase the complementarity between features, improving the comprehensiveness and robustness of the multimodal representation. Finally, we utilize the information bottleneck principle to filter out invalid information from the multimodal representation to mitigate the negative impact of irrelevant noise. The experiments demonstrate that the method achieves 86.08% of the F1 score on the publicly available MRE dataset, which outperforms other baseline methods. •Capturing fine-grained and multi-scale multimodal representations.•Multimodal heterogeneous graphs illuminate fine-grained alignment relations between modalities.•The heterogeneous graph Transformer adjusts the interaction strength of cross-modal features.•eliminating redundant and useless information through information bottlenecks.•Experimental results demonstrate the effectiveness and robustness of our method.
ISSN:0957-4174
DOI:10.1016/j.eswa.2024.125608