Multi-granularity semantic relational mapping for image caption
In terms of constructing object-relationship descriptions in images, existing image captioning methods incorporate regional semantic features into visual features to enhance the visual representation. However, they neglect the construction of grid semantic features, resulting in a lack of accurate d...
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2025-03, Vol.264, p.125847, Article 125847 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In terms of constructing object-relationship descriptions in images, existing image captioning methods incorporate regional semantic features into visual features to enhance the visual representation. However, they neglect the construction of grid semantic features, resulting in a lack of accurate detailed relationships in the generated results. We propose a Multi-granularity Semantic Relational Mapping(MSRM) framework that dynamically extracts image semantic cue features in place of traditional region labeling in order to get rid of the semantic capability limitation of fixed classification labels and construct grid semantic features. MSRM use the Internal Semantic Mapping mechanism to refine semantic features by filtering out irrelevant features and mapping them onto region and grid features. Simultaneously, the Semantic Mapping mechanism is used to integrate the composite features derived from regions and grids, thereby addressing the problem of describing semantic relationships among objects across different granularities. Experiments on the MSCOCO and Flickr30k datasets show that the proposed MSRM significantly outperforms the state-of-the-art baselines by more than 4% in 7 different metrics including BLEUs, Meteor, Rouge and CIDEr.
•We introduce a novel framework called Multi-granularity Semantic Relational Mapping (MSRM), which innovatively constructs multi- granularity semantic relational interactions between regions and grids. This framework enhances visual semantic relational representations, enabling the generation of captions that are not only rich in scene details but also accurately depict relationships within the scene. |
---|---|
ISSN: | 0957-4174 |
DOI: | 10.1016/j.eswa.2024.125847 |