MuJo-SF: Multimodal Joint Slot Filling for Attribute Value Prediction of E-Commerce Commodities
Supplementing product attribute information is a critical step for E-commerce platforms, which further benefits various downstream tasks, including product recommendation, product search, and product knowledge graph construction. Intuitively, the visual information available on e-commerce platforms...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2024, Vol.26, p.10354-10366 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Supplementing product attribute information is a critical step for E-commerce platforms, which further benefits various downstream tasks, including product recommendation, product search, and product knowledge graph construction. Intuitively, the visual information available on e-commerce platforms can effectively function as a primary source for certain product attributes. However, existing works either extract attribute values solely from textual product descriptions or leverage limited visual information (e.g., image features or optical character recognition tokens) to assist extraction, without mining the fine-grained visual cues linked with the products effectively. In this paper, we propose a novel task - Multimodal Joint Slot Filling (MuJo-SF) - that aims to combine multimodal information from both product descriptions and their corresponding product images to jointly fill values into the pre-defined product attribute set. To this end, we develop MAVP, a new dataset with 79 k instances of product description-image pairs. Specifically, we present a strategy to fulfill visualized saliency ascription, which aims to distinguish between text-dependent and image-dependent attributes. For those image-dependent attributes, we annotate the corresponding values from images using distant supervision. Then, we design a model for MuJo-SF, which combines multimodal representations and fills image-dependent and text-dependent attributes separately. Finally, we conduct extensive experiments on MAVP and provide rich results for MuJo-SF, which can be used as baselines to facilitate future research. |
---|---|
ISSN: | 1520-9210 1941-0077 |
DOI: | 10.1109/TMM.2024.3407667 |