MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Massive multi-modality datasets play a significant role in facilitating the
success of large video-language models. However, current video-language
datasets primarily provide text descriptions for visual frames, considering
audio to be weakly related information. They usually overlook exploring the
potential of inherent audio-visual correlation, leading to monotonous
annotation within each modality instead of comprehensive and precise
descriptions. Such ignorance results in the difficulty of multiple
cross-modality studies. To fulfill this gap, we present MMTrail, a large-scale
multi-modality video-language dataset incorporating more than 20M trailer clips
with visual captions, and 2M high-quality clips with multimodal captions.
Trailers preview full-length video works and integrate context, visual frames,
and background music. In particular, the trailer has two main advantages: (1)
the topics are diverse, and the content characters are of various types, e.g.,
film, news, and gaming. (2) the corresponding background music is
custom-designed, making it more coherent with the visual context. Upon these
insights, we propose a systemic captioning framework, achieving various
modality annotations with more than 27.1k hours of trailer videos. Here, to
ensure the caption retains music perspective while preserving the authority of
visual context, we leverage the advanced LLM to merge all annotations
adaptively. In this fashion, our MMtrail dataset potentially paves the path for
fine-grained large multimodal-language model training. In experiments, we
provide evaluation metrics and benchmark results on our dataset, demonstrating
the high quality of our annotation and its effectiveness for model training. |
---|---|
DOI: | 10.48550/arxiv.2407.20962 |