Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition
We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive multiview representations while the cross-view semantic information exhibits variations. We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle thi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We are concerned with a challenging scenario in unpaired multiview video
learning. In this case, the model aims to learn comprehensive multiview
representations while the cross-view semantic information exhibits variations.
We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this
unpaired multiview learning problem. The key idea is to build cross-view
pseudo-pairs and do view-invariant alignment by leveraging the semantic
information of videos. To facilitate the data efficiency of multiview learning,
we further perform video-text alignment for first-person and third-person
videos, to fully leverage the semantic knowledge to improve video
representations. Extensive experiments on multiple benchmark datasets verify
the effectiveness of our framework. Our method also outperforms multiple
existing view-alignment methods, under the more challenging scenario than
typical paired or unpaired multimodal or multiview learning. Our code is
available at https://github.com/wqtwjt1996/SUM-L. |
---|---|
DOI: | 10.48550/arxiv.2308.11489 |