M&M Mix: A Multimodal Multiview Transformer Ensemble
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Mult...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This report describes the approach behind our winning solution to the 2022
Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent
work, Multiview Transformer for Video Recognition (MTV), and adapts it to
multimodal inputs. Our final submission consists of an ensemble of Multimodal
MTV (M&M) models varying backbone sizes and input modalities. Our approach
achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1%
higher than last year's winning entry. |
---|---|
DOI: | 10.48550/arxiv.2206.09852 |