ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos
Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which ar...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Human action or activity recognition in videos is a fundamental task in
computer vision with applications in surveillance and monitoring, self-driving
cars, sports analytics, human-robot interaction and many more. Traditional
supervised methods require large annotated datasets for training, which are
expensive and time-consuming to acquire. This work proposes a novel approach
using Cross-Architecture Pseudo-Labeling with contrastive learning for
semi-supervised action recognition. Our framework leverages both labeled and
unlabelled data to robustly learn action representations in videos, combining
pseudo-labeling with contrastive learning for effective learning from both
types of samples. We introduce a novel cross-architecture approach where 3D
Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are
utilised to capture different aspects of action representations; hence we call
it ActNetFormer. The 3D CNNs excel at capturing spatial features and local
dependencies in the temporal domain, while VIT excels at capturing long-range
dependencies across frames. By integrating these complementary architectures
within the ActNetFormer framework, our approach can effectively capture both
local and global contextual information of an action. This comprehensive
representation learning enables the model to achieve better performance in
semi-supervised action recognition tasks by leveraging the strengths of each of
these architectures. Experimental results on standard action recognition
datasets demonstrate that our approach performs better than the existing
methods, achieving state-of-the-art performance with only a fraction of labeled
data. The official website of this work is available at:
https://github.com/rana2149/ActNetFormer. |
---|---|
DOI: | 10.48550/arxiv.2404.06243 |