Empowering lightweight video transformer via the kernel learning

Video transformers achieve superior performance in video recognition. Despite the recent advances in video transformers, they still require substantial computation and memory resources. To cater for the computation efficiency, a kernel‐based video transformer is proposed, including: (1) a new formul...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Electronics Letters 2024-05, Vol.60 (9), p.n/a
Hauptverfasser: Liu, Xiaoxi, Liu, Ju, Gu, Lingchen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video transformers achieve superior performance in video recognition. Despite the recent advances in video transformers, they still require substantial computation and memory resources. To cater for the computation efficiency, a kernel‐based video transformer is proposed, including: (1) a new formulation of the video transformer via the kernel learning is presented to better understand the individual components of it; (2) a lightweight Kernel‐based spatial–temporal multi‐head self‐attention block is explored to learn the compact joint spatial–temporal video feature; (3) an adaptive‐score position embedding method is conducted to promote the flexibility of video transformer. Experimental results on several action recognition datasets demonstrate the effectiveness of the proposed method. Only pretrained on ImageNet‐1K, the method achieves the preferable balance between computation and accuracy, while requiring 7×$\times$ fewer parameters and 13×$\times$ fewer floating point operations than other comparable methods. Video transformers have achieved superior performance in video recognition. Despite the recent advances in video transformers, they still requires substantial computation and memory resources. To cater for the computation efficiency, the kernel learning is utilized to reformulate video transformer, and propose a novel lightweight kernel‐based video transformer.
ISSN:0013-5194
1350-911X
DOI:10.1049/ell2.13215