EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation

Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, de...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of real-time image processing 2024-08, Vol.21 (4), p.150, Article 150
Hauptverfasser:	Zhao, Xin, Yang, Lianping, Huang, Wencong, Wang, Qi, Wang, Xin, Lou, Yantao
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Algorithms Blurring Cameras Computer Graphics Computer Science Datasets Deep learning Energy consumption Feature maps Human motion Human performance Image Processing and Computer Vision Information retrieval Localization Modules Multimedia Information Systems Neural networks Pattern Recognition Performance degradation Pose estimation R&D Research & development Research methodology Semantics Signal,Image and Speech Processing Temporal resolution Three dimensional motion
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, demonstrate remarkable adaptability in extreme visual environments. Nevertheless, the exploitation of event cameras for pose estimation in current research has not yet fully harnessed the potential of event-driven data, and enhancing model efficiency remains an ongoing pursuit. This work focuses on devising an efficient, compact pose estimation algorithm, with special attention on optimizing the fusion of multi-view event streams for improved pose prediction accuracy. We propose EV-TIFNet, a compact dual-view interactive network, which incorporates event frames along with our custom-designed Global Spatio-Temporal Feature Maps (GTF Maps). To enhance the network’s ability to understand motion characteristics and localize keypoints, we have tailored a dedicated Auxiliary Information Extraction Module (AIE Module) for the GTF Maps. Experimental results demonstrate that our model, with a compact parameter count of 0.55 million, achieves notable advancements on the DHP19 dataset, reducing the MPJPE 3 D to 61.45 mm. Building upon the sparsity of event data, the integration of sparse convolution operators replaces a significant portion of traditional convolutional layers, leading to a reduction in computational demand by 28.3%, totalling 8.71 GFLOPs. These design choices highlight the model’s suitability and efficiency in scenarios where computational resources are limited.
ISSN:	1861-8200 1861-8219
DOI:	10.1007/s11554-024-01528-3