Multiscale Modality-Similar Learning Guided Weakly Supervised RGB-T Crowd Counting

With the development of sensor technology and its numerous applications in intelligent surveillance systems, RGB-thermal (RGB-T) cross-modal crowd counting uses data from different sensors as source data and has received extensive attention from academia and industry. From the feature extraction asp...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE sensors journal 2024-09, Vol.24 (18), p.29121-29134
Hauptverfasser:	Kong, Weihang, Li, He, Zhao, Fengda
Format:	Artikel
Sprache:	eng
Schlagworte:	Ablation Annotations Benchmarks Convolution Cross-modal feature fusion Crowd monitoring Feature extraction information processing from different-type sensors Intelligent sensors Modules multiscale feature fusion Parameters Redundancy RGB-thermal (RGB-T) crowd counting sensor technology application Sensors Surveillance systems Task analysis Thermal sensors weakly supervised framework
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the development of sensor technology and its numerous applications in intelligent surveillance systems, RGB-thermal (RGB-T) cross-modal crowd counting uses data from different sensors as source data and has received extensive attention from academia and industry. From the feature extraction aspect, the existing cross-modal methods mainly adopt multiple parallel large convolution kernels for the notable crowd-scale variation problem, resulting in a large number of parameters. From the supervision aspect, the existing cross-modal crowd-counting methods adopt a fully supervised framework, and it requires time-consuming and laborious pixel-level supervision. In this regard, this article proposes a multiscale modality-similar guided weakly supervised cross-modal crowd-counting method, including a designed multiscale context-level feature fusion (MCFF) module and a modality-similar weakly supervised framework. In particular, the proposed multiscale module decouples the square convolution in different directions equivalently to solve the problems of feature redundancy and parameter increase. The proposed weakly supervised framework explores the similarity of cross-modal crowd semantic features to bootstrap the model with only image-level supervised information. Experimental results on two public RGB-T benchmarks, one RGB-D benchmark, and the collected real-world data show that the proposed weakly supervised method can achieve counting accuracy competitive with existing representative fully supervised methods. The extensive ablation studies validate the positive gain of the core modules on the final counting performance improvement.
ISSN:	1530-437X 1558-1748
DOI:	10.1109/JSEN.2024.3436859