Audio Visual Speaker Localization from EgoCentric Views
The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The use of audio and visual modality for speaker localization has been well
studied in the literature by exploiting their complementary characteristics.
However, most previous works employ the setting of static sensors mounted at
fixed positions. Unlike them, in this work, we explore the ego-centric setting,
where the heterogeneous sensors are embodied and could be moving with a human
to facilitate speaker localization. Compared to the static scenario, the
ego-centric setting is more realistic for smart-home applications e.g., a
service robot. However, this also brings new challenges such as blurred images,
frequent speaker disappearance from the field of view of the wearer, and
occlusions. In this paper, we study egocentric audio-visual speaker DOA
estimation and deal with the challenges mentioned above. Specifically, we
propose a transformer-based audio-visual fusion method to estimate the relative
DOA of the speaker to the wearer, and design a training strategy to mitigate
the problem of the speaker disappearing from the camera's view. We also develop
a new dataset for simulating the out-of-view scenarios, by creating a scene
with a camera wearer walking around while a speaker is moving at the same time.
The experimental results show that our proposed method offers promising
performance in this new dataset in terms of tracking accuracy. Finally, we
adapt the proposed method for the multi-speaker scenario. Experiments on
EasyCom show the effectiveness of the proposed model for multiple speakers in
real scenarios, which achieves state-of-the-art results in the sphere active
speaker detection task and the wearer activity prediction task. The simulated
dataset and related code are available at
https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization. |
---|---|
DOI: | 10.48550/arxiv.2309.16308 |