Mono-Camera-Only Target Chasing for a Drone in a Dense Environment by Cross-Modal Learning

Chasing a dynamic target in a dense environment is one of the challenging applications of autonomous drones. The task requires multi-modal data, such as RGB and depth, to accomplish safe and robust maneuver. However, using different types of modalities can be difficult due to the limited capacity of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE robotics and automation letters 2024-08, Vol.9 (8), p.7254-7261
Hauptverfasser:	Yoo, Seungyeon, Jung, Seungwoo, Lee, Yunwoo, Shim, Dongseok, Kim, H. Jin
Format:	Artikel
Sprache:	eng
Schlagworte:	Cameras Data acquisition deep learning for visual perception deep learning methods Drones Image reconstruction Learning Modal data Modules Perception Planning Representations Target tracking Task analysis Tracking Training Vehicle dynamics Vision-based navigation visual learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Chasing a dynamic target in a dense environment is one of the challenging applications of autonomous drones. The task requires multi-modal data, such as RGB and depth, to accomplish safe and robust maneuver. However, using different types of modalities can be difficult due to the limited capacity of drones in aspects of hardware complexity and sensor cost. Our framework resolves such restrictions in the target chasing task by using only a monocular camera instead of multiple sensor inputs. From an RGB input, the perception module can extract a cross-modal representation containing information from multiple data modalities. To learn cross-modal representations at training time, we employ variational autoencoder (VAE) structures and the joint objective function across heterogeneous data. Subsequently, using latent vectors acquired from the pre-trained perception module, the planning module generates a proper next-time-step waypoint by imitation learning of the expert, which performs a numerical optimization using the privileged RGB-D data. Furthermore, the planning module considers temporal information of the target to improve tracking performance through consecutive cross-modal representations. Ultimately, we demonstrate the effectiveness of our framework through the reconstruction results of the perception module, the target chasing performance of the planning module, and the zero-shot sim-to-real deployment of a drone.
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2024.3407412