MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Learning via Interactive Perception
A holistic understanding of object properties across diverse sensory modalities (e.g., visual, audio, and haptic) is essential for tasks ranging from object categorization to complex manipulation. Drawing inspiration from cognitive science studies that emphasize the significance of multi-sensory int...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A holistic understanding of object properties across diverse sensory
modalities (e.g., visual, audio, and haptic) is essential for tasks ranging
from object categorization to complex manipulation. Drawing inspiration from
cognitive science studies that emphasize the significance of multi-sensory
integration in human perception, we introduce MOSAIC (Multimodal Object
property learning with Self-Attention and Interactive Comprehension), a novel
framework designed to facilitate the learning of unified multi-sensory object
property representations. While it is undeniable that visual information plays
a prominent role, we acknowledge that many fundamental object properties extend
beyond the visual domain to encompass attributes like texture, mass
distribution, or sounds, which significantly influence how we interact with
objects. In MOSAIC, we leverage this profound insight by distilling knowledge
from multimodal foundation models and aligning these representations not only
across vision but also haptic and auditory sensory modalities. Through
extensive experiments on a dataset where a humanoid robot interacts with 100
objects across 10 exploratory behaviors, we demonstrate the versatility of
MOSAIC in two task families: object categorization and object-fetching tasks.
Our results underscore the efficacy of MOSAIC's unified representations,
showing competitive performance in category recognition through a simple linear
probe setup and excelling in the fetch object task under zero-shot transfer
conditions. This work pioneers the application of sensory grounding in
foundation models for robotics, promising a significant leap in multi-sensory
perception capabilities for autonomous systems. We have released the code,
datasets, and additional results: https://github.com/gtatiya/MOSAIC. |
---|---|
DOI: | 10.48550/arxiv.2309.08508 |