What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions
Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human in...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Learning effective representations of visual data that generalize to a
variety of downstream tasks has been a long quest for computer vision. Most
representation learning approaches rely solely on visual data such as images or
videos. In this paper, we explore a novel approach, where we use human
interaction and attention cues to investigate whether we can learn better
representations compared to visual-only representations. For this study, we
collect a dataset of human interactions capturing body part movements and gaze
in their daily lives. Our experiments show that our "muscly-supervised"
representation that encodes interaction and attention cues outperforms a
visual-only state-of-the-art method MoCo (He et al.,2020), on a variety of
target tasks: scene classification (semantic), action recognition (temporal),
depth estimation (geometric), dynamics prediction (physics) and walkable
surface estimation (affordance). Our code and dataset are available at:
https://github.com/ehsanik/muscleTorch. |
---|---|
DOI: | 10.48550/arxiv.2010.08539 |