ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition
Learning to infer labels in an open world, i.e., in an environment where the target "labels" are unknown, is an important characteristic for achieving autonomy. Foundation models pre-trained on enormous amounts of data have shown remarkable generalization skills through prompting, particul...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Learning to infer labels in an open world, i.e., in an environment where the
target "labels" are unknown, is an important characteristic for achieving
autonomy. Foundation models pre-trained on enormous amounts of data have shown
remarkable generalization skills through prompting, particularly in zero-shot
inference. However, their performance is restricted to the correctness of the
target label's search space. In an open world, this target search space can be
unknown or exceptionally large, which severely restricts the performance of
such models. To tackle this challenging problem, we propose a neuro-symbolic
framework called ALGO - Action Learning with Grounded Object recognition that
uses symbolic knowledge stored in large-scale knowledge bases to infer
activities in egocentric videos with limited supervision using two steps.
First, we propose a neuro-symbolic prompting approach that uses object-centric
vision-language models as a noisy oracle to ground objects in the video through
evidence-based reasoning. Second, driven by prior commonsense knowledge, we
discover plausible activities through an energy-based symbolic pattern theory
framework and learn to ground knowledge-based action (verb) concepts in the
video. Extensive experiments on four publicly available datasets
(EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus) demonstrate its performance on
open-world activity inference. |
---|---|
DOI: | 10.48550/arxiv.2406.05722 |