Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024 , Pages 18816-18826 In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) 2024 , Pages 18816-18826 In this paper, we explore the capability of an agent to construct a logical
sequence of action steps, thereby assembling a strategic procedural plan. This
plan is crucial for navigating from an initial visual observation to a target
visual outcome, as depicted in real-life instructional videos. Existing works
have attained partial success by extensively leveraging various sources of
information available in the datasets, such as heavy intermediate visual
observations, procedural names, or natural language step-by-step instructions,
for features or supervision signals. However, the task remains formidable due
to the implicit causal constraints in the sequencing of steps and the
variability inherent in multiple feasible plans. To tackle these intricacies
that previous efforts have overlooked, we propose to enhance the capabilities
of the agent by infusing it with procedural knowledge. This knowledge, sourced
from training procedure plans and structured as a directed weighted graph,
equips the agent to better navigate the complexities of step sequencing and its
potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced
Procedure Planning system, which harnesses a probabilistic procedural knowledge
graph extracted from training data, effectively acting as a comprehensive
textbook for the training domain. Experimental evaluations across three
widely-used datasets under settings of varying complexity reveal that KEPP
attains superior, state-of-the-art results while requiring only minimal
supervision. |
---|---|
DOI: | 10.48550/arxiv.2403.02782 |