Human Motion Instruction Tuning
This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains moti...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper presents LLaMo (Large Language and Human Motion Assistant), a
multimodal framework for human motion instruction tuning. In contrast to
conventional instruction-tuning approaches that convert non-linguistic inputs,
such as video or motion sequences, into language tokens, LLaMo retains motion
in its native form for instruction tuning. This method preserves
motion-specific details that are often diminished in tokenization, thereby
improving the model's ability to interpret complex human behaviors. By
processing both video and motion data alongside textual inputs, LLaMo enables a
flexible, human-centric analysis. Experimental evaluations across
high-complexity domains, including human behaviors and professional activities,
indicate that LLaMo effectively captures domain-specific knowledge, enhancing
comprehension and prediction in motion-intensive scenarios. We hope LLaMo
offers a foundation for future multimodal AI systems with broad applications,
from sports analytics to behavioral prediction. Our code and models are
available on the project website: https://github.com/ILGLJ/LLaMo. |
---|---|
DOI: | 10.48550/arxiv.2411.16805 |