Rethinking vision transformer through human–object interaction detection

Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image- and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal inform...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Engineering applications of artificial intelligence 2023-06, Vol.122, p.106123, Article 106123
Hauptverfasser: Cheng, Yamin, Zhao, Zitian, Wang, Zhi, Duan, Hancong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recent works have shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image- and region-level recognition tasks, i.e., image recognition and object detection. However, can Vision Transformer perform region-level relationship reasoning with minimal information about the spatial geometry formation of input images? To answer this question, we propose the Region-level Relationship Reasoning Vision Transformer (R3ViT), a family of human–object interaction detection models based on the vanilla Vision Transformer with the fewest possible revisions, common region priors, as well as inductive biases of the objective task. Specifically, we first divide the input images into several local patches, replace the specialized [CLS ] token in vanilla ViT with extra relationship semantics carrier tokens in the entanglement-/pair-/triplet-wise manner and calculate both representations and their relevance. We assign each extra token with an individual supervision and compute the training loss in a dense manner. We find the vision transformer simply adjusted by the novel paradigm can already reason about the region-level visual relationship, e.g., R3ViT can achieve quite excellent performance on the challenging human–object interaction detection benchmark. We also discuss the impacts of adjustment schemes and model scaling strategies for Vision Transformer through R3ViT. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms most existing methods and achieves the impressive performance of 28.91 mAP on HICO-DET and 56.8 mAP on V-COCO dataset, respectively.
ISSN:0952-1976
1873-6769
DOI:10.1016/j.engappai.2023.106123