A Transformer-Based Architecture for High-Resolution Stereo Matching

The Transformer architecture is now widely used due to its superior parallel computing and global modelling capabilities. In this paper, We build a dense F eature E xtraction T ransformer (FET) for stereo matching tasks, incorporating Transformer and convolution blocks. In stereo matching tasks, FET...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computational imaging 2024, Vol.10, p.83-92
Hauptverfasser:	Jia, Di, Cai, Peng, Wang, Qian, Yang, Ninghua
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Convolution Error reduction Feature extraction High resolution Image resolution Matching Pixels
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The Transformer architecture is now widely used due to its superior parallel computing and global modelling capabilities. In this paper, We build a dense F eature E xtraction T ransformer (FET) for stereo matching tasks, incorporating Transformer and convolution blocks. In stereo matching tasks, FET has three advantages: 1) For stereo image pairs with high resolution, Transformer blocks joined with Spatial pyramidal pooling windows can obtain a wide range of contextual representations while maintaining linear computational complexity; 2) We use convolution and transposed convolution blocks to respectively implement overlapping patch embedding, which allows features to capture enough proximity information to facilitate fine-grained matching. 3) FET creatively utilizes the jump-query strategy to apply the transformer encoder and decoder structures to feature extraction tasks simultaneously. Furthermore, to obtain an architecture more thoroughly based on Transformer, we use STTR's (Li et al., 2021) attention-based pixel-matching strategy. Our model obtained 0.32 end-point error and 0.89% 3-px error on the Scene Flow benchmark (30.95% point and 29.36% point absolute improvement compared to STTR). On the KITTI 2015 benchmark, our model obtained 1.80 D1-bg in Estimated pixels (1.57 points of error reduction compared to STTR).
ISSN:	2573-0436 2333-9403
DOI:	10.1109/TCI.2024.3350884