ContextMatcher: Detector-Free Feature Matching With Cross-Modality Context

Existing feature matching methods tend to extract feature descriptors by relying on the visual appearance, leading to false matches which are obviously false from the geometric perspective. This paper proposes ContextMatcher, which goes beyond the visual appearance representation by introducing the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-09, Vol.34 (9), p.7922-7934
Hauptverfasser:	Li, Dongyue, Du, Songlin
Format:	Artikel
Sprache:	eng
Schlagworte:	Consensus protocol Context convolutional neural network Convolutional neural networks Correlation Feature extraction feature representation Local feature matching Matching neighborhood consensus Neighborhoods Pixels Reliability transformer Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Existing feature matching methods tend to extract feature descriptors by relying on the visual appearance, leading to false matches which are obviously false from the geometric perspective. This paper proposes ContextMatcher, which goes beyond the visual appearance representation by introducing the geometric context to guild the feature matching. Specifically, our ContextMatcher includes visual descriptors generation, the neighborhood consensus module, and the geometric context encoder. To learn visual descriptors, Transformers situated in different branches are leveraged to obtain feature descriptors. In one branch, convolutions are integrated into self-attention layers elegantly to compensate for the lack of the local structure information. In another branch, a cross-scale Transformer is proposed through injecting heterogeneous receptive field sizes into tokens. To leverage and aggregate the geometric contextual information, a neighborhood consensus mechanism is proposed by re-ranking initial pixel-level matches to make a constraint of geometric consensus on neighborhood feature descriptors. Moreover, local feature descriptors are boosted through combining with the geometric properties of keypoints for refining matches to the sub-pixel level. Extensive experiments on relative pose estimations and image matching show that our proposed method outperforms existing state-of-the-art methods by a large margin.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2024.3383334