Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization

Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly f...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Cheng, Lei, Wang, Teng, Meng, Lingquan, Sun, Changyin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Cheng, Lei Wang, Teng Meng, Lingquan Sun, Changyin
description	Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.
doi_str_mv	10.48550/arxiv.2407.06861
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_06861</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_06861</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_068613</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMwszAz5GSICM_MS8kv1y3J14WwFJxcwxSCUguKUotT80oSSzLz8xR8UhOL8jLz0hXS8osUfDJzM0tSUxTc8sMUnIvyi4t1wzJTyxXcU_N1c_KTE3Myq8CaeBhY0xJzilN5oTQ3g7yba4izhy7YDfEFRZm5iUWV8SC3xIPdYkxYBQDysD88</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization</title><source>arXiv.org</source><creator>Cheng, Lei ; Wang, Teng ; Meng, Lingquan ; Sun, Changyin</creator><creatorcontrib>Cheng, Lei ; Wang, Teng ; Meng, Lingquan ; Sun, Changyin</creatorcontrib><description>Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.</description><identifier>DOI: 10.48550/arxiv.2407.06861</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-07</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.06861$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.06861$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Cheng, Lei</creatorcontrib><creatorcontrib>Wang, Teng</creatorcontrib><creatorcontrib>Meng, Lingquan</creatorcontrib><creatorcontrib>Sun, Changyin</creatorcontrib><title>Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization</title><description>Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMwszAz5GSICM_MS8kv1y3J14WwFJxcwxSCUguKUotT80oSSzLz8xR8UhOL8jLz0hXS8osUfDJzM0tSUxTc8sMUnIvyi4t1wzJTyxXcU_N1c_KTE3Myq8CaeBhY0xJzilN5oTQ3g7yba4izhy7YDfEFRZm5iUWV8SC3xIPdYkxYBQDysD88</recordid><startdate>20240709</startdate><enddate>20240709</enddate><creator>Cheng, Lei</creator><creator>Wang, Teng</creator><creator>Meng, Lingquan</creator><creator>Sun, Changyin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240709</creationdate><title>Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization</title><author>Cheng, Lei ; Wang, Teng ; Meng, Lingquan ; Sun, Changyin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_068613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Cheng, Lei</creatorcontrib><creatorcontrib>Wang, Teng</creatorcontrib><creatorcontrib>Meng, Lingquan</creatorcontrib><creatorcontrib>Sun, Changyin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cheng, Lei</au><au>Wang, Teng</au><au>Meng, Lingquan</au><au>Sun, Changyin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization</atitle><date>2024-07-09</date><risdate>2024</risdate><abstract>Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.</abstract><doi>10.48550/arxiv.2407.06861</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2407.06861
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2407_06861
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T04%3A35%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Window-to-Window%20BEV%20Representation%20Learning%20for%20Limited%20FoV%20Cross-View%20Geo-localization&rft.au=Cheng,%20Lei&rft.date=2024-07-09&rft_id=info:doi/10.48550/arxiv.2407.06861&rft_dat=%3Carxiv_GOX%3E2407_06861%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true