PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion

Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kannan, Shyam Sundar, Min, Byung-Cheol
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Kannan, Shyam Sundar
Min, Byung-Cheol
description Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image may impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer employs patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.
doi_str_mv 10.48550/arxiv.2401.13082
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2401_13082</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2401_13082</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-dd3b90bb24593dae61859275ea2faecd7bf73ebf5187d5cddd2a0da73a4d73753</originalsourceid><addsrcrecordid>eNotj8FKxDAURbNxIaMf4Mr8QGuaNJPWnQxWhRkcnOK2vOS9joFMKmkr-vdqdXW5cDhwGLsqRF5WWosbSJ_-I5elKPJCiUqeM9wHcNQM6UTplrcJ4tgvJ7MwEvJXP84Q-ELxF3LDMfrJD5HPo49HvpvD5LODg0B8D5N74wcK5BYCIvLmBxviBTvrIYx0-b8r1jb37eYx2z4_PG3uthmsjcwQla2FtbLUtUKgdVHpWhpNIHsgh8b2RpHtdVEZ1A4RJQgEo6BEo4xWK3b9p10yu_fkT5C-ut_cbslV35O4Udw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion</title><source>arXiv.org</source><creator>Kannan, Shyam Sundar ; Min, Byung-Cheol</creator><creatorcontrib>Kannan, Shyam Sundar ; Min, Byung-Cheol</creatorcontrib><description>Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image may impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer employs patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.</description><identifier>DOI: 10.48550/arxiv.2401.13082</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Robotics</subject><creationdate>2024-01</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2401.13082$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2401.13082$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kannan, Shyam Sundar</creatorcontrib><creatorcontrib>Min, Byung-Cheol</creatorcontrib><title>PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion</title><description>Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image may impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer employs patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Robotics</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FKxDAURbNxIaMf4Mr8QGuaNJPWnQxWhRkcnOK2vOS9joFMKmkr-vdqdXW5cDhwGLsqRF5WWosbSJ_-I5elKPJCiUqeM9wHcNQM6UTplrcJ4tgvJ7MwEvJXP84Q-ELxF3LDMfrJD5HPo49HvpvD5LODg0B8D5N74wcK5BYCIvLmBxviBTvrIYx0-b8r1jb37eYx2z4_PG3uthmsjcwQla2FtbLUtUKgdVHpWhpNIHsgh8b2RpHtdVEZ1A4RJQgEo6BEo4xWK3b9p10yu_fkT5C-ut_cbslV35O4Udw</recordid><startdate>20240123</startdate><enddate>20240123</enddate><creator>Kannan, Shyam Sundar</creator><creator>Min, Byung-Cheol</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240123</creationdate><title>PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion</title><author>Kannan, Shyam Sundar ; Min, Byung-Cheol</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-dd3b90bb24593dae61859275ea2faecd7bf73ebf5187d5cddd2a0da73a4d73753</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Kannan, Shyam Sundar</creatorcontrib><creatorcontrib>Min, Byung-Cheol</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kannan, Shyam Sundar</au><au>Min, Byung-Cheol</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion</atitle><date>2024-01-23</date><risdate>2024</risdate><abstract>Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image may impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer employs patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.</abstract><doi>10.48550/arxiv.2401.13082</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2401.13082
ispartof
issn
language eng
recordid cdi_arxiv_primary_2401_13082
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
Computer Science - Robotics
title PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T15%3A33%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PlaceFormer:%20Transformer-based%20Visual%20Place%20Recognition%20using%20Multi-Scale%20Patch%20Selection%20and%20Fusion&rft.au=Kannan,%20Shyam%20Sundar&rft.date=2024-01-23&rft_id=info:doi/10.48550/arxiv.2401.13082&rft_dat=%3Carxiv_GOX%3E2401_13082%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true