Unified Object Detector for Different Modalities based on Vision Transformers

Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shen, Xiaoke, Stamos, Ioannis
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shen, Xiaoke Stamos, Ioannis
description	Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this paper, we extend this approach by combining cross/inter-modality transfer learning with a vision transformer to develop a unified detector that achieves superior performance across diverse modalities. Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors in varying lighting conditions. Importantly, the system requires no model architecture or weight updates to enable this smooth transition. Specifically, the system uses the depth sensor during low-lighting conditions (night time) and both the RGB camera and depth sensor or RGB caemra only in well-lit environments. We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50 compared to state-of-the-art methods in the SUNRGBD16 category, and comparable performance in point cloud only mode. We also introduce a novel inter-modality mixing method that enables our model to achieve significantly better results than previous methods. We provide our code, including training/inference logs and model checkpoints, to facilitate reproducibility and further research. \url{https://github.com/liketheflower/UODDM}
doi_str_mv	10.48550/arxiv.2207.01071
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2207_01071</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2207_01071</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-2ec016e788bf1a3f92fba4105b3adabed2ce1deb843539beb72abf92ce55ac83</originalsourceid><addsrcrecordid>eNotj81KAzEURrNxIdUHcGVeYMb8TJp0Ka1_0NJFa7fDvckNpLQzkgyib2-sLj7O5vDBYexOirZzxogHyF_ps1VK2FZIYeU127wPKSYKfItH8hNf0VQxZh7rVilGyjRMfDMGOKUpUeEIperjwA-ppIp9hqFU-0y53LCrCKdCt_-csd3z03752qy3L2_Lx3UDcysbRV7IOVnnMErQcaEiQieFQQ0BkILyJAOh67TRCyS0CrBanowB7_SM3f-9XnL6j5zOkL_736z-kqV_AEy3Saw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Unified Object Detector for Different Modalities based on Vision Transformers</title><source>arXiv.org</source><creator>Shen, Xiaoke ; Stamos, Ioannis</creator><creatorcontrib>Shen, Xiaoke ; Stamos, Ioannis</creatorcontrib><description>Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this paper, we extend this approach by combining cross/inter-modality transfer learning with a vision transformer to develop a unified detector that achieves superior performance across diverse modalities. Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors in varying lighting conditions. Importantly, the system requires no model architecture or weight updates to enable this smooth transition. Specifically, the system uses the depth sensor during low-lighting conditions (night time) and both the RGB camera and depth sensor or RGB caemra only in well-lit environments. We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50 compared to state-of-the-art methods in the SUNRGBD16 category, and comparable performance in point cloud only mode. We also introduce a novel inter-modality mixing method that enables our model to achieve significantly better results than previous methods. We provide our code, including training/inference logs and model checkpoints, to facilitate reproducibility and further research. \url{https://github.com/liketheflower/UODDM}</description><identifier>DOI: 10.48550/arxiv.2207.01071</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2022-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2207.01071$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2207.01071$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shen, Xiaoke</creatorcontrib><creatorcontrib>Stamos, Ioannis</creatorcontrib><title>Unified Object Detector for Different Modalities based on Vision Transformers</title><description>Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this paper, we extend this approach by combining cross/inter-modality transfer learning with a vision transformer to develop a unified detector that achieves superior performance across diverse modalities. Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors in varying lighting conditions. Importantly, the system requires no model architecture or weight updates to enable this smooth transition. Specifically, the system uses the depth sensor during low-lighting conditions (night time) and both the RGB camera and depth sensor or RGB caemra only in well-lit environments. We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50 compared to state-of-the-art methods in the SUNRGBD16 category, and comparable performance in point cloud only mode. We also introduce a novel inter-modality mixing method that enables our model to achieve significantly better results than previous methods. We provide our code, including training/inference logs and model checkpoints, to facilitate reproducibility and further research. \url{https://github.com/liketheflower/UODDM}</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KAzEURrNxIdUHcGVeYMb8TJp0Ka1_0NJFa7fDvckNpLQzkgyib2-sLj7O5vDBYexOirZzxogHyF_ps1VK2FZIYeU127wPKSYKfItH8hNf0VQxZh7rVilGyjRMfDMGOKUpUeEIperjwA-ppIp9hqFU-0y53LCrCKdCt_-csd3z03752qy3L2_Lx3UDcysbRV7IOVnnMErQcaEiQieFQQ0BkILyJAOh67TRCyS0CrBanowB7_SM3f-9XnL6j5zOkL_736z-kqV_AEy3Saw</recordid><startdate>20220703</startdate><enddate>20220703</enddate><creator>Shen, Xiaoke</creator><creator>Stamos, Ioannis</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220703</creationdate><title>Unified Object Detector for Different Modalities based on Vision Transformers</title><author>Shen, Xiaoke ; Stamos, Ioannis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-2ec016e788bf1a3f92fba4105b3adabed2ce1deb843539beb72abf92ce55ac83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Shen, Xiaoke</creatorcontrib><creatorcontrib>Stamos, Ioannis</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shen, Xiaoke</au><au>Stamos, Ioannis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unified Object Detector for Different Modalities based on Vision Transformers</atitle><date>2022-07-03</date><risdate>2022</risdate><abstract>Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this paper, we extend this approach by combining cross/inter-modality transfer learning with a vision transformer to develop a unified detector that achieves superior performance across diverse modalities. Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors in varying lighting conditions. Importantly, the system requires no model architecture or weight updates to enable this smooth transition. Specifically, the system uses the depth sensor during low-lighting conditions (night time) and both the RGB camera and depth sensor or RGB caemra only in well-lit environments. We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50 compared to state-of-the-art methods in the SUNRGBD16 category, and comparable performance in point cloud only mode. We also introduce a novel inter-modality mixing method that enables our model to achieve significantly better results than previous methods. We provide our code, including training/inference logs and model checkpoints, to facilitate reproducibility and further research. \url{https://github.com/liketheflower/UODDM}</abstract><doi>10.48550/arxiv.2207.01071</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2207.01071
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2207_01071
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Unified Object Detector for Different Modalities based on Vision Transformers
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T14%3A04%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unified%20Object%20Detector%20for%20Different%20Modalities%20based%20on%20Vision%20Transformers&rft.au=Shen,%20Xiaoke&rft.date=2022-07-03&rft_id=info:doi/10.48550/arxiv.2207.01071&rft_dat=%3Carxiv_GOX%3E2207_01071%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true