Unified Object Detector for Different Modalities based on Vision Transformers
Traditional systems typically require different models for processing different modalities, such as one model for RGB images and another for depth images. Recent research has demonstrated that a single model for one modality can be adapted for another using cross-modality transfer learning. In this...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Shen, Xiaoke Stamos, Ioannis |
description | Traditional systems typically require different models for processing
different modalities, such as one model for RGB images and another for depth
images. Recent research has demonstrated that a single model for one modality
can be adapted for another using cross-modality transfer learning. In this
paper, we extend this approach by combining cross/inter-modality transfer
learning with a vision transformer to develop a unified detector that achieves
superior performance across diverse modalities. Our research envisions an
application scenario for robotics, where the unified system seamlessly switches
between RGB cameras and depth sensors in varying lighting conditions.
Importantly, the system requires no model architecture or weight updates to
enable this smooth transition. Specifically, the system uses the depth sensor
during low-lighting conditions (night time) and both the RGB camera and depth
sensor or RGB caemra only in well-lit environments. We evaluate our unified
model on the SUN RGB-D dataset, and demonstrate that it achieves similar or
better performance in terms of mAP50 compared to state-of-the-art methods in
the SUNRGBD16 category, and comparable performance in point cloud only mode. We
also introduce a novel inter-modality mixing method that enables our model to
achieve significantly better results than previous methods. We provide our
code, including training/inference logs and model checkpoints, to facilitate
reproducibility and further research.
\url{https://github.com/liketheflower/UODDM} |
doi_str_mv | 10.48550/arxiv.2207.01071 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2207_01071</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2207_01071</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-2ec016e788bf1a3f92fba4105b3adabed2ce1deb843539beb72abf92ce55ac83</originalsourceid><addsrcrecordid>eNotj81KAzEURrNxIdUHcGVeYMb8TJp0Ka1_0NJFa7fDvckNpLQzkgyib2-sLj7O5vDBYexOirZzxogHyF_ps1VK2FZIYeU127wPKSYKfItH8hNf0VQxZh7rVilGyjRMfDMGOKUpUeEIperjwA-ppIp9hqFU-0y53LCrCKdCt_-csd3z03752qy3L2_Lx3UDcysbRV7IOVnnMErQcaEiQieFQQ0BkILyJAOh67TRCyS0CrBanowB7_SM3f-9XnL6j5zOkL_736z-kqV_AEy3Saw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Unified Object Detector for Different Modalities based on Vision Transformers</title><source>arXiv.org</source><creator>Shen, Xiaoke ; Stamos, Ioannis</creator><creatorcontrib>Shen, Xiaoke ; Stamos, Ioannis</creatorcontrib><description>Traditional systems typically require different models for processing
different modalities, such as one model for RGB images and another for depth
images. Recent research has demonstrated that a single model for one modality
can be adapted for another using cross-modality transfer learning. In this
paper, we extend this approach by combining cross/inter-modality transfer
learning with a vision transformer to develop a unified detector that achieves
superior performance across diverse modalities. Our research envisions an
application scenario for robotics, where the unified system seamlessly switches
between RGB cameras and depth sensors in varying lighting conditions.
Importantly, the system requires no model architecture or weight updates to
enable this smooth transition. Specifically, the system uses the depth sensor
during low-lighting conditions (night time) and both the RGB camera and depth
sensor or RGB caemra only in well-lit environments. We evaluate our unified
model on the SUN RGB-D dataset, and demonstrate that it achieves similar or
better performance in terms of mAP50 compared to state-of-the-art methods in
the SUNRGBD16 category, and comparable performance in point cloud only mode. We
also introduce a novel inter-modality mixing method that enables our model to
achieve significantly better results than previous methods. We provide our
code, including training/inference logs and model checkpoints, to facilitate
reproducibility and further research.
\url{https://github.com/liketheflower/UODDM}</description><identifier>DOI: 10.48550/arxiv.2207.01071</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2022-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2207.01071$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2207.01071$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shen, Xiaoke</creatorcontrib><creatorcontrib>Stamos, Ioannis</creatorcontrib><title>Unified Object Detector for Different Modalities based on Vision Transformers</title><description>Traditional systems typically require different models for processing
different modalities, such as one model for RGB images and another for depth
images. Recent research has demonstrated that a single model for one modality
can be adapted for another using cross-modality transfer learning. In this
paper, we extend this approach by combining cross/inter-modality transfer
learning with a vision transformer to develop a unified detector that achieves
superior performance across diverse modalities. Our research envisions an
application scenario for robotics, where the unified system seamlessly switches
between RGB cameras and depth sensors in varying lighting conditions.
Importantly, the system requires no model architecture or weight updates to
enable this smooth transition. Specifically, the system uses the depth sensor
during low-lighting conditions (night time) and both the RGB camera and depth
sensor or RGB caemra only in well-lit environments. We evaluate our unified
model on the SUN RGB-D dataset, and demonstrate that it achieves similar or
better performance in terms of mAP50 compared to state-of-the-art methods in
the SUNRGBD16 category, and comparable performance in point cloud only mode. We
also introduce a novel inter-modality mixing method that enables our model to
achieve significantly better results than previous methods. We provide our
code, including training/inference logs and model checkpoints, to facilitate
reproducibility and further research.
\url{https://github.com/liketheflower/UODDM}</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KAzEURrNxIdUHcGVeYMb8TJp0Ka1_0NJFa7fDvckNpLQzkgyib2-sLj7O5vDBYexOirZzxogHyF_ps1VK2FZIYeU127wPKSYKfItH8hNf0VQxZh7rVilGyjRMfDMGOKUpUeEIperjwA-ppIp9hqFU-0y53LCrCKdCt_-csd3z03752qy3L2_Lx3UDcysbRV7IOVnnMErQcaEiQieFQQ0BkILyJAOh67TRCyS0CrBanowB7_SM3f-9XnL6j5zOkL_736z-kqV_AEy3Saw</recordid><startdate>20220703</startdate><enddate>20220703</enddate><creator>Shen, Xiaoke</creator><creator>Stamos, Ioannis</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220703</creationdate><title>Unified Object Detector for Different Modalities based on Vision Transformers</title><author>Shen, Xiaoke ; Stamos, Ioannis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-2ec016e788bf1a3f92fba4105b3adabed2ce1deb843539beb72abf92ce55ac83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Shen, Xiaoke</creatorcontrib><creatorcontrib>Stamos, Ioannis</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shen, Xiaoke</au><au>Stamos, Ioannis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unified Object Detector for Different Modalities based on Vision Transformers</atitle><date>2022-07-03</date><risdate>2022</risdate><abstract>Traditional systems typically require different models for processing
different modalities, such as one model for RGB images and another for depth
images. Recent research has demonstrated that a single model for one modality
can be adapted for another using cross-modality transfer learning. In this
paper, we extend this approach by combining cross/inter-modality transfer
learning with a vision transformer to develop a unified detector that achieves
superior performance across diverse modalities. Our research envisions an
application scenario for robotics, where the unified system seamlessly switches
between RGB cameras and depth sensors in varying lighting conditions.
Importantly, the system requires no model architecture or weight updates to
enable this smooth transition. Specifically, the system uses the depth sensor
during low-lighting conditions (night time) and both the RGB camera and depth
sensor or RGB caemra only in well-lit environments. We evaluate our unified
model on the SUN RGB-D dataset, and demonstrate that it achieves similar or
better performance in terms of mAP50 compared to state-of-the-art methods in
the SUNRGBD16 category, and comparable performance in point cloud only mode. We
also introduce a novel inter-modality mixing method that enables our model to
achieve significantly better results than previous methods. We provide our
code, including training/inference logs and model checkpoints, to facilitate
reproducibility and further research.
\url{https://github.com/liketheflower/UODDM}</abstract><doi>10.48550/arxiv.2207.01071</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2207.01071 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2207_01071 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | Unified Object Detector for Different Modalities based on Vision Transformers |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T14%3A04%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unified%20Object%20Detector%20for%20Different%20Modalities%20based%20on%20Vision%20Transformers&rft.au=Shen,%20Xiaoke&rft.date=2022-07-03&rft_id=info:doi/10.48550/arxiv.2207.01071&rft_dat=%3Carxiv_GOX%3E2207_01071%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |