Depth-Relative Self Attention for Monocular Depth Estimation

Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shim, Kyuhong, Kim, Jiyoung, Lee, Gusang, Shim, Byonghyo
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shim, Kyuhong Kim, Jiyoung Lee, Gusang Shim, Byonghyo
description	Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.
doi_str_mv	10.48550/arxiv.2304.12849
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2304_12849</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2304_12849</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-50432ecfa842840e384202fccea5c48fe6b421a1b50856b98203461d4f5302d43</originalsourceid><addsrcrecordid>eNotj01OwzAQhb1hgUoPwApfIGFsj4MjsalKgUqtkCD7aOKORaSQVK5bwe1J066-xXt6P0LcK8jRWQuPFH_bU64NYK60w_JWPL_wPn1nn9xRak8sv7gLcpES96kdehmGKLdDP_hjR1FOXrk6pPaHzvKduAnUHXh-5UxUr6tq-Z5tPt7Wy8Umo-KpzCyg0ewDORw7gc1I0MF7JuvRBS4a1IpUY8HZoimdBoOF2mGwBvQOzUw8XGKn-fU-jvXxrz7fqKcb5h-NW0Fd</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Depth-Relative Self Attention for Monocular Depth Estimation</title><source>arXiv.org</source><creator>Shim, Kyuhong ; Kim, Jiyoung ; Lee, Gusang ; Shim, Byonghyo</creator><creatorcontrib>Shim, Kyuhong ; Kim, Jiyoung ; Lee, Gusang ; Shim, Byonghyo</creatorcontrib><description>Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.</description><identifier>DOI: 10.48550/arxiv.2304.12849</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-04</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2304.12849$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2304.12849$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shim, Kyuhong</creatorcontrib><creatorcontrib>Kim, Jiyoung</creatorcontrib><creatorcontrib>Lee, Gusang</creatorcontrib><creatorcontrib>Shim, Byonghyo</creatorcontrib><title>Depth-Relative Self Attention for Monocular Depth Estimation</title><description>Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj01OwzAQhb1hgUoPwApfIGFsj4MjsalKgUqtkCD7aOKORaSQVK5bwe1J066-xXt6P0LcK8jRWQuPFH_bU64NYK60w_JWPL_wPn1nn9xRak8sv7gLcpES96kdehmGKLdDP_hjR1FOXrk6pPaHzvKduAnUHXh-5UxUr6tq-Z5tPt7Wy8Umo-KpzCyg0ewDORw7gc1I0MF7JuvRBS4a1IpUY8HZoimdBoOF2mGwBvQOzUw8XGKn-fU-jvXxrz7fqKcb5h-NW0Fd</recordid><startdate>20230425</startdate><enddate>20230425</enddate><creator>Shim, Kyuhong</creator><creator>Kim, Jiyoung</creator><creator>Lee, Gusang</creator><creator>Shim, Byonghyo</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230425</creationdate><title>Depth-Relative Self Attention for Monocular Depth Estimation</title><author>Shim, Kyuhong ; Kim, Jiyoung ; Lee, Gusang ; Shim, Byonghyo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-50432ecfa842840e384202fccea5c48fe6b421a1b50856b98203461d4f5302d43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Shim, Kyuhong</creatorcontrib><creatorcontrib>Kim, Jiyoung</creatorcontrib><creatorcontrib>Lee, Gusang</creatorcontrib><creatorcontrib>Shim, Byonghyo</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shim, Kyuhong</au><au>Kim, Jiyoung</au><au>Lee, Gusang</au><au>Shim, Byonghyo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Depth-Relative Self Attention for Monocular Depth Estimation</atitle><date>2023-04-25</date><risdate>2023</risdate><abstract>Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.</abstract><doi>10.48550/arxiv.2304.12849</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2304.12849
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2304_12849
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Depth-Relative Self Attention for Monocular Depth Estimation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T10%3A20%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Depth-Relative%20Self%20Attention%20for%20Monocular%20Depth%20Estimation&rft.au=Shim,%20Kyuhong&rft.date=2023-04-25&rft_id=info:doi/10.48550/arxiv.2304.12849&rft_dat=%3Carxiv_GOX%3E2304_12849%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true