Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE sensors journal 2022-10, Vol.22 (19), p.18762-18770
Hauptverfasser:	Hwang, Seung-Jun, Park, Sung-Jun, Baek, Joong-Hwan, Kim, Byungkyu
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Cameras Coders Computational modeling Computer vision Costs Data acquisition Depth estimation Estimation Feature extraction Image reconstruction monocular sensor estimation self-attention self-supervised Sensors Supervised learning transformer Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	18770
container_issue	19
container_start_page	18762
container_title	IEEE sensors journal
container_volume	22
creator	Hwang, Seung-Jun Park, Sung-Jun Baek, Joong-Hwan Kim, Byungkyu
description	Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesis-based depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use the convolutional neural network (CNN)-based networks in encoders. The CNN is suitable for extracting local features through convolution operation. Recent vision transformers (ViTs) are suitable for global feature extraction based on multiself-attention modules. In this article, we propose a hybrid network combining the CNN and ViT networks in self-supervised learning-based monocular depth estimation. We design an encoder-decoder structure that uses CNNs in the earlier stage of extracting local features and a ViT in the later stages of extracting global features. We evaluate the proposed network through various experiments based on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Cityscapes datasets. The results showed higher performance than previous studies and reduced parameters and computations. Codes and trained models are available at https://github.com/fogfog2/manydepthformer .
doi_str_mv	10.1109/JSEN.2022.3199265
format	Article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9864127</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9864127</ieee_id><sourcerecordid>2719554196</sourcerecordid><originalsourceid>FETCH-LOGICAL-c336t-fe5ca1eb4aa36384c8873026415c43aeec402f9111c696627e3e83ce7276afdb3</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EEqXwAYhNJNYpfjteohIoUGDRVmJnuc4EUrVxsBOk_j2JWrGaWZw7j4PQNcETQrC-e1nk7xOKKZ0wojWV4gSNiBBZShTPToee4ZQz9XmOLmLcYEy0EmqEXhewLdNF10D4rSIUyZuvveu2NiQP0LTfSR7bamfbytfJKlb1VzLbr0NVJMtg61j6sIOQ5LXzBYRLdFbabYSrYx2j1WO-nM7S-cfT8_R-njrGZJuWIJwlsObWMsky7rJMMUwlJ8JxZgEcx7TUhBAntZRUAYOMOVBUSVsWazZGt4e5TfA_HcTWbHwX6n6loYpoITjRsqfIgXLBxxigNE3oPwl7Q7AZnJnBmRmcmaOzPnNzyFQA8M_rrL-NKvYHK-hoBg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2719554196</pqid></control><display><type>article</type><title>Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder</title><source>IEEE Electronic Library (IEL)</source><creator>Hwang, Seung-Jun ; Park, Sung-Jun ; Baek, Joong-Hwan ; Kim, Byungkyu</creator><creatorcontrib>Hwang, Seung-Jun ; Park, Sung-Jun ; Baek, Joong-Hwan ; Kim, Byungkyu</creatorcontrib><description>Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesis-based depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use the convolutional neural network (CNN)-based networks in encoders. The CNN is suitable for extracting local features through convolution operation. Recent vision transformers (ViTs) are suitable for global feature extraction based on multiself-attention modules. In this article, we propose a hybrid network combining the CNN and ViT networks in self-supervised learning-based monocular depth estimation. We design an encoder-decoder structure that uses CNNs in the earlier stage of extracting local features and a ViT in the later stages of extracting global features. We evaluate the proposed network through various experiments based on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Cityscapes datasets. The results showed higher performance than previous studies and reduced parameters and computations. Codes and trained models are available at https://github.com/fogfog2/manydepthformer .</description><identifier>ISSN: 1530-437X</identifier><identifier>EISSN: 1558-1748</identifier><identifier>DOI: 10.1109/JSEN.2022.3199265</identifier><identifier>CODEN: ISJEAZ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; Cameras ; Coders ; Computational modeling ; Computer vision ; Costs ; Data acquisition ; Depth estimation ; Estimation ; Feature extraction ; Image reconstruction ; monocular sensor estimation ; self-attention ; self-supervised ; Sensors ; Supervised learning ; transformer ; Transformers</subject><ispartof>IEEE sensors journal, 2022-10, Vol.22 (19), p.18762-18770</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c336t-fe5ca1eb4aa36384c8873026415c43aeec402f9111c696627e3e83ce7276afdb3</citedby><cites>FETCH-LOGICAL-c336t-fe5ca1eb4aa36384c8873026415c43aeec402f9111c696627e3e83ce7276afdb3</cites><orcidid>0000-0002-2316-6463 ; 0000-0002-2659-6952 ; 0000-0003-4606-6737 ; 0000-0003-2576-2274</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9864127$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids></links><search><creatorcontrib>Hwang, Seung-Jun</creatorcontrib><creatorcontrib>Park, Sung-Jun</creatorcontrib><creatorcontrib>Baek, Joong-Hwan</creatorcontrib><creatorcontrib>Kim, Byungkyu</creatorcontrib><title>Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder</title><title>IEEE sensors journal</title><addtitle>JSEN</addtitle><description>Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesis-based depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use the convolutional neural network (CNN)-based networks in encoders. The CNN is suitable for extracting local features through convolution operation. Recent vision transformers (ViTs) are suitable for global feature extraction based on multiself-attention modules. In this article, we propose a hybrid network combining the CNN and ViT networks in self-supervised learning-based monocular depth estimation. We design an encoder-decoder structure that uses CNNs in the earlier stage of extracting local features and a ViT in the later stages of extracting global features. We evaluate the proposed network through various experiments based on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Cityscapes datasets. The results showed higher performance than previous studies and reduced parameters and computations. Codes and trained models are available at https://github.com/fogfog2/manydepthformer .</description><subject>Artificial neural networks</subject><subject>Cameras</subject><subject>Coders</subject><subject>Computational modeling</subject><subject>Computer vision</subject><subject>Costs</subject><subject>Data acquisition</subject><subject>Depth estimation</subject><subject>Estimation</subject><subject>Feature extraction</subject><subject>Image reconstruction</subject><subject>monocular sensor estimation</subject><subject>self-attention</subject><subject>self-supervised</subject><subject>Sensors</subject><subject>Supervised learning</subject><subject>transformer</subject><subject>Transformers</subject><issn>1530-437X</issn><issn>1558-1748</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EEqXwAYhNJNYpfjteohIoUGDRVmJnuc4EUrVxsBOk_j2JWrGaWZw7j4PQNcETQrC-e1nk7xOKKZ0wojWV4gSNiBBZShTPToee4ZQz9XmOLmLcYEy0EmqEXhewLdNF10D4rSIUyZuvveu2NiQP0LTfSR7bamfbytfJKlb1VzLbr0NVJMtg61j6sIOQ5LXzBYRLdFbabYSrYx2j1WO-nM7S-cfT8_R-njrGZJuWIJwlsObWMsky7rJMMUwlJ8JxZgEcx7TUhBAntZRUAYOMOVBUSVsWazZGt4e5TfA_HcTWbHwX6n6loYpoITjRsqfIgXLBxxigNE3oPwl7Q7AZnJnBmRmcmaOzPnNzyFQA8M_rrL-NKvYHK-hoBg</recordid><startdate>20221001</startdate><enddate>20221001</enddate><creator>Hwang, Seung-Jun</creator><creator>Park, Sung-Jun</creator><creator>Baek, Joong-Hwan</creator><creator>Kim, Byungkyu</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>7U5</scope><scope>8FD</scope><scope>L7M</scope><orcidid>https://orcid.org/0000-0002-2316-6463</orcidid><orcidid>https://orcid.org/0000-0002-2659-6952</orcidid><orcidid>https://orcid.org/0000-0003-4606-6737</orcidid><orcidid>https://orcid.org/0000-0003-2576-2274</orcidid></search><sort><creationdate>20221001</creationdate><title>Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder</title><author>Hwang, Seung-Jun ; Park, Sung-Jun ; Baek, Joong-Hwan ; Kim, Byungkyu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c336t-fe5ca1eb4aa36384c8873026415c43aeec402f9111c696627e3e83ce7276afdb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial neural networks</topic><topic>Cameras</topic><topic>Coders</topic><topic>Computational modeling</topic><topic>Computer vision</topic><topic>Costs</topic><topic>Data acquisition</topic><topic>Depth estimation</topic><topic>Estimation</topic><topic>Feature extraction</topic><topic>Image reconstruction</topic><topic>monocular sensor estimation</topic><topic>self-attention</topic><topic>self-supervised</topic><topic>Sensors</topic><topic>Supervised learning</topic><topic>transformer</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hwang, Seung-Jun</creatorcontrib><creatorcontrib>Park, Sung-Jun</creatorcontrib><creatorcontrib>Baek, Joong-Hwan</creatorcontrib><creatorcontrib>Kim, Byungkyu</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>Technology Research Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE sensors journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hwang, Seung-Jun</au><au>Park, Sung-Jun</au><au>Baek, Joong-Hwan</au><au>Kim, Byungkyu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder</atitle><jtitle>IEEE sensors journal</jtitle><stitle>JSEN</stitle><date>2022-10-01</date><risdate>2022</risdate><volume>22</volume><issue>19</issue><spage>18762</spage><epage>18770</epage><pages>18762-18770</pages><issn>1530-437X</issn><eissn>1558-1748</eissn><coden>ISJEAZ</coden><abstract>Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesis-based depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use the convolutional neural network (CNN)-based networks in encoders. The CNN is suitable for extracting local features through convolution operation. Recent vision transformers (ViTs) are suitable for global feature extraction based on multiself-attention modules. In this article, we propose a hybrid network combining the CNN and ViT networks in self-supervised learning-based monocular depth estimation. We design an encoder-decoder structure that uses CNNs in the earlier stage of extracting local features and a ViT in the later stages of extracting global features. We evaluate the proposed network through various experiments based on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Cityscapes datasets. The results showed higher performance than previous studies and reduced parameters and computations. Codes and trained models are available at https://github.com/fogfog2/manydepthformer .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/JSEN.2022.3199265</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0002-2316-6463</orcidid><orcidid>https://orcid.org/0000-0002-2659-6952</orcidid><orcidid>https://orcid.org/0000-0003-4606-6737</orcidid><orcidid>https://orcid.org/0000-0003-2576-2274</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1530-437X
ispartof	IEEE sensors journal, 2022-10, Vol.22 (19), p.18762-18770
issn	1530-437X 1558-1748
language	eng
recordid	cdi_ieee_primary_9864127
source	IEEE Electronic Library (IEL)
subjects	Artificial neural networks Cameras Coders Computational modeling Computer vision Costs Data acquisition Depth estimation Estimation Feature extraction Image reconstruction monocular sensor estimation self-attention self-supervised Sensors Supervised learning transformer Transformers
title	Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T19%3A01%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-Supervised%20Monocular%20Depth%20Estimation%20Using%20Hybrid%20Transformer%20Encoder&rft.jtitle=IEEE%20sensors%20journal&rft.au=Hwang,%20Seung-Jun&rft.date=2022-10-01&rft.volume=22&rft.issue=19&rft.spage=18762&rft.epage=18770&rft.pages=18762-18770&rft.issn=1530-437X&rft.eissn=1558-1748&rft.coden=ISJEAZ&rft_id=info:doi/10.1109/JSEN.2022.3199265&rft_dat=%3Cproquest_ieee_%3E2719554196%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2719554196&rft_id=info:pmid/&rft_ieee_id=9864127&rfr_iscdi=true