3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video

Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. In this paper, a novel unsupervised training framework is pr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2023-04, Vol.33 (4), p.1-1
Hauptverfasser:	Wang, Guangming, Zhong, Jiquan, Zhao, Shijie, Wu, Wenhua, Liu, Zhe, Wang, Hesheng
Format:	Artikel
Sprache:	eng
Schlagworte:	3D augmentation Adaptive optics Autonomous navigation Image reconstruction Monocular depth estimation Optical imaging Optical variables control Optimization Pixels Pose estimation pose refinement Synthesis Three-dimensional displays Training Unsupervised learning visual odometry Visual tasks
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1
container_issue	4
container_start_page	1
container_title	IEEE transactions on circuits and systems for video technology
container_volume	33
creator	Wang, Guangming Zhong, Jiquan Zhao, Shijie Wu, Wenhua Liu, Zhe Wang, Hesheng
description	Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. In this paper, a novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry. In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer. The intermediate view image is proposed and synthesized by warping the pixels in an image with the estimated depth and coarse pose. Then, the residual pose transformation can be estimated from the new view image and the image of the adjacent frame to refine the coarse pose. The iterative refinement is implemented in a differentiable manner in this paper, making the whole framework optimized uniformly. Meanwhile, a new image augmentation method is proposed for the pose estimation by synthesizing a new view image, which creatively augments the pose in 3D space but gets a new augmented 2D image. The experiments on KITTI demonstrate that our depth estimation achieves state-of-the-art performance and even surpasses recent approaches that utilize other auxiliary tasks. Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method, ORB-SLAM2 with back-end optimization. The source codes will be released soon at: https://github.com/IRMVLab/HRANet.
doi_str_mv	10.1109/TCSVT.2022.3215587
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2795806115</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9924173</ieee_id><sourcerecordid>2795806115</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-988ca22a241a99145af0b6680e5dd86961e47c297b99b82a8e0ea6ba5f1d28703</originalsourceid><addsrcrecordid>eNo9kF9LwzAUxYsoOKdfQF8CPncmadMmj2P-mTBRdNtrSNubLaNLatIKfnvbbfh0Dtzzu_dyouiW4AkhWDwsZ1_r5YRiSicJJYzx_CwaDRpTitl57zEjMe9Hl9FVCDuMScrTfBTVySOaG_DKl1tTqhp9gjYW9mBbpGyFpt1m8Ko1ziLtPFrZ0DXgf0yACi1AeWvsBjmNHqFptwfmwwVA2rs9enPWlV2tPFqbCtx1dKFVHeDmpONo9fy0nM3jxfvL62y6iEsqWBsLzktFqaIpUUKQlCmNiyzjGFhV8UxkBNK8j-aFEAWnigMGlRWKaVJRnuNkHN0f9zbefXcQWrlznbf9SUlzwTjOCGF9ih5TpXcheNCy8Wav_K8kWA6tykOrcmhVnlrtobsjZADgHxCi_zVPkj-c93O4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2795806115</pqid></control><display><type>article</type><title>3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Guangming ; Zhong, Jiquan ; Zhao, Shijie ; Wu, Wenhua ; Liu, Zhe ; Wang, Hesheng</creator><creatorcontrib>Wang, Guangming ; Zhong, Jiquan ; Zhao, Shijie ; Wu, Wenhua ; Liu, Zhe ; Wang, Hesheng</creatorcontrib><description>Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. In this paper, a novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry. In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer. The intermediate view image is proposed and synthesized by warping the pixels in an image with the estimated depth and coarse pose. Then, the residual pose transformation can be estimated from the new view image and the image of the adjacent frame to refine the coarse pose. The iterative refinement is implemented in a differentiable manner in this paper, making the whole framework optimized uniformly. Meanwhile, a new image augmentation method is proposed for the pose estimation by synthesizing a new view image, which creatively augments the pose in 3D space but gets a new augmented 2D image. The experiments on KITTI demonstrate that our depth estimation achieves state-of-the-art performance and even surpasses recent approaches that utilize other auxiliary tasks. Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method, ORB-SLAM2 with back-end optimization. The source codes will be released soon at: https://github.com/IRMVLab/HRANet.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2022.3215587</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>3D augmentation ; Adaptive optics ; Autonomous navigation ; Image reconstruction ; Monocular depth estimation ; Optical imaging ; Optical variables control ; Optimization ; Pixels ; Pose estimation ; pose refinement ; Synthesis ; Three-dimensional displays ; Training ; Unsupervised learning ; visual odometry ; Visual tasks</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2023-04, Vol.33 (4), p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-988ca22a241a99145af0b6680e5dd86961e47c297b99b82a8e0ea6ba5f1d28703</citedby><cites>FETCH-LOGICAL-c295t-988ca22a241a99145af0b6680e5dd86961e47c297b99b82a8e0ea6ba5f1d28703</cites><orcidid>0000-0002-9959-1634 ; 0000-0003-1785-7871 ; 0000-0001-6753-0303 ; 0000-0002-7675-543X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9924173$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9924173$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Guangming</creatorcontrib><creatorcontrib>Zhong, Jiquan</creatorcontrib><creatorcontrib>Zhao, Shijie</creatorcontrib><creatorcontrib>Wu, Wenhua</creatorcontrib><creatorcontrib>Liu, Zhe</creatorcontrib><creatorcontrib>Wang, Hesheng</creatorcontrib><title>3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. In this paper, a novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry. In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer. The intermediate view image is proposed and synthesized by warping the pixels in an image with the estimated depth and coarse pose. Then, the residual pose transformation can be estimated from the new view image and the image of the adjacent frame to refine the coarse pose. The iterative refinement is implemented in a differentiable manner in this paper, making the whole framework optimized uniformly. Meanwhile, a new image augmentation method is proposed for the pose estimation by synthesizing a new view image, which creatively augments the pose in 3D space but gets a new augmented 2D image. The experiments on KITTI demonstrate that our depth estimation achieves state-of-the-art performance and even surpasses recent approaches that utilize other auxiliary tasks. Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method, ORB-SLAM2 with back-end optimization. The source codes will be released soon at: https://github.com/IRMVLab/HRANet.</description><subject>3D augmentation</subject><subject>Adaptive optics</subject><subject>Autonomous navigation</subject><subject>Image reconstruction</subject><subject>Monocular depth estimation</subject><subject>Optical imaging</subject><subject>Optical variables control</subject><subject>Optimization</subject><subject>Pixels</subject><subject>Pose estimation</subject><subject>pose refinement</subject><subject>Synthesis</subject><subject>Three-dimensional displays</subject><subject>Training</subject><subject>Unsupervised learning</subject><subject>visual odometry</subject><subject>Visual tasks</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kF9LwzAUxYsoOKdfQF8CPncmadMmj2P-mTBRdNtrSNubLaNLatIKfnvbbfh0Dtzzu_dyouiW4AkhWDwsZ1_r5YRiSicJJYzx_CwaDRpTitl57zEjMe9Hl9FVCDuMScrTfBTVySOaG_DKl1tTqhp9gjYW9mBbpGyFpt1m8Ko1ziLtPFrZ0DXgf0yACi1AeWvsBjmNHqFptwfmwwVA2rs9enPWlV2tPFqbCtx1dKFVHeDmpONo9fy0nM3jxfvL62y6iEsqWBsLzktFqaIpUUKQlCmNiyzjGFhV8UxkBNK8j-aFEAWnigMGlRWKaVJRnuNkHN0f9zbefXcQWrlznbf9SUlzwTjOCGF9ih5TpXcheNCy8Wav_K8kWA6tykOrcmhVnlrtobsjZADgHxCi_zVPkj-c93O4</recordid><startdate>20230401</startdate><enddate>20230401</enddate><creator>Wang, Guangming</creator><creator>Zhong, Jiquan</creator><creator>Zhao, Shijie</creator><creator>Wu, Wenhua</creator><creator>Liu, Zhe</creator><creator>Wang, Hesheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-9959-1634</orcidid><orcidid>https://orcid.org/0000-0003-1785-7871</orcidid><orcidid>https://orcid.org/0000-0001-6753-0303</orcidid><orcidid>https://orcid.org/0000-0002-7675-543X</orcidid></search><sort><creationdate>20230401</creationdate><title>3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video</title><author>Wang, Guangming ; Zhong, Jiquan ; Zhao, Shijie ; Wu, Wenhua ; Liu, Zhe ; Wang, Hesheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-988ca22a241a99145af0b6680e5dd86961e47c297b99b82a8e0ea6ba5f1d28703</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>3D augmentation</topic><topic>Adaptive optics</topic><topic>Autonomous navigation</topic><topic>Image reconstruction</topic><topic>Monocular depth estimation</topic><topic>Optical imaging</topic><topic>Optical variables control</topic><topic>Optimization</topic><topic>Pixels</topic><topic>Pose estimation</topic><topic>pose refinement</topic><topic>Synthesis</topic><topic>Three-dimensional displays</topic><topic>Training</topic><topic>Unsupervised learning</topic><topic>visual odometry</topic><topic>Visual tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Guangming</creatorcontrib><creatorcontrib>Zhong, Jiquan</creatorcontrib><creatorcontrib>Zhao, Shijie</creatorcontrib><creatorcontrib>Wu, Wenhua</creatorcontrib><creatorcontrib>Liu, Zhe</creatorcontrib><creatorcontrib>Wang, Hesheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Guangming</au><au>Zhong, Jiquan</au><au>Zhao, Shijie</au><au>Wu, Wenhua</au><au>Liu, Zhe</au><au>Wang, Hesheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2023-04-01</date><risdate>2023</risdate><volume>33</volume><issue>4</issue><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. In this paper, a novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry. In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer. The intermediate view image is proposed and synthesized by warping the pixels in an image with the estimated depth and coarse pose. Then, the residual pose transformation can be estimated from the new view image and the image of the adjacent frame to refine the coarse pose. The iterative refinement is implemented in a differentiable manner in this paper, making the whole framework optimized uniformly. Meanwhile, a new image augmentation method is proposed for the pose estimation by synthesizing a new view image, which creatively augments the pose in 3D space but gets a new augmented 2D image. The experiments on KITTI demonstrate that our depth estimation achieves state-of-the-art performance and even surpasses recent approaches that utilize other auxiliary tasks. Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method, ORB-SLAM2 with back-end optimization. The source codes will be released soon at: https://github.com/IRMVLab/HRANet.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2022.3215587</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-9959-1634</orcidid><orcidid>https://orcid.org/0000-0003-1785-7871</orcidid><orcidid>https://orcid.org/0000-0001-6753-0303</orcidid><orcidid>https://orcid.org/0000-0002-7675-543X</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2023-04, Vol.33 (4), p.1-1
issn	1051-8215 1558-2205
language	eng
recordid	cdi_proquest_journals_2795806115
source	IEEE Electronic Library (IEL)
subjects	3D augmentation Adaptive optics Autonomous navigation Image reconstruction Monocular depth estimation Optical imaging Optical variables control Optimization Pixels Pose estimation pose refinement Synthesis Three-dimensional displays Training Unsupervised learning visual odometry Visual tasks
title	3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T23%3A05%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=3D%20Hierarchical%20Refinement%20and%20Augmentation%20for%20Unsupervised%20Learning%20of%20Depth%20and%20Pose%20from%20Monocular%20Video&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wang,%20Guangming&rft.date=2023-04-01&rft.volume=33&rft.issue=4&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2022.3215587&rft_dat=%3Cproquest_RIE%3E2795806115%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2795806115&rft_id=info:pmid/&rft_ieee_id=9924173&rfr_iscdi=true