DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-10 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Bulat Gabdullin Konovalova, Nina Patakin, Nikolay Senushkin, Dmitry Konushin, Anton |
description | Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3108867451</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3108867451</sourcerecordid><originalsourceid>FETCH-proquest_journals_31088674513</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_GHQW5uZUukkpXbqIdxnxWTPdbN_W7y-iH9DpwXtvRSIuRJqUGecbEiOOjDGeF1xKEZHmBIu_V213oBdr7DVMytGvozV6PSuvraEKaRW8dXBzgKhfQFsYtIEZjKedwseOrAc1IcQ_bsm-qbvjOVmcfQZA3482OPNJvUhZWeZFJlPx3_UGm8o7JQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3108867451</pqid></control><display><type>article</type><title>DepthART: Monocular Depth Estimation as Autoregressive Refinement Task</title><source>Freely Accessible Journals</source><creator>Bulat Gabdullin ; Konovalova, Nina ; Patakin, Nikolay ; Senushkin, Dmitry ; Konushin, Anton</creator><creatorcontrib>Bulat Gabdullin ; Konovalova, Nina ; Patakin, Nikolay ; Senushkin, Dmitry ; Konushin, Anton</creatorcontrib><description>Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Image quality ; Modelling ; State-of-the-art reviews ; Transformers ; Visual discrimination ; Visual tasks</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Bulat Gabdullin</creatorcontrib><creatorcontrib>Konovalova, Nina</creatorcontrib><creatorcontrib>Patakin, Nikolay</creatorcontrib><creatorcontrib>Senushkin, Dmitry</creatorcontrib><creatorcontrib>Konushin, Anton</creatorcontrib><title>DepthART: Monocular Depth Estimation as Autoregressive Refinement Task</title><title>arXiv.org</title><description>Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.</description><subject>Datasets</subject><subject>Image quality</subject><subject>Modelling</subject><subject>State-of-the-art reviews</subject><subject>Transformers</subject><subject>Visual discrimination</subject><subject>Visual tasks</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNikELgjAYQEcQJOV_GHQW5uZUukkpXbqIdxnxWTPdbN_W7y-iH9DpwXtvRSIuRJqUGecbEiOOjDGeF1xKEZHmBIu_V213oBdr7DVMytGvozV6PSuvraEKaRW8dXBzgKhfQFsYtIEZjKedwseOrAc1IcQ_bsm-qbvjOVmcfQZA3482OPNJvUhZWeZFJlPx3_UGm8o7JQ</recordid><startdate>20241025</startdate><enddate>20241025</enddate><creator>Bulat Gabdullin</creator><creator>Konovalova, Nina</creator><creator>Patakin, Nikolay</creator><creator>Senushkin, Dmitry</creator><creator>Konushin, Anton</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241025</creationdate><title>DepthART: Monocular Depth Estimation as Autoregressive Refinement Task</title><author>Bulat Gabdullin ; Konovalova, Nina ; Patakin, Nikolay ; Senushkin, Dmitry ; Konushin, Anton</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31088674513</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Image quality</topic><topic>Modelling</topic><topic>State-of-the-art reviews</topic><topic>Transformers</topic><topic>Visual discrimination</topic><topic>Visual tasks</topic><toplevel>online_resources</toplevel><creatorcontrib>Bulat Gabdullin</creatorcontrib><creatorcontrib>Konovalova, Nina</creatorcontrib><creatorcontrib>Patakin, Nikolay</creatorcontrib><creatorcontrib>Senushkin, Dmitry</creatorcontrib><creatorcontrib>Konushin, Anton</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bulat Gabdullin</au><au>Konovalova, Nina</au><au>Patakin, Nikolay</au><au>Senushkin, Dmitry</au><au>Konushin, Anton</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>DepthART: Monocular Depth Estimation as Autoregressive Refinement Task</atitle><jtitle>arXiv.org</jtitle><date>2024-10-25</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3108867451 |
source | Freely Accessible Journals |
subjects | Datasets Image quality Modelling State-of-the-art reviews Transformers Visual discrimination Visual tasks |
title | DepthART: Monocular Depth Estimation as Autoregressive Refinement Task |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T17%3A31%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=DepthART:%20Monocular%20Depth%20Estimation%20as%20Autoregressive%20Refinement%20Task&rft.jtitle=arXiv.org&rft.au=Bulat%20Gabdullin&rft.date=2024-10-25&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3108867451%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3108867451&rft_id=info:pmid/&rfr_iscdi=true |