DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Despite recent success in discriminative approaches in monocular depth
estimation its quality remains limited by training datasets. Generative
approaches mitigate this issue by leveraging strong priors derived from
training on internet-scale datasets. Recent studies have demonstrated that
large text-to-image diffusion models achieve state-of-the-art results in depth
estimation when fine-tuned on small depth datasets. Concurrently,
autoregressive generative approaches, such as the Visual AutoRegressive
modeling~(VAR), have shown promising results in conditioned image synthesis.
Following the visual autoregressive modeling paradigm, we introduce the first
autoregressive depth estimation model based on the visual autoregressive
transformer. Our primary contribution is DepthART -- a novel training method
formulated as Depth Autoregressive Refinement Task. Unlike the original VAR
training procedure, which employs static targets, our method utilizes a dynamic
target formulation that enables model self-refinement and incorporates
multi-modal guidance during training. Specifically, we use model predictions as
inputs instead of ground truth token maps during training, framing the
objective as residual minimization. Our experiments demonstrate that the
proposed training approach significantly outperforms visual autoregressive
modeling via next-scale prediction in the depth estimation task. The Visual
Autoregressive Transformer trained with our approach on Hypersim achieves
superior results on a set of unseen benchmarks compared to other generative and
discriminative baselines. |
---|---|
DOI: | 10.48550/arxiv.2409.15010 |