Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses
Stereo depth estimation is a fundamental component in augmented reality (AR) applications. Although AR applications require very low latency for their real-time applications, traditional depth estimation models often rely on time-consuming preprocessing steps such as rectification to achieve high ac...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Stereo depth estimation is a fundamental component in augmented reality (AR)
applications. Although AR applications require very low latency for their
real-time applications, traditional depth estimation models often rely on
time-consuming preprocessing steps such as rectification to achieve high
accuracy. Also, non standard ML operator based algorithms such as cost volume
also require significant latency, which is aggravated on compute
resource-constrained mobile platforms. Therefore, we develop hardware-friendly
alternatives to the costly cost volume and preprocessing and design two new
models based on them, MultiHeadDepth and HomoDepth. Our approaches for cost
volume is replacing it with a new group-pointwise convolution-based operator
and approximation of consine similarity based on layernorm and dot product. For
online stereo rectification (preprocessing), we introduce homograhy matrix
prediction network with a rectification positional encoding (RPE), which
delivers both low latency and robustness to unrectified images, which
eliminates the needs for preprocessing. Our MultiHeadDepth, which includes
optimized cost volume, provides 11.8-30.3% improvements in accuracy and
22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation
model for AR glasses from industry. Our HomoDepth, which includes optimized
preprocessing (Homograhpy + RPE) upon MultiHeadDepth, can process unrectified
images and reduce the end-to-end latency by 44.5%. We adopt a multi-task
learning framework to handle misaligned stereo inputs on HomoDepth, which
reduces theAbsRel error by 10.0-24.3%. The results demonstrate the efficacy of
our approaches in achieving both high model performance with low latency, which
makes a step forward toward practical depth estimation on future AR devices. |
---|---|
DOI: | 10.48550/arxiv.2411.10013 |