360-MLC: Multi-view Layout Consistency for Self-training and Hyper-parameter Tuning
We present 360-MLC, a self-training method based on multi-view layout consistency for finetuning monocular room-layout models using unlabeled 360-images only. This can be valuable in practical scenarios where a pre-trained model needs to be adapted to a new data domain without using any ground truth...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present 360-MLC, a self-training method based on multi-view layout
consistency for finetuning monocular room-layout models using unlabeled
360-images only. This can be valuable in practical scenarios where a
pre-trained model needs to be adapted to a new data domain without using any
ground truth annotations. Our simple yet effective assumption is that multiple
layout estimations in the same scene must define a consistent geometry
regardless of their camera positions. Based on this idea, we leverage a
pre-trained model to project estimated layout boundaries from several camera
views into the 3D world coordinate. Then, we re-project them back to the
spherical coordinate and build a probability function, from which we sample the
pseudo-labels for self-training. To handle unconfident pseudo-labels, we
evaluate the variance in the re-projected boundaries as an uncertainty value to
weight each pseudo-label in our loss function during training. In addition,
since ground truth annotations are not available during training nor in
testing, we leverage the entropy information in multiple layout estimations as
a quantitative metric to measure the geometry consistency of the scene,
allowing us to evaluate any layout estimator for hyper-parameter tuning,
including model selection without ground truth annotations. Experimental
results show that our solution achieves favorable performance against
state-of-the-art methods when self-training from three publicly available
source datasets to a unique, newly labeled dataset consisting of multi-view of
the same scenes. |
---|---|
DOI: | 10.48550/arxiv.2210.12935 |