Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images
Deep learning based object detectors require thousands of diversified bounding box and class annotated examples. Though image object detectors have shown rapid progress in recent years with the release of multiple large-scale static image datasets, object detection on videos still remains an open pr...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Deep learning based object detectors require thousands of diversified
bounding box and class annotated examples. Though image object detectors have
shown rapid progress in recent years with the release of multiple large-scale
static image datasets, object detection on videos still remains an open problem
due to scarcity of annotated video frames. Having a robust video object
detector is an essential component for video understanding and curating
large-scale automated annotations in videos. Domain difference between images
and videos makes the transferability of image object detectors to videos
sub-optimal. The most common solution is to use weakly supervised annotations
where a video frame has to be tagged for presence/absence of object categories.
This still takes up manual effort. In this paper we take a step forward by
adapting the concept of unsupervised adversarial image-to-image translation to
perturb static high quality images to be visually indistinguishable from a set
of video frames. We assume the presence of a fully annotated static image
dataset and an unannotated video dataset. Object detector is trained on
adversarially transformed image dataset using the annotations of the original
dataset. Experiments on Youtube-Objects and Youtube-Objects-Subset datasets
with two contemporary baseline object detectors reveal that such unsupervised
pixel level domain adaptation boosts the generalization performance on video
frames compared to direct application of original image object detector. Also,
we achieve competitive performance compared to recent baselines of weakly
supervised methods. This paper can be seen as an application of image
translation for cross domain object detection. |
---|---|
DOI: | 10.48550/arxiv.1810.02074 |