Learning to Draw Sight Lines

In this paper, we are concerned with the task of gaze following. Given a scene (e.g. a girl playing soccer on the field) and a human subject’s head position, this task aims to infer where she is looking (e.g. at the soccer ball). An existing method adopts a saliency model conditioned on the head pos...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of computer vision 2020-05, Vol.128 (5), p.1076-1100
Hauptverfasser: Zhao, Hao, Lu, Ming, Yao, Anbang, Chen, Yurong, Zhang, Li
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we are concerned with the task of gaze following. Given a scene (e.g. a girl playing soccer on the field) and a human subject’s head position, this task aims to infer where she is looking (e.g. at the soccer ball). An existing method adopts a saliency model conditioned on the head position. However, this methodology is intrinsically troubled with dataset bias issues, which we will reveal in detail. In order to resolve these issues, we argue that the right methodology is to simulate how human beings follow gazes. Specifically, we propose the hypothesis that a human follows gazes by searching for salient objects along the subject’s sight line direction. To algorithmically embody this hypothesis, a two-stage method is proposed, which is dubbed as learning to draw sight lines. In the first stage, a fully convolutional network is trained to directly regress the existence strength of sight lines. It may seem counterintuitive at a first glance as these so-called sight lines do not really exist in the form of learnable image gradients. However, with the large-scale dataset GazeFollow, we demonstrate that this highly abstract concept can be grounded into neural network activations. An extensive study is conducted on the design of this sight line grounding network. We show that the best model we visited can already outperform the state-of-the-arts by a large margin, using a naive greedy inference strategy. We attribute these improvements to modern architecture design philosophies. However, no matter how strong the sight line grounding network is, the greedy inference strategy cannot handle a bunch of failure cases caused by dataset bias issues. We identify these issues and demonstrate that those grounded sight lines, which is a unique ingredient of our method, is the key to overcome them. Specifically, an algorithm termed as RASP is introduced as a second stage. RASP has five intriguing features: (1) it explicitly embodies the aforementioned hypothesis; (2) it involves no hyper-parameters, thus guaranteeing its robustness; (3) if needed, it can be implemented as an integrated layer for end-to-end inference; (4) it improves the performances of all sight line grounding networks we inspected; (5) further analyses confirm that RASP works by alleviating those spotted dataset biases. Strong results are achieved on the GazeFollow benchmark. Combining RASP and the best sight line grounding network can bring mean distance, minimum distance and mean angle diffe
ISSN:0920-5691
1573-1405
DOI:10.1007/s11263-019-01263-4