CFN: A coarse‐to‐fine network for eye fixation prediction

Many image‐to‐image computer vision approaches have made great progress by an end‐to‐end framework with the encoder–decoder architecture. However, the same image‐to‐image eye fixation prediction task is not the same as those computer vision tasks in that it focuses more on salient regions rather tha...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IET Image Processing 2022-07, Vol.16 (9), p.2373-2383
Hauptverfasser: Xu, Binwei, Liang, Haoran, Liang, Ronghua, Chen, Peng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Many image‐to‐image computer vision approaches have made great progress by an end‐to‐end framework with the encoder–decoder architecture. However, the same image‐to‐image eye fixation prediction task is not the same as those computer vision tasks in that it focuses more on salient regions rather than precise predictions for every pixel. Thus, it is not appropriate to directly apply the end‐to‐end encoder–decoder to the eye fixation prediction task. In addition, although high‐level feature is important, the contribution of low‐level feature should also be kept and balanced in computational model. Nevertheless, some low‐level features that attract attention are easily neglected while transiting through the deep network. Therefore, the effective way to integrate low‐level and high‐level features for improving eye fixation prediction performance is still a challenging task. In this paper, a coarse‐to‐fine network (CFN) that encompasses two pathways with different training strategies are proposed: coarse perceiving network (CFN‐Coarse) can be a simple encoder network or any of the existing pretrained network to capture the distribution of salient regions and generate high‐quality feature maps; fine integrating network (CFN‐Fine) uses fixed parameters from the CFN‐Coarse and combines features from deep to shallow in the deconvolution process by adding skip connections between down‐sampling and up‐sampling paths to efficiently integrate deep and shallow features. The saliency map obtained by the method is evaluated over 6 standard benchmark datasets, namely SALICON, MIT1003, MIT300, Toronto, OSIE, and SUN500. The results demonstrate that the method can surpass the state‐of‐the‐art accuracy of eye fixation prediction and achieves the competitive performance to date under most evaluation metrics on SALICON Saliency Prediction Challenge (LSUN2017).
ISSN:1751-9659
1751-9667
DOI:10.1049/ipr2.12494