360Spred: Saliency Prediction for 360-Degree Videos Based on 3D Separable Graph Convolutional Networks
Predicting the saliency map of a 360-degree video is the key for various downstream tasks, such as saliency-based compression and tile-based adaptive streaming. Besides static salient objects, the moving target will also contribute to the saliency map. Therefore, the joint exploitation of spherical...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems for video technology 2024-10, Vol.34 (10), p.9979-9996 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Predicting the saliency map of a 360-degree video is the key for various downstream tasks, such as saliency-based compression and tile-based adaptive streaming. Besides static salient objects, the moving target will also contribute to the saliency map. Therefore, the joint exploitation of spherical spatio-temporal information is necessary for an accurate saliency prediction. The spherical spatial feature extraction, however, is hindered by the non-Euclidean geometric nature of spherical data, which imposes difficulty on direct extraction of the spatial features with traditional convolutional neural networks (CNNs). While the efficient exploitation of temporal correlation between these spherical spatial features remains another challenge, which requires the extraction of spherical optical flows for explicit motion information. To address these, in this paper, we first propose a spherical graph-based Farneback algorithm to extract the spherical optical flows directly in the sphere domain, by leveraging the GICOPix uniform sampling scheme. We then design a 3D separable graph convolutional network-based saliency prediction framework, named 360Spred, by taking both the spherical frames and spherical optical flows as input. The proposed 360Spred framework is based on the U-Net structure, with a 3D separable graph convolution (3DSGC) operator that directly extracts the visual and motion features in the sphere domain and exploits temporal correlation of both the high-level and low-level spatial features. Experimental results on two public datasets show that 360Spred can achieve a better performance than other baseline models in terms of the saliency prediction accuracy for 360-degree videos. |
---|---|
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2024.3407685 |