Learning Multi-Granularity Temporal Characteristics for Face Anti-Spoofing

Face anti-spoofing (FAS) is essential for securing face recognition systems. Despite the decent performance, few existing works fully leverage temporal information. This would inevitably lead to inferior performance because real and fake faces tend to share highly similar spatial appearances, while...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on information forensics and security 2022, Vol.17, p.1254-1269
Hauptverfasser: Wang, Zhuo, Wang, Qiangchang, Deng, Weihong, Guo, Guodong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Face anti-spoofing (FAS) is essential for securing face recognition systems. Despite the decent performance, few existing works fully leverage temporal information. This would inevitably lead to inferior performance because real and fake faces tend to share highly similar spatial appearances, while important temporal features between consecutive frames are neglected. In this work, we propose a temporal transformer network (TTN) to learn multi-granularity temporal characteristics for FAS. It mainly consists of temporal difference attentions (TDA), a pyramid temporal aggregation (PTA), and a temporal depth difference loss (TDL). Firstly, the vision transformer (ViT) is used as the backbone where comprehensive local patches are utilized to provide subtle differences between live and spoof faces. Then, instead of learning temporal features on global faces which may miss some important local cues, the TDA is developed to extract motion-sensitive cues on each of the comprehensive local patches. Moreover, the TDA is inserted into different layers of the ViT, learning multi-scale motion-sensitive local cues to improve the FAS performance. Secondly, it is observed that different subjects may have different visual tempos in some actions, making it necessary to model different temporal speeds. Our PTA aggregates temporal features at various tempos, which could build short-range and long-range relations among multiple frames. Thirdly, depth maps for real parts may change continuously, while they remain zeros for spoof regions. In order to locate motion features on facial parts, the TDL is proposed to guide the network to locate spoof facial parts where motion patterns between neighboring frames are set as the ground truth. To the best of our knowledge, this work is the first attempt to learn temporal characteristics via transformers. Both qualitative and quantitative results on several challenging tasks demonstrate the usefulness and effectiveness of our proposed methods.
ISSN:1556-6013
1556-6021
DOI:10.1109/TIFS.2022.3158062