Driver Gaze Zone Estimation based on 3-Channel Convolution-optimized Vision Transformer with Transfer Learning

Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, CNN-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. The...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE sensors journal 2024-10, p.1-1
Hauptverfasser: Li, Zhao, Jiang, Siyang, Fu, Rui, Guo, Yingshi, Wang, Chang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, CNN-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this paper proposes a novel method that uses 3-channel convolution-optimized ViT (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pre-trained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and 3-channel information input.
ISSN:1530-437X
1558-1748
DOI:10.1109/JSEN.2024.3486373