Continuous sign language recognition based on hierarchical memory sequence network

With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single‐sequence model learning, a hierarchical sequence memory network with a multi‐level iterative optimisation strategy is proposed for continuous sign langua...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IET Computer Vision 2024-03, Vol.18 (2), p.247-259
Hauptverfasser: Xue, Cuihong, Jia, Jingli, Yu, Ming, Yan, Gang, Guo, Yingchun, Liu, Yuehao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single‐sequence model learning, a hierarchical sequence memory network with a multi‐level iterative optimisation strategy is proposed for continuous sign language recognition. This method uses the spatial‐temporal fusion convolution network (STFC‐Net) to extract the spatial‐temporal information of RGB and Optical flow video frames to obtain the multi‐modal visual features of a sign language video. Then, in order to enhance the temporal relationships of visual feature maps, the hierarchical memory sequence network is used to capture local utterance features and global context dependencies across time dimensions to obtain sequence features. Finally, the decoder decodes the final sentence sequence. In order to enhance the feature extractor, the authors adopted a multi‐level iterative optimisation strategy to fine‐tune STFC‐Net and the utterance feature extractor. The experimental results on the RWTH‐Phoenix‐Weather multi‐signer 2014 dataset and the Chinese sign language dataset show the effectiveness and superiority of this method. The authors propose a continuous sign language recognition method based on hierarchical memory sequence network. This method uses the spatial‐temporal fusion convolution network (STFC‐Net) to extract the spatiotemporal information of video frames to obtain the visual features of a sign language video. Then, the authors design a hierarchical memory sequence network to extract high‐level contextual semantic dependency information in a hierarchical structure, and explore deeper connections between sequences; Furthermore, a multi‐level iterative optimisation strategy is proposed to enhance the representation capability of the feature extractor, and ultimately improve the performance of the overall model.
ISSN:1751-9632
1751-9640
DOI:10.1049/cvi2.12240