Toward Human-Friendly ASR Systems: Recovering Capitalization and Punctuation for Vietnamese Text

Speech recognition is a technique that recognizes words and sentences in audio form and converts them into text sentences. Currently, with the advancement of deep learning technologies, speech recognition has achieved very satisfactory results close to human abilities. However, there are still limit...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEICE Transactions on Information and Systems 2021/08/01, Vol.E104.D(8), pp.1195-1203
Hauptverfasser:	NGUYEN, Thi Thu HIEN, NGUYEN, Thai BINH, PHAM, Ngoc PHUONG, DO, Quoc TRUONG, LE, Tu LUC, LUONG, Chi MAI
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic speech recognition Capitalization Conditional random fields Deep learning Homonyms Human performance Machine learning Natural language processing Punctuation Sentences Speech recognition Vietnamese Voice recognition Words (language)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Speech recognition is a technique that recognizes words and sentences in audio form and converts them into text sentences. Currently, with the advancement of deep learning technologies, speech recognition has achieved very satisfactory results close to human abilities. However, there are still limitations in identification results such as lack of punctuation, capitalization, and standardized numerical data. Vietnamese also contains local words, homonyms, etc, which make it difficult to read and understand the identification results for users as well as to perform the next tasks in Natural Language Processing (NLP). In this paper, we propose to combine the transformer decoder with conditional random field (CRF) to restore punctuation and capitalization for the Vietnamese automatic speech recognition (ASR) output. By chunking input sentences and merging output sequences, it is possible to handle longer strings with greater accuracy. Experiments show that the method proposed in the Vietnamese post-speech recognition dataset delivers the best results.
ISSN:	0916-8532 1745-1361
DOI:	10.1587/transinf.2020BDP0005