A Semi-Supervised Paraphrase Identification Model Based on Multi-Granularity Interaction Reasoning

Conventional paraphrase identification (PI) models based on deep learning usually focus on text representation and ignore the mining and matching of multi-granular interaction features. In addition, supervised learning relies on a large labeled data. However, labeled training set for PI is small in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2020, Vol.8, p.60790-60800
Hauptverfasser: Li, Xu, Zeng, Fanxu, Yao, Chunlong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Conventional paraphrase identification (PI) models based on deep learning usually focus on text representation and ignore the mining and matching of multi-granular interaction features. In addition, supervised learning relies on a large labeled data. However, labeled training set for PI is small in comparison with the high complexity of the task. To solve the problems, we propose a semi-supervised deep learning framework for PI. We use a neural encoder with word-by-word attention mechanism to reason equivalence or contradiction over pairs of words, phrases and sentences. We employ a two-stage training procedure. First, we use a language modeling objective to learn the initial parameters on the unlabeled corpora of more than one million pairs of sentences. This is followed by a supervised training, where we adapt these parameters to a specific classification task with labeled data. Experimental results on MRPC (Microsoft Research Paraphrase Corpus) and SICK (Sentences Involving Compositional Knowledge) datasets demonstrate the effectiveness of our approach. Compared with the previous neural network models, we achieve absolute improvements in accuracy of 7.6% and F1 of 5.4% on MRPC, Pearson's r of 4.5% and Spearman's ρ of 5.1% on SICK.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.2984009