TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching

In the field of deep matching, a large amount of noisy data in Chinese long texts affects the matching effect. Most long-form text matching models use all text data indiscriminately, which results in a large amount of noisy data, and thus the PageRank algorithm is combined with Transformer to filter...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-10, Vol.53 (19), p.22313-22327
Hauptverfasser: Gan, Ling, Hu, Liuhui, Tan, Xiaodong, Du, Xinrui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In the field of deep matching, a large amount of noisy data in Chinese long texts affects the matching effect. Most long-form text matching models use all text data indiscriminately, which results in a large amount of noisy data, and thus the PageRank algorithm is combined with Transformer to filter noise. For sentence-level noise detection, after calculating the overlap rate of words to evaluate the similarity, a sentence-level relationship graph is constructed and filtered by using the PageRank algorithm; for word-level noise detection, based on the attention score in Transformer, a word graph is established, then the PageRank algorithm is executed on graph, combined with self-attention weights, to select keywords to highlight topic relevance, the noisy words are filtered sequentially at different layers in the module, layer by layer. In addition, during the model training, PolyLoss is applied to replace the traditional binary Cross-Entropy loss function, thus reducing the difficulty of hyperparameter tuning. Finally, a better filtering strategy is proposed and experiments are conducted to verify it on two Chinese long-form text matching datasets. The result shows that the matching model based on the noise filtering strategy of this paper can better filter the noise and capture the matching signal more accurately.
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-023-04607-3