Is cross-linguistic advert flaw detection in Wikipedia feasible? A multilingual-BERT-based transfer learning approach

Wikipedia is one of the most prominent online platforms from which people acquire knowledge; thus, its article quality should be of great concern. Currently, many scholars focus on the quality assessment and quality flaws detection in Wikipedia articles. However, most of them considered only one lan...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2022-09, Vol.252, p.109330, Article 109330
Hauptverfasser: Li, Muyan, Zhou, Heshen, Hou, Jingrui, Wang, Ping, Gao, Erpei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Wikipedia is one of the most prominent online platforms from which people acquire knowledge; thus, its article quality should be of great concern. Currently, many scholars focus on the quality assessment and quality flaws detection in Wikipedia articles. However, most of them considered only one language version, typically English. One major obstacle to conducting such research in non-English or multilanguage scenarios is insufficient labeled data. To address this, we introduce transfer learning based on a pretraining multilanguage model to verify whether it is feasible to conduct cross-language flaw detection. Specifically, we chose the Advert flaw (containing content written like an advertisement) as our research objective; French, Spanish, and Chinese as the target language scenarios; and English articles as the source scenario. Multilingual BERT combined with a sequential model was used to extract semantic features and build classifiers. Moreover, we compared three strategies (direct transfer, fine-tuning transfer and nontransfer) to determine the best strategy for cross-language Advert flaw detection at different training sample scales. The experimental results demonstrated that the proposed model trained with the English dataset can identify the Advert flaw in other languages; fine-tuning transfer yields the best performance as the corpus increases. •Introduce transfer learning for cross-linguistic Wikipedia advert detection.•English Wikipedia samples can detect Non-English Wikipedia advert.•Multi-lingual BERT is qualified for a cross-linguistic transfer learning encoder.•Proposed fine-tuning transfer performs the best for different dataset scales.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2022.109330