SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels
https://aclanthology.org/2024.lrec-main.1320/ The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | https://aclanthology.org/2024.lrec-main.1320/ The proliferation of news media outlets has increased the demand for
intelligent systems capable of detecting redundant information in news articles
in order to enhance user experience. However, the heterogeneous nature of news
can lead to spurious findings in these systems: Simple heuristics such as
whether a pair of news are both about politics can provide strong but deceptive
downstream performance. Segmenting news similarity datasets into topics
improves the training of these models by forcing them to learn how to
distinguish salient characteristics under more narrow domains. However, this
requires the existence of topic-specific datasets, which are currently lacking.
In this article, we propose a novel dataset of similar news, SPICED, which
includes seven topics: Crime & Law, Culture & Entertainment, Disasters &
Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and
Sports. Futhermore, we present four different levels of complexity,
specifically designed for news similarity detection task. We benchmarked the
created datasets using MinHash, BERT, SBERT, and SimCSE models. |
---|---|
DOI: | 10.48550/arxiv.2309.13080 |