SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-08
Hauptverfasser: Shushkevich, Elena, Long, Mai, Loureiro, Manuel V, Derby, Steven, Tri Kurniawan Wijaya
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Shushkevich, Elena
Long, Mai
Loureiro, Manuel V
Derby, Steven
Tri Kurniawan Wijaya
description The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3096404802</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3096404802</sourcerecordid><originalsourceid>FETCH-proquest_journals_30964048023</originalsourceid><addsrcrecordid>eNqNi9EKgjAYRkcQJOU7_NC1sDY161aNgopA72XYH02mMzez3j6DHqCrD84534Q4jPOVF_mMzYhrTEUpZeGaBQF3SJ5dDnGabOGMg4FM1lKJTto3JGixtFI3kAgrDFoYpL3DqVdWtgoh160sDYjmCrGuR_L6vo74RGUWZHoTyqD72zlZ7tI83nttpx89GltUuu-aURWcbkKf-hFl_L_qAwAAP8I</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3096404802</pqid></control><display><type>article</type><title>SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels</title><source>Free E- Journals</source><creator>Shushkevich, Elena ; Long, Mai ; Loureiro, Manuel V ; Derby, Steven ; Tri Kurniawan Wijaya</creator><creatorcontrib>Shushkevich, Elena ; Long, Mai ; Loureiro, Manuel V ; Derby, Steven ; Tri Kurniawan Wijaya</creatorcontrib><description><![CDATA[The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.]]></description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Complexity ; Datasets ; News media ; Politics ; Similarity ; User experience</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Shushkevich, Elena</creatorcontrib><creatorcontrib>Long, Mai</creatorcontrib><creatorcontrib>Loureiro, Manuel V</creatorcontrib><creatorcontrib>Derby, Steven</creatorcontrib><creatorcontrib>Tri Kurniawan Wijaya</creatorcontrib><title>SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels</title><title>arXiv.org</title><description><![CDATA[The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.]]></description><subject>Complexity</subject><subject>Datasets</subject><subject>News media</subject><subject>Politics</subject><subject>Similarity</subject><subject>User experience</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi9EKgjAYRkcQJOU7_NC1sDY161aNgopA72XYH02mMzez3j6DHqCrD84534Q4jPOVF_mMzYhrTEUpZeGaBQF3SJ5dDnGabOGMg4FM1lKJTto3JGixtFI3kAgrDFoYpL3DqVdWtgoh160sDYjmCrGuR_L6vo74RGUWZHoTyqD72zlZ7tI83nttpx89GltUuu-aURWcbkKf-hFl_L_qAwAAP8I</recordid><startdate>20240823</startdate><enddate>20240823</enddate><creator>Shushkevich, Elena</creator><creator>Long, Mai</creator><creator>Loureiro, Manuel V</creator><creator>Derby, Steven</creator><creator>Tri Kurniawan Wijaya</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240823</creationdate><title>SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels</title><author>Shushkevich, Elena ; Long, Mai ; Loureiro, Manuel V ; Derby, Steven ; Tri Kurniawan Wijaya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30964048023</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Complexity</topic><topic>Datasets</topic><topic>News media</topic><topic>Politics</topic><topic>Similarity</topic><topic>User experience</topic><toplevel>online_resources</toplevel><creatorcontrib>Shushkevich, Elena</creatorcontrib><creatorcontrib>Long, Mai</creatorcontrib><creatorcontrib>Loureiro, Manuel V</creatorcontrib><creatorcontrib>Derby, Steven</creatorcontrib><creatorcontrib>Tri Kurniawan Wijaya</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shushkevich, Elena</au><au>Long, Mai</au><au>Loureiro, Manuel V</au><au>Derby, Steven</au><au>Tri Kurniawan Wijaya</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels</atitle><jtitle>arXiv.org</jtitle><date>2024-08-23</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract><![CDATA[The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.]]></abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-08
issn 2331-8422
language eng
recordid cdi_proquest_journals_3096404802
source Free E- Journals
subjects Complexity
Datasets
News media
Politics
Similarity
User experience
title SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T11%3A29%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=SPICED:%20News%20Similarity%20Detection%20Dataset%20with%20Multiple%20Topics%20and%20Complexity%20Levels&rft.jtitle=arXiv.org&rft.au=Shushkevich,%20Elena&rft.date=2024-08-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3096404802%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3096404802&rft_id=info:pmid/&rfr_iscdi=true