Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback

Retrieving information from an online search engine, is the first and most important step in many data mining tasks. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (a.k.a opaque) supporting short keyword queries. In these settings, re...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-02
Hauptverfasser:	Elyashar, Aviad, Maor Reuben, Puzis, Rami
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Data collection Data mining Datasets Digital media Information retrieval News Queries Search engines Social networks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Elyashar, Aviad Maor Reuben Puzis, Rami
description	Retrieving information from an online search engine, is the first and most important step in many data mining tasks. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (a.k.a opaque) supporting short keyword queries. In these settings, retrieving all posts and comments discussing a particular news item automatically and at large scales is a challenging task. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed iterative query selection algorithm (IQS) interacts with the opaque search engine to iteratively improve the query. It is evaluated on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets showing superior performance compared to state-of-the-art. IQS is applied to automatically collect a large-scale fake news dataset of about 70K true and fake news items. The dataset, publicly available for research, includes more than 22M accounts and 61M tweets in Twitter approved format. We demonstrate the usefulness of the dataset for fake news detection task achieving state-of-the-art performance.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2492478615</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2492478615</sourcerecordid><originalsourceid>FETCH-proquest_journals_24924786153</originalsourceid><addsrcrecordid>eNqNjEtrwkAUhQehYGjzHy64FuLk4WObGtpNq2334XZyo6NhJp07UbrrT-8Iund1Dt_5OCMRyTSdTReZlGMRMx-SJJHFXOZ5Gom_Co8Eb3RmeEaPUNquI-W1NYCmgbJDZt1qhRe0gldPLtQTwXYg9wufdLNb6-C9x5-BAkSn9rA2O22I4az9HjZMQ2PhI_gnNIqgImq-UR2fxEOLHVN8zUcxqdZf5cu0dzacsa8PdnAmTLXMljKbL4pZnt5n_QOlalAY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2492478615</pqid></control><display><type>article</type><title>Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback</title><source>Free E- Journals</source><creator>Elyashar, Aviad ; Maor Reuben ; Puzis, Rami</creator><creatorcontrib>Elyashar, Aviad ; Maor Reuben ; Puzis, Rami</creatorcontrib><description>Retrieving information from an online search engine, is the first and most important step in many data mining tasks. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (a.k.a opaque) supporting short keyword queries. In these settings, retrieving all posts and comments discussing a particular news item automatically and at large scales is a challenging task. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed iterative query selection algorithm (IQS) interacts with the opaque search engine to iteratively improve the query. It is evaluated on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets showing superior performance compared to state-of-the-art. IQS is applied to automatically collect a large-scale fake news dataset of about 70K true and fake news items. The dataset, publicly available for research, includes more than 22M accounts and 61M tweets in Twitter approved format. We demonstrate the usefulness of the dataset for fake news detection task achieving state-of-the-art performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Data collection ; Data mining ; Datasets ; Digital media ; Information retrieval ; News ; Queries ; Search engines ; Social networks</subject><ispartof>arXiv.org, 2021-02</ispartof><rights>2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Elyashar, Aviad</creatorcontrib><creatorcontrib>Maor Reuben</creatorcontrib><creatorcontrib>Puzis, Rami</creatorcontrib><title>Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback</title><title>arXiv.org</title><description>Retrieving information from an online search engine, is the first and most important step in many data mining tasks. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (a.k.a opaque) supporting short keyword queries. In these settings, retrieving all posts and comments discussing a particular news item automatically and at large scales is a challenging task. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed iterative query selection algorithm (IQS) interacts with the opaque search engine to iteratively improve the query. It is evaluated on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets showing superior performance compared to state-of-the-art. IQS is applied to automatically collect a large-scale fake news dataset of about 70K true and fake news items. The dataset, publicly available for research, includes more than 22M accounts and 61M tweets in Twitter approved format. We demonstrate the usefulness of the dataset for fake news detection task achieving state-of-the-art performance.</description><subject>Algorithms</subject><subject>Data collection</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Digital media</subject><subject>Information retrieval</subject><subject>News</subject><subject>Queries</subject><subject>Search engines</subject><subject>Social networks</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjEtrwkAUhQehYGjzHy64FuLk4WObGtpNq2334XZyo6NhJp07UbrrT-8Iund1Dt_5OCMRyTSdTReZlGMRMx-SJJHFXOZ5Gom_Co8Eb3RmeEaPUNquI-W1NYCmgbJDZt1qhRe0gldPLtQTwXYg9wufdLNb6-C9x5-BAkSn9rA2O22I4az9HjZMQ2PhI_gnNIqgImq-UR2fxEOLHVN8zUcxqdZf5cu0dzacsa8PdnAmTLXMljKbL4pZnt5n_QOlalAY</recordid><startdate>20210221</startdate><enddate>20210221</enddate><creator>Elyashar, Aviad</creator><creator>Maor Reuben</creator><creator>Puzis, Rami</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210221</creationdate><title>Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback</title><author>Elyashar, Aviad ; Maor Reuben ; Puzis, Rami</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_24924786153</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Data collection</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Digital media</topic><topic>Information retrieval</topic><topic>News</topic><topic>Queries</topic><topic>Search engines</topic><topic>Social networks</topic><toplevel>online_resources</toplevel><creatorcontrib>Elyashar, Aviad</creatorcontrib><creatorcontrib>Maor Reuben</creatorcontrib><creatorcontrib>Puzis, Rami</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Elyashar, Aviad</au><au>Maor Reuben</au><au>Puzis, Rami</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback</atitle><jtitle>arXiv.org</jtitle><date>2021-02-21</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Retrieving information from an online search engine, is the first and most important step in many data mining tasks. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (a.k.a opaque) supporting short keyword queries. In these settings, retrieving all posts and comments discussing a particular news item automatically and at large scales is a challenging task. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed iterative query selection algorithm (IQS) interacts with the opaque search engine to iteratively improve the query. It is evaluated on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets showing superior performance compared to state-of-the-art. IQS is applied to automatically collect a large-scale fake news dataset of about 70K true and fake news items. The dataset, publicly available for research, includes more than 22M accounts and 61M tweets in Twitter approved format. We demonstrate the usefulness of the dataset for fake news detection task achieving state-of-the-art performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-02
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2492478615
source	Free E- Journals
subjects	Algorithms Data collection Data mining Datasets Digital media Information retrieval News Queries Search engines Social networks
title	Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T03%3A53%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Fake%20News%20Data%20Collection%20and%20Classification:%20Iterative%20Query%20Selection%20for%20Opaque%20Search%20Engines%20with%20Pseudo%20Relevance%20Feedback&rft.jtitle=arXiv.org&rft.au=Elyashar,%20Aviad&rft.date=2021-02-21&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2492478615%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2492478615&rft_id=info:pmid/&rfr_iscdi=true