A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stag...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2018-04
Hauptverfasser:	Xu, Keyang, Gao, Kyle Yingkai, Callan, Jamie
Format:	Artikel
Sprache:	eng
Schlagworte:	Digital media Harvesting Navigation Social networks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Xu, Keyang Gao, Kyle Yingkai Callan, Jamie
description	Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2071978103</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2071978103</sourcerecordid><originalsourceid>FETCH-proquest_journals_20719781033</originalsourceid><addsrcrecordid>eNqNitEKgjAUQEcQJOU_DHoW5pbNHkOKCKIH61mGXmUim91tRX-fQR_Q0-FwzoxEXIg0yTecL0jsXM8Y41vJs0xE5LynpcdQ-4CQXFGD8dDQu3FhBHxqN0mB6jVo031H5aF709YiLW2t1UAv0GhFS-3Brci8VYOD-MclWR8Pt-KUjGgfAZyvehvQTKniTKY7madMiP-uD8-1PLU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2071978103</pqid></control><display><type>article</type><title>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</title><source>Free E- Journals</source><creator>Xu, Keyang ; Gao, Kyle Yingkai ; Callan, Jamie</creator><creatorcontrib>Xu, Keyang ; Gao, Kyle Yingkai ; Callan, Jamie</creatorcontrib><description>Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Digital media ; Harvesting ; Navigation ; Social networks</subject><ispartof>arXiv.org, 2018-04</ispartof><rights>2018. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Xu, Keyang</creatorcontrib><creatorcontrib>Gao, Kyle Yingkai</creatorcontrib><creatorcontrib>Callan, Jamie</creatorcontrib><title>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</title><title>arXiv.org</title><description>Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.</description><subject>Digital media</subject><subject>Harvesting</subject><subject>Navigation</subject><subject>Social networks</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNitEKgjAUQEcQJOU_DHoW5pbNHkOKCKIH61mGXmUim91tRX-fQR_Q0-FwzoxEXIg0yTecL0jsXM8Y41vJs0xE5LynpcdQ-4CQXFGD8dDQu3FhBHxqN0mB6jVo031H5aF709YiLW2t1UAv0GhFS-3Brci8VYOD-MclWR8Pt-KUjGgfAZyvehvQTKniTKY7madMiP-uD8-1PLU</recordid><startdate>20180408</startdate><enddate>20180408</enddate><creator>Xu, Keyang</creator><creator>Gao, Kyle Yingkai</creator><creator>Callan, Jamie</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20180408</creationdate><title>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</title><author>Xu, Keyang ; Gao, Kyle Yingkai ; Callan, Jamie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_20719781033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Digital media</topic><topic>Harvesting</topic><topic>Navigation</topic><topic>Social networks</topic><toplevel>online_resources</toplevel><creatorcontrib>Xu, Keyang</creatorcontrib><creatorcontrib>Gao, Kyle Yingkai</creatorcontrib><creatorcontrib>Callan, Jamie</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xu, Keyang</au><au>Gao, Kyle Yingkai</au><au>Callan, Jamie</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</atitle><jtitle>arXiv.org</jtitle><date>2018-04-08</date><risdate>2018</risdate><eissn>2331-8422</eissn><abstract>Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2018-04
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2071978103
source	Free E- Journals
subjects	Digital media Harvesting Navigation Social networks
title	A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T01%3A09%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Structure-Oriented%20Unsupervised%20Crawling%20Strategy%20for%20Social%20Media%20Sites&rft.jtitle=arXiv.org&rft.au=Xu,%20Keyang&rft.date=2018-04-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2071978103%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2071978103&rft_id=info:pmid/&rfr_iscdi=true