A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stag...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2018-04
Hauptverfasser: Xu, Keyang, Gao, Kyle Yingkai, Callan, Jamie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Xu, Keyang
Gao, Kyle Yingkai
Callan, Jamie
description Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2071978103</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2071978103</sourcerecordid><originalsourceid>FETCH-proquest_journals_20719781033</originalsourceid><addsrcrecordid>eNqNitEKgjAUQEcQJOU_DHoW5pbNHkOKCKIH61mGXmUim91tRX-fQR_Q0-FwzoxEXIg0yTecL0jsXM8Y41vJs0xE5LynpcdQ-4CQXFGD8dDQu3FhBHxqN0mB6jVo031H5aF709YiLW2t1UAv0GhFS-3Brci8VYOD-MclWR8Pt-KUjGgfAZyvehvQTKniTKY7madMiP-uD8-1PLU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2071978103</pqid></control><display><type>article</type><title>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</title><source>Free E- Journals</source><creator>Xu, Keyang ; Gao, Kyle Yingkai ; Callan, Jamie</creator><creatorcontrib>Xu, Keyang ; Gao, Kyle Yingkai ; Callan, Jamie</creatorcontrib><description>Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Digital media ; Harvesting ; Navigation ; Social networks</subject><ispartof>arXiv.org, 2018-04</ispartof><rights>2018. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Xu, Keyang</creatorcontrib><creatorcontrib>Gao, Kyle Yingkai</creatorcontrib><creatorcontrib>Callan, Jamie</creatorcontrib><title>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</title><title>arXiv.org</title><description>Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.</description><subject>Digital media</subject><subject>Harvesting</subject><subject>Navigation</subject><subject>Social networks</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNitEKgjAUQEcQJOU_DHoW5pbNHkOKCKIH61mGXmUim91tRX-fQR_Q0-FwzoxEXIg0yTecL0jsXM8Y41vJs0xE5LynpcdQ-4CQXFGD8dDQu3FhBHxqN0mB6jVo031H5aF709YiLW2t1UAv0GhFS-3Brci8VYOD-MclWR8Pt-KUjGgfAZyvehvQTKniTKY7madMiP-uD8-1PLU</recordid><startdate>20180408</startdate><enddate>20180408</enddate><creator>Xu, Keyang</creator><creator>Gao, Kyle Yingkai</creator><creator>Callan, Jamie</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20180408</creationdate><title>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</title><author>Xu, Keyang ; Gao, Kyle Yingkai ; Callan, Jamie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_20719781033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Digital media</topic><topic>Harvesting</topic><topic>Navigation</topic><topic>Social networks</topic><toplevel>online_resources</toplevel><creatorcontrib>Xu, Keyang</creatorcontrib><creatorcontrib>Gao, Kyle Yingkai</creatorcontrib><creatorcontrib>Callan, Jamie</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xu, Keyang</au><au>Gao, Kyle Yingkai</au><au>Callan, Jamie</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites</atitle><jtitle>arXiv.org</jtitle><date>2018-04-08</date><risdate>2018</risdate><eissn>2331-8422</eissn><abstract>Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2018-04
issn 2331-8422
language eng
recordid cdi_proquest_journals_2071978103
source Free E- Journals
subjects Digital media
Harvesting
Navigation
Social networks
title A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T01%3A09%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Structure-Oriented%20Unsupervised%20Crawling%20Strategy%20for%20Social%20Media%20Sites&rft.jtitle=arXiv.org&rft.au=Xu,%20Keyang&rft.date=2018-04-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2071978103%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2071978103&rft_id=info:pmid/&rfr_iscdi=true