ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph

Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web g...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2013-06
Hauptverfasser:	AlSum, Ahmed, Nelson, Michael L
Format:	Artikel
Sprache:	eng
Schlagworte:	Applications programs Archiving Digital archives Metadata Optimization Optimization techniques Search engines User interfaces Web archiving Websites
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	AlSum, Ahmed Nelson, Michael L
description	Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web graph. While the live web has developed a rich ecosystem of APIs to facilitate web applications (e.g., APIs from Google and Twitter), the web archiving community has not yet broadly implemented this level of access. We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current Web archive interfaces to return content and structural metadata for each URI. We show how this API can be applied to such applications as retrieving inlinks, outlinks, anchortext, and PageRank.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2085224043</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2085224043</sourcerecordid><originalsourceid>FETCH-proquest_journals_20852240433</originalsourceid><addsrcrecordid>eNqNjL0KwjAYAIMgWLTv8IFzIX5ptbip-DMoghQcS2wjTW2TmKQOPr0VfACnG-64AQmQsVmUxogjEjpXU0pxvsAkYQE5rWxxlOqxhLPxspVv7qVWkImiUvLZCQdew7qTTQlclXAR3krxEuAr0Uet0ZY3cBU32FtuqgkZ3nnjRPjjmEx322xziIzV35nPa91Z1ascaZogxjRm7L_qA04sPS0</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2085224043</pqid></control><display><type>article</type><title>ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph</title><source>Freely Accessible Journals</source><creator>AlSum, Ahmed ; Nelson, Michael L</creator><creatorcontrib>AlSum, Ahmed ; Nelson, Michael L</creatorcontrib><description>Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web graph. While the live web has developed a rich ecosystem of APIs to facilitate web applications (e.g., APIs from Google and Twitter), the web archiving community has not yet broadly implemented this level of access. We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current Web archive interfaces to return content and structural metadata for each URI. We show how this API can be applied to such applications as retrieving inlinks, outlinks, anchortext, and PageRank.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Applications programs ; Archiving ; Digital archives ; Metadata ; Optimization ; Optimization techniques ; Search engines ; User interfaces ; Web archiving ; Websites</subject><ispartof>arXiv.org, 2013-06</ispartof><rights>2013. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>AlSum, Ahmed</creatorcontrib><creatorcontrib>Nelson, Michael L</creatorcontrib><title>ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph</title><title>arXiv.org</title><description>Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web graph. While the live web has developed a rich ecosystem of APIs to facilitate web applications (e.g., APIs from Google and Twitter), the web archiving community has not yet broadly implemented this level of access. We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current Web archive interfaces to return content and structural metadata for each URI. We show how this API can be applied to such applications as retrieving inlinks, outlinks, anchortext, and PageRank.</description><subject>Applications programs</subject><subject>Archiving</subject><subject>Digital archives</subject><subject>Metadata</subject><subject>Optimization</subject><subject>Optimization techniques</subject><subject>Search engines</subject><subject>User interfaces</subject><subject>Web archiving</subject><subject>Websites</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjL0KwjAYAIMgWLTv8IFzIX5ptbip-DMoghQcS2wjTW2TmKQOPr0VfACnG-64AQmQsVmUxogjEjpXU0pxvsAkYQE5rWxxlOqxhLPxspVv7qVWkImiUvLZCQdew7qTTQlclXAR3krxEuAr0Uet0ZY3cBU32FtuqgkZ3nnjRPjjmEx322xziIzV35nPa91Z1ascaZogxjRm7L_qA04sPS0</recordid><startdate>20130611</startdate><enddate>20130611</enddate><creator>AlSum, Ahmed</creator><creator>Nelson, Michael L</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20130611</creationdate><title>ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph</title><author>AlSum, Ahmed ; Nelson, Michael L</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_20852240433</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Applications programs</topic><topic>Archiving</topic><topic>Digital archives</topic><topic>Metadata</topic><topic>Optimization</topic><topic>Optimization techniques</topic><topic>Search engines</topic><topic>User interfaces</topic><topic>Web archiving</topic><topic>Websites</topic><toplevel>online_resources</toplevel><creatorcontrib>AlSum, Ahmed</creatorcontrib><creatorcontrib>Nelson, Michael L</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>AlSum, Ahmed</au><au>Nelson, Michael L</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph</atitle><jtitle>arXiv.org</jtitle><date>2013-06-11</date><risdate>2013</risdate><eissn>2331-8422</eissn><abstract>Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web graph. While the live web has developed a rich ecosystem of APIs to facilitate web applications (e.g., APIs from Google and Twitter), the web archiving community has not yet broadly implemented this level of access. We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current Web archive interfaces to return content and structural metadata for each URI. We show how this API can be applied to such applications as retrieving inlinks, outlinks, anchortext, and PageRank.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2013-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2085224043
source	Freely Accessible Journals
subjects	Applications programs Archiving Digital archives Metadata Optimization Optimization techniques Search engines User interfaces Web archiving Websites
title	ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T07%3A26%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ArcLink:%20Optimization%20Techniques%20to%20Build%20and%20Retrieve%20the%20Temporal%20Web%20Graph&rft.jtitle=arXiv.org&rft.au=AlSum,%20Ahmed&rft.date=2013-06-11&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2085224043%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2085224043&rft_id=info:pmid/&rfr_iscdi=true