The Link Database: fast access to graphs of the Web

The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it i...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Randall, K.H., Stata, R., Wickremesinghe, R.G., Wiener, J.L.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 131
container_issue
container_start_page 122
container_title
container_volume
creator Randall, K.H.
Stata, R.
Wickremesinghe, R.G.
Wiener, J.L.
description The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.
doi_str_mv 10.1109/DCC.2002.999950
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_999950</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>999950</ieee_id><sourcerecordid>999950</sourcerecordid><originalsourceid>FETCH-LOGICAL-i174t-75801a1fe90cc89dd46ca07e8ca3c76befbff26bde372ecce7cce18d4a8af0513</originalsourceid><addsrcrecordid>eNotT8tqwzAQFH1AnbTnQk_6Abu7lqxHb8VJH2DoJaXHsJZXjftKsHzp31eQDgxzmNkdRohrhAoR_O2qbasaoK58RgMnoqiVbUpQjT8VC7DGN6it1WeiQDAuG6gvxCKlj3wEYLAQarNj2Y0_n3JFM_WU-E5GSrOkEDglOe_l-0SHXZL7KOecfeP-UpxH-kp89a9L8fqw3rRPZffy-Nzed-WIVs-lbRwgYWQPITg_DNoEAssukArW9Bz7GGvTD6xszbnOZqIbNDmK0KBaipvj35GZt4dp_Kbpd3vcqv4AgadFaA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>The Link Database: fast access to graphs of the Web</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Randall, K.H. ; Stata, R. ; Wickremesinghe, R.G. ; Wiener, J.L.</creator><creatorcontrib>Randall, K.H. ; Stata, R. ; Wickremesinghe, R.G. ; Wiener, J.L.</creatorcontrib><description>The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.</description><identifier>ISSN: 1068-0314</identifier><identifier>ISBN: 0769514774</identifier><identifier>ISBN: 9780769514772</identifier><identifier>EISSN: 2375-0359</identifier><identifier>DOI: 10.1109/DCC.2002.999950</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computer science ; Delay ; Dictionaries ; Mirrors ; Random access memory ; Read-write memory ; Uniform resource locators ; Web pages ; Writing</subject><ispartof>Proceedings DCC 2002. Data Compression Conference, 2002, p.122-131</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/999950$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2051,4035,4036,27904,54898</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/999950$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Randall, K.H.</creatorcontrib><creatorcontrib>Stata, R.</creatorcontrib><creatorcontrib>Wickremesinghe, R.G.</creatorcontrib><creatorcontrib>Wiener, J.L.</creatorcontrib><title>The Link Database: fast access to graphs of the Web</title><title>Proceedings DCC 2002. Data Compression Conference</title><addtitle>DCC</addtitle><description>The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.</description><subject>Computer science</subject><subject>Delay</subject><subject>Dictionaries</subject><subject>Mirrors</subject><subject>Random access memory</subject><subject>Read-write memory</subject><subject>Uniform resource locators</subject><subject>Web pages</subject><subject>Writing</subject><issn>1068-0314</issn><issn>2375-0359</issn><isbn>0769514774</isbn><isbn>9780769514772</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2002</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotT8tqwzAQFH1AnbTnQk_6Abu7lqxHb8VJH2DoJaXHsJZXjftKsHzp31eQDgxzmNkdRohrhAoR_O2qbasaoK58RgMnoqiVbUpQjT8VC7DGN6it1WeiQDAuG6gvxCKlj3wEYLAQarNj2Y0_n3JFM_WU-E5GSrOkEDglOe_l-0SHXZL7KOecfeP-UpxH-kp89a9L8fqw3rRPZffy-Nzed-WIVs-lbRwgYWQPITg_DNoEAssukArW9Bz7GGvTD6xszbnOZqIbNDmK0KBaipvj35GZt4dp_Kbpd3vcqv4AgadFaA</recordid><startdate>2002</startdate><enddate>2002</enddate><creator>Randall, K.H.</creator><creator>Stata, R.</creator><creator>Wickremesinghe, R.G.</creator><creator>Wiener, J.L.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>2002</creationdate><title>The Link Database: fast access to graphs of the Web</title><author>Randall, K.H. ; Stata, R. ; Wickremesinghe, R.G. ; Wiener, J.L.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i174t-75801a1fe90cc89dd46ca07e8ca3c76befbff26bde372ecce7cce18d4a8af0513</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2002</creationdate><topic>Computer science</topic><topic>Delay</topic><topic>Dictionaries</topic><topic>Mirrors</topic><topic>Random access memory</topic><topic>Read-write memory</topic><topic>Uniform resource locators</topic><topic>Web pages</topic><topic>Writing</topic><toplevel>online_resources</toplevel><creatorcontrib>Randall, K.H.</creatorcontrib><creatorcontrib>Stata, R.</creatorcontrib><creatorcontrib>Wickremesinghe, R.G.</creatorcontrib><creatorcontrib>Wiener, J.L.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Randall, K.H.</au><au>Stata, R.</au><au>Wickremesinghe, R.G.</au><au>Wiener, J.L.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>The Link Database: fast access to graphs of the Web</atitle><btitle>Proceedings DCC 2002. Data Compression Conference</btitle><stitle>DCC</stitle><date>2002</date><risdate>2002</risdate><spage>122</spage><epage>131</epage><pages>122-131</pages><issn>1068-0314</issn><eissn>2375-0359</eissn><isbn>0769514774</isbn><isbn>9780769514772</isbn><abstract>The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URL) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8 GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the Web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.</abstract><pub>IEEE</pub><doi>10.1109/DCC.2002.999950</doi><tpages>10</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1068-0314
ispartof Proceedings DCC 2002. Data Compression Conference, 2002, p.122-131
issn 1068-0314
2375-0359
language eng
recordid cdi_ieee_primary_999950
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Computer science
Delay
Dictionaries
Mirrors
Random access memory
Read-write memory
Uniform resource locators
Web pages
Writing
title The Link Database: fast access to graphs of the Web
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T08%3A18%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=The%20Link%20Database:%20fast%20access%20to%20graphs%20of%20the%20Web&rft.btitle=Proceedings%20DCC%202002.%20Data%20Compression%20Conference&rft.au=Randall,%20K.H.&rft.date=2002&rft.spage=122&rft.epage=131&rft.pages=122-131&rft.issn=1068-0314&rft.eissn=2375-0359&rft.isbn=0769514774&rft.isbn_list=9780769514772&rft_id=info:doi/10.1109/DCC.2002.999950&rft_dat=%3Cieee_6IE%3E999950%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=999950&rfr_iscdi=true