Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Garg, Kritika, Jayanetti, Himarsha R, Alam, Sawood, Weigle, Michele C, Nelson, Michael L
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Garg, Kritika
Jayanetti, Himarsha R
Alam, Sawood
Weigle, Michele C
Nelson, Michael L
description Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to the web archive. We found that web pages that require regular updates (e.g., radio playlists, updates for sports scores, image carousels) are more likely to make such recurring requests. If the resources requested by the web page are not archived, some web archives may attempt to patch the archive by requesting the resources from the live web. If the requested resources are unavailable on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archived pages continue to poll the server as frequently as they did on the live web, while some pages poll the server even more frequently if their requests return HTTP 404 responses, creating a high amount of unnecessary traffic. On a large scale, such web pages are effectively a denial of service attack on the web archive. Significant computational, network and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using Cache-Control HTTP response headers. We implemented this approach in a test environment and cached HTTP 404 responses that prevented the browser's requests from reaching the web archive server.
doi_str_mv 10.48550/arxiv.2212.00760
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2212_00760</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2212_00760</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-73665330e260969487b99191b8a485d93f2a83de5cd290517225ccbae7fad5b23</originalsourceid><addsrcrecordid>eNotj7FOwzAYhL10QC0PwIRfIOG3HdvxWEWFUlUqQmGO_jhOayk1IS4VfXvcwnQ33J3uI-SBQV6UUsITTj_-nHPOeA6gFdyRTYX24MOeruv6jRZQ0HcXx88QXaSrwR99wFOyHyE462LE6UKXU2qccUjJccBLkq9vF09xQWY9DtHd_-uc1M-rulpn293La7XcZqg0ZFooJYUAxxUYZYpSt8Yww9oS08fOiJ5jKTonbccNSKY5l9a26HSPnWy5mJPHv9kbTDNO_pheNVeo5gYlfgFD0EXf</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests</title><source>arXiv.org</source><creator>Garg, Kritika ; Jayanetti, Himarsha R ; Alam, Sawood ; Weigle, Michele C ; Nelson, Michael L</creator><creatorcontrib>Garg, Kritika ; Jayanetti, Himarsha R ; Alam, Sawood ; Weigle, Michele C ; Nelson, Michael L</creatorcontrib><description>Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to the web archive. We found that web pages that require regular updates (e.g., radio playlists, updates for sports scores, image carousels) are more likely to make such recurring requests. If the resources requested by the web page are not archived, some web archives may attempt to patch the archive by requesting the resources from the live web. If the requested resources are unavailable on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archived pages continue to poll the server as frequently as they did on the live web, while some pages poll the server even more frequently if their requests return HTTP 404 responses, creating a high amount of unnecessary traffic. On a large scale, such web pages are effectively a denial of service attack on the web archive. Significant computational, network and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using Cache-Control HTTP response headers. We implemented this approach in a test environment and cached HTTP 404 responses that prevented the browser's requests from reaching the web archive server.</description><identifier>DOI: 10.48550/arxiv.2212.00760</identifier><language>eng</language><subject>Computer Science - Digital Libraries ; Computer Science - Networking and Internet Architecture</subject><creationdate>2022-12</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2212.00760$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2212.00760$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Garg, Kritika</creatorcontrib><creatorcontrib>Jayanetti, Himarsha R</creatorcontrib><creatorcontrib>Alam, Sawood</creatorcontrib><creatorcontrib>Weigle, Michele C</creatorcontrib><creatorcontrib>Nelson, Michael L</creatorcontrib><title>Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests</title><description>Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to the web archive. We found that web pages that require regular updates (e.g., radio playlists, updates for sports scores, image carousels) are more likely to make such recurring requests. If the resources requested by the web page are not archived, some web archives may attempt to patch the archive by requesting the resources from the live web. If the requested resources are unavailable on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archived pages continue to poll the server as frequently as they did on the live web, while some pages poll the server even more frequently if their requests return HTTP 404 responses, creating a high amount of unnecessary traffic. On a large scale, such web pages are effectively a denial of service attack on the web archive. Significant computational, network and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using Cache-Control HTTP response headers. We implemented this approach in a test environment and cached HTTP 404 responses that prevented the browser's requests from reaching the web archive server.</description><subject>Computer Science - Digital Libraries</subject><subject>Computer Science - Networking and Internet Architecture</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7FOwzAYhL10QC0PwIRfIOG3HdvxWEWFUlUqQmGO_jhOayk1IS4VfXvcwnQ33J3uI-SBQV6UUsITTj_-nHPOeA6gFdyRTYX24MOeruv6jRZQ0HcXx88QXaSrwR99wFOyHyE462LE6UKXU2qccUjJccBLkq9vF09xQWY9DtHd_-uc1M-rulpn293La7XcZqg0ZFooJYUAxxUYZYpSt8Yww9oS08fOiJ5jKTonbccNSKY5l9a26HSPnWy5mJPHv9kbTDNO_pheNVeo5gYlfgFD0EXf</recordid><startdate>20221201</startdate><enddate>20221201</enddate><creator>Garg, Kritika</creator><creator>Jayanetti, Himarsha R</creator><creator>Alam, Sawood</creator><creator>Weigle, Michele C</creator><creator>Nelson, Michael L</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221201</creationdate><title>Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests</title><author>Garg, Kritika ; Jayanetti, Himarsha R ; Alam, Sawood ; Weigle, Michele C ; Nelson, Michael L</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-73665330e260969487b99191b8a485d93f2a83de5cd290517225ccbae7fad5b23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Digital Libraries</topic><topic>Computer Science - Networking and Internet Architecture</topic><toplevel>online_resources</toplevel><creatorcontrib>Garg, Kritika</creatorcontrib><creatorcontrib>Jayanetti, Himarsha R</creatorcontrib><creatorcontrib>Alam, Sawood</creatorcontrib><creatorcontrib>Weigle, Michele C</creatorcontrib><creatorcontrib>Nelson, Michael L</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Garg, Kritika</au><au>Jayanetti, Himarsha R</au><au>Alam, Sawood</au><au>Weigle, Michele C</au><au>Nelson, Michael L</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests</atitle><date>2022-12-01</date><risdate>2022</risdate><abstract>Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to the web archive. We found that web pages that require regular updates (e.g., radio playlists, updates for sports scores, image carousels) are more likely to make such recurring requests. If the resources requested by the web page are not archived, some web archives may attempt to patch the archive by requesting the resources from the live web. If the requested resources are unavailable on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archived pages continue to poll the server as frequently as they did on the live web, while some pages poll the server even more frequently if their requests return HTTP 404 responses, creating a high amount of unnecessary traffic. On a large scale, such web pages are effectively a denial of service attack on the web archive. Significant computational, network and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using Cache-Control HTTP response headers. We implemented this approach in a test environment and cached HTTP 404 responses that prevented the browser's requests from reaching the web archive server.</abstract><doi>10.48550/arxiv.2212.00760</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2212.00760
ispartof
issn
language eng
recordid cdi_arxiv_primary_2212_00760
source arXiv.org
subjects Computer Science - Digital Libraries
Computer Science - Networking and Internet Architecture
title Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T05%3A03%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Caching%20HTTP%20404%20Responses%20Eliminates%20Unnecessary%20Archival%20Replay%20Requests&rft.au=Garg,%20Kritika&rft.date=2022-12-01&rft_id=info:doi/10.48550/arxiv.2212.00760&rft_dat=%3Carxiv_GOX%3E2212_00760%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true