Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests
Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Upon replay, JavaScript on archived web pages can generate recurring HTTP
requests that lead to unnecessary traffic to the web archive. In one example,
an archived page averaged more than 1000 requests per minute. These requests
are not visible to the user, so if a user leaves such an archived page open in
a browser tab, they would be unaware that their browser is continuing to
generate traffic to the web archive. We found that web pages that require
regular updates (e.g., radio playlists, updates for sports scores, image
carousels) are more likely to make such recurring requests. If the resources
requested by the web page are not archived, some web archives may attempt to
patch the archive by requesting the resources from the live web. If the
requested resources are unavailable on the live web, the resources cannot be
archived, and the responses remain HTTP 404. Some archived pages continue to
poll the server as frequently as they did on the live web, while some pages
poll the server even more frequently if their requests return HTTP 404
responses, creating a high amount of unnecessary traffic. On a large scale,
such web pages are effectively a denial of service attack on the web archive.
Significant computational, network and storage resources are required for web
archives to archive and then successfully replay pages as they were on the live
web, and these resources should not be spent on unnecessary HTTP traffic. Our
proposed solution is to optimize archival replay using Cache-Control HTTP
response headers. We implemented this approach in a test environment and cached
HTTP 404 responses that prevented the browser's requests from reaching the web
archive server. |
---|---|
DOI: | 10.48550/arxiv.2212.00760 |