Dynamically constrained, forward scheduling over uncertain workloads
Scheduling searchable items such as web pages for crawling involves dynamically scheduling items for downloading based on capacity based on time. The workload is distributed over time, in advance, by anticipating and accounting for the discovery of new links on the particular host. Respective times...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Patent |
Sprache: | eng |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scheduling searchable items such as web pages for crawling involves dynamically scheduling items for downloading based on capacity based on time. The workload is distributed over time, in advance, by anticipating and accounting for the discovery of new links on the particular host. Respective times to download items can be determined based on the current size of the host's crawl corpus relative to the maximum size of the host's crawl corpus. The respective times may be determined based additionally on respective freshness targets for the searchable items, which characterize how often an item's content should be refreshed by re-downloading the item, and on respective politeness factors for the host, which characterize the delay time between consecutive download requests to that host. As such, one can know precisely how the system is performing at any point in time and predict future performance. |
---|