Web crawler scheduler that utilizes sitemaps from websites

Systems and methods for scheduling documents for crawling are disclosed. In some implementations, a method includes obtaining sitemap information for a plurality of websites; and analyzing the sitemap information to identify a website, in the plurality of websites. The website has sitemap informatio...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: KELLER RALPH MICHAEL, BRAWER SASCHA B, IBEL MAXIMILIAN, SHIVAKUMAR NARAYANAN
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Systems and methods for scheduling documents for crawling are disclosed. In some implementations, a method includes obtaining sitemap information for a plurality of websites; and analyzing the sitemap information to identify a website, in the plurality of websites. The website has sitemap information that is at least potentially out of date. The method also includes updating the sitemap information for the identified website by downloading updated sitemap information for the identified website; and scheduling documents for crawling in accordance with the updated sitemap information for the identified website.