Network repository service for efficient web crawling

The present invention relates generally to searching and gathering information on computer networks. More specifically, it relates to improved techniques for gathering large amounts of information from a large number of resources on a network, e.g., web crawling. A network repository service supplem...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kraft, Reiner, Emens, Michael Lawrence
Format: Patent
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The present invention relates generally to searching and gathering information on computer networks. More specifically, it relates to improved techniques for gathering large amounts of information from a large number of resources on a network, e.g., web crawling. A network repository service supplements the functions of a web server to enable an increase in the efficiency of web crawling. The repository service: (a) automatically maintains a file modification list that contains the names of files on the server that have been modified (i.e., added, deleted, or otherwise modified), together with the date and time of the file modification; and (b) provides a requesting crawler with the file modification list (or a portion of the list corresponding to a time period specified by the crawler). The repository service may also (c) limit or restrict access privileges of crawlers that do not request the file modification list prior to crawling, thereby protecting the server from overcrawling. The repository service enables a crawler to request the file modification list, and avoid unnecessarily recrawling files that have not been modified since its last visit, thereby preventing considerable waste of time, network bandwidth, server processing resources, and crawler processing resources. Using the file modification list, the crawler can remove all prior references to deleted files, and efficiently recrawl only those files that have been added or changed since the crawler last visited the web server.