DEVELOPMENT OF A PROGRAM FOR COLLECTION OF WEBSITE STRUCTURE DATA

The web graph is the most common mathematical model of a website. Constructing a web graph of a real site requires data about the structure of that site: html-pages and/or documents in the site (in particular, data about URLs of web resources) and hyperlinks linking them. Web servers often use pseud...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Trudy Karelʹskogo nauchnogo t͡s︡entra Rossiĭskoĭ akademii nauk 2016-09 (8), p.81-90
Hauptverfasser: Печников, Андрей Анатольевич, Ланкин, Александр Валерьевич, Pechnikov, Andrey, Lankin, Alexandr
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The web graph is the most common mathematical model of a website. Constructing a web graph of a real site requires data about the structure of that site: html-pages and/or documents in the site (in particular, data about URLs of web resources) and hyperlinks linking them. Web servers often use pseudonyms and redirections. They also generate the same pages dynamically via different URL requests. This raises a problem in which there are various URLs but with the same content. Thus, we can get a web graph in which some of its vertices correspond to pages of a site with the same content. The paper describes a crawler called RCCrawler that collects information about websites to build the web graphs of these sites. This crawler largely addresses the above problem as confirmed by a series of experiments conducted.
ISSN:1997-3217
2312-4504
DOI:10.17076/mat381