Constructing a website graph using the crawling procedure

Authors

  • Ivan O. Dolotov Oles Honchar Dnipro National University, 72, Science Ave. Dnipro, 49010, Ukraine
  • Natalia A. Guk Oles Honchar Dnipro National University, 72, Science Ave. Dnipro, 49010, Ukraine

DOI:

https://doi.org/10.15276/hait.07.2024.27

Keywords:

Graph, website, web graph, crawling, breadth-first search, clustering, modularity, transitivity, metric

Abstract

The paper presents an approach to analyzing website structures. The objective is to develop an automated data collection procedure (crawling process) that systematically traverses a website and constructs a web graph represented as either lists of vertices and edges or an adjacency matrix, enabling subsequent analysis of structural connections between its elements. An unclear website structure can hinder user navigation and slow down the indexing process for search engines. Consequently, the development of automatic structure analysis methods is a relevant task. Existing information collection procedures for websites are deficient in providing comprehensive dataset and lack configuration options for setting data collection parameters. Considering that modern websites often have dynamic structures, which leads to variations in URL composition, this work enhances the approach to automating website structure data collection, accounting for dynamic pages and the specific features of their URL structure. The research method involves analyzing both internal and external links on webpages to understand the interconnections between different parts of a site. The quality of the structure is evaluated by calculating metric characteristics of the generated web graph, including diameter, density, clustering coefficient, and others. In this work a crawling procedure and algorithm were developed based on a breadth-first traversal of the graph. Software was developed to implement the crawling procedure and analyze the collected data, utilizing Python libraries such as requests, BeautifulSoup4, and networkx. Web graphs of several websites of various types and topics were constructed. The web graph representation allowed to explore the website's structural properties. Graphs were created to demonstrate the dependence between the average density of web graphs and the number of vertices, the average graph formation time and the number of vertices, and the average modularity coefficient relative to the average clustering coefficient. It was found that websites with well-defined thematic structures exhibit higher modularity and clustering coefficients. The practical significance of this work lies in its potential applications for optimizing website structures and developing new tools for data analysis.

Downloads

Download data is not yet available.

Author Biographies

Ivan O. Dolotov, Oles Honchar Dnipro National University, 72, Science Ave. Dnipro, 49010, Ukraine

Postgraduate student, Faculty of Applied Mathematics

Natalia A. Guk, Oles Honchar Dnipro National University, 72, Science Ave. Dnipro, 49010, Ukraine

Doctor of Physical and Mathematical Sciences, Professor, Faculty of Applied Mathematics

Scopus Author ID: 54791066900

Downloads

Published

2024-11-14

How to Cite

Dolotov, I. O., & Guk, N. A. (2024). Constructing a website graph using the crawling procedure. Herald of Advanced Information Technology, 7(4), 384–392. https://doi.org/10.15276/hait.07.2024.27