WEB CRAWLING AND DATA STRUCTURING FROM HETEROGENEOUS SOURCES AT SCALE. A PARALLEL WEB INTELLIGENCE INFRASTRUCTURE FOR COLLECTING DATA FROM TENS OF THOUSANDS OF SOURCES: ARCHITECTURAL APPROACHES PRIOR TO THE ERA OF LLMS AND VECTOR DATABASES

ВЕБ-КРАУЛИНГ И СТРУКТУРИЗАЦИЯ ДАННЫХ ИЗ ГЕТЕРОГЕННЫХ ИСТОЧНИКОВ В МАСШТАБЕ. ПАРАЛЛЕЛЬНАЯ ВЕБ-РАЗВЕДОЧНАЯ ИНФРАСТРУКТУРА ДЛЯ СБОРА ДАННЫХ ИЗ ДЕСЯТКОВ ТЫСЯЧ ИСТОЧНИКОВ: АРХИТЕКТУРНЫЕ ПОДХОДЫ ДО ЭПОХИ LLM И ВЕКТОРНЫХ БАЗ ДАННЫХ
Redko D.
Цитировать:
Redko D. WEB CRAWLING AND DATA STRUCTURING FROM HETEROGENEOUS SOURCES AT SCALE. A PARALLEL WEB INTELLIGENCE INFRASTRUCTURE FOR COLLECTING DATA FROM TENS OF THOUSANDS OF SOURCES: ARCHITECTURAL APPROACHES PRIOR TO THE ERA OF LLMS AND VECTOR DATABASES // Universum: технические науки : электрон. научн. журн. 2026. 5(146). URL: https://7universum.com/ru/tech/archive/item/22672 (дата обращения: 28.05.2026).
Прочитать статью:
DOI - 10.32743/UniTech.2026.146.5.22672
Статья поступила в редакцию: 21.04.2026
Принята к публикации: 23.04.2026
Опубликована: 28.05.2026

 

УДК 004.7

ABSTRACT

The article examines architectural approaches to web crawling and the subsequent structuring of data when working with large arrays of heterogeneous Internet sources. It is shown that the foundation of large-scale information collection was formed long before the emergence of large language models and vector databases. The main role in such systems was played by a distributed crawler, a URL frontier, crawl prioritization rules, filtering, extraction of meaningful page fragments, field normalization, and transformation of results into a unified schema. Special attention is paid to focused crawling as a means of reducing the share of irrelevant downloads and increasing the usefulness of the data flow. A comparison is made between classical solutions and the capabilities of LLMs, leading to the conclusion that modern language models deliver the greatest effect at the stages of semantic matching, feature extraction from weakly structured text, and transformation of records into a unified representation, while the core engineering infrastructure of large-scale crawling is still determined by classical architectural principles.

АННОТАЦИЯ

В статье рассматриваются архитектурные подходы к веб-краулингу и последующей структуризации данных при работе с большими массивами разнородных интернет-источников. Показано, что основа крупномасштабного сбора информации была сформирована задолго до появления больших языковых моделей и векторных баз данных. Основную роль в подобных системах играли распределенный краулер, очередь URL, правила приоритизации обхода, фильтрация, извлечение содержательных фрагментов страницы, нормализация полей и сведение результатов к единой схеме. Отдельное внимание уделено focused crawling как средству сокращения доли нерелевантных загрузок и повышения полезности потока данных. Проведено сопоставление классических решений и возможностей LLM, по результатам чего сделан вывод, что современные языковые модели обеспечивают наибольший эффект на стадиях смыслового сопоставления, извлечения признаков из слабоструктурированного текста и приведения записей к единому представлению, в то время как базовая инженерная инфраструктура массового краулинга по-прежнему определяется классическими архитектурными принципами.

 

Keywords: web crawling, large-scale crawler, focused crawling, data structuring, heterogeneous sources, OSINT, market intelligence, data extraction, LLM, deep web.

Ключевые слова: веб-краулинг, large-scale crawler, focused crawling, структуризация данных, гетерогенные источники, OSINT, market intelligence, извлечение данных, LLM, deep web.

 

Introduction. The growth in the number of digital sources and the accelerating pace of web content updates have made large-scale data collection a central task in applied analytics. In particular, this concerns systems that must regularly crawl thousands and tens of thousands of websites, extract relevant information, eliminate noise, track changes, and convert the results into a form suitable for comparison. Such solutions are widely used in OSINT, market intelligence, competitor monitoring, news analytics, job market analysis, and the monitoring of industry platforms, among other domains. In this context, a web crawler should be viewed as a programmable, scalable, and distributed component, which makes it possible to examine its structure and functions in relation to the evolution of web intelligence systems [8].

In real-world scenarios, however, value arises from maintaining a continuous stream of verifiable and comparable data; accordingly, large-scale crawling has always addressed two interrelated tasks: the first involves organizing a stable and resource-efficient crawl of sources, while the second concerns transforming raw web material into a structured dataset in which records can be compared on the basis of common attributes. It is therefore unsurprising that the rise of large language models has intensified attention to the second task; nevertheless, the core engineering framework of such systems emerged much earlier and was shaped within the scope of classical research on distributed and focused crawling.

The aim of this study is to conduct a comparative analysis and to develop an algorithm for a parallel web intelligence infrastructure designed to collect data from tens of thousands of sources.

Research methodology. The methodology is based on an analysis of the scientific literature on large-scale crawling, focused crawling, data extraction from the deep web, processing of heterogeneous sources, and the application of LLMs in web analytics tasks. The study employs a comparative analysis of architectural approaches, a structural and functional analysis of the core components of a web crawler, and a synthesis of engineering solutions in the field of large-scale data collection and structuring.

Thus, a classical large-scale crawler was built around several essential components:

  • a URL frontier;
  • a crawl scheduler;
  • page fetchers;
  • a parser;
  • a page repository;
  • auxiliary monitoring mechanisms.

The idea of separating the web crawling process itself from the application-level filtering of results was of fundamental importance. For example, in the study by J. M. Hsieh, S. D. Gribble, and H. M. Levy, an extensible crawler is described as a service that crawls the web on behalf of multiple client applications, while clients define filtering rules and receive only those pages that meet their criteria. The authors explicitly note that timely large-scale crawling is complex, operationally demanding, and costly; therefore, isolating the crawling function into a separate service reduces the overhead for application systems and allows them to focus on the substantive processing of data [6]. Within such an architecture, the URL frontier and the scheduler play a particularly important role: the former is responsible for accumulating URLs, while the latter determines the order of crawling, revisits, and request rate limiting (Fig. 1):

 

Figure 1. Architectural foundation of a classical large-scale crawler prior to the era of LLMs, compiled by the author

 

As can be observed, within a classical large-scale crawler prior to the era of LLMs, the URL frontier and the scheduler play a central role. The former is responsible for accumulating URLs, while the latter determines the order of crawling, revisits, and request rate limiting. Based on Fig. 1, it becomes clear why large-scale crawling was from the outset a problem of distributed computation. Parallelism in such systems emerged as a prerequisite for the timely expansion of the data corpus. At the same time, even at the architectural level, it was necessary to maintain a balance between crawl coverage, content freshness, the cost of network activity, and the speed of subsequent processing.

At the same time, it should be noted that when dealing with tens of thousands of sources, it is impossible to crawl all available pages with equal priority. For this reason, focused crawling has gained particular importance, as it prioritizes materials that are thematically relevant to the domain of the system. For example, in the study by F. Ahmadi-Abkenari and A. Selamat, the architecture of a focused trend parallel crawler is associated with the task of identifying relevant documents and prioritizing them for subsequent retrieval; in other words, the challenge lies in determining a traversal order of URLs that increases the proportion of useful pages in the resulting data stream [1].

This line of research was further developed in studies on semantic focused crawling. J. Hernandez, H. M. Marin-Castro, and M. Morales-Sandoval note that a traditional crawler relies on URL indexing, a frontier, a page downloader, and a repository, whereas a focused crawler extends this structure with a topical classification module. The authors propose a semantic focused web crawler in which the domain is described through a knowledge representation schema, which in turn reduces dependence on a rigidly defined ontology and enables the evaluation of page relevance based on content within a specified domain. For large-scale crawling, this result is significant for two reasons:

  1. the proportion of irrelevant (effectively non-useful) downloads is reduced;
  2. early-stage topical filtering facilitates subsequent structuring, as the document stream becomes more semantically homogeneous from the outset [5].

However, after crawling heterogeneous sources, a more labor-intensive stage begins, as the web pages retrieved by the crawler rarely constitute a uniform data type. Differences arise in HTML templates, the depth and structure of the DOM, the formats used to present dates and numerical values, the presence of tables, cards, links, and graphical elements, as well as the behavior of dynamic interfaces. Therefore, data structuring has always taken the form of a sequence of transformations (Fig. 2).

 

Figure 2. Data structuring from heterogeneous sources, compiled by the author

 

A significant contribution to addressing this problem has been made by research on deep web extraction. A number of authors, including T. Furche, G. Gottlob, G. Grasso, and others, in their study on OXPath, describe a declarative extension of XPath for scalable data extraction, automation of interactions, and traversal of the deep web. As the volume of information increased and web application interfaces became more complex, the need for automated processing naturally emerged, as earlier data extraction tools struggled to handle interaction-intensive scenarios. Within the context of web intelligence infrastructure, structuring cannot rely solely on static HTML, since a substantial portion of relevant data is published in forms that are revealed through interactive elements such as expandable blocks, search results, and other interfaces, where extraction is tied to user actions and navigation across pages [3]. The problem is further exacerbated when information about a single entity is distributed across multiple sources and expressed in different representation formats. As noted by M. I. Ali, R. Pichler, H. L. Truong, and S. Dustdar, information about an entity in the web environment is often dispersed across heterogeneous sources using formats such as XML, RDF, and OWL, while applications are required to query autonomous and distributed data [2]. This observation is particularly important in the context of large-scale source comparison systems, as publications, product listings, company profiles, job postings, or news items rarely conform to a common schema. For this reason, a dedicated stage is required to transform data into a unified model, where different representations of the same attributes are mapped into a comparable format; without such a stage, further analysis remains inherently fragmented.

Accordingly, a separate task consists in extracting the main content of a web page. In real-world web documents, useful material is accompanied by navigation elements, banners, sidebars, service links, recommendation blocks, and other recurring template components. M. Radilova, P. Kamencay, R. Hudec, M. Benco, and R. Radil examine a tool for extracting the main text and images from a web document. The authors rely on the Document Object Model, NLP methods, and classification techniques to identify the page segment that contains the article text and the relevant images [7]. Accordingly, the quality of main content extraction determines the accuracy of subsequent extraction of dates, names, titles, numerical values, and other attributes.

Results and Discussion. Translating the above into the practical task of comparing large sets of sources, several key levels of comparison can be identified:

  1. the sources themselves are first compared as stable observation nodes;
  2. individual documents or entries within these sources are then compared;
  3. finally, the extracted records are compared based on their fields (for which classical systems relied on URL patterns, timestamps, entity identification rules, lexical features, and predefined normalization rules).

Accordingly, when the page structure remained stable, this approach proved to be sufficiently effective, provided that the domain allowed for formalization.

However, the emergence of LLMs has primarily transformed the semantic level of processing. In the work by D. Gauhl, K. Kakkanattu, M. Mukkattu, and T. Hanne, a hybrid system is described that combines near real-time web crawling with LLMs for semantic analysis of web pages. The authors note that the proposed system extracts and ranks keywords from user data and dynamically retrieves job listings from online platforms. At the same time, they highlight an important issue: the quality of recommendations depends on the performance of the web crawler; therefore, while the language model enhances analytics as an independent layer, it does not eliminate the need for robust data collection and updating mechanisms [4]. Thus, it can be concluded that LLMs deliver the greatest impact in recognizing semantic similarity between records, interpreting weakly structured text, identifying closely related descriptions, and transforming fields into a unified representation. Core tasks such as URL deduplication, crawl rate control, page snapshot storage, and initial filtering remain the responsibility of the classical infrastructure (Table 1).

Table 1.

Comparative analysis of the impact of LLMs on the comparison of large-scale source datasets, compiled by the author

Criterion

Classical approaches

LLM-based approaches

Optimal usage strategy

Selection of relevant pages

Reliance on URLs, keywords, patterns, and domain-specific rules

Semantic selection based on page content

Primary filtering is best handled by classical methods

Recognition of similarity between records

Matching of fields, dictionaries, patterns, and labels

Identification of semantic similarity despite differences in wording

LLMs are effective for secondary comparison of records

Data extraction from weakly structured text

XPath, DOM, extraction rules

Interpretation based on semantic features

LLMs are justified for processing non-standard pages

Field normalization

Rigid mapping rules

Interpretation of ambiguous representations

Semantic refinement improves alignment to a unified schema

Handling changing markup

Depends on regular updates of rules

More robust to variations in phrasing

LLMs reduce sensitivity to some changes but do not eliminate the need for pipeline maintenance

Processing cost

Lower for large-scale processing

Higher due to inference

LLMs are best applied selectively

 

Based on the data presented in Table 1, the most justified approach appears to be a two-stage processing system, in which the classical pipeline performs large-scale collection, parsing, and coarse normalization, while the LLM is engaged at the stages of semantic refinement and handling complex cases (Fig. 3):

 

Figure 3. Algorithm for the combined use of LLMs and classical approaches in web crawling and large-scale structuring of data from heterogeneous sources, developed by the author

 

The proposed algorithm is based on the premise that rule-based and DOM-based methods retain an advantage at the stages of large-scale initial processing. LLMs, in turn, provide gains in situations where it is necessary to identify semantically similar records despite differences in wording, interpret non-standard representations of attributes, or resolve ambiguity between fields.

Conclusion. Thus, large-scale web crawling represents a mature domain whose core architectural principles were established prior to the emergence of LLMs and vector databases. Such systems are built upon distributed crawling, a URL frontier, a revisit scheduler, stream filtering, extraction of meaningful page fragments, and the transformation of results into a unified data schema. The mere retrieval of web pages does not, in itself, carry analytical value, as practical outcomes arise at the structuring stage, when heterogeneous material is converted into a comparable set of records.

Focused crawling occupies a special place within this system, as it reduces the share of irrelevant documents, lowers the overall load on storage systems, and improves the quality of subsequent data extraction. At the next stage, the system’s ability to isolate the main content block of a page, interact with deep web interfaces, and consolidate disparate attributes into a unified representation of the observed entity becomes critical.

Modern LLMs have expanded the capabilities of semantic data processing, with their greatest value realized in comparing records with differing formulations, interpreting weakly structured text, and refining field normalization. At the same time, the role of classical engineering solutions remains essential, as large-scale web crawling, source management, deduplication, crawl rate control, and data freshness are still governed by the crawler architecture. Accordingly, an effective web intelligence system should be built on the integration of a core distributed data collection infrastructure with a specialized semantic layer, which is engaged in cases where formal methods reach their limits of accuracy.

 

References:

  1. Ahmadi-Abkenari F., Selamat A. An architecture for a focused trend parallel Web crawler with the application of clickstream analysis // Information Sciences. 2012. Vol. 184, iss. 1. P. 266–281. DOI: 10.1016/j.ins.2011.08.022.
  2. Ali M. I., Pichler R., Truong H. L., Dustdar S. DeXIN: An extensible framework for distributed XQuery over heterogeneous data sources // Enterprise Information Systems : proceedings of ICEIS 2009 / ed. by J. Filipe, J. Cordeiro. Berlin ; Heidelberg : Springer, 2009. Vol. 24. P. 211–224. DOI: 10.1007/978-3-642-01347-8_15.
  3. Furche T., Gottlob G., Grasso G. et al. OXPath: A language for scalable data extraction, automation, and crawling on the deep web // The VLDB Journal. 2013. Vol. 22. P. 47–72. DOI: 10.1007/s00778-012-0286-6.
  4. Gauhl D., Kakkanattu K., Mukkattu M., Hanne T. Integrating large language models with near real-time web crawling for enhanced job recommendation systems // Computers. 2025. Vol. 14. Art. 387. DOI: 10.3390/computers14090387.
  5. Hernandez J., Marin-Castro H. M., Morales-Sandoval M. A semantic focused web crawler based on a knowledge representation schema // Applied Sciences. 2020. Vol. 10. Art. 3837. DOI: 10.3390/app10113837.
  6. Hsieh J. M., Gribble S. D., Levy H. M. The architecture and implementation of an extensible web crawler // 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 10). San Jose, CA : USENIX Association, 2010. URL: https://www.usenix.org/conference/nsdi10-0/architecture-and-implementation-extensible-web-crawler
  7. Radilova M., Kamencay P., Hudec R., Benco M., Radil R. Tool for parsing important data from web pages // Applied Sciences. 2022. Vol. 12. Art. 12031. DOI: 10.3390/app122312031.
  8. Zeinalipour-Yazti D., Dikaiakos M. Design and implementation of a distributed crawler and filtering processor // Next Generation Information Technologies and Systems : proceedings of NGITS 2002 / ed. by A. Halevy, A. Gal. Berlin ; Heidelberg : Springer, 2002. Vol. 2382. P. 58–74. DOI: 10.1007/3-540-45431-4_6.
Информация об авторах

Senior Software Engineer, Ada, Toronto, Canada

старший инженер-программист, Ada, Канада, г. Торонто

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top