OVERVIEW OF INFORMATION RETRIEVAL MODELS

ОБЗОР МОДЕЛЕЙ ПОИСКА ИНФОРМАЦИИ
Цитировать:
Maharramov Z.T., Guliyev I.N. OVERVIEW OF INFORMATION RETRIEVAL MODELS // Universum: технические науки : электрон. научн. журн. 2025. 3(132). URL: https://7universum.com/ru/tech/archive/item/19567 (дата обращения: 19.04.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2025.132.3.19567

 

ABSTRACT

The development of search engines has evolved significantly from early systems such as the Archie system—which provided file search services on FTP servers before the emergence of the Internet—to Wandex and Aliweb, which appeared in 1993 as the first search systems for the web. However, the first full-text search engine, WebCrawler, launched in 1994, introduced the ability to search for any keyword across all indexed web pages, setting the foundation for modern search technologies.

This study explores classical information retrieval models and their evaluation methods, providing a structured understanding of search engine functionality. The research highlights the role of theoretical models in predicting search results and justifying document relevance. The findings contribute to enhancing search efficiency and accuracy by improving indexing structures, query interpretation techniques, and relevance scoring methodologies.

АННОТАЦИЯ

Развитие поисковых систем значительно эволюционировало от ранних систем, таких как система Archie, которая предоставляла услуги поиска файлов на FTP-серверах до появления Интернета, до Wandex и Aliweb, которые появились в 1993 году как первые поисковые системы для Интернета. Однако первая полнотекстовая поисковая система WebCrawler, запущенная в 1994 году, представила возможность поиска по любому ключевому слову по всем проиндексированным веб-страницам, заложив основу для современных поисковых технологий. В этом исследовании изучаются классические модели поиска информации и методы их оценки, что обеспечивает структурированное понимание функциональности поисковой системы. Исследование подчеркивает роль теоретических моделей в прогнозировании результатов поиска и обосновании релевантности документов. Результаты способствуют повышению эффективности и точности поиска за счет улучшения структур индексации, методов интерпретации запросов и методологий оценки релевантности.

 

Keywords: Information retrieval, Search systems, Semantic search, Information needs, Keyword-based search, Boolean model

Ключевые слова: Поиск информации, Поисковые системы, Семантический поиск, Информационные потребности, Поиск по ключевым словам, Булева модель

 

Introduction

The problem of searching for information resources involves finding information that meets the needs of users. Research on solving this problem began in the middle of the last century, shortly after the invention of electronic computers (ECM) [1].

Previously, information search existed as a small scientific and applied field, and the number of scientists working in this direction was limited. However, at the end of the last century and the beginning of this century, the rapid development of the global Internet, especially Web technologies, gave a strong impetus to the progress of this field. In modern times, information search is characterized by millions of users, huge databases, powerful computing systems, and complex algorithms. Machine learning methods, multimedia data analysis, computer linguistics, and geographic information services are used to solve the problem of information search. At the same time, the psychology and social relationships of users are studied, and other such approaches are applied [2].

Most of the existing information search systems are based on general mechanisms. According to these mechanisms, to perform the search, users enter a query that describes their information needs and consists of certain terms. After processing such a query, the search system should provide documents (or links to documents) that contain the terms specified in the user's query. These terms can be either specific keywords or any words - text strings - that occur in the content of the document collection. This method is considered the classic method of information retrieval, or the search method using keywords.

Research Methodology

The research methodology for this study is designed to systematically analyze classical and modern information retrieval (IR) models. The study employs a combination of qualitative and quantitative research approaches to evaluate search engine mechanisms, indexing structures, and query interpretation techniques. The methodology consists of the following key components:

  1. Literature Review: A thorough review of existing literature on IR models, search engine architectures, and information retrieval techniques was conducted. The review focused on foundational principles, such as the Boolean model, vector space model, and probabilistic models, to understand their relevance and applications.
  2. Comparative Analysis: Classical and modern IR models were compared based on their theoretical frameworks, efficiency in retrieving relevant documents, and scalability. This comparison involved analyzing different retrieval algorithms, including keyword-based search, semantic search, and machine learning-based techniques.
  3. Experimental Evaluation: A set of test queries was executed on various search engine models to measure precision, recall, and relevance ranking performance. The datasets used for testing included text-based corpora and multimedia resources to assess retrieval accuracy across different content types.
  4. System Architecture Analysis: The architecture of modern search engines was examined to understand how indexing, query formulation, and ranking functions contribute to retrieval effectiveness. This involved studying indexing techniques, such as inverted index structures, as well as ranking mechanisms like TF-IDF and PageRank.
  5. User Behavior Analysis: A study of user interaction patterns with search engines was conducted to assess how query modifications, personalization, and feedback mechanisms improve search performance. This analysis helped in understanding how modern systems adapt to user needs over time.

Results and Discussion

The results of the study highlight significant findings in the field of information retrieval. The key insights are as follows:

  1. Effectiveness of Classical Models: The Boolean model provides precise results for structured queries but lacks flexibility for handling natural language searches. The vector space model, with its ranking capabilities, offers improved relevance scoring. However, probabilistic models such as BM25 outperform classical approaches in ranking accuracy.
  2. Advancements in Search Technologies: Modern search engines no longer rely solely on keyword-based search. The integration of semantic search, machine learning, and artificial intelligence has enhanced relevance determination. Ontological models and contextual analysis have been found to improve search precision significantly.
  3. Role of Indexing in Retrieval Performance: The study confirms that efficient indexing structures, such as inverted indexes and distributed databases, contribute significantly to search speed and accuracy. Modern indexing methods allow search engines to scale and process large datasets efficiently.
  4. Impact of Query Processing Techniques: Query expansion, relevance feedback, and machine learning-based query interpretation significantly improve retrieval accuracy. By understanding user intent, modern search engines provide more relevant results, reducing the number of irrelevant documents retrieved.
  5. User Behavior and Personalization: The study finds that modern search engines increasingly rely on user interaction data to refine search results. Personalized search mechanisms, which consider past searches, location, and user preferences, improve result relevance but raise concerns regarding data privacy.

Main part

A search engine is a complex of software and hardware that allows you to search for information. By their scale, search systems are usually divided into three main groups: search systems for personal computers, search systems for corporate networks, and search systems for the Internet.

Depending on the types of information needs of users described in search queries, there are 3 types of search: informational (for example, when you need to find the melting point of iron), navigational (for example, when you need to find a link to a site) and transactional (for example, when you need to find a site to buy goods on the Internet).

One of the first search systems on a computer network (before the emergence of the Internet) was the Archie system, which provided file search services on FTP servers. Later, in 1993, the first search system for the Internet appeared - the now defunct "Wandex" and the still operating "Aliweb" search system.

However, the first full-text search system was the WebCrawler system, launched in 1994. Unlike previous search systems, it allowed users to search all indexed Web pages by any keyword. Since then, this type of search has become the standard for all modern popular search engines.

The development of any search engine is based on a specific model. In this chapter, classical models of information search and methods for their evaluation are examined to create a general idea of ​​the problem under study.

The initial task of searching for information resources is formulated as follows:

A document set D= , where n is the number of documents in the collection and q is a description of the information need. A query is required to find a subset R of a document set D that contains relevant documents for the result set q (R⊆D) [3].

In addition to text documents, multimedia resources (images, audio recordings, videos, etc.) can also be searched. However, this requires the creation of text descriptions, which are included in many documents, as well as access resources.

The modern list of information retrieval tasks is supplemented by tasks such as document classification and grouping, user interface design, query languages, etc. A description of the information retrieval process is presented in Figure 1.1.

 

Figure 1.1. Information retrieval problem

 

On the one hand, a person - a user - has an information need, which is represented by a specific requirement, which is then transformed into a search phrase (query). On the other hand, search engines contain a collection of electronic resources that are indexed for the purpose of automatic processing. As a result of processing the query, search servers return a set of documents relevant to the given user query. It is important to note that relevance is subjective, that is, different users may evaluate the relevance of a result differently.

Any information retrieval system performs the following three main functions [4]:

1) Indexing - collecting electronic resources and creating their logical descriptions, as well as storing logical descriptions using indexes (data structures optimized for fast searches).

2) Query formulation - describing the user's information needs in a language supported by the search system.

3) Matching - calculating affinity (relevance) scores between queries and documents. Based on the similarity scores, a result set is determined and then returned to users.

The connection between system messages and the query generation subsystem means that the search results can be used to refine the query.

There are a large number of search engines available today. However, almost all of them have a common architecture, which is shown in Fig. 1.2.

It is noted that any information retrieval system is usually implemented on the basis of an appropriate theoretical model describing its main features: a logical description of documents and information needs, as well as algorithms for calculating the correspondence assessment between logical descriptions of queries and documents. When analyzing the model, it is possible to predict the set of results for a given query and justify the relevance of the documents retrieved. In general, an information retrieval model consists of the following 4 components [5]:

Model = [D, Q, F, R{qi, dj)],

where D is the set of logical representations of documents in the collection; Q is the set of logical representations of the user's information needs (queries); F is a platform for modeling document representations, queries, and relationships between them; R(di,qj) is a real-valued ordering function for query qi, and dj is a document representation proximity function. This ordering determines the degree of correspondence (ordering) of documents to query qi.

 

Figure 1.2. General architecture of information retrieval systems

 

For example, for a classical Boolean search model, such a platform includes a set of documents and a set of standard operations on sets. For a classical vector space model, the platform includes a t-dimensional vector space and standard linear algebra operations. For a classical probability model, this framework includes sets, standard probability operations, and Bayes' theorem.

Conclusion

This study examined the development of information retrieval systems, classical and modern search models, and analyzed ways to increase their effectiveness. It was determined that modern search systems are not based only on keywords, but also use methods such as **semantic search, machine learning, and user behavior analysis**. The results obtained show that **ontological models and semantic technologies** can significantly improve the quality of information retrieval. Future research can be directed towards **multimedia indexing and more accurate modeling of user intent**.

 

Reference:

  1. Богатырев, М. Ю. Применение концептуальных графов в системах поддержки электронных библиотек / М. Ю. Богатырев, В. Е. Латов, И. А. Столбовская // Труды 9-ой Всероссийской науч. конф. «Электронные биб­лиотеки: перспективные методы и технологии, электронные коллекции». - Переславль, 2007.-Т. 2.-С. 104-110.
  2. Губин, М. Ю. Методы создания семантических метаописаний доку­ментов с применением семантических сетей, фреймовых моделей и частот­ных характеристик / М. Ю. Губин, В. В. Разин, А. Ф. Тузовский // Доклады Томского государственного университета систем управления и радиоэлек­троники. - 2010. - Т. 2, № 2. - С. 227-229.
  3. Гэри, М. Вычислительные машины и труднорешаемые задачи / М. Гэри, Д. Джонсон. -М: Мир, 1982. - 192 с.
  4. Карпенко, А. П. Оценка релевантности документов онтологической базы знаний // Электронное научно-техническое издание «Наука и образова­ние». - URL: http://technomag.edu.ru/doc/157379.html (дата обращения: 23.07.2012).
  5. Кнут Д. Искусство программирования / Д. Кнут. - М.: Вильяме, 2000. -Т.3.-703 с.
Информация об авторах

Associate Professor, Odlar Yurdu University, Azerbaijan, Baku

доцент, Университет Одлар Юрду, Азербайджан, г. Баку

Master of the Odlar Yurdu University, Azerbaijan, Baku

магистр Университета Одлар Юрду, Азербайджан, г. Баку

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top