VECTOR EMBEDDINGS FOR ENHANCED EFFICIENCY IN FULL-TEXT SEARCH WITH ARTIFICIAL INTELLIGENCE TECHNOLOGIES

ВЕКТОРНЫЕ ВСТАВКИ ДЛЯ ПОВЫШЕНИЯ ЭФФЕКТИВНОСТИ ПОЛНОТЕКСТОВОГО ПОИСКА С ИСПОЛЬЗОВАНИЕМ ТЕХНОЛОГИЙ ИСКУССТВЕННОГО ИНТЕЛЛЕКТА
Ivanouski A.
Цитировать:
Ivanouski A. VECTOR EMBEDDINGS FOR ENHANCED EFFICIENCY IN FULL-TEXT SEARCH WITH ARTIFICIAL INTELLIGENCE TECHNOLOGIES // Universum: технические науки : электрон. научн. журн. 2024. 12(129). URL: https://7universum.com/ru/tech/archive/item/18965 (дата обращения: 14.03.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2024.129.12.18965

 

ABSTRACT

The work analyzes the use of vector attachments in full-text search engines to improve the accuracy of the search and improve the understanding of the meaning of texts. The work examines modern methods that allow converting text data into vector representations that reflect both grammatical structures and semantic relationships.

Attention is also paid to the introduction of models based on transformers, such as BERT, and GPT, which can improve the quality of the search. We consider ways to speed up calculations through nearest-neighbor search algorithms, including FAISS, and HNSW, which provide data processing speed.

The methodological basis of the work was a system analysis, supplemented by a comparison of practical cases, modeling the potential effects of the introduction of analytics. Scientific papers, industry reports, as well as specific examples of technology implementation in companies, data on which are posted on the Internet, were studied.

The results of the work are useful for specialists involved in data processing, artificial intelligence, and information retrieval. Developers of search engines and corporate platforms will also benefit from the article, including solutions for semantic indexing.

The study not only highlights the importance of vector embeddings for modern technologies but also describes specific strategies to eliminate problems such as scaling complexity and insufficient interpretability of models. Thus, the work offers effective recommendations for solving urgent problems.

АННОТАЦИЯ

В рамках работы проводится анализ использования векторных вложений в полнотекстовых поисковых системах в целях повышения точности поиска, улучшения понимания смысла текстов. В рамках работы изучаются современные методы, позволяющие преобразовывать текстовые данные в векторные представления, которые отражают как грамматические структуры, так и смысловые взаимосвязи.

Также внимание уделяется внедрению моделей на основе трансформеров, таких как BERT, GPT, способных повысить качество поиска. Рассматриваются способы ускорения вычислений через алгоритмы поиска ближайших соседей, включая FAISS, HNSW, которые обеспечивают скорость обработки данных.

Методологической основой работы стал системный анализ, дополненный сравнением практических кейсов, моделированием потенциальных эффектов от внедрения аналитики. Были изучены научные работы, отраслевые отчёты, а также конкретные примеры внедрения технологий в компаниях, данные о которых размещены в сети «Интернет».

Результаты работы полезны специалистам, занимающимся обработкой данных, искусственным интеллектом, информационным поиском. Разработчики поисковых систем, корпоративных платформ также извлекут из статьи практическую пользу, включая решения для семантического индексирования.

В исследовании не только подчёркивается значение векторных вложений для современных технологий, но и описываются конкретные стратегии устранения таких проблем, как сложность масштабирования, недостаточная интерпретируемость моделей. Таким образом, работа предлагает действенные рекомендации для решения актуальных задач.

 

Keywords: vector embeddings, full-text search, artificial intelligence, semantic search, transformer models, nearest neighbor search, scalability.

Ключевые слова: векторные вложения, полнотекстовый поиск, искусственный интеллект, семантический поиск, трансформаторные модели, поиск ближайших соседей, масштабируемость.

 

Introduction 

With the advancement of artificial intelligence technologies and natural language processing (hereafter referred to as NLP), existing search approaches have undergone significant changes. Traditional systems based on keyword matching are becoming obsolete in modern contexts due to the increasing complexity of data and the demand for meaningful interpretation of information. These methods have been replaced by vector embeddings, which are mathematical models that transform text into multidimensional vectors, capturing not only grammatical structure but also semantic relationships. 

The technological breakthrough in the development of vector representations was made possible by the adoption of deep learning algorithms. Solutions such as Word2Vec, GloVe, and transformers like BERT and GPT have revolutionized not only the quality of single-word processing but also the analysis of context in entire texts, including their hidden semantic connections. These models are particularly useful for processing complex queries, especially those involving ambiguous or polysemantic terms. 

One of the key features of these systems is their suitability for working with data. They can address tasks related to the optimization and scalability of search operations. However, their integration comes with challenges, such as high computational resource requirements and the need to refine algorithms for efficient information processing. 

The relevance of this study lies in examining the role of vector embeddings in the evolution of full-text search. The purpose of the work is to analyze the principles behind the formation of vector representations, their impact on search result relevance, and the advancements that have enabled their application in modern systems.

Materials and Methods 

This study employed scientific methods of data analysis and synthesis, including a review and comparative analysis of existing approaches. Methods of content analysis of scientific publications were used to systematize theoretical and practical aspects, which allowed for the identification of key trends and advancements in semantic search. The empirical part relied on experimental modeling methods, including performance testing of algorithms on real datasets.

One of the key directions in this field is the use of deep generative models to improve search performance. In the study by Zhang et al. [1], a model-enhanced vector index was proposed, combining generative modeling methods with indexing technologies. This approach improves search accuracy while maintaining query service performance.

In the context of semantic search, sentence comparison plays a significant role, as examined in the study by Zoupanos S. et al. [2]. Their work compared embedding algorithms and evaluated systems such as FAISS and Elasticsearch for semantic search tasks. The results demonstrated that while the outcomes were similar, the choice of an appropriate approach depends on the specific task and processing speed requirements.

An important aspect is improving code embeddings for search-related tasks. This topic was explored in the study by Neelakantan A. et al. [3], which proposed a method of contrastive pretraining. This approach enhanced results in semantic search tasks, confirming its effectiveness in improving quality.

The application of hashing to accelerate computations in search systems was discussed in the study by Wang M. et al. [4]. This research presented a framework for working with incomplete knowledge in graphs using binarized embeddings. The proposed approach reduces operation time, which is essential when handling large datasets.

The use of embedding algorithms for binary code retrieval was the subject of the study by Yang J. et al. [5]. The authors proposed an embedding scheme for binary code search based on tensor decomposition using singular values, which improved search efficiency.

In the study by Yoon J. et al. [6], a method was proposed to adapt embeddings from large language models to improve search performance. This demonstrated advantages in accuracy with reduced computational costs.

In terms of enhancing search efficiency and reducing data processing costs, the work by Xiao S. et al. [7] provided significant insights. The authors introduced the Distill-VQ methodology, which combines vector quantization learning with knowledge distillation methods. This approach improved search speed and embedding accuracy across various applications.

Gan Y. et al. [8] developed a binary embedding engine for search systems, which enhanced both accuracy and resource efficiency in large-scale applications.

The practical part of the study relied on information available on the website medium.com [9], which describes the experiences of companies and their outcomes from using vector embeddings to enhance full-text search through artificial intelligence technologies.

Thus, the analysis of the information reflected in these sources confirms that current advancements in information retrieval continue to evolve, enabling improvements in accuracy, search speed, and reductions in processing costs across various fields.

Results and Discussion 

The defining characteristic of such vectors is their ability to reflect not only syntactic but also semantic relationships between words. They can group similar objects, such that words with related meanings, like "king" and "tsar" or "dog" and "hound," are located close to each other in vector space. This approach fundamentally differs from traditional methods, where "king" and "tsar" are treated as separate elements, regardless of their semantic similarity [1].

The emergence of deep learning models, such as Word2Vec, GloVe, FastText, BERT, and GPT, has expanded the capabilities of representations. These models can generate vectors not only for individual words but also for entire documents. 

The use of artificial intelligence-based algorithms facilitates the selection process. Tools like BERT and GPT provide a more effective understanding of text compared to traditional methods, making them highly relevant for extracting information from heterogeneous data systems. This approach not only improves search quality but also interprets user intent, even when the query deviates from explicit terms found in databases. For example, a query like "the impact of climate change on agriculture" may lack specific keywords, but the system can identify related terms, such as "global warming" or "effects on agronomy," thereby broadening the scope of search results [3]. 

Another notable feature of these systems is the enhancement of document ranking processes, enabling documents to be sorted based on their semantic relevance to the user query, thereby improving efficiency. For faster retrieval, approximate search indexing methods are employed, reducing the complexity of locating information within documents. Algorithms such as FAISS and HNSW accelerate data retrieval, delivering the required speed and accuracy. 

Moreover, the use of distributed systems allows for the processing and storage of embeddings while ensuring minimal latency in search queries. Cloud-based infrastructures enable scalable searches, which is critical for large-scale applications with high traffic loads [6]. 

Despite the advantages of these models, several challenges must be addressed. One of the primary issues is the cost associated with both the creation and application of vector representations. Training transformers and similar systems require substantial resources, which may not be justified for smaller projects. Additionally, storing and processing embeddings demand significant memory and computational power, limiting their use in certain applications. 

Another challenge is interpretability. Deep neural network systems like BERT and GPT deliver high accuracy, yet their functioning is often perceived as a "black box." In fields such as medicine, law, or finance, transparency in search results is crucial. Understanding why and how a specific document was selected is important for informed decision-making [4]. 

Several practical examples from companies are considered [9]. Spotify combines keyword-based and semantic search to provide users with relevant podcast episode results. For instance, the team highlighted the limitations of a keyword search with the query "electric vehicles climate impact," which returned zero results, even though relevant podcast episodes exist in Spotify's library. To improve retrieval, Spotify implemented Approximate Nearest Neighbor (ANN) for faster and more relevant podcast searches (Fig. 1). 

 

Figure 1. Current search for the query "electric vehicles: climate impact" on Spotify [9]. 

 

Spotify utilizes the Universal Sentence Encoder CMLM model to create vector representations. Its ability to process texts in multiple languages enables efficient handling of Spotify’s international podcast library. This model is particularly suited for sentence-level tasks. Among the alternatives considered was the BERT model; however, its specificity is limited to word-level processing and supports only English-language data. 

The process of creating vector embeddings involves transforming query texts and podcast episode metadata, such as titles and descriptions. To measure the similarity between texts, the cosine distance metric is employed, which quantifies their semantic closeness. 

Training the Universal Sentence Encoder CMLM model is based on data pairs illustrating successful search examples. This process incorporated methodologies described in studies, including Dense Passage Retrieval for Open-Domain Question Answering (DPR) and Que2Search. Both manually crafted and synthetic negative examples were used, as well as actual queries. This diversity of data contributed to improving the model’s accuracy. 

The integration of vector search into the recommendation system consisted of several stages: 

1. Indexing Episodes: Vector representations are created offline and indexed using the Vespa platform. This system combines ANN-based search with additional filtering based on metadata, such as popularity metrics. 

2. Query Processing: In real-time, user queries are transformed via Google* Cloud Vertex AI. This platform accelerates transformer model computations using GPUs and cache queries to reduce costs. After the query vector is created, relevant episodes are retrieved through Vespa. 

Semantic search does not completely replace keyword search. In cases requiring exact matches of episode or podcast titles, it proves insufficient. To address this limitation, Spotify employs a hybrid approach. Search through Vespa is complemented by the traditional Elasticsearch system, and the final results are re-ranked (Fig. 2). 

 

Figure 2. Workflow of vector search in Spotify [9]. 

 

eBay has introduced a search technology based on image analysis, enabling the system to find products that match queries based on appearance or characteristics. This functionality is achieved through a model capable of processing textual, visual, audio, and video information. The system’s architecture is designed to perform tasks such as data prediction or classification.

The algorithm’s operation is based on combining textual and visual data. A convolutional neural network, ResNet-50, converts images into vector representations. For analyzing textual descriptions, the deep learning model BERT is employed, offering precise interpretation of natural language information. The resulting data is transformed into multidimensional vectors that integrate product attributes.

This approach creates a comprehensive representation of each item by analyzing its parameters and adapting to user queries. Ultimately, this simplifies the search process and enhances user interaction (Fig. 3). 

 

Figure 3. Representation of the multimodal embedding model used at eBay [9]. 

 

Once the multimodal model is trained using a dataset of paired images and titles, it is deployed in the site’s search functionality. Due to the vast number of listings on eBay, data is uploaded in batches to HDFS. This platform uses Apache Spark to extract and store images, along with relevant fields needed for further listing processing, including the creation of listing embeddings. These embeddings are subsequently published in a columnar storage system, such as HBase, which is suited for large-scale data aggregation. They are then indexed and served through Cassini, eBay’s search engine (Fig. 4). 

 

Figure 4. Workflow of vector-based search on eBay [9]. 

 

Apache Airflow is used to manage data processing workflows and handle large-scale tasks. Its capabilities include integration with technologies like Spark, Hadoop, and Python, facilitating the development of machine-learning solutions tailored to various scenarios. 

Visual search allows users to select furniture and interior items based on stylistic features. eBay plans to expand this approach to other product categories, enabling users to create cohesive interior designs aligned with specific themes.

Airbnb has developed algorithms to enhance search functionality and listing personalization. Vector representations of data are utilized to analyze approximately 4.5 million active listings and 800 million search sessions. Relevant objects are subsequently clustered into a single segment of vector space, while others are positioned at a distance. For these purposes, a fixed embedding dimensionality of 32 was chosen.

Testing of the new model, including comparisons with previously used approaches, demonstrated a 21% increase in click-through rate and a 4.9% growth in booking share. Additionally, a personalization system was implemented, which integrates data on recent user activity into current queries via Kafka. 

Doordash, in turn, employs vector embeddings to improve store recommendations. The store2vec model analyzes user interactions with stores over a specific period. This method identifies connections between stores that were not previously evident. For example, if a customer has ordered from 4505 Burgers and New Nagano Sushi, the system identifies similar establishments, such as Kezar Pub and Wooden Charcoal Korean Village BBQ (Fig. 5). 

 

Figure 5. Example of vector search on Doordash, adapted from the blog Personalized Store Feed with Vector Embeddings [9]. 

 

Doordash integrated the store2vec distance function as one of the features in its comprehensive recommendation and personalization model. By employing vector-based search, the company achieved a 5% increase in click-through rates. The team is also experimenting with new models, optimizing existing ones, and incorporating real-time user activity data into the system [9]. 

Thus, vector embeddings, which form the foundation of modern search methods, are transforming approaches to organizing full-text search. They not only improve the accuracy and relevance of search results but also address scalability challenges when handling large datasets. Combined with artificial intelligence, vector representations enable searches based on meaning rather than just keywords, making search processes more efficient. 

Conclusion 

Vector embeddings present promising opportunities for organizing full-text searches, and introducing innovative methods for handling large datasets. This study analyzed advanced technologies, including transformer-based architectures like BERT and GPT, as well as algorithms supporting approximate nearest neighbor search and distributed computing systems capable of processing queries efficiently.

The findings demonstrated that implementing vector representations enhances search accuracy by accounting for the semantic structure of language and contextual information in queries. Furthermore, the developed solutions focus on achieving scalability and optimizing the allocation of computational resources, which is essential for corporate platforms and applied systems. Thus, vector embeddings have become a key tool for improving the performance of search systems.

 

References:

  1. Zhang H. et al. Model-enhanced vector index //Advances in Neural Information Processing Systems. – 2024. – Vol. 36.
  2. Zoupanos S. et al. Efficient comparison of sentence embeddings //Proceedings of the 12th Hellenic Conference on Artificial Intelligence. – 2022. – pp. 1-6.
  3. Neelakantan A. et al. Text and code embeddings by contrastive pre-training //arXiv preprint arXiv:2201.10005. – 2022.
  4. Wang M. et al. Efficient search over incomplete knowledge graphs in binarized embedding space //Future Generation Computer Systems. – 2021. – Vol. 123. – pp. 24-34.
  5. Yang J. et al. Codee: A tensor embedding scheme for binary code search //IEEE Transactions on Software Engineering. - 2021. – vol. 48. – No. 7. – pp. 2224-2244.
  6. Yoon J. et al. Search-Adaptor: Text Embedding Customization for Information Retrieval //arXiv preprint arXiv:2310.08750. – 2023.
  7. Xiao S. et al. Distill-vq: Learning retrieval-oriented vector quantization by distilling knowledge from dense embeddings //Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. – 2022. – pp. 1513-1523.
  8. Gan Y. et al. Binary Embedding-based Retrieval at Tencent //Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. – 2023. – pp. 4056-4067.
  9. 5 Use Cases for Vector Search. Available at: https://medium.com/rocksetcloud/5-use-cases-for-vector-search-f9316b158361 (accessed: 12.11.2024).

 

*(По требованию Роскомнадзора информируем, что иностранное лицо, владеющее информационными ресурсами Google является нарушителем законодательства Российской Федерации – прим. ред.)

Информация об авторах

Lead Software Engineer, Contexxt Ltd, Brest, Belarus

ведущий инженер-программист, Contexxt Ltd, Беларусь, г. Брест

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top