Master’s student, Kazakh-British Technical University (KBTU), Kazakhstan, Almaty
COMPARATIVE EVALUATION OF SPARSE, DENSE, AND HYBRID INFORMATION RETRIEVAL METHODS WITH LLM-AUTHORED QUERIES ON A DOMAIN-SPECIFIC NLP CORPUS
ABSTRACT
This paper presents a comparative evaluation of eight information retrieval methods — spanning sparse, dense, and hybrid paradigms — on a domain-specific Wikipedia corpus of 1,856 NLP-related articles. Departing from conventional query construction practices, we employ a large language model acting as a domain expert to author 424 naturalistic research queries of four types: factoid, conceptual, comparative, and application. Relevance judgments are assigned through a topic-cluster schema with graded scores (0/1/2), evaluated without relying on external search APIs. Statistical analysis uses bootstrap confidence intervals on mean nDCG and Bonferroni-corrected pairwise significance tests. Reciprocal Rank Fusion of BM25 and E5-base achieves the top nDCG@10 of 0.416, significantly outperforming all sparse baselines. Cross-encoder reranking underperforms relative to its expected position due to first-stage recall constraints. All neural methods significantly outperform both sparse baselines, confirming the benefit of semantic representations for conceptual domain-specific retrieval.
АННОТАЦИЯ
В статье представлена сравнительная оценка восьми методов информационного поиска — разрежённых, плотных и гибридных — на предметно-ориентированном корпусе из 1856 статей Википедии по обработке естественного языка. 424 запроса четырёх типов (фактографические, концептуальные, сравнительные, прикладные) сформулированы языковой моделью в роли эксперта предметной области. Разметка релевантности основана на тематических кластерах с градуированными оценками (0/1/2). Метод взаимного слияния рангов (RRF) на основе BM25 и E5-base достигает наивысшего nDCG@10 = 0,416, значимо превосходя все разрежённые базовые методы. Нейронные методы значимо превосходят разрежённые, подтверждая преимущество семантических представлений для концептуальных запросов.
Keywords: information retrieval, BM25, dense retrieval, cross-encoder reranking, reciprocal rank fusion, nDCG, LLM-authored queries, graded relevance.
Ключевые слова: информационный поиск, BM25, плотный поиск, перекрёстный энкодер, взаимное слияние рангов, nDCG, запросы языковой модели, градуированная релевантность.
Introduction
The shift from lexical to semantic retrieval is one of the defining developments in modern information retrieval. Early systems relied on exact term matching, progressively refined through probabilistic weighting and learning-to-rank techniques. The emergence of transformer-based language models introduced a fundamentally different retrieval paradigm, where queries and documents are mapped into continuous vector spaces and relevance is measured by geometric proximity rather than vocabulary overlap. This transition is well documented but remains incompletely understood in domain-specific settings, where the interplay between lexical precision and semantic understanding takes on particular importance.
A. Sparse Retrieval Foundations
The vector space model [17] established the foundational vocabulary-weighted representation of documents and queries, with TF-IDF scoring providing a practical mechanism for discriminating between informative and common terms. Despite its conceptual simplicity, TF-IDF treats each document as an unordered bag of words, unable to capture synonymy, polysemy, or any contextual meaning [21]. The probabilistic BM25 model [16] addressed some of these limitations through term frequency saturation and document length normalization, producing a ranking function that balances term importance with document structure. BM25 has retained its status as the default sparse baseline across decades of IR research, a durability explained by its combination of computational efficiency and consistent empirical performance [21]. Learning-to-rank methods [2, 7] introduced supervised optimization of ranking functions but require labeled training data and do not resolve the fundamental vocabulary mismatch between queries and semantically related documents.
B. Dense and Neural Retrieval
Dense retrieval systems encode queries and documents into fixed-dimensional vectors, with similarity measured through dot product or cosine distance. Karpukhin et al. [10] demonstrated that dual BERT encoders trained on question-answer pairs could support fast similarity search when document embeddings are pre-indexed with FAISS [9]. This architecture achieves strong recall on semantically related content that shares no surface-level vocabulary with the query. More compact SentenceTransformer models [15], such as all-MiniLM-L6-v2 and all-mpnet-base-v2 [18], deliver efficient dense retrieval optimized for semantic textual similarity tasks, while E5-base [20] employs weakly supervised contrastive pre-training on large-scale text pairs to produce embeddings that generalize well across retrieval benchmarks [24].
C. Hybrid Retrieval and Re-ranking
Recognizing that sparse and dense methods capture complementary aspects of relevance, hybrid retrieval systems combine both signals. Reciprocal Rank Fusion (RRF) [3] merges rankings from multiple retrievers by assigning each document a score inversely proportional to its rank in each component list, without requiring score normalization or additional model training. An alternative architecture chains retrieval with neural re-ranking: a first-stage retriever fetches a candidate pool, which a cross-encoder then re-scores by jointly encoding query and document [14]. Cross-encoders achieve substantially higher precision than bi-encoders, but their effectiveness is bounded by the recall of the first stage — a constraint known as the cascade recall ceiling [14]. Learned sparse models such as SPLADE [5, 6] preserve index compatibility while injecting transformer semantics into sparse term weights. Mandikal and Mooney [13] confirm the advantage of hybrid pipelines for scientific document retrieval, while Zhang et al. [23] extend hybrid retrieval to graph-based approximate nearest neighbor search. Recent work by Mackie et al. [12] shows that generative feedback signals improve retrieval effectiveness across both sparse and dense systems. Within LLM-based architectures, Zeng et al. [22] find that sparse retrieval consistently outperforms dense methods across in-domain and out-of-domain benchmarks as model scale increases.
D. Evaluation Methodology in IR
Normalized Discounted Cumulative Gain (nDCG) [8] is the standard primary metric for systems with graded relevance, weighting gains logarithmically by rank. Binary ground truth collapses nDCG into a form of precision and eliminates its ability to discriminate between highly and partially relevant documents. External search APIs, previously used to generate ground truth by treating returned rankings as relevant sets, conflate the evaluation signal with the very ranking logic under study, introducing circular bias [19]. Bootstrap-based confidence intervals [4] on mean metric scores provide rigorous uncertainty estimates without parametric assumptions, while Bonferroni correction [1] guards against inflation of Type I error across multiple pairwise comparisons. The use of large language models as relevance assessors correlates well with human expert judgments on structured tasks [25], motivating our approach of using an LLM both for query authoring and relevance assignment.
E. Motivation and Contributions
Despite the growing body of work on individual retrieval paradigms, systematic comparative studies that simultaneously address naturalistic query construction, graded relevance judgment, and robust statistical evaluation remain limited in domain-specific settings. This study addresses that gap with the following contributions: (1) evaluation of eight retrieval methods spanning all three paradigms under a unified experimental framework; (2) 424 LLM-authored research queries of four types, constructed without reference to article titles or term overlap; (3) a paradigm-neutral graded relevance scheme based on expert topic clustering; (4) bootstrap confidence intervals and Bonferroni-corrected significance tests for all pairwise comparisons; and (5) an empirical demonstration of the first-stage recall bottleneck explaining why RRF outperforms cross-encoder cascade.
Materials and Methods
A. Corpus
The retrieval corpus consists of 1,856 Wikipedia articles covering a broad range of topics, stored in JSONL format with stable document identifiers, titles, and full text. The entire corpus serves as the retrieval pool for all queries — not merely the NLP-relevant subset — ensuring that retrieval methods must discriminate among a diverse collection rather than selecting from a narrowly filtered domain. Among the 1,856 articles, 106 cover NLP, machine learning, and information retrieval topics; these form the source articles for query construction. Article text was preserved without stemming or stopword removal at the indexing stage to allow retrieval models to operate on original content.
B. Query Generation
Query generation was performed by a large language model (Claude, Anthropic) acting in the role of an NLP and IR domain expert. For each of the 106 source articles, four queries were composed — one per type: (1) factoid: a specific, answerable question about a defined concept or measure (e.g., “What matrix factorization technique does LSA use to capture latent term-document associations?”); (2) conceptual: a question targeting the motivation or theoretical underpinning of the topic (e.g., “Why does grounding generation in retrieved documents reduce hallucination in language models?”); (3) comparative: a question asking how two methods or representations differ (e.g., “How do Word2Vec embeddings differ from GloVe embeddings in how they capture word co-occurrence?”); (4) application: a question about practical use cases (e.g., “How is NER used as a preprocessing step in information extraction pipelines?”).
Queries were written without referencing article titles or paraphrasing first sentences, ensuring that term-overlap retrievers receive no artificial advantage. The resulting query set of 424 items is balanced across types (106 per type) and covers all 106 source articles. This contrasts with prior approaches that used template-based keyword extraction or prompted LLMs to paraphrase short article excerpts, both of which tend to produce queries with high lexical overlap with the source document, artificially favoring sparse retrievers.
C. Ground Truth Construction
Relevance judgments were assigned using a topic-cluster schema designed by a domain expert, without reference to any retrieval model’s output. The 106 source articles were grouped into 20 thematic clusters representing coherent areas of NLP or IR research. Example clusters include transformer architectures, GPT-family language models, information retrieval core methods, and dense retrieval approaches. Relevance scores follow a three-level graded scale: rel = 2 denotes the source article the query was written about (one per query); rel = 1 denotes articles belonging to the same or an explicitly adjacent cluster, as defined by domain knowledge (for instance, a query about transformer architectures receives rel = 1 for GPT-family models, recurrent networks, and dense retrieval articles); rel = 0 denotes all other articles, implicitly non-relevant.
Across 424 queries this schema yields 424 rel = 2 assignments and 11,344 rel = 1 assignments, with an average of 26.75 partially relevant documents per query. All queries have at least one partially relevant document. The key advantage over TF-IDF cosine similarity as ground truth is paradigm neutrality: TF-IDF-based relevance inadvertently favors sparse retrievers because the relevance signal and the retrieval signal are computed by the same mechanism. Expert topic clustering evaluates whether a retrieval system finds conceptually related articles, independent of the vocabulary it uses.
D. Retrieval Methods
Eight retrieval methods are evaluated across three paradigms. Sparse methods: (1) TF-IDF — documents and queries are represented as sparse TF-IDF vectors (w(t,d) = tf(t,d) × log(N/df(t))); retrieval scores are cosine similarities, implemented via scikit-learn’s TfidfVectorizer. (2) BM25 [16] — Okapi BM25 with k₁ = 1.5 and b = 0.75, incorporating term frequency saturation and document length normalization; implemented via the rank-bm25 library.
Dense methods: (3) MiniLM — the all-MiniLM-L6-v2 model [15] encodes queries and documents into 384-dimensional embeddings trained via knowledge distillation; retrieval by dot product with FAISS indexing [9]. (4) E5-base [20] — the intfloat/e5-base model produces 768-dimensional embeddings via weakly supervised contrastive pre-training on over 1.3 billion text pairs; queries are prefixed with “query:” and documents with “passage:”. (5) MPNet [18] — the all-mpnet-base-v2 model employs masked and permuted pre-training and SentenceTransformer fine-tuning, producing 768-dimensional embeddings that consistently rank among the top-performing bi-encoders on semantic similarity benchmarks.
Hybrid methods: (6) BM25 + CrossEncoder — a two-stage cascade: BM25 retrieves the top 100 candidates, which are re-ranked by the cross-encoder/ms-marco-MiniLM-L-12-v2 model [14] performing full self-attention over the concatenated query-document pair. (7) E5 + CrossEncoder — the same cross-encoder reranker applied to the top-100 candidates from E5-base, testing whether a semantically stronger first stage raises the recall ceiling. (8) RRF(BM25 + E5-base) [3] — Reciprocal Rank Fusion combines the ranked lists of BM25 and E5-base: RRF(d) = Σᵣ 1/(k + rankᵣ(d)), with k = 60. Documents absent from a ranker’s list receive a rank equal to corpus size.
E. Evaluation Metrics and Statistical Analysis
Five metrics are computed at cutoffs K ∈ {1, 5, 10, 20}. Normalized Discounted Cumulative Gain (nDCG@K) [8] is the primary metric: DCG@K = Σᵏᵢ₌₁ relᵢ/log₂(i+1); nDCG@K = DCG@K/IDCG@K. Precision@K (P@K) measures the fraction of top-K results assigned rel ≥ 1. Recall@K measures the fraction of all judged relevant documents appearing in the top K. Hit@K is a binary indicator of at least one relevant document in the top K, averaged across queries. Mean Reciprocal Rank (MRR) is the reciprocal of the rank of the first relevant document, averaged across queries.
For statistical analysis, a 95% bootstrap confidence interval on the mean nDCG@10 is estimated from 2,000 bootstrap resamples of per-query scores [4]. Pairwise significance is assessed via a paired bootstrap test (10,000 iterations). All 28 pairwise p-values are multiplied by 28 (Bonferroni correction [1]) before applying the α = 0.05 threshold, controlling the family-wise error rate. All experiments were executed on Google Colab with a T4 GPU. None of the retrieval models were fine-tuned on the evaluation corpus.
Results and Discussion
A. Main Retrieval Performance
Table 1 reports retrieval performance across all eight methods at K = 10 with bootstrap confidence intervals on nDCG. Table 2 extends nDCG results to all evaluated cutoffs. The ranking of methods is consistent across all K values: the best method (RRF) leads at every cutoff from 1 to 20, and the two worst (BM25, TF-IDF) remain at the bottom throughout.
Table 1.
Retrieval performance at K = 10 with 95% bootstrap CI on nDCG
|
Method |
nDCG@10 |
95% CI |
P@10 |
Recall@10 |
Hit@10 |
MRR |
|
RRF(BM25+E5-base) |
0.416 |
[0.399, 0.433] |
0.306 |
0.113 |
0.967 |
0.762 |
|
MPNet |
0.402 |
[0.384, 0.420] |
0.288 |
0.106 |
0.955 |
0.752 |
|
E5+CrossEncoder |
0.392 |
[0.376, 0.409] |
0.276 |
0.101 |
0.960 |
0.740 |
|
E5-base |
0.389 |
[0.372, 0.405] |
0.268 |
0.098 |
0.948 |
0.745 |
|
MiniLM |
0.389 |
[0.371, 0.405] |
0.274 |
0.101 |
0.941 |
0.737 |
|
BM25+CrossEncoder |
0.383 |
[0.366, 0.399] |
0.263 |
0.096 |
0.955 |
0.739 |
|
BM25 |
0.357 |
[0.339, 0.374] |
0.261 |
0.097 |
0.958 |
0.711 |
|
TF-IDF |
0.350 |
[0.334, 0.366] |
0.251 |
0.092 |
0.953 |
0.669 |
Table 2.
nDCG@K across all cutoffs
|
Method |
nDCG@1 |
nDCG@5 |
nDCG@10 |
nDCG@20 |
Type |
Rank |
|
RRF(BM25+E5-base) |
0.563 |
0.492 |
0.416 |
0.345 |
Hybrid |
1 |
|
MPNet |
0.559 |
0.477 |
0.402 |
0.327 |
Dense |
2 |
|
E5+CrossEncoder |
0.554 |
0.464 |
0.392 |
0.319 |
Hybrid |
3 |
|
E5-base |
0.568 |
0.469 |
0.389 |
0.315 |
Dense |
4 |
|
MiniLM |
0.550 |
0.464 |
0.389 |
0.314 |
Dense |
5 |
|
BM25+CrossEncoder |
0.537 |
0.462 |
0.383 |
0.310 |
Hybrid |
6 |
|
BM25 |
0.455 |
0.422 |
0.357 |
0.286 |
Sparse |
7 |
|
TF-IDF |
0.449 |
0.409 |
0.350 |
0.290 |
Sparse |
8 |
B. Statistical Significance
Table 3 presents Bonferroni-corrected pairwise comparisons for all pairs where the outcome is noteworthy. Pairs not shown are not significantly different after correction.
Table 3.
Significant pairwise differences (Bonferroni-corrected, α = 0.05, nDCG@10)
|
Winner |
Loser |
|Δ nDCG@10| |
Cohen's d |
p-Bonf |
|
RRF |
TF-IDF |
0.066 |
0.370 |
<0.001 |
|
RRF |
BM25 |
0.060 |
0.330 |
<0.001 |
|
MPNet |
TF-IDF |
0.052 |
0.287 |
<0.001 |
|
MPNet |
BM25 |
0.046 |
0.248 |
<0.001 |
|
E5+CrossEncoder |
TF-IDF |
0.042 |
0.238 |
<0.001 |
|
E5+CrossEncoder |
BM25 |
0.036 |
0.199 |
<0.001 |
|
E5-base |
TF-IDF |
0.039 |
0.218 |
<0.001 |
|
E5-base |
BM25 |
0.032 |
0.179 |
0.022 |
|
MiniLM |
TF-IDF |
0.039 |
0.216 |
<0.001 |
|
MiniLM |
BM25 |
0.032 |
0.178 |
0.022 |
|
BM25+CrossEncoder |
TF-IDF |
0.033 |
0.184 |
<0.001 |
|
RRF |
BM25+CrossEncoder |
0.034 |
0.182 |
<0.001 |
|
RRF |
E5-base |
0.028 |
0.151 |
<0.001 |
|
RRF |
MiniLM |
0.028 |
0.151 |
0.011 |
|
RRF |
E5+CrossEncoder |
0.024 |
0.132 |
0.028 |
Non-significant pairs include TF-IDF vs. BM25 (p = 1.00), MiniLM vs. E5-base (p = 1.00), MiniLM vs. E5+CrossEncoder (p = 1.00), E5-base vs. E5+CrossEncoder (p = 1.00), and MPNet vs. RRF (p = 1.00).
C. Finding 1: RRF Is the Best Overall Method
RRF(BM25+E5-base) achieves nDCG@10 = 0.416 — the highest score across all eight methods at every evaluated cutoff. Its lead over BM25 (Δ = 0.060, d = 0.330) and TF-IDF (Δ = 0.066, d = 0.370) is strongly significant (p < 0.001). Although the confidence intervals of RRF and MPNet overlap and their difference is not Bonferroni significant (p = 1.00), RRF achieves a numerically higher score at every cutoff and exhibits advantages in MRR (0.762 vs. 0.752) and Hit@10 (0.967 vs. 0.955).
The mechanism behind RRF’s success aligns with the theoretical properties established by Cormack et al. [3]: BM25 is strong when query terms overlap with document vocabulary, while E5-base excels on queries requiring semantic inference beyond surface form. When one ranker fails on a query, the other compensates, and the fusion absorbs both signals without requiring score normalization, trained combination weights, or GPU inference beyond each component retriever.
D. Finding 2: Cross-Encoder Reranking Is Constrained by First-Stage Recall
BM25+CrossEncoder ranks sixth out of eight methods with nDCG@10 = 0.383 — barely above BM25 alone (0.357), and the difference is not Bonferroni significant (p = 0.062). A cross-encoder can only rescore documents that the first-stage retriever placed in its candidate pool. BM25 retrieves 100 candidates based on term overlap. For the naturalistic research questions in this study, many relevant documents share no surface vocabulary with the query: BM25 never places them in the candidate pool, and the cross-encoder cannot recover what was never retrieved. This is the cascade recall ceiling described by Nogueira and Cho [14].
E5+CrossEncoder partially mitigates this problem by substituting E5-base as the first stage: its nDCG@10 = 0.392 is 0.009 above E5-base alone (0.389), though the difference is not statistically significant (p = 1.00). Replacing BM25 with E5-base does raise first-stage recall, but not enough to produce a detectable improvement after reranking at this evaluation scale. These findings suggest that cascade systems should be designed with careful attention to first-stage recall, and that RRF is preferable to cross-encoder cascades when queries exhibit vocabulary mismatch with relevant documents.
E. Finding 3: Dense Methods Form a Statistically Indistinguishable Cluster
MiniLM (0.389), E5-base (0.389), E5+CrossEncoder (0.392), and MPNet (0.402) form a cluster in which all pairwise differences are non-significant after Bonferroni correction (minimum p = 0.69 across these four pairs). The raw score differences range from 0.001 to 0.013 nDCG@10 — smaller than the bootstrap standard errors. This finding does not mean the methods perform identically; it means that a study with 424 queries lacks the statistical power to distinguish them. Evaluations using larger query sets, such as the BEIR benchmark [19] with thousands of queries, would likely achieve separation. At 424 queries, the study discriminates between paradigm-level differences (dense vs. sparse) but not within-paradigm model differences.
F. Finding 4: Sparse Methods Are Statistically Equivalent
TF-IDF (0.350) and BM25 (0.357) show a difference of 0.006 nDCG@10, with p-Bonf = 1.00. Under the natural-language research questions in this study, BM25’s term saturation and length normalization offer no measurable advantage over TF-IDF’s simpler weighting. Both methods are limited by vocabulary mismatch, and neither can find relevant documents that use different terminology to express the same concept. This result aligns with observations from BEIR [19] that sparse methods show minimal relative differences on semantically rich queries.
G. Finding 5: Dense Paradigm Significantly Outperforms Sparse
Despite the inability to distinguish among individual dense models, all four neural methods significantly outperform both sparse baselines after Bonferroni correction. Effect sizes range from d = 0.178 (MiniLM vs. BM25) to d = 0.370 (RRF vs. TF-IDF). These are small-to-medium effects under standard guidelines, but they are consistent and robust across all 424 queries. The underlying mechanism is semantic bridging: neural encoders capture that a query about reducing hallucination by grounding generation is relevant to an article on Retrieval-Augmented Generation, even though the article title contains none of the query’s key terms. Sparse methods retrieve by term matching and consistently miss such cases.
H. Limitations
Three limitations warrant acknowledgment. First, statistical power at 424 queries is insufficient to distinguish individual dense models; future work should extend to at least 1,000 queries. Second, relevance judgments are produced by a single LLM judge using a topic-cluster heuristic rather than independent human assessors per query-document pair; while LLM-as-judge approaches correlate well with human judgments in structured settings [25], cluster-level assignment introduces coarser granularity than per-pair annotation. Third, all retrieval models are used zero-shot without domain-specific fine-tuning; fine-tuning dense encoders on NLP domain data would likely improve performance across the board.
Conclusion
This study evaluated eight information retrieval methods across sparse, dense, and hybrid paradigms on a domain-specific NLP corpus of 1,856 Wikipedia articles, using 424 LLM-authored queries with graded relevance judgments derived from expert topic clustering. Bootstrap confidence intervals and Bonferroni-corrected significance testing provided statistically rigorous comparisons beyond simple point estimates.
Five principal findings emerged: (1) Reciprocal Rank Fusion of BM25 and E5-base achieves the highest nDCG@10 of 0.416, significantly outperforming all other methods except MPNet; (2) cross-encoder reranking underperforms relative to RRF and dense bi-encoders — BM25+CrossEncoder fails to significantly improve upon BM25 alone, explained by the first-stage recall bottleneck; (3) all four dense methods are statistically indistinguishable at this evaluation scale; (4) TF-IDF and BM25 are statistically equivalent under naturalistic semantic queries; and (5) all neural methods significantly outperform both sparse baselines, confirming the semantic advantage of dense representations for conceptual domain-specific retrieval.
For practitioners, these findings support the following recommendations: prefer dense or hybrid methods over sparse baselines for domain-specific retrieval with conceptual queries; choose RRF when simplicity and robustness are priorities, as it requires no score calibration or additional training; and design cascade re-ranking systems with careful attention to first-stage recall when vocabulary mismatch between queries and documents is likely. Future work should pursue larger query sets, domain-adaptive fine-tuning of bi-encoders, per-pair relevance annotation, and extension to multilingual corpora.
References:
- Arabzadeh N., Yan X., Clarke C.L.A. Predicting efficiency/effectiveness trade-offs for dense vs. sparse retrieval strategy selection // Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR). — 2021.
- Burges C., Shaked T., Renshaw E., Hamilton N., Hullender G. Learning to rank using gradient descent // Proceedings of the 22nd International Conference on Machine Learning (ICML). — 2005. — P. 89–96.
- Cormack G.V., Clarke C.L.A., Buettcher S. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods // Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. — 2009. — P. 758–759.
- Efron B., Tibshirani R. An Introduction to the Bootstrap. — New York: Chapman & Hall, 1993. — 456 p.
- Formal T., Lassance C., Piwowarski B., Clinchant S. SPLADE: Sparse lexical and expansion model for first stage ranking // Proceedings of the 44th International ACM SIGIR Conference. — 2021.
- Formal T., Piwowarski B., Lassance C., Clinchant S. SPLADE v2: Sparse lexical and expansion model for information retrieval // ACM Transactions on Information Systems. — 2024.
- Guo J., Fan Y., Ai Q., Croft W.B. A deep relevance matching model for ad-hoc retrieval // Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM). — 2016. — P. 55–64.
- Järvelin K., Kekäläinen J. Cumulated gain-based evaluation of IR techniques // ACM Transactions on Information Systems. — 2002. — Vol. 20, No. 4. — P. 422–446.
- Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs // IEEE Transactions on Big Data. — 2019. — Vol. 7, No. 3. — P. 535–547.
- Karpukhin V., Oguz B., Min S., Lewis P., Wu L., Edunov S., Chen D., Yih W. Dense passage retrieval for open-domain question answering // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). — 2020. — P. 6769–6781.
- Ma X., Zhang X., Pradeep R., Lin J. Zero-shot listwise document reranking with a large language model // arXiv preprint arXiv:2305.02156. — 2023.
- Mackie I., Chatterjee S., Dalton J. Generative and pseudo-relevant feedback for sparse, dense and learned sparse retrieval // Workshop on Large Language Models’ Interpretation and Trustworthiness, CIKM. — 2023.
- Mandikal P., Mooney R. Sparse meets dense: A hybrid approach to enhance scientific document retrieval // CEUR Workshop Proceedings (Scientific Document Understanding Workshop). — 2023.
- Nogueira R., Cho K. Passage re-ranking with BERT // arXiv preprint arXiv:1901.04085. — 2019.
- Reimers N., Gurevych I. Sentence-BERT: Sentence embeddings using siamese BERT-networks // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). — 2019. — P. 3982–3992.
- Robertson S.E., Sparck Jones K. Relevance weighting of search terms // Journal of the American Society for Information Science. — 1976. — Vol. 27, No. 3. — P. 129–146.
- Salton G., Wong A., Yang C.S. A vector space model for automatic indexing // Communications of the ACM. — 1975. — Vol. 18, No. 11. — P. 613–620.
- Song K., Tan X., Qin T., Lu J., Liu T. MPNet: Masked and permuted pre-training for language understanding // Advances in Neural Information Processing Systems (NeurIPS). — 2020. — Vol. 33. — P. 16145–16155.
- Thakur N., Reimers N., Rücklé A., Srivastava A., Gurevych I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models // Proceedings of the 35th NeurIPS, Datasets and Benchmarks Track. — 2021.
- Wang L., Yang N., Huang X., Jiao B., Yang L., Jiang D., Majumder R., Wei F. Text embeddings by weakly-supervised contrastive pre-training // arXiv preprint arXiv:2212.03533. — 2022.
- Xu Z., Mo F., Huang Z., Zhang C., Yu P., Wang B., Lin J., Srikumar V. A survey of model architectures in information retrieval // arXiv preprint arXiv:2502.14822. — 2025.
- Zeng H. et al. Scaling sparse and dense retrieval in decoder-only LLMs // arXiv preprint. — 2025.
- Zhang H. et al. Efficient and effective retrieval of dense-sparse hybrid vectors using graph-based approximate nearest neighbor search // arXiv preprint. — 2024.
- Zhao W.X., Liu J., Ren R., Wen J.-R. Dense text retrieval based on pretrained language models: A survey // ACM Transactions on Information Systems. — 2022. — Vol. 42, No. 4.
- Zheng L., Chiang W.-L., Sheng Y. et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena // Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS). — 2023.