Master's student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
A LIGHTWEIGHT QUERY-BASED FRAMEWORK FOR ABSTRACTIVE MULTI-DOCUMENT SUMMARIZATION IN RUSSIAN
ABSTRACT
As digital information keeps growing, it becomes harder for users to understand content spread across many doc- uments. This is especially important for areas like encyclopedias, news, and education, where key facts are found in different sources. Large language models (LLMs) are often used in systems that combine search and generation, but they are expensive and require a lot of resources. To solve this, we propose a lightweight, query-based multi-document summarization system for the Russian language. Our method builds summaries by combining relevant parts from multiple documents, making it a simpler and enabling effective summarization in scenarios with limited computational resources.
АННОТАЦИЯ
По мере непрерывного роста объемов цифровой информации пользователям становится все сложнее воспринимать контент, распределенный по множеству документов. Это особенно актуально для таких сфер, как энциклопедии, новости и образование, где ключевые факты находятся в различных источниках. Большие языковые модели (LLM) часто используются в системах, объединяющих поиск и генерацию, однако они дорогостоящи и требуют больших ресурсов. Для решения этой проблемы мы предлагаем легковесную систему многодокументной суммаризации на основе запросов для русского языка. Наш метод формирует резюме путем объединения релевантных частей из нескольких документов, что упрощает процесс и обеспечивает эффективную суммаризацию в условиях ограниченных вычислительных ресурсов.
Keywords: multi-document summarization, abstractive summarization, transformer models, natural language processing.
Ключевые слова: многодокументная суммаризация, абстрактивная суммаризация, модели-трансформеры, обработка естественного языка.
Introduction
Natural Language is the standard form of human contact and communication, with text appearing in various formats such as emails, social media posts, and online articles. People today are so busy that they don’t have much time to dig into the huge amount of information and extracting relevant insights from this information overload presents a significant challenge. This creates a problem in modern society. On one hand, there’s not enough time; on the other hand, people need to stay informed for work and social reasons. So, they need ways to get information quickly and easily. Text summarization is essential for compressing large texts while retaining key information. Multi-document summarization (MDS) plays a critical role in scenarios where information is spread across multiple sources, such as news aggregation, scientific literature reviews, or encyclopedic content like Wikipedia. Unlike single-document summarization, MDS must resolve redundancy, contradictions, and coherence across diverse texts. This research focuses on abstractive multidocument summarization, leveraging deep learning techniques to generate coherent, concise, and informative summaries that integrate content from multiple documents. Over the past decade, natural language processing has enormous progress in areas such as summarization, text generation, text classification, and etc. Since 2017 when transformerwhile retaining key information. Multi-document summarization (MDS) plays a critical role in scenarios where information is spread across multiple sources, such as news aggregation, scientific literature reviews, or encyclopedic content like Wikipedia. Unlike single-document summarization, MDS must resolve redundancy, contradictions, and coherence across diverse texts. This research focuses on abstractive multidocument summarization, leveraging deep learning techniques to generate coherent, concise, and informative summaries that integrate content from multiple documents. Over the past decade, natural language processing has enormous progress in areas such as summarization, text generation, text classification, and etc. Since 2017 when transformer architecture was developed [1] NLP methods were rapidly evolved. Some of the most advanced models with transformerbased architectures are large language models (LLMs) such as ChatGPT, Grok, and Gemini [2]. These models show very strong results in NLP tasks, with strong accuracy, coherence, and fluency. But they are expensive to train and run, so it’s hard to use them in small projects or when resources are limited [3]. Summarization methods are usually split into two types: extractive and abstractive. Extractive methods take important words or sentences directly from the original text. Abstractive methods create new text, which may include words not found in the source [4], [5]. Several studies focus on graph-based algorithms such as TextRank and LexRank, which rank sentences based on their importance on the document. Research by Tangade et al. (2023) explores an optimized TextRank approach combined with K-means clustering and neural network classification, showing improved performance over traditional ranking methods [4]. Another statistical approach uses Latent Semantic Analysis (LSA), which identifies important sentences by analyzing word co-occurrence. This method has been applied in news summarization, as mentioned in the work of Rajalakshmi et al. (2023) [5]. Transformer-based models have demonstrated significant improvements in text summarization [6], [7]. Adding an named entity recognition (NER) model to BERT can further improve evaluation metrics by entity recognition and structuring extracted content [6]. Meanwhile, the hybrid approach that combines BART with extractive summarization methods has shown strong performance, particularly in biomedical question-answering tasks, achieving high ROUGE-2, F1 scores [7]. Query-focused summarization adapts to summaries based on specific user queries. JianCheng Du (2021) developed a Span-based Question-answering driven model for Abstractive Summarization (SQAS) using a question-answering approach, ensuring high relevance and accuracy [8]. PEGASUS is a summarization model trained to predict important missing sentences, which helps it focus on key content. It performs well on English abstractive multi-document summarization tasks [9]. Multilingual models like mBART and mT5 support many languages and are effective for multi-document summarization. However, they may not perform as well on Russian texts due to less training data in Russian, so they often require finetuning to improve results. [10], [11]. Although transformer models work well in many NLP tasks, older methods are still used in some cases, even though they are often outdated. Most available datasets are made for extractive summarization and for single documents. There are very few datasets for abstractive multi-document summarization, especially for low-resource languages. For the Russian language, there is no large public dataset for this task.
Materials and methods
The goal of this research is to build an effective multidocument summarization model that creates clear and concise summaries using information from several related documents. To do this, we use transformer-based models, prepare a Russian-language dataset, and apply evaluation metrics suitable for multi-document summarization.
A. Dataset Preparation
Because there are few good public datasets for abstractive multi-document summarization, we created our own based on the WikiSum approach [12]. Each example includes a query (topic), several related paragraphs, and a reference summary taken from the beginning of the article. The dataset was built using the following steps:
1) Dump Extraction and Cleaning: We downloaded the latest Russian Wikipedia dump in XML format and extracted clean plain-text articles using the WikiExtractor tool [13]. During this stage, we removed all MediaWiki-specific markup, templates, categories, and citations (e.g., “[1]”).
2) Topic Selection: We select N different Wikipedia article titles from topics like science, history, and technology. Each title is used as a query for summarization.
3) Document Retrieval: For each query, we use FAISS [14] to find the top-K most relevant paragraphs. FAISS is a fast library for searching similar text using dense vector representations. We use multilingual embeddings from the E5 model [15], which is trained to match queries with documents. E5 shows strong results on many retrieval benchmarks, including multilingual ones [16]. The retrieved paragraphs come from other Wikipedia articles related to the query. We set K = 20 to ensure both relevance and enough context.
4) Filtering and Cleaning: Retrieved paragraphs are filtered to remove noise. We remove summaries with fewer than 30 tokens and articles with fewer than 100 tokens, as well as text containing non-informative markup.
5) Reference Summaries: For each topic, we extract the lead section (typically the first 2–3 sentences) of the main Wikipedia article and use it as the gold-standard abstractive summary. This follows the WikiSum approach of matching summaries to the main content of the documents.
6) Data Split: The resulting dataset is randomly divided into training, development, and test sets in an 80/10/10 ratio, ensuring that no topic appears in more than one split to prevent information leakage.
/Alikhan.files/image001.png)
Figure 1. Workflow of dataset preparation for multi-document summarization
/Alikhan.files/image002.png)
Figure 2. Word cloud for Russian-language texts from WikiSum
Overall, the dataset contains approximately 1 million Russian-language articles, each associated with a topic query, a set of retrieved paragraphs, and a reference summary. The average input length per example is around 2,000 tokens, while the average reference summary length is 50–60 tokens. Although the pipeline supports multilingual embeddings for generalization, the entire dataset is constructed from Russian Wikipedia, making it one of the few resources focused on Russian- language multi-document abstractive summarization. This setup reflects a realistic use case for multi-document summarization in underrepresented languages, addressing the challenge of generating coherent and informative summaries from diverse Russian-language sources.
Table 1.
Example of a query–documents summary triplet
/Alikhan.files/image003.png)
/Alikhan.files/image004.png)
Figure 3. General architecture of a multi-document summarization (MDS) model
B. Model Architecture
In this work, we evaluate several transformer-based encoder– decoder architectures to identify the most effective models for fine-tuning in Russian-language multi-document abstractive summarization. We consider the following pre-trained models:
• mT5 [17]: a massively multilingual version of T5 trained on over 100 languages, including Russian. It has demonstrated strong performance across multilingual NLP tasks and is suitable for zero-shot and cross-lingual generalization.
• mBART50 [10]: a multilingual BART-style sequenceto- sequence model trained on 50 languages. Unlike the original BART, mBART50 includes Russian and supports both generation and translation tasks, making it a strong candidate for summarization in Russian.
• RuT5 [18]: a monolingual Russian version of T5 trained on large-scale Russian corpora. It has shown high performance on various Russian-language NLP benchmarks and is especially effective for abstractive summarization, classification, and QA in the Russian domain.
All three models are used for multi-document summarization by putting the top-K documents together into one input, with special tokens (like </s> or [SEP]) between them. To make the summary focus on the query, we add the query at the beginning of the input. This helps the model understand what information is most important to include in the summary. The final input format looks like this: <query> [SEP] doc_1 [SEP] doc_2 ... doc_K
No architectural modifications are introduced. Instead, we fine-tune each model on our custom Russian-language dataset using supervised learning. All models are implemented using the HuggingFace Transformers library.
C. Training Setup
Training is conducted on a single NVIDIA RTX 4090 GPU with 24GB VRAM. All experiments are implemented using the HuggingFace Transformers and Datasets libraries in PyTorch. We fine-tune each model using cross-entropy loss with label smoothing (ϵ = 0.1). This helps the model generalize better and avoid being too confident in its predictions. The loss is calculated as:
, (1)
where
is the input sequence and
is the reference token at time step
. Gradient clipping with a maximum norm of 1.0 is used to improve training stability. We stop training early if the ROUGE-L score on the validation set doesn’t improve . This helps avoid overfitting. We use a custom Russian-language dataset with about 1 million examples. Each example contains a query, 20 relevant paragraphs retrieved using FAISS, and a reference summary. Because the input sequences can be up to 1024 tokens long, we set the batch size to 2 and use gradient accumulation over 8 steps, which gives an effective batch size of 16. Each model is fine-tuned for 1 epochs, that takes 15 hours of training. We apply the AdamW optimizer with a learning rate of
and use a linear learning rate schedule with 1,000 warmup steps. The update rule for AdamW is defined as:
(2)
In this equation,
and
represent the first and second moment estimates,
denotes the weight decay coefficient, and
is the learning rate. We save the model every 10,000 steps and choose the best version using the ROUGE-L score on the validation set.
D. Evaluation Metrics
To check how good the generated summaries are, we use two types of metrics: one for word overlap and one for meaning similarity. Together, they give a full picture of the summary quality.
• ROUGE-1, ROUGE-2, ROUGE-L: ROUGE (Recall- Oriented Understudy for Gisting Evaluation) [19] is a popular set of metrics used to evaluate summaries.
– ROUGE-1 evaluates the overlap of individual words (unigrams) between the generated summary and the reference.
– ROUGE-2 looks at how many two-word combinations (bigrams) match between the summaries.
– ROUGE-L looks at the longest sequence of words that appears in both summaries to evaluate how similar their structure is. The general ROUGE-N recall is computed as:
(3)
where
are
-grams, and Count refers to the number of occurrences.
For ROUGE-L, the LCS-based recall is given by:
(4)
where
and
denote the generated summary and the reference summary, respectively.
• BERTScore: BERTScore [20] compares contextual embeddings of tokens using a pre-trained BERT model. It captures semantic similarity, which is especially useful for abstractive summaries that may paraphrase content. Given predicted tokens
and reference tokens
:
(5)
(6)
Here,
and
are token vectors from BERT, and
measures how similar they are. The final score is the F1 value that combines precision and recall.
Together, these metrics provide complementary views: ROUGE emphasizes surface-level overlap, while BERTScore captures deeper semantic alignment between the generated and reference summaries.
Results and discussion
We evaluate several transformer-based models on the test set of our Russian-language multi-document summarization dataset. Table 2 reports the ROUGE and BERTScore metrics for three architectures: mT5-base, mBART50, and RuT5- base. All models are fine-tuned for a limited number of epochs using standard hyperparameters, primarily to explore feasibility rather than achieve optimal results.
Table 2.
Evaluation results of fine-tuned models on the Russian multi-document summarization dataset
/Alikhan.files/image028.png)
The results show that all models are capable of generating coherent summaries, although there is room for improvement. Among them, RuT5-base slightly outperforms the others, likely due to its monolingual pretraining on Russian corpora. mBART50 also performs reasonably well, showing that multilingual models can generalize to Russian, albeit not as effectively. mT5 lags slightly behind in both lexical and semantic evaluations.
A. Qualitative Observations
Manual inspection of selected outputs reveals typical patterns:
• mT5 sometimes includes generic or repetitive phrases.
• mBART50 tends to produce safe but extractive outputs.
• RuT5 produces shorter and slightly more abstractive summaries, though not always accurate.
B. Inference Time
We also measure inference time on a single RTX 4090 GPU for input lengths of 1024 tokens and output lengths of 64 tokens:
• mT5-base: 120 ms
• mBART50: 110 ms
• RuT5-base: 105 ms
The inference times are roughly comparable, with RuT5 being slightly faster. These preliminary results confirm that fine-tuned summarization models can operate efficiently even on moderately long multi-document inputs, making them suitable for further development.
Conclusion
In this paper, we explored the challenge of summarizing large amounts of text coming from multiple sources. This is especially important for applications like news websites, encyclopedias, and educational tools, where key information is often spread out across different documents. To solve this, we built a simple and efficient query-based summarization system for the Russian language. We used transformer models and created our own dataset based on Russian Wikipedia. Unlike large language models that require a lot of resources, our approach is faster, cheaper, and better suited for low-resource environments. This work highlights the importance of combining dense retrieval techniques with fine-tuned encoder-decoder architectures to generate clear and meaningful summaries. Future research may further explore domain-specific pretraining, multilingual extensions, and real-world integration in retrievalaugmented generation (RAG) pipelines.
References:
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 5998–6008. [Online]. Available: https://arxiv.org/abs/1706.03762
- S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, "Large language models: A survey," 2024. [Online]. Available: https://arxiv.org/abs/2402.06196
- G. Bai, Z. Chai, C. Ling, S. Wang, J. Lu, N. Zhang, T. Shi, Z. Yu, M. Zhu, Y. Zhang, X. Song, C. Yang, Y. Cheng, and L. Zhao, "Beyond efficiency: A systematic survey of resource-efficient large language models," 2024. [Online]. Available: https://arxiv.org/abs/2401.00625
- A. Tangade, A. Kumar Verma, N. Darapaneni, Y. Harika, Prasanna, A. Reddy Paduri, S. Ram Shankar, and R. Sadalagi, "The power of pre-trained transformers for extractive text summarization: An innovative approach," in 2023 11th International Symposium on Electronic Systems Devices and Computing (ESDC), vol. 1, 2023, pp. 1–6.
- R. Rajalakshmi, S. Vidhya, D. Harina, R. Karna, and A. Sowmya, "Text summarization for news articles using latent semantic analysis technique," in 2023 4th International Conference on Electronics and Sustainable Communication Systems, ICESC 2023 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023, pp. 1421–1425.
- I. P. Tummala, "Text summarization based named entity recognition for certain application using bert," in 2nd International Conference on Intelligent Cyber Physical Systems and Internet of Things, ICoICI 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024, pp. 1136–1141.
- Q. A. Nguyen, Q. H. Duong, M. Q. Nguyen, H. S. Nguyen, H. Q. Le, D. C. Can, T. D. Thanh, and M. V. Tran, "A hybrid multi-answer summarization model for the biomedical question-answering system," in Proceedings - International Conference on Knowledge and Systems Engineering, KSE, vol. 2021-November. Institute of Electrical and Electronics Engineers Inc., 2021.
- J. Du and Y. Gao, "Query-focused abstractive summarization via question-answering model," in Proceedings - 12th IEEE International Conference on Big Knowledge, ICBK 2021. Institute of Electrical and Electronics Engineers Inc., 2021, pp. 440–447.
- J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu, "Pegasus: Pre-training with extracted gap-sentences for abstractive summarization," in Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. [Online]. Available: https://arxiv.org/abs/1912.08777
- Y. Tang, C. Tran, and et al., "Multilingual translation with extensible multilingual pretraining and finetuning," arXiv preprint arXiv:2001.08210, 2020.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, "Exploring the limits of transfer learning with a unified text-to-text transformer," Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
- P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer, "Wikisum: Coarse-grained generative summarization of wikipedia articles," in International Conference on Learning Representations (ICLR), 2018. [Online]. Available: https://openreview.net/forum?id=H1key85xx
- G. Attardi, "Wikiextractor: A tool for extracting plain text from wikipedia dumps," https://github.com/attardi/wikiextractor, 2023, online; accessed 25-April-2025.
- J. Johnson, M. Douze, and H. Jégou, "Billion-scale similarity search with gpus," IEEE Transactions on Big Data, 2019.
- S. Wang, J. Li, X. Zhang, M. Tan, X. Hu, and T.-S. Chua, "Text embeddings by weakly-supervised contrastive pre-training," arXiv preprint arXiv:2212.03533, 2022.
- Hugging Face, "Mteb: Massive text embedding benchmark – hugging face leaderboard (legacy)," https://huggingface.co/spaces/mteb/leaderboard legacy, 2023, accessed: 2025-05-18.
- L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, "mt5: A massively multilingual pre-trained text-to-text transformer," arXiv preprint arXiv:2010.11934, 2021.
- SberDevices AI Lab, "Rut5: Russian text-to-text transfer transformer," https://huggingface.co/sberbank-ai/ruT5-base, 2022.
- C.-Y. Lin, "Rouge: A package for automatic evaluation of summaries," in Text summarization branches out, 2004, pp. 74–81.
- T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, "Bertscore: Evaluating text generation with bert," in International Conference on Learning Representations (ICLR), 2020.