Master’s student, School of Information Technology and Engineering, Kazakh-British Technical University, Republic of Kazakhstan, Almaty
A NATURAL LANGUAGE PROCESSING PIPELINE FOR TEXT SUMMARIZATION AND MIND MAP VISUALIZATION
УДК 004.42
ABSTRACT
The exponential growth of digital textual information necessitates efficient automated summarization and structuring methods. While mind maps serve as effective tools for visual knowledge representation, their manual creation is highly time-consuming. Existing automated approaches often rely on simplistic sentence-to-sentence relationship matrices, leading to computational bottlenecks and shallow semantic understanding. To address these limitations, this study proposes a comprehensive Natural Language Processing (NLP) pipeline that transforms plain text into a structured, hierarchical graph. The methodology integrates advanced semantic embeddings (Snowflake Arctic 2) for sentence representation, hybrid similarity metrics for paragraph segmentation, K-Means clustering for grouping related concepts, and a Large Language Model (Gemma 3) for abstractive summarization. Furthermore, TF-IDF is employed for topic and concept extraction. Evaluated on the BBC News Summaries dataset, the proposed pipeline demonstrates competitive abstractive summarization performance, achieving an overall ROUGE-1 score of 0.4545 and providing coherent, automatically generated mind maps. The results indicate that combining embedding-based clustering with lightweight LLMs offers a scalable and effective solution for structured text visualization.
АННОТАЦИЯ
Экспоненциальный рост цифровой текстовой информации требует эффективных автоматизированных методов суммирования и структурирования. Хотя ментальные карты служат эффективными инструментами для визуального представления знаний, их ручное создание крайне трудоемко. Существующие автоматизированные подходы часто основаны на упрощенных матрицах связей между предложениями, что приводит к вычислительным узким местам и поверхностному семантическому пониманию. Для решения этих проблем в данном исследовании предлагается комплексный конвейер обработки естественного языка (NLP), который преобразует простой текст в структурированный иерархический граф. Методология интегрирует передовые семантические встраивания (Snowflake Arctic 2) для представления предложений, гибридные метрики сходства для сегментации абзацев, кластеризацию методом K-средних для группировки связанных понятий и большую языковую модель (Gemma 3) для абстрактного суммирования. Кроме того, для извлечения тем и понятий используется TF-IDF. Предложенный конвейер, оцененный на наборе данных BBC News Summaries, демонстрирует конкурентоспособные показатели абстрактного суммирования, достигая общего балла ROUGE-1 0.4545 и обеспечивая согласованные, автоматически генерируемые интеллект-карты. Результаты показывают, что сочетание кластеризации на основе встраивания с легковесными LLM предлагает масштабируемое и эффективное решение для визуализации структурированного текста.
Keywords: large language model, mind map, text summarization, natural language processing, semantic clustering, hierarchical graph.
Ключевые слова: большая языковая модель, ментальная карта, резюмирование текста, обработка естественного языка, семантическая кластеризация, иерархический граф.
Introduction
In the modern digital landscape, individuals and organizations are overwhelmed by vast amounts of unstructured textual data. Extracting core concepts and understanding the hierarchical relationships within lengthy documents remain significant challenges in information retrieval and knowledge management. Visual formats, particularly mind maps, have proven highly effective in mitigating cognitive overload by transforming linear texts into interconnected, structured visual representations [2; 13; 15].
Despite their cognitive benefits, the manual construction of mind maps requires deep reading comprehension, analytical structuring, and substantial time investment. Consequently, automating the text-to-graph transformation has emerged as a crucial research domain. Early automated systems primarily utilized rule-based syntax parsing and Part-of-Speech (POS) tagging to extract nouns as nodes and verbs as connecting edges [1; 3]. While these methods established foundational principles for multilevel graph representations [4], their reliance on rigid linguistic rules severely limited their scalability and adaptability to diverse text structures.
Subsequent advancements introduced statistical machine learning algorithms, such as Support Vector Machines (SVM) for node classification [17] and cosine similarity matrices for edge definition [18]. However, these approaches often failed to capture deep semantic nuances. Recently, the advent of deep learning and Transformer-based architectures has revolutionized the field. Models capable of generating sentence-to-graph structures have significantly improved the detection of long-range semantic dependencies [8; 10; 18]. Nonetheless, purely generative approaches using large pre-trained transformers can be computationally prohibitive and frequently struggle with maintaining strict hierarchical graph topologies.
This study aims to bridge the gap between deep semantic analysis and structured visual representation by proposing a unified NLP pipeline. Rather than solely relying on an LLM for end-to-end generation, the proposed methodology synergizes dense sentence embeddings, mathematical clustering, and targeted LLM summarization. This hybrid approach ensures both computational efficiency and high-quality hierarchical structuring, ultimately outputting an accurate mind map representation of the input text.
Materials and methods
The proposed system architecture, illustrated in Figure 1, consists of five sequential stages: sentence embedding generation, semantic paragraph segmentation, recursive clustering, abstractive summarization via LLM, and concept extraction for final graph visualization.
/Zhetessov.files/image001.jpg)
Figure 1. Proposed approach
To capture the profound semantic meaning of the text, the input document is initially tokenized into individual sentences. Each sentence is then encoded into a high-dimensional vector space using the Snowflake Arctic 2 embedding model. Based on the XLM-RoBERTa Large architecture and comprising 568 million parameters, this bi-encoder model provides state-of-the-art dense retrieval performance while maintaining computational efficiency [11; 16]. As demonstrated in existing benchmarks on Figure 2, Arctic Embeddings 2 robustly captures semantic features across various contexts.
/Zhetessov.files/image002.png)
Figure 2. Efficiency of search using multilingual embedding models
To identify logical thematic shifts within the text, the pipeline evaluates the semantic variation between adjacent sentences. For any two consecutive sentence embeddings,
and
, the directional similarity is computed using cosine distance as in Equation 1:
/Zhetessov.files/image005.png)
Simultaneously, the absolute magnitude of semantic shift is captured via the Euclidean distance (semantic delta) as in Equation 2:
/Zhetessov.files/image006.png)
Both metrics are subsequently normalized using Z-score standardization to ensure scale invariance as in Equation 3:
/Zhetessov.files/image007.png)
where
and
represent the mean and standard deviation, respectively, and
. The final segmentation boundary is determined by a hybrid score
, combining both normalized metrics with adjustable weights
and
as in Equation 4:
/Zhetessov.files/image014.png)
Through empirical tuning on the validation subset, the adjustable weights were set to
and
, prioritizing directional semantic shifts while still accounting for absolute Euclidean magnitude. Local maxima (peaks) in the hybrid score series indicate significant thematic transitions, dynamically dividing the document into semantically cohesive paragraphs.
Following segmentation, sentences within each semantic paragraph are grouped into hierarchical nodes. The optimal number of clusters k is determined automatically by maximizing the Silhouette Score, which evaluates intra-cluster cohesion and inter-cluster separation [14]. The sentences are then partitioned using the K-Means clustering algorithm. This process is applied recursively until the cluster size is reduced to a predefined granular threshold of 1 to 2 sentences per leaf node, forming the hierarchical tree structure of the mind map.
To generate concise labels for the graph’s leaf nodes, the clustered sentences are processed by the Gemma 3 Large Language Model (1B parameters). This model was explicitly selected for its localized attention mechanisms and hardware efficiency, making it highly suitable for generating precise, abstractive summaries of short text clusters without the high computational overhead associated with massive LLMs [5].
To populate the final nodes of the mind map, Term Frequency-Inverse Document Frequency (TF-IDF) is applied to the summaries as in Equation 5:
/Zhetessov.files/image017.png)
Noun phrases (n-grams of size 1 to 3) are extracted and ranked based on their TF-IDF scores. The highest-ranked phrases serve as the overarching cluster topics (parent nodes), while individual high-scoring lemmas represent specific key concepts (child nodes).
Results and discussions
The methodology was evaluated using the BBC News Summaries dataset [6], comprising 2225 English news articles and their corresponding reference summaries. As shown in Figure 3 and Figure 4, the dataset is distributed across five thematic categories: Business, Tech, Entertainment, Politics, and Sport, ensuring a diverse evaluation ground. A lexical analysis on Figure 5 confirms a broad vocabulary distribution suitable for NLP testing.
/Zhetessov.files/image018.png)
Figure 3. Distribution of news articles by categories
/Zhetessov.files/image019.png)
Figure 4. Pie chart of categories in percentages
/Zhetessov.files/image020.png)
Figure 5. Word cloud in a text column
The quality of the generated summaries representing the mind map nodes was evaluated against the dataset’s ground-truth references using standard NLP metrics: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [9] and BLEU (Bilingual Evaluation Understudy) [12].
Table 1.
Overall Summarization Performance
|
Metrics |
ROUGE-1 |
ROUGE-2 |
ROUGE-L |
BLEU |
|
All topics |
0.4545±0.0536 |
0.1687±0.0547 |
0.2147±0.0466 |
0.0952±0.0468 |
Table 2.
Summarization Performance by Category
|
Topic |
ROUGE-1 |
ROUGE-2 |
ROUGE-L |
BLEU |
|
Business |
0.4508±0.0506 |
0.1641±0.0469 |
0.2203±0.0413 |
0.0970±0.0444 |
|
Entertainment |
0.4778±0.0572 |
0.2057±0.0611 |
0.2382±0.0543 |
0.1116±0.0522 |
|
Politics |
0.4554±0.0443 |
0.1674±0.0432 |
0.2058±0.0340 |
0.0892±0.0378 |
|
Sport |
0.4379±0.0598 |
0.1591±0.0595 |
0.2122±0.0529 |
0.0946±0.0533 |
|
Tech |
0.4575±0.0443 |
0.1526±0.0451 |
0.1974±0.0359 |
0.0880±0.0386 |
As detailed in Table 1, the pipeline achieved an overall ROUGE-1 score of 0.4545, indicating strong lexical overlap with human-written summaries. The domain-specific analysis in Table 2 reveals highest performance in the Entertainment category (ROUGE-1 of 0.4778 and BLEU of 0.1116) and slightly lower precision in the Tech category. The moderate ROUGE-1 and lower ROUGE-2 and BLEU scores are highly characteristic of abstractive summarization tasks, where the model synthesizes new phrasing rather than strictly extracting exact original wording. These results are competitive when compared to established baselines for the BBC News dataset. For instance, recent studies show that massive, fine-tuned models like Gemma-7B achieve a ROUGE-1 score of 0.51 [7]. Achieving an overall ROUGE-1 of 0.4545 with a much smaller 1B-parameter model (Gemma 3) operating on zero-shot clustered nodes confirms the robustness and high computational efficiency of the proposed pipeline.
/Zhetessov.files/image021.png)
Figure 6. The resulting mind map.
The final output is rendered using the Graphviz library (Neato engine), mapping the hierarchical relationships logically. As illustrated in the generated sample on Figure 6, the core topic serves as the root, expanding into subtopics (green and blue ellipses), with generated abstractive summaries (orange boxes) and isolated key concepts (purple diamonds) acting as terminal leaves. This clearly demonstrates the pipeline’s ability to successfully convert unstructured text into a highly readable, relational structure.
Conclusion
This study presented a novel NLP pipeline designed to automate the extraction and structuring of information from extensive texts. By moving beyond simple rule-based parsers and computationally heavy end-to-end LLMs, the proposed method successfully integrated dense embeddings, mathematical segmentation, and targeted LLM summarization to generate hierarchical mind maps. Experimental validation on the BBC News dataset yielded competitive abstractive summarization metrics ROUGE-1 of 0.4545, proving the methodology's reliability in accurately capturing and condensing core textual themes. The structured visual outputs effectively reduce cognitive load, demonstrating practical applicability in educational and corporate environments. Future work will explore dynamic parameter tuning for the K-Means algorithm and evaluate the pipeline on multilingual datasets.
References:
- Abdeen M., El-Sahan R., Ismaeil A., El-Harouny S., Shalaby M., Yagoub M. C. Direct automatic generation of mind maps from text with m²gen. // 2009 IEEE Toronto International Conference Science and Technology for Humanity (TIC-STH). – 2009. – P. 95-99. DOI: 10.1109/TIC-STH.2009.5444360.
- Buzan T., Buzan B. How to mind map. // London: Thorsons. – 2002. ISBN: 978-0-00-714684-0.
- Chen Y.-S., Argueta C., Hsu P.-L., Hsieh H.-S., Lee L.-C. Homme: Hierarchical-ontological mind map explorer. // The 26th Annual Conference of the Japanese Society for Artificial Intelligence. – 2012. DOI: 10.11517/pjsai.jsai2012.0_3m1ios3a1.
- Elhoseiny M., Elgammal A. English2mindmap: An automated system for mindmap generation from english text. // Proceedings - 2012 IEEE International Symposium on Multimedia, ISM 2012. – 2012. – P. 326-331. DOI: 10.1109/ISM.2012.103.
- Gemma Team, Kamath A. [et al.]. Gemma 3 Technical Report. // arXiv e-prints. – 2025. – P. arXiv:2503.19786. DOI: 10.48550/arXiv.2503.19786.
- Greene D., Cunningham P. Practical solutions to the problem of diagonal dominance in kernel document clustering. // Proceedings of the 23rd International Conference on Machine Learning (ICML '06). – New York: Association for Computing Machinery. – 2006. – P. 377-384. DOI: 10.1145/1143844.1143892.
- Jiao Y., Yin Y., Wang Y. Evaluating LLMs and Pre-trained Models for Text Summarization. // arXiv e-prints. – 2025. – P. arXiv:2502.19339. DOI: 10.48550/arXiv.2502.19339.
- Kulkarni A., Shah H., D’Mello L., Shah K. Flowchart generation and mind map creation using extracted summarized text. // 2023 International Conference on Recent Advances in Science and Engineering Technology (ICRASET). – 2023. – P. 1-6. DOI: 10.1109/ICRASET59632.2023.10420315.
- Lin C.-Y. ROUGE: A package for automatic evaluation of summaries. // Text Summarization Branches Out. – Barcelona: Association for Computational Linguistics. – 2004. – P. 74-81. URL: https://aclanthology.org/W04-1013/
- Mhatre M., Pandey A., Rane H., Sahu S. A novel approach for creating flowcharts using generative ai. // 2024 Asia Pacific Conference on Innovation in Technology, APCIT 2024. – 2024. DOI: 10.1109/APCIT62007.2024.10673464.
- Muennighoff N., Tazi N., Magne L., Reimers N. MTEB: Massive text embedding benchmark. // Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. – Dubrovnik: Association for Computational Linguistics. – 2023. – P. 2014-2037. DOI: 10.18653/v1/2023.eacl-main.148.
- Papineni K., Roukos S., Ward T., Zhu W.-J. Bleu: a method for automatic evaluation of machine translation. // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02). – USA: Association for Computational Linguistics. – 2002. – P. 311-318. DOI: 10.3115/1073083.1073135.
- Rezapour-Nasrabad R. Mind map learning technique: An educational interactive approach. // International Journal of Pharmaceutical Research. – 2019. – Vol. 11, No. 1. – P. 1593-1597. ISSN: 0975-2366.
- Rousseeuw P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. // Journal of Computational and Applied Mathematics. – 1987. – Vol. 20. – P. 53-65. DOI: 10.1016/0377-0427(87)90125-7.
- Winn W. Encoding and retrieval of information in maps and diagrams. // IEEE Transactions on Professional Communication. – 1990. – Vol. 33. – P. 103-107. DOI: 10.1109/47.59083.
- Yu P., Merrick L., Nuti G., Campos D. Arctic-Embed 2.0: Multilingual Retrieval Without Compromise. // arXiv e-prints. – 2024. – P. arXiv:2412.04506. DOI: 10.48550/arXiv.2412.04506.
- Yulianto R., Mariyah S. Building automatic mind map generator for natural disaster news in bahasa indonesia. // 2017 International Conference on Information Technology Systems and Innovation (ICITSI). – 2017. – P. 177-182. DOI: 10.1109/ICITSI.2017.8267939.
- Zhang Z., Hu M., Bai Y., Zhang Z. Coreference graph guidance for mind-map generation. // Proceedings of the AAAI Conference on Artificial Intelligence. – 2024. – Vol. 38. – P. 19623-19631. DOI: 10.1609/aaai.v38i17.29935.