SEMANTIC CLASSIFICATION OF CITIZEN APPEALS ON GOVERNMENT WEBSITES IN KAZAKH

СЕМАНТИЧЕСКАЯ КЛАССИФИКАЦИЯ ОБРАЩЕНИЙ ГРАЖДАН НА ГОСУДАРСТВЕННЫХ ВЕБ-САЙТАХ НА КАЗАХСКОМ ЯЗЫКЕ

Issayev Zh.N. Suleimenov Y.R.

28.04.2026 101

4(145)

10. Информатика, вычислительная техника и управление

Цитировать:

Issayev Zh.N., Suleimenov Y.R. SEMANTIC CLASSIFICATION OF CITIZEN APPEALS ON GOVERNMENT WEBSITES IN KAZAKH // Universum: технические науки : электрон. научн. журн. 2026. 4(145). URL: https://7universum.com/ru/tech/archive/item/22403 (дата обращения: 28.05.2026).

Прочитать статью:

Статья поступила в редакцию: 27.03.2026

Принята к публикации: 14.04.2026

Опубликована: 28.04.2026

ABSTRACT

This study addresses the problem of automatic classification of citizen appeals submitted via government portals in Kazakhstan. Manual processing of such requests is inefficient and does not scale. We propose a semantic classification approach for Kazakh-language appeals using both classical machine learning methods and a transformer-based model (BERT). The models were trained and evaluated on a dataset of real-world appeals categorized into domains such as health, finance, education, law, and social services. The results demonstrate that transformer-based models outperform traditional approaches, highlighting their effectiveness for processing morphologically rich and low-resource languages. The proposed approach can support government agencies in improving the efficiency of public service management.

АННОТАЦИЯ

В статье рассматривается задача автоматической классификации обращений граждан, поступающих через государственные порталы Казахстана. Ручная обработка таких данных является неэффективной и трудоемкой. Предложен подход к семантической классификации текстов на казахском языке с использованием методов машинного обучения и трансформерной модели BERT. Модели обучены и протестированы на реальном наборе данных, включающем обращения по темам: здравоохранение, финансы, образование, право и социальные услуги. Результаты показывают, что модели на основе трансформеров превосходят традиционные подходы и эффективно работают с языками с ограниченными ресурсами. Предложенное решение может повысить эффективность обработки обращений в государственных системах.

Keywords: natural language processing, text classification, citizen appeals, Kazakh language, BERT, machine learning, e-government.

Ключевые слова: обработка естественного языка, классификация текста, обращения граждан, казахский язык, BERT, машинное обучение, электронное правительство.

Introduction

The number of citizen appeals submitted via government portals in Kazakhstan has significantly increased. Manual processing of such data is inefficient and does not scale.

Recent studies demonstrate the effectiveness of machine learning and deep learning approaches for complaint classification in various languages [1], [2]. However, the Kazakh language remains underrepresented in NLP research. This study aims to address this gap by comparing classical machine learning methods with transformer-based models for semantic classification.

Materials and methods

In working towards designing a semantic classification of Kazakh citizen complaints, I used a standard machine learning pipeline as well as a transformer-based deep learning method.

Figure 1. Methodology of the proposed approach

Data Collection

The dataset consists of approximately 2,500 appeals collected from five government ministries (Health, Finance, Education, Internal Affairs, and Sports).

Data Preprocessing

Preprocessing included text cleaning, tokenization, stopword removal, and normalization. The dataset was split into training (80%) and testing (20%).

Model Training

We compared two semantic classification strategies for citizen petitions, namely traditional machine learning as well as deep learning with a transformer.

Traditional machine learning:

We employed traditional classifiers—Bernoulli Naive Bayes, as well as Stochastic Gradient Descent (SGD)—over features derived from a binary Bag-of-Words (BoW) model in our baseline. We employed tokenization along with stopword removal using Kazakh-specific packages in text preprocessing. This follows prior text classification research in low-resource languages where BoW in combination with linear classifiers have been shown to yield strong baseline performance. For instance, a paper classifying scientific text in Kazakh through neural models with multimodal fusion set forth the potential of ML-based solutions for Kazakh-language problems [6].

Transformer-Based Model:

The second was fine-tuning a Multilingual BERT (mBERT) to make a prediction of an appeal from contextual representation of a [CLS] token. The text was tokenized using BERT's WordPiece tokenizer, along with a classification head over a transformer encoder. Recent research identified transformer-based architecture as a possible gamechanger in Kazakh NLP. For example, an LSTM vs. BERT comparison for named entity recognition identified better performance from BERT in language-specific forms [7]. XLM-RoBERTa fine-tuned for Kazakh NER already exists and represents real-world practicability [8]. They both ran over exactly the same splits of a dataset to allow valid comparison of their performance.

Model Evaluation

Model performance was evaluated using accuracy, precision, recall, and F1-score.

Results and discussions

In order to assess performances of different modeling methods to classify Kazakh-language appeals of citizens, we have compared performances of three models, namely Bernoulli Naive Bayes, SGDClassifier, and mBERT. The performances of the models have been compared in terms of accuracy, precision, recall, and F1-score on the test set.

Classifiers indicators

Table 1.

Evaluation metrics for different classification models

The performance of the Bernoulli Naive Bayes established a benchmark of good performance, which showed the word-presence systems to have limitations in representing semantic information for morphologically rich languages like Kazakh.

SGDClassifier with sparse binary Bag-of-Words features worked exceptionally well and showed the power of linear classifiers in structured text classification — as already shown in multilingual government-related classification problems [9].

BERT fine-tuning performed better than both baselines by a wide margin with the greatest overall performance. With its ability to represent both context and semantics, it generalized better to other forms of citizen appeal structures. This aligns with existing research utilising deep transformers with the same applications to other morphologically more complicated or low-resource languages [10], [11].

Model Comparison Based on Confusion Matrices

In this subsection, the following abbreviations are used to denote government ministries: Minzdrav – Ministry of Health, MINFIN – Ministry of Finance, MKS – Ministry of Culture and Sport, MVD – Ministry of Internal Affairs, and MON – Ministry of Education and Science:

Figure 2. Confusion Matrix for BernoulliNB

Figure 3. Confusion Matrix for SGDClassifier

Figure 4. Confusion Matrix for BERT

These results indicate that BernoulliNB is sensitive to sparsity and assumes binary feature independence, which is not ideal for complex or overlapping text data. This aligns with previous findings on the limitations of Naive Bayes for text classification tasks [12].

This results highlights BERT’s strength in capturing semantic meaning and contextual relationships within the text, which gives it a significant advantage over traditional models [13].

In summary, while traditional models like SGDClassifier and BernoulliNB offer speed and simplicity, BERT provides superior accuracy and robustness in multi-class text classification tasks due to its deep understanding of language [14].

Conclusion

This study demonstrated the effectiveness of transformer-based models for semantic classification of Kazakh-language citizen appeals. BERT significantly outperformed traditional machine learning approaches. Future work may include larger datasets and multilingual models such as XLM-R. These results highlight the importance of transformer-based models for low-resource languages such as Kazakh.

References:

Begen N., Chugunov A. Smart government: Automated classification of Russian citizens’ complaints using machine learning // Proceedings of the International Conference on Electronic Governance and Open Society: Challenges in Eurasia. – 2019. – S. 59–66.
Sun L., Zhao Y., Zhang M. Automatic classification of public complaints using deep learning in Chinese e-government systems // Proceedings of the AAAI Conference on Artificial Intelligence. – 2022.
Amanzholova A., Yessenbayev Z., Nurpeiissov A., Mazzara M., Distefano S., Yahyaoui H., Basso A. KazNERD: A Kazakh named entity recognition dataset // Proceedings of the 13th Language Resources and Evaluation Conference (LREC). – 2022. – S. 4466–4474.
Astana N. L. QazNLTK: Natural language toolkit for Kazakh [Elektronnyj resurs]. – 2021. – URL: https://github.com/IS2AI/QazNLTK (data obrashhenija: 18.05.2025).
Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzmán F., Grave E., Ott M., Zettlemoyer L., Stoyanov V. Unsupervised cross-lingual representation learning at scale // Proceedings of ACL. – 2020.
Bogdanchikov A., Ayazbayev D., Varlamis I. Classification of scientific documents in the Kazakh language using deep neural networks and a fusion of images and text // Big Data and Cognitive Computing. – 2022. – Vol. 6, № 4. – P. 123.
Oralbekova D., Yessengaliyev Y., Zhakipbek Y., Mirzakhmetov B., Bakytkyzy A. A comparative analysis of LSTM and BERT models for named entity recognition in Kazakh language // Modeling and Simulation of Social-
Yeshpanov R. XLM-RoBERTa large NER Kazakh [Elektronnyj resurs]. – 2025. – URL: https://huggingface.co/yeshpanovrustem/xlm-roberta-large-ner-kazakh (data obrashhenija: 18.05.2025).
Alharbi F., Alhassan R. Arabic text classification using machine learning approaches: A comparative study // Proceedings of the International Conference on Computer and Information Sciences (ICCIS). – IEEE, 2021. – S. 1–6.
Nguyen T., Pham L., Tran D. A BERT-based model for Vietnamese complaint classification in e-government systems // Journal of Information and Telecommunication. – 2022. – Vol. 6, № 2. – S. 225–238.
Meyer D., Goyal P., Choudhury M. Deep learning for complaint classification in low-resource civic contexts // Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). – 2021. – S. 3794–3805.
Manning C. D., Raghavan P., Schütze H. Introduction to Information Retrieval // Cambridge: Cambridge University Press. – 2008.
Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv:1810.04805. – 2018.
Rogers A., Kovaleva O., Rumshisky A. A primer in BERTology: What we know about how BERT works // Transactions of the Association for Computational Linguistics. – 2020. – Vol. 8. – S. 842–866.