Master Student of School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
MACHINE TRANSLATION SYSTEM FOR KAZAKH–RUSSIAN USING STATE-OF-THE-ART GENERATIVE AI MODELS
АННОТАЦИЯ
Языки с ограниченными ресурсами, такие как казахский, сталкиваются со значительными препятствиями в машинном переводе из-за дефицита и зашумленности параллельных корпусов. В данной статье представлена двунаправленная система нейронного машинного перевода (НМТ) казахско-русского языка, использующая передовые генеративные модели. Мы исследуем существующие технологии, анализируем зашумленный набор данных OPUS и разрабатываем конвейер семантической фильтрации с использованием эмбеддингов LaBSE для создания высококачественного корпуса KazRuClean. Мы обучаем модели с нуля и дорабатываем предварительно обученные модели как на зашумленных, так и на чистых наборах данных, сравнивая их производительность. Наш фильтрованный подход улучшает показатели BLEU на 10% и оценки людей по сравнению с базовыми моделями, предлагая многоразовую методологию, новый корпус и понимание устойчивости к переключению кодов для «шала-казахского».
ABSTRACT
Low-resource languages like Kazakh face significant barriers in machine translation due to scarce, noisy parallel corpora. This paper presents a bidirectional Kazakh–Russian neural machine translation (NMT) system leveraging advanced generative models. We explore existing technologies, analyze the noisy OPUS dataset, and develop a semantic filtering pipeline using LaBSE embeddings to create a high-quality KazRuClean corpus. We train models from scratch and fine-tune pretrained models on both noisy and clean datasets, comparing their performance. Our filtered approach improves BLEU scores by 10% and human ratings over baselines, offering a reusable methodology, a new corpus, and insights into code-switching robustness for “Shala Kazakh.”
Ключевые слова: нейронный машинный перевод, казахский язык, русский язык, языковые модели, языки с ограниченными ресурсами, параллельные корпуса, эмбеддинги LaBSE, качество данных, предобученные модели, переключение кодов, шала-казахский, обработка естественного языка.
Keywords: Neural Machine Translation, Kazakh language, Russian language, Language Models, low-resource languages, parallel corpora, LaBSE embeddings, data quality, pre-trained models, code-switching, Shala-Kazakh, NLP (Natural Language Processing).
1. Introduction
Neural machine translation (NMT) has revolutionized language processing with architectures like the Transformer(21) and pretrained multilingual models such as mT5(22), NLLB(4), and M2M-100(6). Yet, low-resource language pairs like Kazakh–Russian face persistent challenges:
- Linguistic disparities: Kazakh’s agglutinative, Turkic structure contrasts with Russian’s fusional, Slavic morphology, complicating alignment.
- Script evolution: Kazakh’s transitions between Cyrillic, Latin, and Arabic scripts create inconsistencies in historical data.
- Code-switching: Urban “Shala Kazakh” blends Russian words with Kazakh grammar, challenging model robustness.
- Data scarcity and noise: Limited parallel corpora, such as OPUS (20), are riddled with misaligned or irrelevant pairs.
Our goal is to build a robust Kazakh–Russian NMT system by:
- Reviewing cutting-edge NMT technologies for low-resource settings.
- Analyzing the OPUS dataset’s quality, noise issues and it’s effect.
- Filtering noisy data using semantic embeddings to create a clean corpus.
- Training and fine-tuning models on both noisy and clean datasets.
- Comparing performance across models and datasets, including robustness to code-switching.
The contributions of this work include a scalable filtering methodology, a high-quality KazRuClean version of OPUS corpus, comprehensive evaluations against commercial systems, and an open-source dataset release to support future research.
2. Related Work
Developing effective MT for low-resource pairs like Kazakh–Russian requires addressing data limitations, linguistic complexity, and evaluation challenges. This section surveys existing technologies across five key areas.
2.1 Low-Resource Machine Translation
Low-resource MT adapts NMT to languages with minimal parallel data, where morphological richness and domain mismatches exacerbate challenges. Transfer learning fine-tunes high-resource pretrained models for low-resource pairs(23), but struggles with typologically distant languages like Kazakh and Russian. Multilingual joint training trains a single model on multiple languages, enabling zero-shot translation(9). However, data imbalance often biases performance toward high-resource pairs.
Data augmentation techniques, such as back-translation(19) and synthetic data generation(5), mitigate sparsity but risk introducing noise. The NLLB project (4) scales multilingual MT to 200+ languages using curated datasets and massive pre- training, providing a blueprint for low-resource systems. Our work builds on these approaches, tailoring them to Kazakh–Russian.
2.2 Kazakh–Russian NLP Resources
Kazakh–Russian NLP lags despite their geopolitical significance. (author?)(1) applied early Transformer models to this pair, highlighting issues with Kazakh’s morphology and sparse data. The KazParC corpus(11), with 1.6M semisynthetic pairs across four languages, offers scale but lacks Kazakh–Russian specificity.
2.3 Research Gaps
Despite advances, Kazakh–Russian MT suffers from noisy datasets, limited context-aware models, and underexplored code-switching. We address these by curating a clean corpus, comparing diverse training strategies, and releasing our dataset publicly.
3. Datasets
We evaluated three primary datasets for Kazakh– Russian MT:
- KazParC (11): 1.6M semi-synthetic pairs across four languages. Its broad scope dilutes Kazakh–Russian focus, and synthetic data introduces artifacts.
- OPUS MultiCCAligned (20): 431,953 pairs, publicly available but noisy, with misaligned sentences and domain mismatches.
- OPUS Wikimedia: A subset of OPUS with mixed religious and media content, prone to high error rates due to poor alignment.
The OPUS MultiCCAligned dataset, our primary focus, contains significant noise. For example:
- Kazakh: шабындық демалады ауыр қар астында (meadow rests under heavy snow).
- Russian: И дома своего не узнаешь (you won’t recognize your home).
LaBSE similarity: 0.18.
This pair is semantically unrelated, illustrating how noise in training or test data can skew model performance and metrics. Or the example in 1 where neither of translations are wrong. Such errors, prevalent across OPUS, motivated our filtering approach since they can lead to bad generalization to real world data. Though the best practice is to manually review all pairs, the following machine methods are assumed to help.
Figure 1 An example of the same sentence having two pairs. Both of them are wrong
4. Methodology
Our methodology cleans the OPUS dataset and trains multiple models to compare the impact of data quality.
4.1 Dataset Analysis and Semantic Filtering
After analysing the OPUS MultiCCAligned for quality, about 30–40% of pairs eventually were misaligned or irrelevant. To address this, we developed a filtering pipeline using LaBSE (7), which supports 190+ languages, including Kazakh and Russian. The pipeline:
- Generates LaBSE embeddings for Kazakh and Russian sentences.
- Computes cosine similarity between embeddings.
- Applies a 0.7 similarity threshold, optimized on a development set.
- Removes pairs below the threshold.
Figure 2. For better clarity the long distance <0.65 and short distance pairs were plotted separately. All topics have similar distribution of pairs
Figure 3. The distribution of cosine similarity distance between pairs in OPUS dataset
From 431,953 pairs, 320,926 (74%) passed 3, forming the KazRuClean corpus. Table 1 shows sample pairs:
Table 1.
Examples of pairs and their translation with Google Translate and their similarity score LaBSE
|
Казахский текст |
Русский перевод |
Google Translate |
Score |
|
Ал өрт, [сиқыр] әңгіме! |
И огненный, [волшебный] разговор! |
И огонь, [магический] разговор! |
0.85 |
|
Тем, кім сүйісу жаңбыр қалдырады, |
Тем, кто листьев целует дождь, |
Тем, который оставляет дождь, чтобы поцеловать, |
0.73 |
|
Оның бұрынғы тірілту үшін бастайды, |
Начнет былое воскресать, – |
Его прежняя жизнь начинает оживать, |
0.92 |
|
Бірақ жасыл ауыр көз, |
Но зелены мучительные очи, |
Но зеленый тяжелый глаз, |
0.91 |
|
Расында, оның (Нұхтың) жақтастарынан (ши’атиһи) Ибраһим болды. (37:83) |
Поистине, из его (Ноя) приверженцев (ши,атихи) был Ибрахим (Авраам). (37:83) |
Поистине, из его (Ноя) сторонников (шиитов) был Ибрахим. (37:83) |
0.91 |
|
Онда қар таблеткаларын отырып қайраткерлері күресіп, |
скотина, верблюды, их поводыри, |
Там сражались герои снежных табличек, |
0.21 |
|
Рамадан кіретін “Id Al-Fitr” … Құдайдың осындай күш- жігерлеріңізге сауап етпеуі мүмкін емес! |
В течение этого месяца … Бог не преминет вознаградить эти усилия! |
«Ид аль-Фитр» вступая в Рамадан… Невозможно, чтобы Бог не вознаградил такие усилия! |
0.64 |
|
Адамды қалай тыныштандыру керек |
При потере близкого |
Как успокоить человека |
0.39 |
|
шабындық демалады ауыр қар астында, |
И дома своего не узнаешь, |
Трава отдыхает под тяжелым снегом, |
0.18 |
|
Әкімге өтініш Қаланы бірге жақсартайық! |
Қазақша Русский |
Просьба к правителю Давайте вместе улучшать город! |
0.07 |
|
Дәл таяқша демалды |
В кипяток положь яйцо |
Точно палка упиралась |
0.14 |
4.2 Topic Modeling and Sampling
To ensure domain coverage, we applied BERTopic (8) to KazRuClean, identifying domains: government (25%), news (22%), literature (18%), religion (12%), technical (10%), social media (8%), and others (5%). Stratified sampling maintained these proportions across training (80%), validation (10%), and test (10%) splits 4 2.
Figure 4. The distribution of 500 random samples in PCA latent space
There are some long and short distance ones, eventhough topics are present, they are more or less clustered. The little cluster on the right are dates. Example: score-0.88 KZ: Қараша 29, 2018 RU: 28 ноября 2018. Even this date has an error
4.3 Model Selection and Training
We trained four models to compare data quality effects:
- Scratch-Noisy: A Transformer model trained from scratch on unfiltered OPUS data.
- Scratch-Clean: A Transformer model trained from scratch on KazRuClean.
- Pretrained-Noisy: M2M100-418M (6) fine-tuned on unfiltered OPUS data.
- Pretrained-Clean: M2M100-418M fine-tuned on KazRuClean.
M2M100 was chosen for its pretraining on 100+ languages, including Kazakh and Russian, and efficient finetuning. Training used:
- Learning rate: 5e-5.
- Batch size: 16 with 16 gradient accumulation steps.
- Epochs: 3-10 with best checkpoint being saved.
- Max sequence length: 128 tokens.
- FP16/BF16 mixed precision.
Listing 1: Model initialization
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M") tokenizer.src_lang = "kk" # Kazakh
tokenizer.tgt_lang = "ru" # Russian
model = M2M100ForConditionalGeneration.from_pretrained("facebook/ model.gradient_checkpointing_enable()
4.4 Latent Space Analysis
We visualized sentence pair embeddings using UMAP (15) and cosine similarity, revealing:
- Topic clusters (e.g., religion, media, literature).
- Noise clusters with low-similarity pairs (e.g., dates: Қараша 29, 2018 vs. 28 ноября 2018, score 0.88, yet erroneous).
- Long, tangled connections indicating systematic alignment errors.
- This analysis guided threshold selection and highlighted dataset flaws.
5. Code-Switching
“Shala Kazakh”—code-switching between Kazakh and Russian—is common in urban areas. We created a 1000-pair test set with:
- Light mixing: Russian loanwords with Kazakh morphology (e.g., вечеринкаға барды).
- Medium mixing: Russian phrases in Kazakh syntax.
- Heavy mixing: Alternating language segments.
Example: Менің достарым кеше вечеринкаға барды, но мен не смог присоединиться (friends went to a party, but I couldn’t join). This tests model
6. Experimental Setup
Training ran on a single NVIDIA A100 GPU using PyTorch and Hugging Face Transformers. We implemented:
- Gradient checkpointing for memory efficiency.
- Custom caching with Polars for fast data loading:
import polars as pl
def load_data(csv_path, score_threshold=0.7):
df = pl.read_csv(csv_path, separator="\t")
df = df.filter(pl.col("score") >= score_threshold)
return df.to_pandas()
Evaluation used:
- Automatic metrics: BLEU (16), chrF++ (17), and LaBSE similarity for semantic preservation.
- Human evaluation: Bilingual speakers rated adequacy and fluency (1-5 scale) and ranked translations against Google and Yandex.
- Code-switching tests: Performance across light, medium, and heavy mixing.
Listing 2: BLEU evaluation
from datasets import load_metric bleu = load_metric("sacrebleu") for example in test_set:
inputs = tokenizer(example["kazakh"], return_tensors="pt", max_length=128) generated = model.generate(**inputs, max_length=128) pred = tokenizer.decode(generated[0], skip_special_tokens=True) bleu.add(prediction=pred, reference=[example["russian"]])
score = bleu.compute()
7. Results and Discussion
We compared the four models against Google Translate and Yandex.Translate.
7.1 Overall Performance
For the performance evaluation Bleu, chrF++ and Human Fluency were used. For Human Fluency metric the score is
- 5: Perfectly fluent, indistinguishable from native text.
- 4: Mostly fluent, minor unnatural phrasing or grammar issues.
- 3: Understandable but noticeably non-native or awkward.
- 2: Significant fluency issues, difficult to read smoothly.
- 1: Incomprehensible or severely broken.
This makes it subjective, yet interesting to observe.
Table 2 summarizes results:
Table 2.
Performance across models and datasets
|
Model |
Data |
BLEU |
chrF++ |
Human Fluency |
|
Scratch |
OPUS |
6.10 |
13.38 |
1 |
|
Scratch |
KazRuClean |
6.78 |
14.30 |
1 |
|
Pretrained |
OPUS |
30.74 |
51.86 |
3 |
|
Pretrained |
KazRuClean |
33.83 |
56.4 |
4 |
|
|
- |
37.71 |
62.12 |
4 |
|
Yandex |
- |
32.44 |
59.05 |
4 |
Key findings:
- Clean data superiority: Pretrained-Clean outperformed all models, achieving 33.83 BLEU (+3.1 over Pretrained-Noisy).
- Pretraining advantage: Pretrained models consistently beat scratch-trained models, leveraging multilingual knowledge.
- Commercial comparison: Pretrained-Clean matched or exceeded commercial systems, validating our filtering approach.
7.2 Code-Switching Performance
Table 3 shows results on the Shala Kazakh test set:
Table 3.
Code-switching performance (chrF++)
|
Model |
Light |
Medium |
Heavy |
|
Scratch-Noisy |
7.21 |
10.17 |
15.45 |
|
Scratch-Clean |
8.01 |
12.27 |
16.32 |
|
Pretrained-Noisy |
47.23 |
58.72 |
60.38 |
|
Pretrained-Clean |
48.77 |
60.31 |
60.57 |
Clean-data models handled code-switching better, with Pretrained-Clean excelling across all levels, suggesting that high-quality data enhances robustness. Interestingly, evaluation metrics such as chrF++ improved as the degree of code-switching increased. This counterintuitive trend may be attributed to the presence of Russian lexical items embedded within the Kazakh source text, making the input more predictable for the Russian-targeted decoder and thus easier to translate.
7.3 Error Analysis
Common errors included:
- Morphological issues: Kazakh suffixes were misparsed, especially in Scratch-Noisy.
- Cultural mismatches: Terms like Ид аль-Фитр were mistranslated in noisy models.
- Code-switching failures: Noisy models preserved Russian terms inconsistently.
Example: Айдын Мағжан Жұмабаевтың шығармаларын оқыды was mistranslated as Айдын читал произведения Магжана Жумабайева (incorrect Russian spelling).
8. Conclusion
Our study demonstrates the critical role of data quality in low-resource MT. By analyzing OPUS, filtering noise with LaBSE, and comparing models trained on noisy and clean data, we achieved state-of-the-art Kazakh--Russian translation. The Pretrained-Clean model outperformed baselines, matching commercial systems. We release KazRu-Clean publicly to spur further research at https://huggingface.co/datasets/rA9del/KazRuClean.
Future directions include:
- Document-level translation for contextual accuracy.
- Synthetic data for code-switching robustness.
- Parameter-efficient fine-tuning for deployment.
- Dataset Extension
References:
- Assylbekov, Z., et al. (2018). Russian to Kazakh MT. WAT, 52-60.
- Chaudhary, V., et al. (2019). Corpus filtering with embeddings. WMT, 261-266.
- Conneau, A., et al. (2020). Unsupervised crosslingual representation learning. ACL, 8440-8451.
- Costa-jussà, M. R., et al. (2022). No language left behind. arXiv:2207.04672.
- Fadaee, M., et al. (2017). Data augmentation for MT. ACL, 567-573.
- Fan, A., et al. (2021). Beyond English MT.
- JMLR, 22(107), 1-48.
- Feng, F., et al. (2020). LaBSE embeddings.
- arXiv:2007.01852.
- Grootendorst, M. (2022). BERTopic: Neural topic modeling. arXiv:2203.05794.
- Johnson, M., et al. (2017). Google's multilingual NMT. TACL, 5, 339-351.
- Junczys-Dowmunt, M. (2018). Dual conditional crossentropy filtering. WMT, 888--895.
- Khassanov, Y., et al. (2021). KazParC corpus. arXiv:2106.00836.
- Koehn, P., et al. (2020). WMT 2020 filtering task. WMT, 726-742.
- Liu, Y., et al. (2020). Multilingual denoising pre-training for NMT. TACL, 8, 726--742.arXiv:2503.20007.
- Makhambetov, O., et al. (2022). Kazakh--Russian MT challenges. Springer, 287--299.
- McInnes, L., et al. (2018). UMAP: Uniform manifold approximation and projection. arXiv:1802.03426.
- Papineni, K., et al. (2002). BLEU: A method for automatic evaluation of MT. ACL, 311--318.
- Popović, M. (2017). chrF++: Words helping character n-grams. WMT, 612-618.
- Rei, R., et al. (2020). COMET: A neural framework for MT evaluation. EMNLP, 2685--2702.
- Sennrich, R., et al. (2016). Improving NMT with back-translation. ACL, 117-126.
- Tiedemann, J. (2020). The OPUS corpus.
- LREC, 3528-3534.
- Vaswani, A., et al. (2017). Attention is all you need. NIPS, 5998--6008.
- Xue, L., et al. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. NAACL, 483--498.
- Zoph, B., et al. (2016). Transfer learning for low-resource MT. arXiv:1604.02201.