Master’s Student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
KAZAKH HANDWRITTEN TEXT RECOGNITION USING LIGHTWEIGHT CRNN+CTC ARCHITECTURE
ABSTRACT
The aim of this study is to develop an end-to-end pipeline for offline Kazakh handwritten text recognition (HTR) using the KOHTD dataset (140,355 word images). We propose a lightweight CRNN architecture combining a six-block convolutional feature extractor and a two-layer bidirectional LSTM trained with CTC loss. The methodology includes a four-stage preprocessing pipeline (grayscale conversion, adaptive binarization, stroke dilation, normalization) and comprehensive data augmentation (elastic distortions, brightness variation, Gaussian noise). The model achieves 73.42% word accuracy (WA) and 8.0% character error rate (CER) on the held-out test set. The results confirm that data-centric enhancements can deliver strong HTR performance without pretrained backbones in low-resource settings.
АННОТАЦИЯ
Цель исследования — разработка сквозного конвейера распознавания рукописного текста на казахском языке с применением датасета KOHTD (140 355 изображений слов). Предложена лёгкая архитектура CRNN, объединяющая шестиблочный свёрточный экстрактор признаков и двухслойный двунаправленный LSTM, обучаемые с функцией потерь CTC. Методология включает четырёхэтапный конвейер предобработки (перевод в оттенки серого, адаптивная бинаризация, дилатация штрихов, нормализация) и комплексную аугментацию данных (упругие деформации, яркостные искажения, гауссов шум). На тестовой выборке модель достигает 73,42% точности на уровне слова и 8,0% частоты символьных ошибок (CER). Результаты подтверждают, что методы, ориентированные на данные, позволяют добиться высокой точности без предобученных сетей в условиях ограниченных ресурсов.
Keywords: handwritten text recognition, Kazakh language, CRNN, CTC, KOHTD, deep learning, data augmentation.
Ключевые слова: распознавание рукописного текста, казахский язык, CRNN, CTC, KOHTD, глубокое обучение, аугментация данных.
Introduction
Handwritten text recognition (HTR) is a fundamental challenge at the intersection of computer vision and natural language processing. While significant progress has been made for widely spoken languages such as English and Chinese, Kazakh handwritten text remains an underexplored domain. The Kazakh language uses a modified Cyrillic alphabet of 42 letters, including unique characters absent from standard Russian Cyrillic (Ә, Ғ, Қ, Ң, Ө, Ұ, Ү, Һ, І), which makes it impossible to directly apply models trained for other Cyrillic-script languages.
Traditional HTR systems relied on handcrafted feature extraction followed by Hidden Markov Models or Support Vector Machines [15]. These methods struggled with unconstrained handwriting styles. The shift to deep learning introduced Convolutional Recurrent Neural Networks (CRNN) [12], which combine a CNN feature extractor with a bidirectional LSTM and Connectionist Temporal Classification (CTC) loss [3], enabling end-to-end training without explicit character segmentation. Parameshachari et al. [7] confirmed the superiority of CNN-based methods over SVMs for character recognition. Sadaf et al. [10] applied CRNN variants to Bangla script, achieving 12.83% CER. Akter et al. [1] demonstrated that synthetic data generation via BiLSTM-CTC can mitigate data scarcity in low-resource languages. Barrere et al. [2] and Li et al. [5] showed that Transformer-based architectures [14] advance HTR on large datasets, though they require extensive pretraining corpora unavailable for Kazakh. Levkov et al. [4] demonstrated transfer learning effectiveness for Cyrillic scripts, but only on a closed 100-word Russian vocabulary. Pham et al. [8] showed that aggressive augmentation and preprocessing reduce CER by 15% in noisy document settings.
For Kazakh, Narynbayev et al. [6] presented the first systematic CRNN study on KOHTD [13], achieving 78% word accuracy with greedy decoding and 85% with Word Beam Search using a Kazakh word corpus [11]. However, their pipeline applied minimal preprocessing, did not incorporate systematic augmentation, and relied on a large pretrained ResNet-50 backbone. Puigcerver [9] argued that aspect-ratio-preserving resize prevents artificial distortion of character proportions in HTR systems.
The aim of this study is to build a lightweight, reproducible Kazakh HTR system without pretrained backbones, addressing the following tasks: (1) develop a preprocessing pipeline adapted to KOHTD exam-paper scan quality; (2) design and train a compact CRNN+CTC architecture from scratch; (3) evaluate the effect of comprehensive data augmentation on recognition accuracy.
Materials and Methods
All experiments are conducted on KOHTD [13] — the Kazakh Offline Handwritten Text Dataset — the largest publicly available corpus for Kazakh word-level HTR. It was collected during university examinations at Satbayev University and Al-Farabi Kazakh National University, providing realistic handwriting from diverse student writers. Table 1 summarizes the key statistics.
Table 1.
KOHTD Dataset Statistics
|
Property |
Value |
|
Raw document pages |
3,000 scanned exam sheets |
|
Total word images |
140,355 |
|
Total characters |
922,010 |
|
Languages |
99% Kazakh, 1% Russian |
|
Mean word length |
6.57 characters |
|
Character classes |
102 (Cyrillic + digits + punctuation + blank) |
|
Training set |
113,687 images (81%) |
|
Validation set |
12,632 images (9%) |
|
Test set |
13,036 images (10%) |
Splits are created with random_state=42 for reproducibility.
Each image passes through a four-stage pipeline before augmentation (Figure 1).
Stage 1 — Grayscale conversion. The image is converted to a single intensity channel using luminance weighting (0.299R + 0.587G + 0.114B) to focus the model on stroke patterns.
Stage 2 — Adaptive binarization. Gaussian adaptive thresholding (block size 25, C = 15) binarizes the grayscale image (stroke pixels = 255, background = 0). Adaptive thresholding handles uneven illumination common in photographed exam papers better than global Otsu thresholding.
Stage 3 — Stroke dilation. A 2×2 elliptical kernel dilates the binarized image (one iteration), removing small noise particles and scanner artifacts while preserving stroke geometry.
Stage 4 — Normalization. The cleaned binary image is resized to 32×160 px, scaled to [0, 1], then standardized to zero mean and unit variance.
/Kartbayev.files/image001.jpg)
Figure 1. Preprocessing pipeline on three KOHTD samples. Left to right: (1) original grayscale, (2) adaptive binarization (white text on black background), (3) stroke dilation (2×2 ellipse, strokes visibly thickened).
Data augmentation (training only): random rotation (±5°), elastic distortions, grid warping, brightness/contrast jitter, and Gaussian noise. Validation and test images receive only normalization.
The proposed Lightweight CRNN consists of two modules connected end-to-end (Figure 2).
/Kartbayev.files/image002.jpg)
Figure 2. Lightweight CRNN: six-block CNN feature extractor → 2-layer Bidirectional LSTM → Linear classifier → Greedy CTC decoding.
CNN feature extractor. Six convolutional blocks (Conv2d → ReLU → MaxPool) with channel progression 1→64→128→256→256→512→512 reduce spatial height via pooling while preserving the horizontal width. The output is reshaped into a time-sequence of feature vectors, one per vertical image column.
2-Layer Bidirectional LSTM. The feature sequence is encoded by a 2-layer bidirectional LSTM (256 hidden units per direction), producing 512-dimensional representations. Bidirectional processing incorporates context from both sides of each character position, essential for resolving ambiguous Kazakh cursive characters.
Output layer. A linear layer maps the 512-dimensional vectors to logits over 102 classes (Kazakh Cyrillic + digits + punctuation + CTC blank). Table 2 summarizes the architecture.
Table 2.
Lightweight CRNN Architecture Summary
|
Module |
Configuration |
Output |
|
CNN |
6x(Conv2d + ReLU + MaxPool), channels: 1-64-128-256-512 |
B x T x D |
|
BiLSTM |
2 layers, 256 units per direction, bidirectional |
B x T x 512 |
|
FC |
Linear(512 to 102) |
B x T x 102 |
|
CTC |
Greedy collapse at inference |
variable-length text |
CTC loss [3] enables alignment-free training by marginalizing over all valid label paths. The Adam optimizer (learning rate 1×10⁻⁴) with gradient clipping (max norm 5) is used. Batch size: 16; epochs: 40. At inference, greedy CTC decoding collapses repeated symbols and removes blank tokens to produce the final transcription.
Results and Discussion
Three metrics are reported, consistent with prior Kazakh HTR work [6, 13]: Word Accuracy (WA) — share of predictions that exactly match ground truth; Word Error Rate (WER) — complement of WA; Character Error Rate (CER) — character-level Levenshtein distance divided by total reference characters: CER = (S + D + I) / C.
Figure 3 shows validation WA and CER over 40 training epochs.
/Kartbayev.files/image003.jpg)
Figure 3. Validation Word Accuracy (solid line) and Character Error Rate (dashed line) over 40 epochs for the Lightweight CRNN.
On the held-out test set (13,036 images) the model achieves: Word Accuracy = 73.42%, CER = 8.0%. Table 3 summarizes the evaluation results of the proposed model.
Table 3.
Evaluation Results on the KOHTD Test Set
|
Model |
Decoder |
WA (%) |
WER (%) |
CER (%) |
|
Lightweight CRNN (proposed) |
Greedy |
73.42 |
26.58 |
8.0 |
The proposed Lightweight CRNN achieves 73.42% WA and 8.0% CER on the held-out KOHTD test set. The 8.0% CER confirms that most errors involve partial character confusions rather than complete word failures, consistent with the visual similarity of several Kazakh Cyrillic pairs (Ш/Щ, З/Ж, Ы/Ь). The compact architecture, trained entirely from scratch on 113,687 KOHTD samples, demonstrates that systematic preprocessing and data augmentation are sufficient to achieve competitive recognition without relying on large pretrained backbones. Applying lexicon-constrained beam search decoding is expected to yield further gains and is left as future work.
Conclusion
This paper presented a data-centric, lightweight CRNN+CTC framework for offline Kazakh handwritten word recognition on the KOHTD dataset. The main contributions are: (1) a four-stage preprocessing pipeline (adaptive binarization, stroke dilation, normalization) adapted to exam-paper scan quality; (2) a compact custom CRNN with a six-block CNN and two-layer bidirectional LSTM trained entirely from scratch; (3) a comprehensive augmentation strategy simulating scanning artifacts and writing variation. The model achieves 73.42% word accuracy and 8.0% CER, providing a reproducible benchmark for future Kazakh HTR research. Future work will explore lexicon-constrained beam search decoding and full-line recognition with automatic segmentation.
Список литературы:
- Akter M.S., Shahriar H., Cuzzocrea A., Ahmed N., Leung C. Handwritten Word Recognition using Deep Learning: A Novel Way of Generating Handwritten Words // 2022 IEEE International Conference on Big Data. — IEEE, 2022. — P. 5414–5423. DOI: 10.1109/BIGDATA55660.2022.10021025
- Barrere K., Soullard Y., Lemaitre A., Couasnon B. A Light Transformer-Based Architecture for Handwritten Text Recognition // Document Analysis Systems. — Springer, 2022. — P. 275–290. DOI: 10.1007/978-3-031-06555-2
- Graves A., Fernández S., Gomez F., Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks // Proceedings of the 23rd International Conference on Machine Learning (ICML). — 2006. — P. 369–376. DOI: 10.1145/1143844.1143891
- Levkov A., Kaplun D., Safonova A. Transfer Learning for Russian Handwriting Recognition // 2023 46th International Conference on Telecommunications and Signal Processing (TSP). — IEEE, 2023. — P. 272–275. DOI: 10.1109/TSP59544.2023.10197682
- Li M., Lv T., Cui L., Lu Y., Florencio D., Zhang C., Li Z., Wei F. TrOCR: Transformer-based optical character recognition with pre-trained models // arXiv preprint arXiv:2109.10282. — 2021.
- Narynbayev D., Serikkhan A., Barkhandinova A., Mohammad I. Kazakh Handwritten Text Recognition Using Computer Vision and Neural Network // 2023 17th International Conference on Electronics Computer and Computation (ICECCO). — IEEE, 2023. — P. 1–5. DOI: 10.1109/ICECCO58239.2023.10147136
- Parameshachari B.D., Ashok A., Reddy H. Comparative Analysis of Handwritten Text Recognition using CNN and SVM // 2nd IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE). — IEEE, 2023. DOI: 10.1109/ICDCECE57866.2023.10150890
- Pham H., Setlur A., Dingliwal S., Lin T.H., Poczos B., Huang K., Li Z., Lim J., McCormack C., Vu T. Robust Handwriting Recognition with Limited and Noisy Data // Proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR). — IEEE, 2020. — P. 301–306. DOI: 10.1109/ICFHR2020.2020.00062
- Puigcerver J. Are multidimensional recurrent layers really necessary for handwritten text recognition? // 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). — IEEE, 2017. — Vol. 1. — P. 67–72.
- Sadaf F., Raju S.M.T.U., Muntakim A. Offline Bangla Handwritten Text Recognition: A Comprehensive Study // 3rd International Conference on Electrical and Electronic Engineering (ICEEE). — IEEE, 2021. — P. 153–156. DOI: 10.1109/ICEEE54059.2021.9718890
- Scheidl H., Fiel S., Sablatnig R. Word Beam Search: A Connectionist Temporal Classification Decoding Algorithm // 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). — 2018. — P. 253–258. DOI: 10.1109/ICFHR-2018.2018.00052
- Shi B., Bai X., Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition // IEEE Transactions on Pattern Analysis and Machine Intelligence. — 2017. — Vol. 39, No. 11. — P. 2298–2304. DOI: 10.1109/TPAMI.2016.2646371
- Toiganbayeva N., Kasem M., Abdimanap G., Bostanbekov K., Abdallah A., Alimova A., Nurseitov D. KOHTD: Kazakh offline handwritten text dataset // Signal Processing: Image Communication. — 2022. — Vol. 108. — P. 116827. DOI: 10.1016/j.image.2022.116827
- Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Kaiser L., Polosukhin I. Attention is all you need // Advances in Neural Information Processing Systems. — 2017. — Vol. 30. — P. 5998–6008.
- Vinjit B.M., Bhojak M.K., Kumar S., Chalak G. A Review on Handwritten Character Recognition Methods and Techniques // 2020 IEEE International Conference on Communication and Signal Processing (ICCSP). — IEEE, 2020. — P. 1224–1228. DOI: 10.1109/ICCSP48568.2020.9182129