LSTM-CTC BASED ACOUSTIC MODEL FOR KAZAKH SPEECH EVALUATION

АКУСТИЧЕСКАЯ МОДЕЛЬ НА ОСНОВЕ LSTM-CTC ДЛЯ ОЦЕНКИ РЕЧИ НА КАЗАХСКОМ ЯЗЫКЕ

Mussabekov N.D. Suleimenov Y.R.

28.04.2026 101

4(145)

10. Информатика, вычислительная техника и управление

Цитировать:

Mussabekov N.D., Suleimenov Y.R. LSTM-CTC BASED ACOUSTIC MODEL FOR KAZAKH SPEECH EVALUATION // Universum: технические науки : электрон. научн. журн. 2026. 4(145). URL: https://7universum.com/ru/tech/archive/item/22503 (дата обращения: 28.05.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.145.4.22503

Статья поступила в редакцию: 04.04.2026

Принята к публикации: 14.04.2026

Опубликована: 28.04.2026

ABSTRACT

People who learning foreign languages online mostly cant get a feedback on how they sound so automatic pronunciation evaluation can be an essential component of language learning. This paper tries to address the problem of building an acoustic model which is that can align Kazakh speech with its text at the letter level forming the foundation for future pronunciation assessment systems. Architecture and parameters of a memory based LSTM model with using the CTC-loss on the AlFarabi Speech Dataset to predict one character per audio frame achieved a letter recognition error 25.3%, 28.1% error rate in terms of word demonstrating the ability to produce frame level alignments without manual segmentation. These results indicate that deep learning methods can be effectively adapted for Kazakh speech. The model provides a practical starting point for developing real time feedback driven language learning applications.

АННОТАЦИЯ

Люди, изучающие иностранные языки онлайн, часто не получают обратной связи о своем произношении, поэтому автоматическая оценка произношения может стать важным компонентом обучения языкам. В данной работе рассматривается задача построения акустической модели, способной выравнивать казахскую речь с текстом на уровне символов, что закладывает основу для будущих систем оценки произношения. Архитектура и параметры модели LSTM с памятью, использующей функцию потерь CTC и обученной на датасете AlFarabi Speech Dataset, позволяют предсказывать один символ на каждый аудиокадр, достигая ошибки распознавания символов 25.3% и ошибки на уровне слов 28.1%. Это демонстрирует возможность получения выравнивания на уровне кадров без ручной сегментации. Полученные результаты показывают, что методы глубокого обучения могут эффективно применяться к казахской речи. Предложенная модель служит практической основой для разработки приложений по изучению языков с обратной связью в реальном времени.

Keywords: Kazakh language, machine learning, speech recognition, speech-to-text alignment, computer-assisted language learning.

Ключевые слова: казахский язык, машинное обучение, распознавание речи, выравнивание речи и текста, компьютерные системы обучения языкам.

Introduction

Learning a new language can be a exiting experience but one of the barrier that a all of us can face is learning the pronunciation. A language learner can have challenges pronouncing words, especially without immediate feedback. For more widely spoken global languages there are helpful tools to evaluate a learners pronunciation. Unfortunately for a low resource language such as Kazakh there are fewer tools available leaving a learner without a way to improve. Automatic Speech Recognition (ASR) technology has rapidly advanced [1] and now can be useful in language learning. But, most ASR systems and research focus for famous languages. Kazakh have received little focus both in academia and commercial apps while in English research is growing. There are two primary approaches to sequence modeling in speech recognition. First is CTC [2], [3] which marginalizes everything from output results, effectively handling an alignment between input speech and target text without requiring pre-segmented data. The second approach is encoding+decoding framework [4]–[6], This framework uses two LSTM: the first process the input over time and the second predicts the output probabilities. Among these two approaches the CTC loss function is widely utilized to handle sequence alignment between speech and text. [1], [7]. Recent studies have also compared LSTM [8] and attention based acoustic models. Transformer models tend to train faster and converge more rapidly than LSTM often without requiring CTC loss for alignment. But, Transformer model tend to overfit more easily in case that data for learning is limited. Despite these differ ences, both architectures can achieve comparable performance when a moderate amount of training data is available [9]. Studies have reported significant improvements in acoustic modeling using the aforementioned methods. LSTM+CTC models as well as CNN-RNN-CTC [7], [10]–[12] models have demonstrated superior accuracy in ASR and pronunciation assessment compared to old Hidden Markov Model (HMM) based systems. Most existing acoustic modeling systems trained on highresource languages. These models struggle with the unique phonetic characteristics of Kazakh, such as vowel harmony, complex consonant clusters, and prosodic patterns. In this work, I address this gap by designing and training an LSTM-CTC based acoustic model specifically for Kazakh speech on the AlFarabi Speech Dataset and evaluate performance measured at the letter level along with errors calculated based on whole words. My architecture operates at the letter level, aligning each audio frame with a predicted character. This letter level prediction capability serves as a foundational component for future pronunciation evaluation systems.

Materials and methods

The Kazakh Speech Dataset was created at KazNU [13]. The Kazakh Speech Dataset is a high-quality open source dataset developed specifically purpose of working with Automatic Speech Recognition (ASR) of Kazakh language. The Kazakh Speech Dataset includes 554 hours of recorded speech which were transcribed and checked for transcription quality by native Kazakh speakers. A summary of the dataset is provided in table 1 .

Table 1.

Dataset summary

Characteristic	Details
Developed by	University of Al farabi
Total Duration of recorded speech	554hours
Speaker Count	873
Per speaker sentence mean	250
Overall sentences	204250
Audio file type	.wav
Recording specs	Sample Rates: 16-bit mono channel recordings sampled at 16 kHz, 22 kHz, or 44 kH

Provided dataset is a multi faceted of speech samples with speakers from various backgrounds and ages and genders. The diverse speakers are crucial for training effective and generalized speech systems that are capable of across differing dialects and patterns of speech. Furthermore the recordings were made using mobile devices (iOS and Android) making this dataset appropriate for real world situations when working on mobile based ASR systems. High transcription quality data overseen by native Kazakh speakers verifies that the ground truth text is correct, and thus, appropriate for train and evaluating our acoustic model. This dataset is especially useful for training Kazakh language acoustic model that work through STT and AI support tool and voice verification.

The reason for conducting the study is to create a deep learning based acoustic model for Kazakh speech which capable of predicting the letter or letter sequence from the given speech signal. To develop the model, I specialize in using a LSTM plus CTC as they especially created for sequence-tosequence distinction problems [2], which provides flexibility for some speech recognition tasks where we would not have alignment, and the alignment would require more resources [3]. The LSTM-CTC models are supposed to be useful for low resource languages like Kazakh where training data is costly to annotate [5] Although LSTM-based model architectures are slower than some of the newer architectures such as Transformers due to their need to be process sequentially they have a number of advantages as well when training the model with a limited amount of resources especially in real-time processing applications and streaming use-cases. Furthermore, the Transformer model architectures while faster for training and prediction have a limitation of needing a large amount of data to train. Additionally, they have a large amount of computational cost, especially when processing only on CPUs [9].

CTC penalty function useful when correlation among input and output sequences is unknown. To handle sequences of varying it introduces a gap token (ϕ) the thing that allows the algorithm to identify gaps in the sequential interval labels and to repeat labels when necessary. This enables a multiple-single correspondence between incoming frames as well as predicted labels. Given a target label sequence y, CTC defines a set of possible alignments Ω(y) by allowing insertions of ϕ and repeated labels. Each alignment or path π ∈ Ω(y) is a sequence of length T (the length of the input) where each element πt is drawn from the extended label set {1, ..., K} ∪ {ϕ}. CTC sums over the probabilities of all valid alignment paths to compute the probabilities with respect to the output Y derived from passed X audio vector:

Here, PCTC (y | X) is the total probability of generating the letter sequence y for the X vector.

Figure 1. Illustration of the CTC output distribution over time steps for the target word ”nur”

Consider the word ”nur”, represented as a target sequence y = (n, u, r). Let the input be an acoustic sequence of length T = 10, corresponding to the time steps t1 through t10. As illustrated in Figure 1 the model can give a probabilities over the set of output symbols. In this case, the characters ”n”, ”u”, ”r”, and the special blank token at each time step.

Recurrent models with long term memory intended to learn relationships spanning long intervals in time series. Unlike standard RNN which suffer from vanishing or exploding gradients LSTM maintain information over long sequences through a gating mechanism that regulates the transmission of information.

Figure 2. Structure of a LSTM cell.

Figure 2 illustrates the architecture of an LSTM cell:

it - input gate: decides to which new info from the input xt and the previous hidden state ht−1 should influence the cell state.

ft - forget gate: decides what information from the previous cell state ct−1 should be deleted.

ot - output gate: manage the amount of information from the cell state that is passed to the next hidden state ht.

ct - cell state : memory storage within the unit to hold significant info while processing the entire input order.

Also, LSTM cell has 3 gates and sigmoid and tanh functions to adjust the transmission of data supporting the model in managing and evolving its internal condition successfully. This gating mechanism make LSTMs effective for real time apps such as speech recognition.

In our work, LSTM layers are used to give softmax predictions from audio input. CTC loss train model weight to make predictions of model better. The overall architecture of the proposed LSTM-CTC based audio mapping system is illustrated in Fig. 3. Acoustic model is designed to map raw Kazakh speech input into a character level transcription without requiring manual alignment between the audio signal and its corresponding frame level alignments. Mel-audio 40 dimension vectors extracted from raw audio signals are used to feed the network. These features are computed using a sliding window approach with a 25 ms frame size and 10 ms stride. The Mel spectrogram effectively captures the time frequency representation of speech and serves as a suitable input for model. Inputs are directed to an LSTM layer that interprets structure of the voice signal sequence. Let the sequence be denoted as X = (x1, ..., xT ) where x is the feature vector of an audio. The LSTM layer accept this x features in each cell to generate a hidden state at each cell. Then from each cell obtained hidden vectors directed into a linear FC layer which maps the hidden states to the output label space. Fully connected layer convert accepted vector into vocabulary size vector K = 43 is the number of target classes (42 Kazakh characters plus a blank symbol for CTC). Next, a softmax layer is used over all fc cells logits to obtain a probability scores for the output characters at each time step:

The resulting sequence of probability distributions (y1, ..., yT ) represents the model predictions for each frame of audio. To train the model I use the CTC loss function which grants the model the capability to align temporal audio data with the corresponding textual labels. The CTC loss sums over all possible alignments and enables effective training. Principe of CTC explained in Section 2.3.

The model was implemented using PyTorch and trained on the AlFarabi dataset using this hyperparameters:

Table 2.

Model configuration

Parameter	Value
Input Dimension	40 (Mel spectrogram coefficients)
Hidden Dimension	128
Vocabulary Size	43 (characters + blank)
Learning Rate	0.001
Epochs	100

Figure 3. The pipeline of LSTM-CTC model of Kazakh acoustic modeling. The model maps input Mel features to character-level predictions through an LSTM network followed by a linear layer and CTC loss

The model predicts one character or blank at each time step. The final transcription is obtained by collapsing repeated predictions and removing blank tokens (standard CTC decoding). Example output is shown in Table 4.

Figure 4. Example output from the LSTM-CTC model showing predicted characters over time including [BLANK] tokens for silence

Learning continued at a 0.001 learning rate until no further improvements were observed and evaluated on the AlFarabi Speech Dataset. The goal of my experimental setting for LSTM-CTC model using Kazakh speech data is to assess its capacity for predicting letter sequences and aligning them with the corresponding transcriptions. In the training process I utilized Mel spectrograms as the input features which offer a concise representation of characteristics of the speech signals frequency. The training was performed on a machine equipped with a GPU to accelerate computation. We monitored performance applying Character and Word Error Rates to monitor the models performance to align the predicted letters with the true transcriptions. The subsequent section I revealed model results and discussed on their potential impact.

Figure 5. Example of a Mel spectrogram used as input to the model

Results and discussion

The trained LSTM-CTC model was evaluated on Kazakh audio from the AlFarabi Speech Dataset. At inference time the model outputs a character or blank token at each time step. The model demonstrates the ability to accurately align Kazakh speech to letter sequences. Valid characters are predicted during voiced segments, while blank tokens dominate silent intervals. This confirms that the model successfully learns temporal alignment, as expected with CTC-based training.

We evaluated the model using CER and WER. CER rates the normalized Levenshtein interval at the character level between the predictions and the ground-truth transcriptions whereas WER assesses the decoding performance and is at the wordlevel after being post-processed into words. Both CER and WER include insertions, deletions and substitutions but the CER measure is the most appropriate for the current context since this model was trained to predict letters directly in the transcription. After training completed the LSTM-CTC model was producing promising results with 25.3% error of character and 28.1% error of word representing the performance on the test data. These results show that this deep learning modelling approach to learning to obtain the acoustic properties of spoken Kazakh has a good potential. This way of learning letter-level accuracy is important for any possible future use of this model in educational contexts as accurate pronunciation feedback is important for the progress of learners. These results demonstrate a strong baseline for Kazakh acoustic modeling particularly given the limitations in available training data. Prior work has reported similar error rates in early-stage pronunciation evaluation systems for low-resource languages.

The results show that the model can learn useful letter level alignments using the limited Kazakh speech I have. The end to end LSTM + CTC uses only paired transcripts and audio and does not rely on HMM models or hand aligned phonetics as per prior work. For this work it is essential that need for human annotation is substantially reduced. Although the model captures many correct segments I did highlight some deviations that can be improved like confusion occurring across short or unstressed words, lack of contextual correction without a language model and consistently poor performance when segments were annotated for noise or overlapping speech. Such patterns are consistent with alignment based CTC models. In terms of improvements: If I had more resources I could pursue some of others work and incorporate an use attention mechanisms. [14]

Conclusion

In summary, the paper focused on the experimental setup and testing of an acoustic model created for the Kazakh language within an LSTM-CTC framework. The model is ready to produce time aligned character predictions from audio input forming a foundational component for kazakh pronunciation evaluation systems. Qualitative inspection of model predictions shows strong potential for real-time alignment and error detection. The model successfully captures vowelconsonant transitions and identifies word boundaries in speech. However, the it still faces challenges with letter confusions, insertion errors and degraded performance on noisy inputs. These limitations show I need to improve things further, like adding more data, teaching the model to handle noisy inputs better and working with external language models to help correct context. Importantly, this work fill a notable gap in Kazakh speech technology. Because of its grammatical complexity and lack of resources, Kazakh is still underserved in comparison to languages like English which benefit from sophisticated pronunciation tools. By providing a workable and scalable method for simulating Kazakh acoustics using modern deep learning techniques, this study helps to close that gap. In addition to immediately addressing the issue of letter alignment for Kazakh speech the model created in this study paves the way for a variety of uses in speech technology, linguistic research and language acquisition. This model can be used as the foundation for more sophisticated pronunciation assessment systems in my future research. Planned extensions include the implementation of forced alignment methods and GOP scoring for phoneme level assessment and deployment in mobile applications for language learners. These tools would enable personalized feedback on learner pronunciation to help improve fluency and improve learner confidence. This approach can also be used for other Turkic languages and languages that do not have a lot of resources.

References:

W. Wang, X. Yang, and H. Yang, “End-to-end low-resource speech recognition with a deep cnn-lstm encoder,” p. 505, 2020.
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” 2014.
A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist ´ temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in ACM International Conference Proceeding Series, vol. 148, 2006, pp. 369–376.
W. Chan, J. Navdeep, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” 2016.
R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of sequence-to-sequence models for speech recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-August. International Speech Communication Association, 2017, pp. 939–943.
J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” 12 2014. [Online]. Available: http://arxiv.org/abs/1412.1602
Y. Shi, M.-Y. Hwang, and X. Lei, “End-to-end speech recognition using a high rank lstm-ctc based model,” p. 465, 2018.
S. Hochreiter, “Long short-term memory,” 1997.
A. Zeyer, “A comparison of transformer and lstm encoder decoder models for asr,” 2019.
Z. Yiwen and X. Lu, “A speech recognition acoustic model based on lstm-ctc,” 2018.
W.-K. Leung, X. Liu, and H. Meng, “Cnn-rnn-ctc based end-to-end mispronunciation detection,” p. 465, 2018.
D. Lee, M. Lim, H. Park, Y. Kang, J.-S. Park, G.-J. Jang, and J.- H. Kim, “Long short-term memory recurrent neural network- based acoustic model using connectionist temporal classification on a largescale training corpus,” 2017.
M. S. A. M. G. Kadyrbek, N.; Mansurova, “The development of a kazakh speech recognition model using a convolutional neural network with fixed character level filters,” Big Data and Cognitive Computing, vol. 7, no. 3, p. 132, 2023.
S. Ueno, H. Inaguma, M. Mimura, and T. Kawahara, “Acoustic-toword attention-based model complemented with character-level ctcbased model,” 2018.