PhD, Candidate of Computer Science, Azerbaijan State Oil and Industry University, Azerbaijan, Baku
COMPARATIVE ANALYSIS OF WORD EMBEDDING METHODS FOR THE AZERBAIJANI LANGUAGE IN GAME APPLICATIONS
ABSTRACT
Natural Language Processing in Azerbaijani language have been important topics of study in the literature for many years. Understanding NLP is important not only for linguistic research and communication technologies, but also for the game industry. It is important to understand content of the game and do analysis using NLP methods such as embeddings. This research provides a comprehensive examination and comparative analysis of word embedding techniques for serious games in the Azerbaijani language. Study shows importance of embedding in game-based applications. Given the constrained resources of Azerbaijani and its complex morphology, we evaluate multiple classical, subword-aware, and contextual embeddings. We assess each method using corpora obtained from the Azerbaijani serious game called "Yasaq". Experiments focused on word similarity and semantic relationships within a gaming context. Results reveal that classical models perform exceptionally well on smaller datasets, whereas contextual embeddings exhibit superior performance on downstream tasks relevant to gaming environments. The findings support the development of intelligent, linguistically flexible gaming systems. This domain-specific embedding enhances user understanding and engagement in Azerbaijani-language games.
АННОТАЦИЯ
Обработка естественного языка для азербайджанского языка уже многие годы является важной темой исследований в литературе. Понимание NLP важно не только для лингвистических исследований и коммуникационных технологий, но и для игровой индустрии. Важно понимать содержание игры и проводить анализ с использованием методов NLP, таких как векторные представления (эмбеддинги). В данном исследовании представлено всестороннее изучение и сравнительный анализ методов векторного представления слов для серьёзных игр на азербайджанском языке. Исследование показывает значимость применения эмбеддингов в игровых приложениях. Учитывая ограниченные ресурсы азербайджанского языка и его сложную морфологию, мы оцениваем несколько классических, основанных на подсловах и контекстуальных моделей. Каждая методика оценивается с использованием корпусов, полученных из азербайджанской серьёзной игры под названием «Yasaq». Эксперименты были сосредоточены на сходстве слов и семантических связях в игровом контексте. Результаты показывают, что классические модели демонстрируют исключительно хорошие результаты на небольших наборах данных, тогда как контекстуальные эмбеддинги показывают превосходную производительность в прикладных задачах, связанных с игровыми средами. Полученные выводы способствуют развитию интеллектуальных, лингвистически гибких игровых систем. Такая предметно-ориентированная векторизация усиливает понимание и вовлечённость пользователей в играх на азербайджанском языке.
Keywords: serious game, natural language processing, word embedding, text classification, semantic relationship, word similarity, Azerbaijani language.
Ключевые слова: серьёзные игры, обработка естественного языка, векторное представление слов, классификация текста, семантические отношения, сходство слов, азербайджанский язык.
Introduction
Natural Language Processing (NLP) has become an important part of building smart platforms in recent years [1]. These platforms can be used for many things, such as learning, talking, and having fun. Among these fields, the gaming industry is increasingly using NLP methods to improve the experience of players through creating content that changes all the time and determining users interaction with each other. Serious games are not just for fun but also for learning or training [2], [3], [4]. Game platforms also benefit significantly from NLP because it allows you to look more closely at how language is used [5], [6]. It can be used for different reasons: what players do, how the games respond to players' actions, updating content of games. Since knowing language is a big part of these games, being able to process and understand real language well is important for making user experiences that are engaging and useful. Because of this, using NLP methods such as word embeddings is an important part of designing and analyzing games today.
The Azerbaijani language presents several challenges for NLP due to its linguistic complexity and limited computational resources [1], [7]. These issues become even more critical when developing domain-specific applications such as serious games.
- Azerbaijani is a morphologically rich and agglutinative language, resulting in a vast number of word variations and complex structures.
- There is a shortage of annotated corpora and high-quality language datasets for training and evaluating NLP models.
- Pretrained models and language-specific tools (e.g., tokenizers, lemmatizers) are either lacking or underdeveloped.
- Informal, colloquial, and domain-specific language often used in games is harder to process with generic NLP techniques.
- Azerbaijani’s low-resource status limits the effectiveness of transfer learning from models trained on high-resource languages.
- Standard embedding techniques may fail to capture semantic relationships without adaptations for the language's unique morphology.
Word embeddings provide a good way to solve the difficulties in Azerbaijani as they enable models to capture semantic linkages. Despite limited language resources and complicated morphology of low-resource languages it can capture meaningful relations. Unlike traditional methods that rely heavily on handcrafted features and large annotated datasets, embeddings can be trained on unlabelled corpora and still learn meaningful representations of words. Subword and character-level embeddings are especially useful for handling the agglutinative structure of Azerbaijani, as they allow models to generalize across various word forms. Contextual embeddings further enhance this capability by interpreting word meaning based on related context, making them well-suited for dynamic and informal language use found in games. By leveraging embeddings tailored to game-specific data, such as the corpus from the "Yasaq" game, it becomes possible to build NLP systems that better understand and respond to user input within Azerbaijani-language gaming environments.
Materials and methods
Early static approaches such as Word2Vec and GloVe have been repeatedly shown to degrade when training data are sparse or morphologically rich [8], [9]. Sub-word models like FastText became the baseline because n-gram character features mitigate data sparsity and agglutination. Comparative studies on other low-resource languages confirm that FastText outperforms purely word-level vectors on both intrinsic and downstream tasks, while ensembles of multiple embeddings give additional gains. Recent work has shifted toward contextual and cross-lingual embeddings. Multilingual transformers (mBERT, XLM-R) supply zero-shot support but suffer from mis-aligned sub-spaces; alignment frameworks that add explicit word-level constraints now reduce that gap, boosting bitext retrieval and transfer accuracy for eight low-resource languages. Surveys of Turkic NLP underline the same pattern: transfer learning plus sub-word modelling currently offers the best trade-off while dedicated monolingual pre-training remains rare.
Games use embeddings to drive dialogue systems, intent detection, adaptive storytelling and learner analytics. A 55-paper scoping review of GPT-series applications maps five major use cases from procedural content generation to game-user research. This is signalling rapid uptake of large language models in production pipelines. Concretely, in the research [5], show how knowledge-graph-augmented GPT-4 creates context-aware NPC chatter in Final Fantasy VII Remake and Pokémon without author-written scripts.
For serious and educational games, classical embeddings are still prevalent. Combined NLP with different classifiers to mine gameplay logs and refine learning content, demonstrating measurable pedagogical lift [3], [10], [11]. Complementary research couples embeddings with ontologies to produce explainable text classifiers that power taboo-style card games and other language-learning mechanics [4]. Together, these studies indicate that both static and contextual vectors are valuable. Static models remain computationally light for embedded or mobile games, whereas transformer-based models drive richer generative experiences [2], [12].
Current best practice therefore mixes (i) multilingual transformers for zero-/few-shot coverage, (ii) FastText-style sub-word vectors for lightweight inference, and (iii) selective domain fine-tuning (e.g., game dialogue) when task-specific data are available. Multilingual BERT has been widely used as a base model though with limited representation. It has demonstrated encouraging outcomes when optimized for downstream tasks like as Named Entity Recognition, Part-of-Speech tagging, and sentiment analysis. Turkish BERT is a one of fine-tuned models. It is trained on extensive Turkish corpora and produces competitive outcomes for tasks like as Named Entity Recognition and text categorization [13], [14].
Results and discussions
Experiments carried on “Yasaq” serious game dataset. It contains cards which contain 6 words in each of them. These words related to each other semantically. This research focused on comparison of word embeddings in order to find suitable approach for serious word games in Azerbaijani. In the table 1, this study compares the performance of several embedding techniques CountVectorizer, TfidfVectorizer, Word2Vec, FastText, and BERT. For each word, the top 1st similar word retrieved by each embedding method and the corresponding similarity score are shown. Based on our observations , we conclude:
- Traditional Methods (CountVectorizer, TfidfVectorizer): These methods generally show lower similarity scores because they rely purely on word frequency and co-occurrence without capturing semantic meaning. They tend to retrieve exact or morphologically similar matches rather than semantically similar ones.
- Word2Vec and FastText: These embeddings capture semantic similarity better, with scores around 0.4–0.9. FastText consistently shows high scores due to its subword-level modeling, which is particularly helpful for the morphologically complex Azerbaijani words. For example, for the word “quş”, FastText finds “ingiltərə” with a high similarity of 0.908582, reflecting a limitation in domain-specific context that may need further refinement.
- BERT: Contextual embeddings like BERT provide robust similarity scores reflecting their ability to incorporate context into word meaning. However, BERT’s higher similarity scores might sometimes retrieve contextually but not necessarily semantically close words, depending on the corpus.
Table 1.
Top similar word and their similarity score across different embedding techniques in “Yasaq”
|
Word |
Top 1st similarity |
Embedding |
Similarity Score |
|
quş |
tük |
CountVectorizer |
0.281718 |
|
tük |
TfidfVectorizer |
0.298161 |
|
|
ayı |
Word2Vec |
0.451469 |
|
|
i̇ngiltərə |
FastText |
0.908582 |
|
|
tutuquşu |
BERT |
0.838724 |
|
|
kiçik |
qucaqlamaq |
CountVectorizer |
0.202031 |
|
qucaqlamaq |
TfidfVectorizer |
0.216492 |
|
|
yapışıq |
Word2Vec |
0.451469 |
|
|
i̇llustrasiya |
FastText |
0.938683 |
|
|
böyük |
BERT |
0.838724 |
|
|
su |
dəniz |
CountVectorizer |
0.276648 |
|
dəniz |
TfidfVectorizer |
0.333277 |
|
|
luiziana |
Word2Vec |
0.373240 |
|
|
gecə |
FastText |
0.762766 |
|
|
isti |
BERT |
0.709387 |
|
|
heyvan |
it |
CountVectorizer |
0.272475 |
|
it |
TfidfVectorizer |
0.268121 |
|
|
ilbiz |
Word2Vec |
0.381669 |
|
|
qara |
FastText |
0.941047 |
|
|
karvan |
BERT |
0.806644 |
|
|
yemək |
pəhriz |
CountVectorizer |
0.208514 |
|
pəhriz |
TfidfVectorizer |
0.197921 |
|
|
qızıl |
Word2Vec |
0.399578 |
|
|
i̇slandi̇ya |
FastText |
0.951257 |
|
|
əmək |
BERT |
0.844028 |
|
|
şirin |
meyvə |
CountVectorizer |
0.272423 |
|
şirniyyat |
TfidfVectorizer |
0.285782 |
|
|
zebra |
Word2Vec |
0.424426 |
|
|
dil |
FastText |
0.922399 |
|
|
şəkər |
BERT |
0.900347 |
|
|
ev |
mənzil |
CountVectorizer |
0.288675 |
|
mənzil |
TfidfVectorizer |
0.273212 |
|
|
almaniya |
Word2Vec |
0.395754 |
|
|
yaban |
FastText |
0.739765 |
|
|
kənd |
BERT |
0.888595 |
The following are the results obtained from the experiments on BERT embedding (fig. 1). It shows most similar words to “kitab” in 3D. For example, “kitab”, “kitablıq”, “kitabxana” are connected to each other despite of they have same root. In our BERT approach, it contains words which has new meaning.
/Mammadli.files/image001.png)
Figure 1. Top most similar 9 words to “kitab” word
Conclusion
Study findings research the necessity of integrating contextual embedding techniques in "Yasaq" like games. To provide dynamic and context-sensitive experiences such tactics are essential. These findings validate that a transformer-based BERT methodology is essential for achieving accurate and functional embeddings in gaming settings. Game designers may enhance language understanding and provide more engaging user experiences in Azerbaijani-language contexts by implementing a customized embedding technique for serious games.
References:
- D. Pathak, S. Nandi, and P. Sarmah, “Evaluating Performance of Pre-trained Word Embeddings on Assamese, a Low-resource Language,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds., Torino, Italia: ELRA and ICCL, May 2024, pp. 6418–6425. [Online]. Available: https://aclanthology.org/2024.lrec-main.568/
- A. Mammadli, “Application of Deep Learning for Procedural Content Integration for Learning Serious Games,” Proceedings of Azerbaijan High Technical Educational Institutions, vol. 48, no. 06, pp. 455–469, Jan. 2025.
- A. Mammadli, “UNLOCKING EDUCATIONAL INSIGHTS: INTEGRATING WORD2VEC EMBEDDINGS AND NAIVE BAYES CLASSIFIER FOR SERIOUS GAME DATA ANALYSIS AND ENHANCEMENT,” Azerbaijan Journal of High Performance Computing, vol. 6, no. 2, pp. 191–198, Dec. 2023, doi: 10.32010/26166127.2023.6.2.191.198.
- A. Mammadli, E. Ismayilov, and C. Zanni-Merk, “Explainability of text Classification through ontology-driven analysis in Serious Games,” Procedia Comput Sci, vol. 246, pp. 2128–2137, 2024, doi: https://doi.org/10.1016/j.procs.2024.09.626.
- N. Nananukul and W. Wongkamjan, “What if Red Can Talk? Dynamic Dialogue Generation Using Large Language Models,” Jul. 2024, [Online]. Available: http://arxiv.org/abs/2407.20382.
- D. Yang, E. Kleinman, and C. Harteveld, “GPT for Games: A Scoping Review (2020-2023),” Apr. 2024, doi: 10.1109/CoG60054.2024.10645548.
- Y. Veitsman and M. Hartmann, “Recent Advancements and Challenges of Turkic Central Asian Language Processing,” Jul. 2024, [Online]. Available: http://arxiv.org/abs/2407.05006.
- S. Mammadli, S. Huseynov, H. Alkaramov, U. Jafarli, U. Suleymanov, and S. Rustamov, “Sentiment polarity detection in Azerbaijani social news articles,” in International Conference Recent Advances in Natural Language Processing, RANLP, 2019. doi: 10.26615/978-954-452-056-4_082.
- K. Sarıtaş, C. A. Öz, and T. Güngör, “A comprehensive analysis of static word embeddings for Turkish,” Expert Syst Appl, vol. 252, Oct. 2024, doi: 10.1016/j.eswa.2024.124123.
- D. Picca, D. Jaccard, and G. Eberlé, “Natural Language Processing in Serious Games: A state of the art.,” International Journal of Serious Games, vol. 2, no. 3, 2015, doi: 10.17083/ijsg.v2i3.87.
- T. Ashby, B. K. Webb, G. Knapp, J. Searle, and N. Fulda, “Personalized Quest and Dialogue Generation in Role-Playing Games: A Knowledge Graph- and Language Model-based Approach,” in Conference on Human Factors in Computing Systems - Proceedings, 2023. doi: 10.1145/3544548.3581441.
- A. Mammadli, “Advancing Serious Games With Multilingual Deep Learning,” 6. International Boğaziçi Scientific Research Congress, pp. 1100–1106, Jan. 2025.
- T. Akdeniz, “Turkish BERT based NER (Revision b247a7f),” 2023, Hugging Face . doi: 10.57967/hf/0949.
- D. Küçük, D. Küçük, and N. Arıcı, “A named entity recognition dataset for Turkish,” 2016. doi: 10.1109/siu.2016.7495744.