COMPARATIVE ANALYSIS OF WORD EMBEDDING METHODS FOR THE AZERBAIJANI LANGUAGE IN GAME APPLICATIONS

СРАВНИТЕЛЬНЫЙ АНАЛИЗ МЕТОДОВ ВЕКТОРНОГО ПРЕДСТАВЛЕНИЯ СЛОВ ДЛЯ АЗЕРБАЙДЖАНСКОГО ЯЗЫКА В ИГРОВЫХ ПРИЛОЖЕНИЯХ
Mammadli A.
Цитировать:
Mammadli A. COMPARATIVE ANALYSIS OF WORD EMBEDDING METHODS FOR THE AZERBAIJANI LANGUAGE IN GAME APPLICATIONS // Universum: технические науки : электрон. научн. журн. 2025. 9(138). URL: https://7universum.com/ru/tech/archive/item/20745 (дата обращения: 05.12.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2025.138.9.20745

 

ABSTRACT

Natural Language Processing in Azerbaijani language have been important topics of study in the literature for many years. Understanding NLP is important not only for linguistic research and communication technologies, but also for the game industry. It is important to understand content of the game and do analysis using NLP methods such as embeddings. This research provides a comprehensive examination and comparative analysis of word embedding techniques for serious games in the Azerbaijani language. Study shows importance of embedding in game-based applications. Given the constrained resources of Azerbaijani and its complex morphology, we evaluate multiple classical, subword-aware, and contextual embeddings. We assess each method using corpora obtained from the Azerbaijani serious game called "Yasaq". Experiments focused on word similarity and semantic relationships within a gaming context. Results reveal that classical models perform exceptionally well on smaller datasets, whereas contextual embeddings exhibit superior performance on downstream tasks relevant to gaming environments. The findings support the development of intelligent, linguistically flexible gaming systems. This domain-specific embedding enhances user understanding and engagement in Azerbaijani-language games.

АННОТАЦИЯ

Обработка естественного языка для азербайджанского языка уже многие годы является важной темой исследований в литературе. Понимание NLP важно не только для лингвистических исследований и коммуникационных технологий, но и для игровой индустрии. Важно понимать содержание игры и проводить анализ с использованием методов NLP, таких как векторные представления (эмбеддинги). В данном исследовании представлено всестороннее изучение и сравнительный анализ методов векторного представления слов для серьёзных игр на азербайджанском языке. Исследование показывает значимость применения эмбеддингов в игровых приложениях. Учитывая ограниченные ресурсы азербайджанского языка и его сложную морфологию, мы оцениваем несколько классических, основанных на подсловах и контекстуальных моделей. Каждая методика оценивается с использованием корпусов, полученных из азербайджанской серьёзной игры под названием «Yasaq». Эксперименты были сосредоточены на сходстве слов и семантических связях в игровом контексте. Результаты показывают, что классические модели демонстрируют исключительно хорошие результаты на небольших наборах данных, тогда как контекстуальные эмбеддинги показывают превосходную производительность в прикладных задачах, связанных с игровыми средами. Полученные выводы способствуют развитию интеллектуальных, лингвистически гибких игровых систем. Такая предметно-ориентированная векторизация усиливает понимание и вовлечённость пользователей в играх на азербайджанском языке.

 

Keywords:  serious game, natural language processing, word embedding, text classification, semantic relationship, word similarity, Azerbaijani language.

Ключевые слова: серьёзные игры, обработка естественного языка, векторное представление слов, классификация текста, семантические отношения, сходство слов, азербайджанский язык.

 

Introduction

Natural Language Processing (NLP) has become an important part of building smart platforms in recent years [1]. These platforms can be used for many things, such as learning, talking, and having fun. Among these fields, the gaming industry is increasingly using NLP methods to improve the experience of players through creating content that changes all the time and determining users interaction with each other. Serious games are not just for fun but also for learning or training [2], [3], [4]. Game platforms also benefit significantly from NLP because it allows you to look more closely at how language is used [5], [6]. It can be used for different reasons: what players do, how the games respond to players' actions, updating content of games. Since knowing language is a big part of these games, being able to process and understand real language well is important for making user experiences that are engaging and useful. Because of this, using NLP methods such as word embeddings is an important part of designing and analyzing games today.

The Azerbaijani language presents several challenges for NLP due to its linguistic complexity and limited computational resources [1], [7]. These issues become even more critical when developing domain-specific applications such as serious games.

  • Azerbaijani is a morphologically rich and agglutinative language, resulting in a vast number of word variations and complex structures.
  • There is a shortage of annotated corpora and high-quality language datasets for training and evaluating NLP models.
  • Pretrained models and language-specific tools (e.g., tokenizers, lemmatizers) are either lacking or underdeveloped.
  • Informal, colloquial, and domain-specific language often used in games is harder to process with generic NLP techniques.
  • Azerbaijani’s low-resource status limits the effectiveness of transfer learning from models trained on high-resource languages.
  • Standard embedding techniques may fail to capture semantic relationships without adaptations for the language's unique morphology.

Word embeddings provide a good way to solve the difficulties in Azerbaijani as they enable models to capture semantic linkages. Despite limited language resources and complicated morphology of low-resource languages it can capture meaningful relations. Unlike traditional methods that rely heavily on handcrafted features and large annotated datasets, embeddings can be trained on unlabelled corpora and still learn meaningful representations of words. Subword and character-level embeddings are especially useful for handling the agglutinative structure of Azerbaijani, as they allow models to generalize across various word forms. Contextual embeddings further enhance this capability by interpreting word meaning based on related context, making them well-suited for dynamic and informal language use found in games. By leveraging embeddings tailored to game-specific data, such as the corpus from the "Yasaq" game, it becomes possible to build NLP systems that better understand and respond to user input within Azerbaijani-language gaming environments.

Materials and methods

Early static approaches such as Word2Vec and GloVe have been repeatedly shown to degrade when training data are sparse or morphologically rich [8], [9]. Sub-word models like FastText became the baseline because n-gram character features mitigate data sparsity and agglutination. Comparative studies on other low-resource languages confirm that FastText outperforms purely word-level vectors on both intrinsic and downstream tasks, while ensembles of multiple embeddings give additional gains. Recent work has shifted toward contextual and cross-lingual embeddings. Multilingual transformers (mBERT, XLM-R) supply zero-shot support but suffer from mis-aligned sub-spaces; alignment frameworks that add explicit word-level constraints now reduce that gap, boosting bitext retrieval and transfer accuracy for eight low-resource languages. Surveys of Turkic NLP underline the same pattern: transfer learning plus sub-word modelling currently offers the best trade-off while dedicated monolingual pre-training remains rare.

Games use embeddings to drive dialogue systems, intent detection, adaptive storytelling and learner analytics. A 55-paper scoping review of GPT-series applications maps five major use cases from procedural content generation to game-user research. This is signalling rapid uptake of large language models in production pipelines. Concretely, in the research [5], show how knowledge-graph-augmented GPT-4 creates context-aware NPC chatter in Final Fantasy VII Remake and Pokémon without author-written scripts.

For serious and educational games, classical embeddings are still prevalent. Combined NLP with different classifiers to mine gameplay logs and refine learning content, demonstrating measurable pedagogical lift [3], [10], [11]. Complementary research couples embeddings with ontologies to produce explainable text classifiers that power taboo-style card games and other language-learning mechanics [4]. Together, these studies indicate that both static and contextual vectors are valuable. Static models remain computationally light for embedded or mobile games, whereas transformer-based models drive richer generative experiences [2], [12].

Current best practice therefore mixes (i) multilingual transformers for zero-/few-shot coverage, (ii) FastText-style sub-word vectors for lightweight inference, and (iii) selective domain fine-tuning (e.g., game dialogue) when task-specific data are available. Multilingual BERT has been widely used as a base model though with limited representation. It has demonstrated encouraging outcomes when optimized for downstream tasks like as Named Entity Recognition, Part-of-Speech tagging, and sentiment analysis. Turkish BERT is a one of fine-tuned models. It is trained on extensive Turkish corpora and produces competitive outcomes for tasks like as Named Entity Recognition and text categorization [13], [14].

Results and discussions

Experiments carried on “Yasaq” serious game dataset. It contains cards which contain 6 words in each of them. These words related to each other semantically. This research focused on comparison of word embeddings in order to find suitable approach for serious word games in Azerbaijani. In the table 1, this study compares the performance of several embedding techniques CountVectorizer, TfidfVectorizer, Word2Vec, FastText, and BERT. For each word, the top 1st similar word retrieved by each embedding method and the corresponding similarity score are shown. Based on our observations , we conclude:

  • Traditional Methods (CountVectorizer, TfidfVectorizer): These methods generally show lower similarity scores because they rely purely on word frequency and co-occurrence without capturing semantic meaning. They tend to retrieve exact or morphologically similar matches rather than semantically similar ones.
  • Word2Vec and FastText: These embeddings capture semantic similarity better, with scores around 0.4–0.9. FastText consistently shows high scores due to its subword-level modeling, which is particularly helpful for the morphologically complex Azerbaijani words. For example, for the word “quş”, FastText finds “ingiltərə” with a high similarity of 0.908582, reflecting a limitation in domain-specific context that may need further refinement.
  • BERT: Contextual embeddings like BERT provide robust similarity scores reflecting their ability to incorporate context into word meaning. However, BERT’s higher similarity scores might sometimes retrieve contextually but not necessarily semantically close words, depending on the corpus.

Table 1.

Top similar word and their similarity score across different embedding techniques in “Yasaq”

Word

Top 1st similarity

Embedding

Similarity Score

 

 

quş

tük

CountVectorizer

0.281718

tük

TfidfVectorizer

0.298161

ayı

Word2Vec

0.451469

i̇ngiltərə

FastText

0.908582

tutuquşu

BERT

0.838724

 

 

kiçik

qucaqlamaq

CountVectorizer

0.202031

qucaqlamaq

TfidfVectorizer

0.216492

yapışıq

Word2Vec

0.451469

i̇llustrasiya

FastText

0.938683

böyük

BERT

0.838724

 

 

su

dəniz

CountVectorizer

0.276648

dəniz

TfidfVectorizer

0.333277

luiziana

Word2Vec

0.373240

gecə

FastText

0.762766

isti

BERT

0.709387

 

 

heyvan

it

CountVectorizer

0.272475

it

TfidfVectorizer

0.268121

ilbiz

Word2Vec

0.381669

qara

FastText

0.941047

karvan

BERT

0.806644

 

 

yemək

pəhriz

CountVectorizer

0.208514

pəhriz

TfidfVectorizer

0.197921

qızıl

Word2Vec

0.399578

i̇slandi̇ya

FastText

0.951257

əmək

BERT

0.844028

 

 

şirin

meyvə

CountVectorizer

0.272423

şirniyyat

TfidfVectorizer

0.285782

zebra

Word2Vec

0.424426

dil

FastText

0.922399

şəkər

BERT

0.900347

 

 

ev

mənzil

CountVectorizer

0.288675

mənzil

TfidfVectorizer

0.273212

almaniya

Word2Vec

0.395754

yaban

FastText

0.739765

kənd

BERT

0.888595

 

The following are the results obtained from the experiments on BERT embedding (fig. 1). It shows most similar words to “kitab” in 3D. For example, “kitab”, “kitablıq”, “kitabxana” are connected to each other despite of they have same root. In our BERT approach, it contains words which has new meaning.

 

Figure 1. Top most similar 9 words to “kitab” word

 

Conclusion

Study findings research the necessity of integrating contextual embedding techniques in "Yasaq" like games.  To provide dynamic and context-sensitive experiences such tactics are essential. These findings validate that a transformer-based BERT methodology is essential for achieving accurate and functional embeddings in gaming settings. Game designers may enhance language understanding and provide more engaging user experiences in Azerbaijani-language contexts by implementing a customized embedding technique for serious games.

 

References:

  1. D. Pathak, S. Nandi, and P. Sarmah, “Evaluating Performance of Pre-trained Word Embeddings on Assamese, a Low-resource Language,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds., Torino, Italia: ELRA and ICCL, May 2024, pp. 6418–6425. [Online]. Available: https://aclanthology.org/2024.lrec-main.568/
  2. A. Mammadli, “Application of Deep Learning for Procedural Content Integration for Learning Serious Games,” Proceedings of Azerbaijan High Technical Educational Institutions, vol. 48, no. 06, pp. 455–469, Jan. 2025.
  3. A. Mammadli, “UNLOCKING EDUCATIONAL INSIGHTS: INTEGRATING WORD2VEC EMBEDDINGS AND NAIVE BAYES CLASSIFIER FOR SERIOUS GAME DATA ANALYSIS AND ENHANCEMENT,” Azerbaijan Journal of High Performance Computing, vol. 6, no. 2, pp. 191–198, Dec. 2023, doi: 10.32010/26166127.2023.6.2.191.198.
  4. A. Mammadli, E. Ismayilov, and C. Zanni-Merk, “Explainability of text Classification through ontology-driven analysis in Serious Games,” Procedia Comput Sci, vol. 246, pp. 2128–2137, 2024, doi: https://doi.org/10.1016/j.procs.2024.09.626.
  5. N. Nananukul and W. Wongkamjan, “What if Red Can Talk? Dynamic Dialogue Generation Using Large Language Models,” Jul. 2024, [Online]. Available: http://arxiv.org/abs/2407.20382.
  6. D. Yang, E. Kleinman, and C. Harteveld, “GPT for Games: A Scoping Review (2020-2023),” Apr. 2024, doi: 10.1109/CoG60054.2024.10645548.
  7. Y. Veitsman and M. Hartmann, “Recent Advancements and Challenges of Turkic Central Asian Language Processing,” Jul. 2024, [Online]. Available: http://arxiv.org/abs/2407.05006.
  8. S. Mammadli, S. Huseynov, H. Alkaramov, U. Jafarli, U. Suleymanov, and S. Rustamov, “Sentiment polarity detection in Azerbaijani social news articles,” in International Conference Recent Advances in Natural Language Processing, RANLP, 2019. doi: 10.26615/978-954-452-056-4_082.
  9. K. Sarıtaş, C. A. Öz, and T. Güngör, “A comprehensive analysis of static word embeddings for Turkish,” Expert Syst Appl, vol. 252, Oct. 2024, doi: 10.1016/j.eswa.2024.124123.
  10. D. Picca, D. Jaccard, and G. Eberlé, “Natural Language Processing in Serious Games: A state of the art.,” International Journal of Serious Games, vol. 2, no. 3, 2015, doi: 10.17083/ijsg.v2i3.87.
  11. T. Ashby, B. K. Webb, G. Knapp, J. Searle, and N. Fulda, “Personalized Quest and Dialogue Generation in Role-Playing Games: A Knowledge Graph- and Language Model-based Approach,” in Conference on Human Factors in Computing Systems - Proceedings, 2023. doi: 10.1145/3544548.3581441.
  12. A. Mammadli, “Advancing Serious Games With Multilingual Deep Learning,” 6. International Boğaziçi Scientific Research Congress, pp. 1100–1106, Jan. 2025.
  13. T. Akdeniz, “Turkish BERT based NER (Revision b247a7f),” 2023, Hugging Face . doi: 10.57967/hf/0949.
  14. D. Küçük, D. Küçük, and N. Arıcı, “A named entity recognition dataset for Turkish,” 2016. doi: 10.1109/siu.2016.7495744.
Информация об авторах

PhD, Candidate of Computer Science, Azerbaijan State Oil and Industry University, Azerbaijan, Baku

канд. компьютерных наук,  Азербайджанский государственный университет нефти и промышленности, Азербайджан, г. Баку

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top