TRAINING AGENTS FOR NAVIGATION AND BEHAVIOUR IN DYNAMIC GAME ENVIRONMENTS USING UNITY ML-AGENTS

ОБУЧЕНИЕ АГЕНТОВ ДЛЯ НАВИГАЦИИ И ПОВЕДЕНИЯ В ДИНАМИЧЕСКОЙ ИГРОВОЙ СРЕДЕ С ИСПОЛЬЗОВАНИЕМ UNITY ML-AGENTS

Seitov T. Bissembayev A.S. Serek A.

28.05.2026 231

5(146)

10. Информатика, вычислительная техника и управление

Цитировать:

Seitov T., Bissembayev A.S., Serek A. TRAINING AGENTS FOR NAVIGATION AND BEHAVIOUR IN DYNAMIC GAME ENVIRONMENTS USING UNITY ML-AGENTS // Universum: технические науки : электрон. научн. журн. 2026. 5(146). URL: https://7universum.com/ru/tech/archive/item/22678 (дата обращения: 28.07.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.146.5.22678

Статья поступила в редакцию: 27.04.2026

Принята к публикации: 01.05.2026

Опубликована: 28.05.2026

УДК 004.942

ABSTRACT

In the context of growing consumer demand and the need for effective personal finance management, mobile applications for financial tracking play a crucial role. The aim of the study is to identify the most significant functionality criteria for such applications based on user preferences. The methodology includes a survey of 87 respondents of different ages and income levels, as well as statistical data processing using expert evaluation methods. The results indicate that users prefer applications with an intuitive interface, automatic synchronization with bank accounts, advanced analytics, a free version with extensive functionality, and a reminder system. The expert consensus coefficient (0.82) confirms the reliability of the obtained data. The identified criteria can be utilized in the development and enhancement of mobile applications, increasing their demand and user convenience.

АННОТАЦИЯ

В данной работе исследуется обучение адаптивных агентов в процедурно генерируемых средах с использованием инструментария Unity ML-Agents. В ходе исследования проводится сравнительный анализ алгоритмов проксимальной оптимизации политики (PPO), имитационного обучения (Behavior Cloning) и генеративного состязательного имитационного обучения (GAIL) в динамических условиях.

Эксперименты показали, что оптимизированная конфигурация PPO, обученная поэтапно (Incremental Training), продемонстрировала лучшие показатели обобщающей способности, достигнув доли успеха 0,70 при обучении и 0,50 при тестировании. Полученные результаты подчеркивают значимость тонкой настройки гиперпараметров и использования стратегий поэтапного обучения для повышения адаптивности агентов в сложных средах.

В рамках работы представлена полностью модульная среда, состоящая из взаимозаменяемых сегментов уровней, рандомизированного расположения препятствий и динамической генерации маршрутов, что позволяет проводить оценку эффективности моделей в постоянно меняющихся условиях.

Keywords: Reinforcement Learning, Unity ML-Agents, Procedural Generation, PPO, Behavior Cloning, GAIL, Navigation, Dynamic Environments.

Ключевые слова: Обучение с подкреплением, Unity ML-Agents, процедурная генерация, проксимальная оптимизация политики (PPO), клонирование поведения (Behavior Cloning), генеративное состязательное имитационное обучение (GAIL), навигация, динамические среды.

Introduction

Artificial Intelligence (AI) has become increasingly popular in recent years, finding applications in a wide range of domains such as healthcare, finance, education, robotics, face generation, autonomous systems [1]–[4] and especially in game development. It makes possible the creation of intelligent bots and non-playable characters (NPCs) that can act like real players. [5] Most of the time, NPCs in games are predefined and developed using traditional programming, making them predictable and repetitive. [6]–[8] However, in modern game engines, NPCs can be implemented using machine learning tools. [9] Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. [10] Unity ML-Agents, a framework within Unity game engine, is a powerful tool that can enable the deployment of learning algorithms in an environment for training agents in different scenarios. [11], [12]. Reinforcement learning is a machine learning technique that uses an agent to interact with the environment, learning an optimal policy by trial and error to maximize its long-term rewards. [13] However, many existing approaches, focused on training agents in static or semi-dynamic environments with simple maps, conditions, and obstacles. To address this, different techniques are needed to identify the most effective strategies to train AI agents capable of adaptation in dynamic game environments.

Different authors used different approaches along, or in contrast, with Reinforcement Learning such as Proximal Policy Optimization [14], [15], [16], Behavior Cloning [16], [17], [18], Generative Adversarial Imitation Learning (GAIL) [18]. Proximal Policy Optimization (PPO), a reinforcement learning method that updates the policy by collecting experience through agent interactions with the environment to maximize the agent’s reward. [19], [20] PPO is the default and most widely used reinforcement learning algorithm in Unity ML- Agents. [21] and [14] used the PPO to create an agent able to play hide and seek game. PPO appeared to be more stable than other algorithms. Behavior cloning is a part of imitation learning in which agent is capable of imitating the behavior of a human. [22], [23] GAIL is a technique that combines imitation learning with generative adversarial network. [24], [25] Behavior Cloning and GAIL were compared by [18] to train kart agents to navigate a track, avoiding obstacles. As they stated, GAIL was unsatisfactory, while the best result was achieved with Behavior Cloning. Both the track layout and all obstacles were fully predetermined. Another study [16] trained kart-racing agents. They reported that after introducing static obstacles, PPO alone failed to adapt. Behavior Cloning was used as a pre-training strategy, which significantly improved obstacle avoidance. They also emphasized the importance of complexity of the environment - randomizing starting positions, changing obstacle locations, and training with different tracks to reduce overfitting and enhance adaptability. Many studies have explored different machine learning techniques to train game agents, using reinforcement learning, imitation learning, and hybrid approaches. These methods have achieved impressive results in various tasks however, most research was conducted in static or semi-dynamic environments. Real-time adaptation in dynamic settings where layouts and obstacles change remains a challenge. This paper explores the training of adaptive agents using Unity ML- Agents, comparing PPO, GAIL, and Behavior Cloning under unpredictable conditions. By optimizing reward functions and training strategies, this study identifies effective approaches for real-time generalization. The findings provide insights for developing intelligent agents that require high adaptability to dynamic environments.

Methodology

The environment consists of modular level segments, each representing a small section of the full track. Every segment contains:

Checkpoints: Invisible boxes, serve as intermediate progress and encouraging step-by-step progress throughout the segment.
Obstacle generator: Responsible for spawning random obstacles during each episode.
Connection points: Used to connect segments together seamlessly. They define where the next segment begins.

The example of a segment can be seen in Figure 1

Figure 1. Segment of straight corridor with the highlighted checkpoints and start point

To assemble a complete level, a custom Segment Route Builder script was coded and attached to the corresponding game object - RouteBuilder prefab. The script receives (Fig. 2):

A list of available segment prefabs.
The desired number of segments to generate in sequence.
A Goal prefab, representing the target destination of the agent

Figure 2. Segment Route Builder script component's interface, it accepts list of segment prefabs and desirable number of segments to be generated and goal prefab

During initialization, RouteBuilder randomly selects and arranges the segments in a continuous path, connecting their start and end points. Once the path is complete, the goal object is placed at the end of the last segment. This approach enables flexible and highly variable level generation.

For the actual training process, we created a unified RandomLevel prefab (Fig. 3). It consists of all essential components of the experiment:

RouteBuilder: Responsible for dynamic level generation.
Agent: The agent prefab.
FallZone: A collider box that triggers the end of an episode when the agent falls off the level.
ChekpointsManager: Stores all checkpoints within a level, tracks which have been reached by the agent, and resets the state at the beginning of each episode.

Figure 3. RandomLevel prefab

The agent behavior was defined in the C# class. During each step, the agent collects observations about its current state and the environment. The agent collects information using CollectObservations method including its position, velocity, rotation, and the relative position of the next target. Using the RayPerceptionSensor3D component (Fig. 4) allows the agent to detect surrounding obstacles and checkpoints. This sensor casts rays in different directions and returns information about what types of objects the agent “sees”. The agent's Behavior Parameters component acts as a bridge between the Agent script and the ML-Agents system.

Figure 4. Agent with Ray Perception Sensor component

The reward system of the agent is calculated in:

1. Step penalty () - a small reward (-2.0 / MaxStep) encourages the agent to complete the level faster. MaxStep means maximum number of steps of agent training. It is set to three million.

2. Checkpoint Reward () - reaching new checkpoints grants +0.2 reward to promote exploration and route progression.

3. Progress Reward () - continuous reward for reducing the distance to the goal.

4. Goal reward () - a base reward of +5.0 plus a speed bonus proportional to how quickly the goal is reached, using a quadratic decay formula.

5. Failure Penalty () - collisions with walls or other obstacles result in small negative rewards, while falling into the FallZone triggers episode termination and -0.5 penalty.

The total reward at each time step is as follows (equation 1):

Equation 1. The formula of the total reward calculated at each time step

For the actual training, we evaluate two variants of Proximal Policy Optimization: the default Unity ML-Agents configuration and a tuned configuration designed to improve policy stability and generalization, Behavior Cloning and GAIL.

The default PPO configuration provided by Unity ML-Agents serves as our baseline (Table 1). It uses a relatively small network (128 hidden units), moderate batch and buffer sizes, and a discount factor of = 0.99. While this configuration can be sufficient for static environments, our experiments show that it showed worse results in complex, procedural environments.

Table 1.

Default PPO Configuration

Parameter	Value
trainer_type	ppo
batch_size	1024
buffer_size	10240
learning_rate	3.0e-4
beta	0.001
epsilon	0.2
lambd	0.95
num_epoch	3
learning_rate_schedule	linear
normalize	true
hidden_units	128
num_layers	2
gamma	0.99
reward_strength	1.0
keep_checkpoints	5
max_steps	1000000
time_horizon	128
summary_freq	10000

batch size / buffer size control how many experiences are collected and used per optimization step.
learning_rate defines how fast the model updates.
beta, epsilon, lambd are regularization and clipping parameters for PPO stability.
hidden units, num layers define the neural network capacity.
gamma is a reward discount factor.
time horizon controls how far in the future the agent looks when assigning rewards.

To improve the performance under dynamic conditions, we introduced several modifications to the default configuration (Table 2). These changes target three aspects: sample efficiency, policy smoothness, and long-term planning.

Table 2.

Difference between Tuned and Default PPO Configurations

Parameter	Default Config	Tuned Config
batch_size	1024	2048
buffer_size	10240	15360
beta	0.001	0.0005
hidden_units	128	256
gamma	0.99	0.995

Increased batch and buffer sizes. Doubling the batch size (1024 2048) and expanding the buffer (10240 15360) reduces gradient variance and leads to more stable policy updates.
Larger network capacity. Increasing the number of hidden units (128 256) enables the policy to encode a more detailed representation of the procedural layouts.
Lower β coefficient. Reducing the entropy regularization term (β = 0.001 0.0005) decreases random exploration in later training stages.
Higher discount factor. A longer planning horizon ( = 0.995) encourages the agent to optimize long-term navigation paths.

Behavior Cloning. BC relies on expert demonstration data rather than trial-and-error exploration. We recorded trajectories into the traverseDemo.demo file. The file contains state–action pairs which covers straight segments, turns, and basic obstacle avoidance patterns. The configuration parameters of Behavior Cloning are shown in Table 3.

Table 3.

Difference between Tuned and Default PPO Configurations

Parameter	Value
trainer_type	ppo
batch_size	1024
buffer_size	10240
learning_rate	3.0e-4
beta	0.001
epsilon	0.2
lambd	0.95
num_epoch	3
learning_rate_schedule	linear
normalize	true
hidden_units	128
num_layers	2
gamma	0.99
reward_strength	1.0
demo_path	Demonstrations/traverseDemo.demo
strength	0.35
steps	300000
max_steps	1000000

The imitation loss is scaled using the strength parameter, which determines how strongly the agent follows the demonstrated behavior. Larger values may lead to overfitting to human trajectories, while smaller values reduce the impact of imitation. BC is applied only during the first 300k training steps. After this phase, the algorithm switches to standard PPO updates.

While Behavior Cloning simply copies the expert’s actions, GAIL is more advanced. It uses a “Discriminator” network that acts like a judge. This judge tries to find the difference between the human movements and the agent’s movements. The agent then learns to perform in a way that the judge cannot tell it apart from the human. [26] In our setup, we combine GAIL with standard Reinforcement Learning (PPO). This means the agent gets two types of rewards: one for reaching the goal and another for moving like an expert. The configuration for GAIL is shown in Table 4.

Table 4.

GAIL Configuration

Parameter	Value
beta	0.001
learning_rate_schedule	linear
hidden_units	128
num_layers	2
extrinsic_strength	0.8
gail_strength	0.1
demo_path	Demonstrations/Demo1.demo
use_actions	true
use_vail	false
max_steps	4000000
Time_horizon	256

We set the gail strength to 0.1. This small value gives the agent enough guidance from the expert without making it forget the main goal. By setting use_actions to true, the agent learns not only where to go but also how to use its controls just like a human. We increased the training limit to 4 million steps for GAIL since the agent and the judge need more time to learn together.

To speed up the learning process and collect diverse experiences, we placed 14 instances of the RandomLevel prefab in the scene (Fig. 5). Each instance contained a RouteBuilder prefab that generated a new random sequence of connected segments at the beginning of each episode after the agent reaches the end goal on the final segment. Six out of fourteen levels were configured to generate three segments - segmentStraight, segmentTurnRight and segmentTurnLeft, while the other eight instances also included segmentBranch, which introduced additional branching paths which could extend levels by one to five extra segments. We chose this approach to ensure that agents are simultaneously trained in different procedural configurations, which could increase the generalization of learning.

Figure 5. Unity scene with random levels

During training, PPO and BC configuration parameters started with 1 million steps. After each training cycle, the model was reloaded and continued with the same hyperparameters but with an increased total step limit by one million, reaching up to four million steps. This incremental training strategy helps maintain a controlled learning progression, which gives the model time to stabilize. However, this was not the case for every model, for example, GAIL showed best results with setting step limit to four million from the start.

After training the agents under different configurations, the next stage was to evaluate their generalization and adaptability through testing. Testing is important to determine whether trained agents can perform well in previously unseen and more complex procedural conditions. It allows validating the stability of the training process and assessing the agent's ability to transfer its learned behavior to new environments.

For the testing phase, six RandomLevel prefabs were placed in the scene, each contains a RouteBuilder prefab that has list of five segment prefabs, including newly introduced SegmentNarrow, representing a narrow corridor that was not present during training. Two levels generated three segments, two up to four, and the other two up to five. This setup was created to test the agents' adaptability to increased environmental variability and unseen configurations. The example of the generated level is shown in Figure 6.

Figure 6. Example of generated level during test phase

Since Unity ML-Agents does not provide a dedicated "test mode" and ability to log results in test mode, a pseudo-test mode was used by setting the learning_rate to zero in the training configuration file using mlagents-learn config/TraverseTest.yaml --run-id test_PPO --resume --inference command. --resume allows to continue testing right after training. With this, the agents are trained in inference mode while keeping the logging active.

Results and discussion

We used TensorBoard, TensorFlow’s visualization toolkit, instrument for machine learning experimentations, [27] to track metrics such as cumulative rewards, episode length, entropy and many others across training steps. The primary performance metric was decided to be success rate 100, our custom logging metric. It provides a more clear evaluation of efficiency and task completion success.

Figure 7 illustrates the smoothed cumulative reward and success rate of all the models during train phase - PPO default (blue), PPO tuned (gray), Behavior Cloning (cyan) and GAIL (red). Figure 8 shows the smoothed success rate during training (before four million) and testing (after four million step).

Figure 7. Cumulative reward during training phase

Figure 7. Success rate during training and testing phase

PPO (Tuned) and GAIL demonstrated the best results, achieving success rates of ≈ 0.50 and ≈ 0.45 respectively in the test scenarios. This suggests that these methods have a higher generalization ability. By using a larger network (256 hidden units) and better exploration settings (tuned beta), PPO (Tuned) was able to find more robust solutions than default PPO. Similarly for GAIL, the parameter tuning like increased beta and time horizon, reward shaping helped to achieve best results. In contrast, Behavior Cloning showed the largest decline, falling to a success rate of ≈ 0.28. BC suffered from encountering a situation slightly different from the demonstration. However, this may be caused by the reward weighting: BC’s high extrinsic reward (1.0) relative to its cloning strength (0.35) may cause it to strictly copy expert’s behavior, whereas GAIL benefits from a more balanced ratio of 0.8 extrinsic reward to 0.1 gail strength. While the Default PPO performed better than BC, its success rate of ≈ 0.35 is still significantly lower than the Tuned version, proving that hyperparameter optimization is critical.

The comparison between training and testing results illustrated in Table 5.

Table 5.

Comparison of success rates between training and testing phases

Model	Training success rate	Testing success rate
PPO (Default)	≈ 0.62	≈ 0.35
PPO (Tuned)	≈ 0.70	≈ 0.50
Behavior Cloning	≈ 0.52	≈ 0.28
GAIL	≈ 0.68	≈ 0.45

The results in the table show a clear difference between training and testing performance. All models experienced a drop in success rates during the test phase. This is expected because the test environment is more difficult, featuring narrow corridors, longer paths, and complex obstacle combinations that the agents did not see during training.

Conclusion

In this study, we compared four models - PPO with default configuration, PPO with tuned parameters, GAIL and Behavior Cloning. They were trained to navigate procedurally generated levels with varying complexity filled with obstacles. The tuned PPO agent consistently achieved better results, higher success rates and more stable behavior during both training and testing. GAIL model also showed high performance and adaptability. The default PPO configuration demonstrated satisfactory results, while the Behavior Cloning model failed at adaptability even though it showed faster learning speed at the beginning. These findings confirm that careful hyperparameter tuning, balanced reward design, and environment diversity combining with hybrid approaches using reinforcement learning with ex- pert demonstrations are key factors in developing agent capable of navigation and generalization in dynamic, procedurally generated environments. There are also several areas that may be improved in the future. For example, expansion of the agent’s action space by introducing the ability to jump to perform vertical movements. Additionally, testing alternative learning algorithms such as Soft Actor-Critic (SAC) would be beneficial. Another direction could be integration of curriculum learning to test how gradually increased complexity can affect agent’s results.

References:

Zholshiyeva, L., Zhukabayeba, T., Serek, A., Duisenbek, R., Berdieva, M., & Shapay, N. (2025). Deep Learning-Based Continuous Sign Language Recognition. Journal of Robotics and Control (JRC), 6(3), 1106-1118.
Kuanyshbay, D. N., Serek, A. G., Shoiynbek, A. A., Sharipov, K. R., Shoiynbek, T. A., Meraliyev, B. A., & Meraliyev, M. A. (2025). Development of an AI-Based Communication Fraud Detection System. Appl. Math, 19(4), 953-963.
A. Serek, B. Amirgaliyev, R. Y. M. Li, A. Zhumadillayeva, and D. Yedilkhan, “Crowd density estimation using enhanced multi-column convolutional neural network and adaptive collation,” IEEE Access, vol. 13, pp. 146 956–146 972, 2025.
Yegemberdiyeva, G., & Amirgaliyev, B. (2021, April). Study Of AI Generated And Real Face Perception. In 2021 IEEE International Conference on Smart Information Systems and Technologies (SIST) (pp. 1-6). IEEE.
Filipović, A. (2023). The role of artificial intelligence in video game development. Kultura polisa, 20(3), 50-67.
Meshram, R., Krishna, S., Kulkarni, O., Patil, R. R., Kaur, G., & Maheshwari, S. (2025, March). NPC Behavior in Games Using Unity ML-Agents: A Reinforcement Learning Approach. In 2025 International Conference on Automation and Computation (AUTOCOM) (pp. 1519-1523). IEEE.
Servat, A., & Mohamadi, H. S. (2023, December). Immersive Game Worlds: Using Deep Reinforcement Learning for Lifelike Non-Player Characters. In 2023 International Serious Games Symposium (ISGS) (pp. 1-5). IEEE.
Ayoub, M. S., Tehseen, R., Omer, U., Awan, M. M., & Javaid, R. (2025). Enhancing Non-Player Characters (NPC) Behaviour in Video Games Using Reinforcement Learning. International Journal of Innovations in Science & Technology, 7(2), 966-985.
Hu, C. (2024). Research on the integrated application of machine learning in Unity. APPLIED AND COMPUTATIONAL ENGINEERING, vol. 82(1), pp 161-166.
Ling, Q. (2023). Machine learning algorithms review. Applied and computational engineering, 4(1), 91-98.
Lukas, M., Tomicic, I., & Bernik, A. (2022). Anticheat system based on reinforcement learning agents in unity. Information, 13(4), 173.
Amirgaliyev, B., Mukhidenov, D., Yedilkhan, D., & Yermekov, A. (2025, July). A Novel Clustering-Based Anomaly Detection Approach Using a Game Engine for Road Safety. In 2025 International Conference on Advanced Machine Learning and Data Science (AMLDS) (pp. 709-714). IEEE.
Pecioski, D., Gavriloski, V., Domazetovska, S., & Ignjatovska, A. (2023, June). An overview of reinforcement learning techniques. In 2023 12th Mediterranean conference on embedded computing (MECO) (pp. 1-4). IEEE.
Livada, Č., & Hodak, D. (2022, October). Advanced Mechanisms of Perception in the Digital Hide and Seek Game Based on Deep Learning. In 2022 International Conference on Smart Systems and Technologies (SST) (pp. 135-140). IEEE.
Lai, J., Chen, X. L., & Zhang, X. Z. (2019, July). Training an agent for third-person shooter game using unity ml-agents. In International Conference on Artificial Intelligence and Computing Science. Hangzhou (pp. 317-332).
Savid, Y., Mahmoudi, R., Maskeliūnas, R., & Damaševičius, R. (2023). Simulated autonomous driving using reinforcement learning: A comparative study on unity’s ML-agents framework. Information, 14(5), 290.
Almeida, P., Carvalho, V., & Simões, A. (2024). Reinforcement Learning as an Approach to Train Multiplayer First-Person Shooter Game Agents. Technologies, 12(3), 34.
Mahmoudi, R., & Ostreika, A. (2023). Reinforcement Learning for Obstacle Avoidance Application in Unity Ml-Agents. In IVUS (pp. 214-221).
Bansal, H., Goyal, V., Joshi, B., Gupta, A., & Kandath, H. (2024, August). RaCIL: Ray Tracing based Multi-UAV Obstacle Avoidance through Composite Imitation Learning. In 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE) (pp. 2188-2193). IEEE.
Zhuang, Z., Lei, K., Liu, J., Wang, D., & Guo, Y. (2023). Behavior proximal policy optimization. arXiv preprint arXiv:2302.11312.
U. Technologies, “ML-Agents overview”, https://docs.unity3d.com/Packages/com.unity.ml-agents@4.0/manual/ML-Agents-Overview.html, 2023.
Masayuki, U., & Tomoyuki, T. (2024, October). Visualization of Fighting Game Player Skills Using Imitation Learning Agents. In 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE) (pp. 82-83). IEEE.
Bain, M., & Sammut, C. (1995, July). A Framework for Behavioural Cloning. In Machine intelligence 15 (pp. 103-129).
Yang, Z., Nai, W., Li, D., Liu, L., & Chen, Z. (2024). A Mixed Generative Adversarial Imitation Learning Based Vehicle Path Planning Algorithm. IEEE Access, 12, 85859-85879.
Cavadas, L. V. R., Clua, E., Kohwalter, T. C., & Melo, S. A. (2022, October). Training human-like bots with Imitation Learning based on provenance data. In 2022 21st Brazilian Symposium on Computer Games and Digital Entertainment (SBGames) (pp. 1-6). IEEE.
Li, J., Huang, S., Xu, X., & Zuo, G. (2022, August). Generative Adversarial Imitation Learning from Human Behavior with Reward Shaping. In 2022 34th Chinese Control and Decision Conference (CCDC) (pp. 6254-6259). IEEE.
TensorFlow Developers, “Tensorboard: Get started”, https://www.tensorflow.org/tensorboard/get_started?hl=ru, 2025.