Intelligent NPCs with Unity’s ML agents toolkit

November 5, 2021

8 minute read time.

Until very recently, NPCs (Non-Player Characters) within games have lacked the ability to perform smart actions. Reinforcement Learning (RL) allows us to train smarter NPC agents, which enables more interesting gameplay. In becoming proficient at games like Go, Quake III or StarCraft, RL models have demonstrated that they can outdo human performance and produce unique long-term strategies never discovered before. These strategies mean RL has a variety of real-world applications. For example, RL can be used for Robotics as well as next-gen video game AI. It is used to train robots to grasp various objects, which is also a growing area of research.

The Unity Machine Learning Agents Toolkit (ML-Agents) allows us to train intelligent agents within games and simulations. We applied this toolkit to our own internal Unity game project to see how smart the game AI can become. More importantly, it will let us explore the field of RL. Finally, innovation within the field of training intelligent NPCs could benefit other real-world applications.

Reinforcement Learning (RL)

RL is a field of machine learning (ML) where an agent at each timestep takes an action in an environment and receives a state and reward. RL aims to maximize the agent’s total reward by learning an optimal policy (that is, an algorithm used by the Agent to decide what actions to take at each timestep) through a trial-and-error learning process for a environment. Essentially, a policy is evaluated depending on the results of actions taken within an environment by an agent.

Classic RL diagram

Classic RL Diagram

In the context of game AI, the agent refers to the game player/NPC, the environment refers to the environment surrounding the player within the game simulation, and the action refers to the actions taken by the player, such as moving, attacking or dodging in an action game. The state and reward are generally defined by the game AI designer. For example, in a simple action game, the state is the distance between the player and the enemy, the reward becomes a positive value if the player defeats an enemy, and a negative value if the player is defeated by an enemy, etc.

Unity ML-Agents

Unity’s ML-Agents is an open-source toolkit that enables the training of intelligent agents within gaming and simulation environments. A Python API allows us to train agents using RL, as well as a number of other ML techniques, all implemented in PyTorch.

ML Agents Block Diagram from ML-Agents Toolkit Overview

The toolkit contains four key components: 1) the learning environment; 2) the communicator; 3) Python API; and 4) the Python Trainer. The learning environment consists of the Unity game scene and all the game characters. The communicator allows interaction between the Python API and the learning environment. The Python API allows for control of the learning environment. The Python Trainer contains all of the ML algorithms. The game AI designer does the training of agents with Python Trainer interface.

Training smart agents for internal Unity game project

Our internal Unity game project has the purpose of generating workloads for experimentation, valuable to GPU design. We decided to train intelligent NPC agents that are difficult to defeat using this existing game. This also assists in producing new computation workloads.

Screenshot 1 from internal Unity Game Project

Screenshot 2 from internal Unity Game Project

Screenshots from our Internal Unity Game Project

In this trial, we have limited the number of characters to be trained to two: a player and NPC. The player attacks with the sword and with the Fireball. The enemy attacks by swinging their arms down as shown in the images below.

Player's actions in the game

The Player's Actions

The NPC's actions

The NPC's Actions

Also, to simplify the training, we have limited the game scene to a simple box garden, as shown below. Initially we trained our player agent against a stationary NPC. Then we trained our NPC against the trained player agent. Then the player agent and NPC repeatedly take turns at training. In this way, while the NPC becomes better at eliminating the player, the player becomes tougher, and the NPC must become smarter to defeat it.

Simplified Box Garden Environment

Training our player to eliminate a stationary NPC using PPO (Proximal Policy Optimization):

In this environment, we trained our player to eliminate a randomly positioned stationary NPC. Each episode of training is limited to 250 timesteps. Each timestep, we added a reward of (-1 / 250), which means that the agent gets a reward of -1 if the player can’t defeat the NPC within the allotted time. This incentivizes the agent to eliminate the NPC as quick as possible. If the NPC is successfully eliminated, then we add a reward of 1. In Unity, multiple instances of the box garden can be used to parallelize data collection and speed up training.

Training our player agent using multiple instances

We used Proximal Policy Optimization (PPO) method to train the agents. PPO uses a neural network to approximate the underlying mapping function from states to actions. Using PPO, we are able to successfully train our agent. Our player agent learns that throwing a fireball is the fastest way to defeat the NPC.

Graph showing extrinsic reward over 2.5 million timesteps

Player agent throwing a fireball to defeat the NPC

Training our NPC to eliminate our skilled Player using PPO:

Now the player agent uses its previously trained model for inference while we train the NPC to attack it. Using the same PPO configuration as before results in our NPC agent converging prematurely to a local optimum. Here the NPC is exploiting a bug where the player becomes stuck in the corner because it has not explored the entire arena. Therefore, the player is not aware of the action to take in this state.

Graph showing extrinsic reward over 1 million timesteps

NPC agent escaping to the corner

Training our NPC to eliminate our skilled player using GAIL (Generative Adversarial Imitation Learning):

To enable more human like gameplay from the NPC, we can utilize Generative Adversarial Imitation Learning (GAIL). Unity’s ML Agents allow us to play the game and record expert behavior in demonstration files. In GAIL, a second neural network learns to distinguish between the states/actions of a demonstration file and an agent. The discriminator generates a reward that quantifies how similar the new states/actions are to the demonstration file provided. In turn, the agent becomes better at “fooling” the discriminator, while the discriminator becomes more rigorous at distinguishing the “fooling”. Essentially, this gradually leads to our agent better imitating our actions.

GAIL diagram

GAIL Diagram

Gameplay showing GAIL trained agent

In this gameplay, we can observe that the NPC exhibits the correct behavior. The NPC is able to run behind the player agent to avoid the thrown fireballs. However, it fails to take advantage of its positioning. This is probably due to our NPC agent stopping to rotate in our demonstrations. This, combined with our policy of having no memory, means that our NPC agent has learned to imitate stopping. Using stacked observation vectors can alleviate this problem because it gives our NPC agent important information about the past.

After using stacked observations, it was clear that our NPC agent stopped a lot less. However, due to the slow animation speed of the NPC’s attack, the player would often move out of the way of imminent attacks.

Challenges and future research

With the current game input controller, we have a discrete action space for movement. This means that we are limited to eight directions of movement. Although this seemingly adds simplicity, it restricts our agents from being able to face each other over longer distances. This is why our player agent had learned to get close to the NPC before throwing a fireball. Ideally, we would want our player agent to learn to throw fireballs from any distance. Therefore, updating our input controller to enable continuous movement would encourage smarter behavior being learned.

Another challenge involves updating the animation speeds of certain actions to enable fairer gameplay. We observed that the NPC’s attack action was sluggish and therefore enabled the player agent to move out of the way of imminent attacks.

Future research would involve adding more agents and actions. This would allow for interesting gameplay, complex behavior, and the generation of various computation workloads.

In addition, this study was carried out on a laptop. We will be looking at how mobile devices with Unity ML-Agents and Arm CPUs can perform.

Conclusion

From this project, we learned that Unity’s ML Agents Toolkit is easy-to-use and has a wide range of capabilities. We observe that RL can lead to unintended behaviors where exploits are found. It is therefore necessary to use Imitation Learning to enable more human-like behaviors. GAIL is an effective algorithm for this and provided by Unity’s ML Agents Toolkit.

The project also led to us highlighting challenges and areas for future research. The intelligent agent behavior we had hoped for is stifled by our goal for simplicity. An input controller that enabled continuous behavior would have most likely led to a significant increase in the intelligence that was attainable. Moreover, a change in animation speeds for certain actions would have assisted us in training a more effect NPC. Future research would look to add more agents and actions. We are very excited to see how this field of RL in gaming will expand with the Arm CPUs. If you are interested in ML with Arm CPUs, please check out this site too. We hope that this blog will inspire others and be a catalyst for further research into RL.

Learn more about ML

0 comments
0 members are here

Mobile, Graphics, and Gaming blog

Optimizing 3D scenes in Godot on Arm GPUs

Clay John

Exploring advanced mobile GPU optimizations in Godot using Arm tools like Streamline and Mali Offline Compiler for real-world performance gains.
- July 10, 2025
Optimizing 3D scenes in Godot on Arm GPUs

Clay John

In part 1 of this series, learn how we utilized Arm Performance Studio to identify and resolve major performance issues in Godot’s Vulkan-based mobile renderer.
- June 11, 2025
Bringing realistic clothing simulation to mobile: A new frontier for game developers

Mina Dimova

Realistic clothing simulation on mobile—our neural GAT model delivers lifelike cloth motion without heavy physics or ground-truth data.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog