Until very recently, NPCs (Non-Player Characters) within games have lacked the ability to perform smart actions. Reinforcement Learning (RL) allows us to train smarter NPC agents, which enables more interesting gameplay. In becoming proficient at games like Go, Quake III or StarCraft, RL models have demonstrated that they can outdo human performance and produce unique long-term strategies never discovered before. These strategies mean RL has a variety of real-world applications. For example, RL can be used for Robotics as well as next-gen video game AI. It is used to train robots to grasp various objects, which is also a growing area of research.
The Unity Machine Learning Agents Toolkit (ML-Agents) allows us to train intelligent agents within games and simulations. We applied this toolkit to our own internal Unity game project to see how smart the game AI can become. More importantly, it will let us explore the field of RL. Finally, innovation within the field of training intelligent NPCs could benefit other real-world applications.
RL is a field of machine learning (ML) where an agent at each timestep takes an action in an environment and receives a state and reward. RL aims to maximize the agent’s total reward by learning an optimal policy (that is, an algorithm used by the Agent to decide what actions to take at each timestep) through a trial-and-error learning process for a environment. Essentially, a policy is evaluated depending on the results of actions taken within an environment by an agent.
Classic RL Diagram
In the context of game AI, the agent refers to the game player/NPC, the environment refers to the environment surrounding the player within the game simulation, and the action refers to the actions taken by the player, such as moving, attacking or dodging in an action game. The state and reward are generally defined by the game AI designer. For example, in a simple action game, the state is the distance between the player and the enemy, the reward becomes a positive value if the player defeats an enemy, and a negative value if the player is defeated by an enemy, etc.
Unity’s ML-Agents is an open-source toolkit that enables the training of intelligent agents within gaming and simulation environments. A Python API allows us to train agents using RL, as well as a number of other ML techniques, all implemented in PyTorch.
ML Agents Block Diagram from ML-Agents Toolkit Overview
The toolkit contains four key components: 1) the learning environment; 2) the communicator; 3) Python API; and 4) the Python Trainer. The learning environment consists of the Unity game scene and all the game characters. The communicator allows interaction between the Python API and the learning environment. The Python API allows for control of the learning environment. The Python Trainer contains all of the ML algorithms. The game AI designer does the training of agents with Python Trainer interface.
Our internal Unity game project has the purpose of generating workloads for experimentation, valuable to GPU design. We decided to train intelligent NPC agents that are difficult to defeat using this existing game. This also assists in producing new computation workloads.
Screenshots from our Internal Unity Game Project
In this trial, we have limited the number of characters to be trained to two: a player and NPC. The player attacks with the sword and with the Fireball. The enemy attacks by swinging their arms down as shown in the images below.
The Player's Actions
The NPC's Actions
Also, to simplify the training, we have limited the game scene to a simple box garden, as shown below. Initially we trained our player agent against a stationary NPC. Then we trained our NPC against the trained player agent. Then the player agent and NPC repeatedly take turns at training. In this way, while the NPC becomes better at eliminating the player, the player becomes tougher, and the NPC must become smarter to defeat it.
Simplified Box Garden Environment
In this environment, we trained our player to eliminate a randomly positioned stationary NPC. Each episode of training is limited to 250 timesteps. Each timestep, we added a reward of (-1 / 250), which means that the agent gets a reward of -1 if the player can’t defeat the NPC within the allotted time. This incentivizes the agent to eliminate the NPC as quick as possible. If the NPC is successfully eliminated, then we add a reward of 1. In Unity, multiple instances of the box garden can be used to parallelize data collection and speed up training.
Training our player agent using multiple instances
We used Proximal Policy Optimization (PPO) method to train the agents. PPO uses a neural network to approximate the underlying mapping function from states to actions. Using PPO, we are able to successfully train our agent. Our player agent learns that throwing a fireball is the fastest way to defeat the NPC.
Graph showing extrinsic reward over 2.5 million timesteps
Player agent throwing a fireball to defeat the NPC
Now the player agent uses its previously trained model for inference while we train the NPC to attack it. Using the same PPO configuration as before results in our NPC agent converging prematurely to a local optimum. Here the NPC is exploiting a bug where the player becomes stuck in the corner because it has not explored the entire arena. Therefore, the player is not aware of the action to take in this state.
Graph showing extrinsic reward over 1 million timesteps
NPC agent escaping to the corner
To enable more human like gameplay from the NPC, we can utilize Generative Adversarial Imitation Learning (GAIL). Unity’s ML Agents allow us to play the game and record expert behavior in demonstration files. In GAIL, a second neural network learns to distinguish between the states/actions of a demonstration file and an agent. The discriminator generates a reward that quantifies how similar the new states/actions are to the demonstration file provided. In turn, the agent becomes better at “fooling” the discriminator, while the discriminator becomes more rigorous at distinguishing the “fooling”. Essentially, this gradually leads to our agent better imitating our actions.
GAIL Diagram
Gameplay showing GAIL trained agent
In this gameplay, we can observe that the NPC exhibits the correct behavior. The NPC is able to run behind the player agent to avoid the thrown fireballs. However, it fails to take advantage of its positioning. This is probably due to our NPC agent stopping to rotate in our demonstrations. This, combined with our policy of having no memory, means that our NPC agent has learned to imitate stopping. Using stacked observation vectors can alleviate this problem because it gives our NPC agent important information about the past.
After using stacked observations, it was clear that our NPC agent stopped a lot less. However, due to the slow animation speed of the NPC’s attack, the player would often move out of the way of imminent attacks.
With the current game input controller, we have a discrete action space for movement. This means that we are limited to eight directions of movement. Although this seemingly adds simplicity, it restricts our agents from being able to face each other over longer distances. This is why our player agent had learned to get close to the NPC before throwing a fireball. Ideally, we would want our player agent to learn to throw fireballs from any distance. Therefore, updating our input controller to enable continuous movement would encourage smarter behavior being learned.
Another challenge involves updating the animation speeds of certain actions to enable fairer gameplay. We observed that the NPC’s attack action was sluggish and therefore enabled the player agent to move out of the way of imminent attacks.
Future research would involve adding more agents and actions. This would allow for interesting gameplay, complex behavior, and the generation of various computation workloads.
In addition, this study was carried out on a laptop. We will be looking at how mobile devices with Unity ML-Agents and Arm CPUs can perform.
From this project, we learned that Unity’s ML Agents Toolkit is easy-to-use and has a wide range of capabilities. We observe that RL can lead to unintended behaviors where exploits are found. It is therefore necessary to use Imitation Learning to enable more human-like behaviors. GAIL is an effective algorithm for this and provided by Unity’s ML Agents Toolkit.
The project also led to us highlighting challenges and areas for future research. The intelligent agent behavior we had hoped for is stifled by our goal for simplicity. An input controller that enabled continuous behavior would have most likely led to a significant increase in the intelligence that was attainable. Moreover, a change in animation speeds for certain actions would have assisted us in training a more effect NPC. Future research would look to add more agents and actions. We are very excited to see how this field of RL in gaming will expand with the Arm CPUs. If you are interested in ML with Arm CPUs, please check out this site too. We hope that this blog will inspire others and be a catalyst for further research into RL.
[CTAToken URL = "https://developer.arm.com/ip-products/processors/machine-learning" target="_blank" text="Learn more about ML" class ="green"]