In part 1 of this blog series, we provided a general overview of our Dr Arm’s Boss Battle Demo. Part 2 takes a more in-depth look at how game AI agents are designed and what generated Neural Network (NN) models look like.
Once the strategy for the boss battle have been decided, the next step is to design the agent. Designing an agent mainly requires four items to be clarified. More information on agents design can be found here.
We first must think about what information the agent needs for the target task. In our demo, the input includes the statistics, the action events and the position of the target and the agent itself. The Statistics are Health, Mana, and Stamina. The action events are Attack, Roll and Fire. We collect such information in two ways.One way is to give the information to the agent from the agent C# code. Mainly, the stats and the action events are passed in this way, as the following:
// Customized class to manage a character's state
public PlayerManager _manager;
public PlayerManager _enemyManager;
public override void CollectObservations(VectorSensor sensor)
// Collect my state and add them to observations
// Normalize a value to [0, 1] by dividing its max value
sensor.AddObservation(_manager.Stats.CurrentHealth / _manager.Stats.MaxHealth);
sensor.AddObservation(_manager.Stats.CurrentStamina / _manager.Stats.MaxStanima);
sensor.AddObservation(_manager.Stats.CurrentMana / _manager.Stats.MaxMana);
sensor.AddObservation(_manager.posFire); // Vector3 type
// Collect enemy's state and add them to observations
sensor.AddObservation(_enemyManager.Stats.CurrentHealth / _enemyManager.Stats.MaxHealth);
sensor.AddObservation(_enemyManager.Stats.CurrentStamina / _enemyManager.Stats.MaxStanima);
sensor.AddObservation(_enemyManager.Stats.CurrentMana / _enemyManager.Stats.MaxMana);
sensor.AddObservation(_enemyManager.posFire); // Vector3 type
int isEnemyFacingMe = (Vector3.Dot(_manager.transform.localPosition - _enemyManager.transform.localPosition, _enemyManager.transform.forward)) > 0 ? 1 : 0;
Health, Mana, and Stamina of the target and the agent itself are vital information. Health is a direct indicator of winning a battle, so must be provided. It is also likely to be important for success to take actions depending on Mana and Stamina. Action events allow the agent to know if the target is about to attack. We also provide whether the target is looking at the agent based on their orientation. This information allows the agent to get behind the target for a more effective attack.
The other way to give the information to the agent is to use Raycasts. You can think of them as lasers that detect line of sight to an object. This is used to detect walls and the position of the target by adding RayPerceptionSensor3D component to the agent.
Figure 1. Raycasts in action (Left: top-down view, right: oblique view)
The number of inputs data fed into the NN model is the sum of the inputs defined by the two ways above. The number of inputs in the C# code is 57. This can be calculated using the formula: (Space Size) * (Stacked Vectors). In this instance, Space Size is the number of observed data collected by the AddObservation method and Stacked Vectors is the number of frames of the inputs fed to an NN model at once. The Stacked Vectors can be set in Unity’s UI as shown in figure 2. You must match the parameters with the observation you have defined in the code. The number of inputs by the raycasts is 492. The number can be calculated using the formula: (Stacked Raycasts) * (1 + 2 * Rays Per Direction) * (Num of Detectable Tags + 2). These can be set in Unity’s UI too. Of course, the smaller the number of rays and tags, the smaller the amount of data used and the lighter the computational complexity.
Figure 2. Agents Behavior Parameters (left) and Ray Perception Sensor 3D Component (right)
Next is to define the possible outputs from the agent. We map the output from the agent one-to-one with the unique actions of the character. In this game demo, the actions that Dr Arm and Knight can take are identical.
Figure 3. Character actions (Left: Dr Arm, right: Knight, bottom: possible actions)
The characters can move in two axes, horizontal and vertical, each taking a continuous value from -1 to 1. It also takes four exclusive discrete values as an action. Each value is assigned to an action: ATTACK that swings a sword, FIRE that throws a fireball, ROLL that dodges an attack, and NO ACTION. These can be implemented as shown in the sample code in the following:
// Called every time the agent receives an action to take from Agent.OnActionReceived()
public void ActAgent(ActionBuffers actionBuffers)
// Joystick movement
var actionZ = Mathf.Clamp(actionBuffers.ContinuousActions, -1f, 1f);
var actionX = Mathf.Clamp(actionBuffers.ContinuousActions, -1f, 1f);
Vector2 moveVector = new Vector2(actionZ, actionX);
// Discrete actions
if (actionBuffers.DiscreteActions == 1)
else if (actionBuffers.DiscreteActions == 2)
else if (actionBuffers.DiscreteActions == 3)
// Heuristic convers the controller inputs into actions.
// If the agent has a Model file, it will use the NN Model to take actions instead.
public override void Heuristic(in ActionBuffers actionsOut)
var continuousActionsOut = actionsOut.ContinuousActions;
var discreteActionsOut = actionsOut.DiscreteActions;
continuousActionsOut = Input.GetAxis("Horizontal");
continuousActionsOut = Input.GetAxis("Vertical");
discreteActionsOut = 1;
else if (Input.GetKey(KeyCode.Joystick1Button2))
discreteActionsOut = 2;
else if (Input.GetKey(KeyCode.Joystick1Button1))
discreteActionsOut = 3;
// do nothing
discreteActionsOut = 0;
As described in figure 3, there are 2 continuous actions (horizontal and vertical movement), and 1 discrete action. For the agent's output, there are 4 possible values. The sum of these values equals the number of nodes in the output layer of the NN model. Again, you must match the parameters in Unity’s UI shown below with the number of actions you have defined in the code.
Figure 4. Behavior parameters should match with the output configuration
Next, consider the brain of the agent for decision making. Which NN model structure should it be? Is historical information necessary? Is camera input required?
By default, ML-Agents uses a Multi-Layer Perceptron (MLP) structure. MLP is the most basic structure of a neural network with each neuron connected as shown in figure 5. The input layer and output layer of the network are determined by the inputs and outputs defined in the Design sections 1 and 2 above. In addition, ML-Agents provides several parameters to change the number and size of intermediate layers and more.
Figure 5. MLP NN model structure
There are three parameters that game developers can change:
• Stacked Vectors: the number of frames of input data fed to the NN model at once• Number of Layers: the number of intermediate layers in the NN model• Hidden Units: the number of neurons per layer
Game developers must set the appropriate parameter values depending on the complexity of the task. In our demo, the NN model has 3 Stacked Vectors, 2 intermediate layers and 128 neurons per layer. The size of the NN model is not that large, but this is a somewhat common size in Reinforcement Learning. As mentioned in the Design section 1 above, the Stacked Vectors can be set in Unity’s UI. The other two parameters should be specified in a YAML script, which is passed to the training command. More information on network settings can be found here.
ML-Agents generates NN models as ONNX format. Below is the structure of the generated NN model. As you can see, the inputs and outputs defined are reflected in the model. In the upper right, an input named action_masks has been created. This can be used to disable specific actions at a point in time. For example, you can explicitly mask FIRE action when there is not enough Mana left, but we did not use this feature in our demo.
Figure 6. Generated NN model
Finally, consider what rewards should be given. Rewards play a key role in setting the agent's goals. In this demo, three main rewards are given to the agent, depending on the state.
Figure 7. Reward function
In our case, the first is a large positive reward given when the agent achieves its aim and defeats the target. A +1 reward is given in this case. Conversely, a large negative reward is given when the agent is defeated by the target. Being defeated is an event that must be avoided, so a -1 reward is given in this case. Finally, a small negative reward continues to be given at every step. This way the agent is incentivized to defeat the target as quickly as possible. Also, during training, a timeout is defined after a certain period. It is 2500 steps in our case. This accumulated penalty becomes -1 when the timeout occurs. This means that a time-out draw is the same negative reward as being defeated by the target.
In part 3, I will explore training strategy for the game AI agents.