Part 2: Build dynamic and engaging mobile games with multi-agent reinforcement learning

June 28, 2023

10 minute read time.

Part 2 of 3 blog series

In part 1, we provided a general overview of our Candy Clash demo. Part 2 looks at how AI agents are designed in more depth.

Three types of rabbit roles

In the demo, all three types of rabbit agents look identical and have the same inputs and outputs. Inputs are given as raycasts and vectors. You can think of raycasts as lasers that sense the rabbit's surroundings and they are used for detecting walls, eggs, and the distance and angle to teammates or opponents. Each agent sends out 17 rays, with egg positions also provided as vectors. In total, there are 244 input data points. These inputs are fed into a NN model, which outputs the agent's movement or attack actions. The NN model we use features a simple Multi-Layer Perceptron (MLP) structure with one hidden layer and 64 hidden units. Although this model structure is common to all three types of rabbit agents, each type has a distinct policy, meaning they have different model weights.

Rabbit agent

Figure 1. Rabbit agent

The three types of rabbit agents are trained individually using self-play. Self-play is a training method that allows agents to play against each other to gradually improve their intelligence. We train the attacker and defender agents by having them compete, while wanderers compete against each other. The following videos show training in action, with different positions and different numbers of rabbits spawned per episode.

Training on attackers and defenders Training on wanderers

Figure 2. Training rabbits with self-play (left: attacker vs. defender, right: wanderer vs. wanderer)

We design reward functions for each role to encourage the desired behavior.

The attacker's goal is to crack the opponent's egg, so a large positive group reward is given, and a negative reward is given if the egg is not cracked within a predetermined time. The shorter the time taken to crack the egg, the larger the reward. Additionally, to further encourage training, individual small rewards are given to an attacker who successfully attacks the egg.
For defenders, we want them to protect their own egg, so we give a large positive group reward if they successfully defend the egg for a certain time and a negative reward if their egg is cracked, which is the opposite of the attackers. Additionally, individual rewards are given when the defender stays within a certain distance from their egg, and a small reward is provided when they attack an opponent rabbit.
Wanderers have the role of defeating opponent rabbits, so they receive a large positive group reward when they eliminate all enemy rabbits and a negative reward if their team gets defeated. The faster they achieve their goal, the larger the reward. Individual rewards are given when a wanderer attacks an opponent's rabbit.

Reward functions

Figure 3. Reward functions for three types of rabbit agents

Here is the pseudo code for training as a group. To train rabbit characters as a group, we use SimpleMultiAgentGroup. By adding rabbit agents to the group, ML-Agents can recognize them as part of the same group. Since rabbits that are defeated during an episode, we include a process in the ResetRabbits function to add them back to the group after each episode. To give rewards to the group, we use the AddGroupReward function.

Code 1. Agent group training (in C#)

Using Unity.MLAgents;

public class RabbitSpawner : MonoBehaviour
{
    int rabbitCount = 50;
    SimpleMultiAgentGroup agentGroup;
    List<GameObject> rabbits = new List<GameObject>();

    public void Initialise()
    {
        agentGroup = new SimpleMultiAgentGroup()
        int cnt = 0;
        while (cnt < rabbitCount)
        {
            GameObject rabbit = SpawnRabbit();
            rabbits.add(rabbit);
            agentGroup.RegisterAgent(rabbit.GetComponent<AgentRabbit>());
            cnt++;
        }
    }

    // Will be called at the end of every episode
    public void ResetRabbits()
    {
        foreach (var rabbit in rabbits)
        {
            rabbit.Reset() // Reset position etc.
            agentGroup.RegisterAgent(rabbit.GetComponent<AgentRabbit>());
        }
    }

    public void Win(){
        agentGroup.AddGroupReward(+2);
    }

    public void Lose(){
        agentGroup.AddGroupReward(-1);
    }

    public void Timeout(){
        agentGroup.EndGroupEpisode();
    }
}

The following is an excerpt from the training script. When training as a group, set the trainer_type to “poca”.

Code 2. Training script with 5 environments example (in yaml)

behaviors:

  Attacker:

    trainer_type: poca

    hyperparameters:

      batch_size: 2048

      buffer_size: 102400

      learning_rate: 0.0003

      beta: 0.005

      epsilon: 0.2

      lambd: 0.95

      num_epoch: 3

      learning_rate_schedule: constant

    network_settings:

      normalize: false

      hidden_units: 64

      num_layers: 2

    reward_signals:

      extrinsic:

        gamma: 0.99

        strength: 1.0

    keep_checkpoints: 40

    max_steps: 50000000

    checkpoint_interval: 200000

    time_horizon: 1000

    summary_freq: 10000

    self_play:

      save_steps: 250000

      team_change: 500000

      swap_steps: 100000

      window: 50

      play_against_latest_model_ratio: 0.5

      initial_elo: 1200.0

 

  Defender:

    trainer_type: poca

    hyperparameters:

      batch_size: 2048

      buffer_size: 102400

      learning_rate: 0.0003

      beta: 0.005

      epsilon: 0.2

      lambd: 0.95

      num_epoch: 3

      learning_rate_schedule: constant

    network_settings:

      normalize: false

      hidden_units: 64

      num_layers: 2

    reward_signals:

      extrinsic:

        gamma: 0.99

        strength: 1.0

    keep_checkpoints: 40

    max_steps: 50000000

    checkpoint_interval: 200000

    time_horizon: 1000

    summary_freq: 10000

    self_play:

      save_steps: 250000

      team_change: 500000

      swap_steps: 100000

      window: 50

      play_against_latest_model_ratio: 0.5

      initial_elo: 1200.0

When training, you can use the executable of the built environment instead of running it on the Unity editor. There are several benefits to this, but one of the biggest advantages is that training can be sped up by running the executable in parallel and in headless mode without rendering. You could set the level of parallelism by passing the num-envs parameter to the training command. The following GitHub page explains using an environment executable in detail. When you run the executable in parallel, training may become ineffective because increasing the number of environments can potentially change the data in the buffer used for training. In such cases, you could increase the parameter buffer_size in the training config file. The buffer_size corresponds to how much experience should be collected before updating the model. A common practice is to multiply buffer_size by num-envs. After building the environment, you can parallelize the training with the following command. In the setup, training time was reduced by 51% by running it in parallel with 5 environments.

Code 3. Training command example with executable (in shell)

mlagents-learn ./<trainConfig>.yaml --no-graphics --run-id=<runId> --env=<builtEnv>.exe --num-envs=5

As for tips to make training successful, start small and gradually scale up. Begin with simple settings and objectives, and when you confirm that the agents are behaving as expected, you can then add more complex settings. It is important to keep track of what works well and what does not, so you can make adjustments and improvements when necessary.

Cooperative behavior

Cooperative behavior is a key aspect of multi-agent systems to achieve common goals. When we trained each role as part of a multi-agent system, some interesting cooperative behaviors emerged among the groups. We would like to share a few of these examples here:

Figure 4. Cooperative behaviors (left: attackers, center: defenders, right: wanderers)

The attacker's strategy is a "one-side attack". After training, they discovered that focusing their attack on the egg from the right side of the path was the most successful approach.
The defender's strategy is "forming a circle". They found that it was important to encircle the egg and fill any gaps as much as possible to defend it from their opponents.
The wanderer's strategy is "wave attack formation". They found it effective to check their teammates' positions and create a wave attack formation, which involves dividing their teammates into several waves and sending them to the opponents one after the other. They continuously spin around in their own positions because I forced them to move and not stop by masking actions. We wanted the rabbits to hop around instead of standing still.

As you can see, they have developed fascinating cooperative behaviors as a group.

Planner Agent: AI from a bird’s eye view

Next, let us talk about the planner agent. As explained earlier, the planner controls which role is assigned to each rabbit.

Input:
Grid Sensor + Eggs HP
to see how many rabbits in each grid
(20 x 20 grids and 9 vectors)

NN Model
CNN with 1 layer

Output:
Rabbit Role

Figure 5. Planner agent

The planner agent uses a bird's-eye view of the field as input, considering the number of rabbits present in each of the 20 x 20 grid divisions. This information is fed into an NN model, which outputs one of the three roles assigned to each rabbit. Since the planner's inputs are grid-like data, the NN model consists of a Convolutional Neural Network (CNN) structure. This enables dynamic strategy changes, adjusting the roles of the rabbits according to the game situation.

Grid sensor

Figure 6. Grid sensor

To collect information on how many rabbits are in each grid, we extended upon the GridSensor component and attached it to the planner. To extend the grid sensor, see a detailed process on GitHub. The following is the example code.

Code 4. Custom grid sensor (in C#)

using Unity.MLAgents.Sensors;
using UnityEngine;

public class CustomGridSensorComponent : GridSensorComponent
{
    protected override GridSensorBase[] GetGridSensors()
    {
        return new GridSensorBase[] { new CustomGridSensorBase(name, CellScale, GridSize, DetectableTags, CompressionType) };
    }

}

public class CustomGridSensorBase : GridSensorBase
{
    const int maxRabbitNum = 100;
    const int maxRabbitNumInCell = 12;

    /// Create a CountingGridSensor with the specified configuration.
    public CustomGridSensorBase(
        string name, 
        Vector3 cellScale, 
        Vector3Int gridSize, 
        string[] detectableTags, 
        SensorCompressionType compression
    ) : base(name, cellScale, gridSize, detectableTags, compression)
    {
    }
    
    protected override int GetCellObservationSize()
    {
        return DetectableTags == null ? 0 : DetectableTags.Length;
    }

    protected override bool IsDataNormalized()
    {
        return true;
    }

    protected override ProcessCollidersMethod GetProcessCollidersMethod()
    {
        return ProcessCollidersMethod.ProcessAllColliders;
    }

    /// Get object counts for each detectable tags detected in a cell.
    protected override void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer)
    {
        dataBuffer[tagIndex] += 1.0f / maxRabbitNumInCell;
    }

}

As mentioned earlier, the information we are detecting is the number of rabbits on each team in each grid. Using the grid sensor generates a CNN structure, which takes longer to train compared to an MLP. So, we recommend keeping the input to a minimum.

Planner in action

Now let us take a look at the planner in action. In the following videos, Team A, in blue, has rabbits controlled by the planner, while Team B, in orange, does not have a planner, and the three roles are assigned only at the beginning of the game.

Planner: attacker infection at chance Planner: penetrate with attacker, defeat with other roles

Figure 7. Planner in action (left: attacker infection at change, right: penetrate with attacker)

In the video on the left, the blue team is invading the defender-only orange team. Initially, the blue team's roles are roughly evenly distributed, but as the planner realizes that the opponent is not going to attack, it gradually increases the number of attackers. Eventually, almost all of the rabbits are assigned to attackers.

In the video on the right, there are more team members, 120 in total, with each team having 60 rabbits. This time, the orange team has all roles, including attackers, defenders, and wanderers. If you look closely, you can see that the planner temporarily increases the number of attackers, and after some time, the number decreases again. This is done to help the attackers penetrate deeper into the opponent's zone because attackers are trained to go for the egg, not to focus on the opponent's rabbits. Afterward, the attacker rabbits in the enemy zone are assigned to defenders or wanderers, who then attack the opponent rabbits from behind. This is one of the strategies the planner employs.

In this way, you can see the planner and rabbit roles combining to create interesting strategies that change depending on the situation.

In part 3, I will explore how the game runs on mobiles.

1 comment
0 members are here

AI blog

Arm at KubeCon and CloudNativeCon China 2025: Powering the future of Cloud Native AI

Fei Xiang

Arm energized KubeCon + CloudNativeCon China 2025, driving record dev engagement and showcasing cloud-native AI innovation on Arm-based infrastructure.
- July 21, 2025
Empowering engineers with AI-enabled security code review

Michalis Spyrou

Metis uses AI to detect design flaws and logic bugs missed by traditional tools, helping teams build secure software with context-aware reviews.
- July 17, 2025
Get ready for Arm SME, coming soon to Android

Eric Sondhi

Build next-gen mobile AI apps with SME2—no code changes needed. Accelerate performance across devices using top AI frameworks and runtimes.
- July 10, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Part 2: Build dynamic and engaging mobile games with multi-agent reinforcement learning

Three types of rabbit roles

Cooperative behavior

Planner Agent: AI from a bird’s eye view

Planner in action

Arm at KubeCon and CloudNativeCon China 2025: Powering the future of Cloud Native AI

Empowering engineers with AI-enabled security code review

Get ready for Arm SME, coming soon to Android