In part 1, we provided a general overview of our Candy Clash demo. Part 2 looks at how AI agents are designed in more depth.
In the demo, all three types of rabbit agents look identical and have the same inputs and outputs. Inputs are given as raycasts and vectors. You can think of raycasts as lasers that sense the rabbit's surroundings and they are used for detecting walls, eggs, and the distance and angle to teammates or opponents. Each agent sends out 17 rays, with egg positions also provided as vectors. In total, there are 244 input data points. These inputs are fed into a NN model, which outputs the agent's movement or attack actions. The NN model we use features a simple Multi-Layer Perceptron (MLP) structure with one hidden layer and 64 hidden units. Although this model structure is common to all three types of rabbit agents, each type has a distinct policy, meaning they have different model weights.
Figure 1. Rabbit agent
The three types of rabbit agents are trained individually using self-play. Self-play is a training method that allows agents to play against each other to gradually improve their intelligence. We train the attacker and defender agents by having them compete, while wanderers compete against each other. The following videos show training in action, with different positions and different numbers of rabbits spawned per episode.
Figure 2. Training rabbits with self-play (left: attacker vs. defender, right: wanderer vs. wanderer)
We design reward functions for each role to encourage the desired behavior.
Figure 3. Reward functions for three types of rabbit agents
Here is the pseudo code for training as a group. To train rabbit characters as a group, we use SimpleMultiAgentGroup. By adding rabbit agents to the group, ML-Agents can recognize them as part of the same group. Since rabbits that are defeated during an episode, we include a process in the ResetRabbits function to add them back to the group after each episode. To give rewards to the group, we use the AddGroupReward function.
Code 1. Agent group training (in C#)
Using Unity.MLAgents; public class RabbitSpawner : MonoBehaviour { int rabbitCount = 50; SimpleMultiAgentGroup agentGroup; List<GameObject> rabbits = new List<GameObject>(); public void Initialise() { agentGroup = new SimpleMultiAgentGroup() int cnt = 0; while (cnt < rabbitCount) { GameObject rabbit = SpawnRabbit(); rabbits.add(rabbit); agentGroup.RegisterAgent(rabbit.GetComponent<AgentRabbit>()); cnt++; } } // Will be called at the end of every episode public void ResetRabbits() { foreach (var rabbit in rabbits) { rabbit.Reset() // Reset position etc. agentGroup.RegisterAgent(rabbit.GetComponent<AgentRabbit>()); } } public void Win(){ agentGroup.AddGroupReward(+2); } public void Lose(){ agentGroup.AddGroupReward(-1); } public void Timeout(){ agentGroup.EndGroupEpisode(); } }
The following is an excerpt from the training script. When training as a group, set the trainer_type to “poca”.
Code 2. Training script with 5 environments example (in yaml)behaviors: Attacker: trainer_type: poca hyperparameters: batch_size: 2048 buffer_size: 102400 learning_rate: 0.0003 beta: 0.005 epsilon: 0.2 lambd: 0.95 num_epoch: 3 learning_rate_schedule: constant network_settings: normalize: false hidden_units: 64 num_layers: 2 reward_signals: extrinsic: gamma: 0.99 strength: 1.0 keep_checkpoints: 40 max_steps: 50000000 checkpoint_interval: 200000 time_horizon: 1000 summary_freq: 10000 self_play: save_steps: 250000 team_change: 500000 swap_steps: 100000 window: 50 play_against_latest_model_ratio: 0.5 initial_elo: 1200.0 Defender: trainer_type: poca hyperparameters: batch_size: 2048 buffer_size: 102400 learning_rate: 0.0003 beta: 0.005 epsilon: 0.2 lambd: 0.95 num_epoch: 3 learning_rate_schedule: constant network_settings: normalize: false hidden_units: 64 num_layers: 2 reward_signals: extrinsic: gamma: 0.99 strength: 1.0 keep_checkpoints: 40 max_steps: 50000000 checkpoint_interval: 200000 time_horizon: 1000 summary_freq: 10000 self_play: save_steps: 250000 team_change: 500000 swap_steps: 100000 window: 50 play_against_latest_model_ratio: 0.5 initial_elo: 1200.0
behaviors: Attacker: trainer_type: poca hyperparameters: batch_size: 2048 buffer_size: 102400 learning_rate: 0.0003 beta: 0.005 epsilon: 0.2 lambd: 0.95 num_epoch: 3 learning_rate_schedule: constant network_settings: normalize: false hidden_units: 64 num_layers: 2 reward_signals: extrinsic: gamma: 0.99 strength: 1.0 keep_checkpoints: 40 max_steps: 50000000 checkpoint_interval: 200000 time_horizon: 1000 summary_freq: 10000 self_play: save_steps: 250000 team_change: 500000 swap_steps: 100000 window: 50 play_against_latest_model_ratio: 0.5 initial_elo: 1200.0 Defender: trainer_type: poca hyperparameters: batch_size: 2048 buffer_size: 102400 learning_rate: 0.0003 beta: 0.005 epsilon: 0.2 lambd: 0.95 num_epoch: 3 learning_rate_schedule: constant network_settings: normalize: false hidden_units: 64 num_layers: 2 reward_signals: extrinsic: gamma: 0.99 strength: 1.0 keep_checkpoints: 40 max_steps: 50000000 checkpoint_interval: 200000 time_horizon: 1000 summary_freq: 10000 self_play: save_steps: 250000 team_change: 500000 swap_steps: 100000 window: 50 play_against_latest_model_ratio: 0.5 initial_elo: 1200.0
When training, you can use the executable of the built environment instead of running it on the Unity editor. There are several benefits to this, but one of the biggest advantages is that training can be sped up by running the executable in parallel and in headless mode without rendering. You could set the level of parallelism by passing the num-envs parameter to the training command. The following GitHub page explains using an environment executable in detail. When you run the executable in parallel, training may become ineffective because increasing the number of environments can potentially change the data in the buffer used for training. In such cases, you could increase the parameter buffer_size in the training config file. The buffer_size corresponds to how much experience should be collected before updating the model. A common practice is to multiply buffer_size by num-envs. After building the environment, you can parallelize the training with the following command. In the setup, training time was reduced by 51% by running it in parallel with 5 environments.
Code 3. Training command example with executable (in shell)
mlagents-learn ./<trainConfig>.yaml --no-graphics --run-id=<runId> --env=<builtEnv>.exe --num-envs=5
As for tips to make training successful, start small and gradually scale up. Begin with simple settings and objectives, and when you confirm that the agents are behaving as expected, you can then add more complex settings. It is important to keep track of what works well and what does not, so you can make adjustments and improvements when necessary.
Cooperative behavior is a key aspect of multi-agent systems to achieve common goals. When we trained each role as part of a multi-agent system, some interesting cooperative behaviors emerged among the groups. We would like to share a few of these examples here:
Figure 4. Cooperative behaviors (left: attackers, center: defenders, right: wanderers)
As you can see, they have developed fascinating cooperative behaviors as a group.
Next, let us talk about the planner agent. As explained earlier, the planner controls which role is assigned to each rabbit.
Figure 5. Planner agent
The planner agent uses a bird's-eye view of the field as input, considering the number of rabbits present in each of the 20 x 20 grid divisions. This information is fed into an NN model, which outputs one of the three roles assigned to each rabbit. Since the planner's inputs are grid-like data, the NN model consists of a Convolutional Neural Network (CNN) structure. This enables dynamic strategy changes, adjusting the roles of the rabbits according to the game situation.
Figure 6. Grid sensor
To collect information on how many rabbits are in each grid, we extended upon the GridSensor component and attached it to the planner. To extend the grid sensor, see a detailed process on GitHub. The following is the example code.
Code 4. Custom grid sensor (in C#)
using Unity.MLAgents.Sensors; using UnityEngine; public class CustomGridSensorComponent : GridSensorComponent { protected override GridSensorBase[] GetGridSensors() { return new GridSensorBase[] { new CustomGridSensorBase(name, CellScale, GridSize, DetectableTags, CompressionType) }; } } public class CustomGridSensorBase : GridSensorBase { const int maxRabbitNum = 100; const int maxRabbitNumInCell = 12; /// Create a CountingGridSensor with the specified configuration. public CustomGridSensorBase( string name, Vector3 cellScale, Vector3Int gridSize, string[] detectableTags, SensorCompressionType compression ) : base(name, cellScale, gridSize, detectableTags, compression) { } protected override int GetCellObservationSize() { return DetectableTags == null ? 0 : DetectableTags.Length; } protected override bool IsDataNormalized() { return true; } protected override ProcessCollidersMethod GetProcessCollidersMethod() { return ProcessCollidersMethod.ProcessAllColliders; } /// Get object counts for each detectable tags detected in a cell. protected override void GetObjectData(GameObject detectedObject, int tagIndex, float[] dataBuffer) { dataBuffer[tagIndex] += 1.0f / maxRabbitNumInCell; } }
As mentioned earlier, the information we are detecting is the number of rabbits on each team in each grid. Using the grid sensor generates a CNN structure, which takes longer to train compared to an MLP. So, we recommend keeping the input to a minimum.
Now let us take a look at the planner in action. In the following videos, Team A, in blue, has rabbits controlled by the planner, while Team B, in orange, does not have a planner, and the three roles are assigned only at the beginning of the game.
Figure 7. Planner in action (left: attacker infection at change, right: penetrate with attacker)
In the video on the left, the blue team is invading the defender-only orange team. Initially, the blue team's roles are roughly evenly distributed, but as the planner realizes that the opponent is not going to attack, it gradually increases the number of attackers. Eventually, almost all of the rabbits are assigned to attackers.
In the video on the right, there are more team members, 120 in total, with each team having 60 rabbits. This time, the orange team has all roles, including attackers, defenders, and wanderers. If you look closely, you can see that the planner temporarily increases the number of attackers, and after some time, the number decreases again. This is done to help the attackers penetrate deeper into the opponent's zone because attackers are trained to go for the egg, not to focus on the opponent's rabbits. Afterward, the attacker rabbits in the enemy zone are assigned to defenders or wanderers, who then attack the opponent rabbits from behind. This is one of the strategies the planner employs.
In this way, you can see the planner and rabbit roles combining to create interesting strategies that change depending on the situation.
In part 3, I will explore how the game runs on mobiles.
Looks pretty cool, Koki.