During the Game Developer Conference (GDC) in March 2023, we showcased our multi-agent demo called Candy Clash, a mobile game containing 100 intelligent agents. In the demo, the agents are developed using Unity’s ML-Agents Toolkit which allows us to train them using reinforcement learning (RL). To find out more about the demo and its development, see our previous blog series. Previously, the agents had a simple Multi-Layer Perceptron (MLP) Neural Network (NN) model. This blog explores the impact of using other types of neural networks models on the gaming experience and performance.
The Game Developer Conference was held near Easter, which inspired the setup for the Candy Clash demo. In Candy Clash, there are two teams of rabbits, both with an egg to protect. The aim is for the rabbits to either:
There are 3 rabbit roles: Attacker, Defender, and Wanderer. There is also a Planner agent which dynamically assigns the roles of the rabbits during play.
Figure 1: Different agent roles in the Candy Clash demo
The rabbits’ behaviors are created by training the rabbits with different reward functions. This is enforced by their policy, which is effectively the rabbits’ “brain”. Their policy takes the observation as input and determines an action as output. The policy is trained to maximize the reward by providing the optimal action for any given observation. For more details see ML-Agents official documentation. The Neural Network (NN) models are used to model the optimal policy, which is the same model for all of the rabbits. However, the policy modelled by the NN will be different after training. This is because the weights are adjusted to model the optimal policy to give the maximum reward.
The original demo used the simple MLP network. This is one of many networks which can be used to model the policy. Unity’s ML-Agents Toolkit allows the user to introduce slight changes to the baseline MLP network by changing the settings in the yaml configuration file. See Unity's official documentation. For example, you can add “memory” to the agent. This, in NN model terms, means adding a recurrent NN layer. In ML-Agents this is a single Long-Short Term Memory (LSTM) layer. This network is more complex than the simple MLP network. It enables the agent to learn which observations to remember and observations to forget. To set up a network with an LSTM layer in Unity, the following changes were made to the configurations file.
network_settings: normalize: false hidden_units: 64 num_layers: 2
Code 1: Network configuration for the MLP Network
network_settings: normalize: false hidden_units: 64 num_layers: 1 memory: sequence_length: 32 memory_size: 32
Code 2: Network configuration for the LSTM Network
Figure 2 shows the NN models of the different configuration settings and the resultant ONNX models after training. The differences are highlighted by the orange and blue boxes. The blue shows the LSTM layer and the recurrent inputs and outputs. The LSTM model is more complex.
Figure 2: MLP NN structure and LSTM NN structure
To investigate the effect of changing the model for the mobile gaming experience, we trained new Wanderer agents using the LSTM-based model for 50 million steps. This is the same number used to train the original MLP model. In this trial, we only retrained the Wanderer rabbit because the gaming experience improvements are easier to observe in Wanderer vs. Wanderer games. Also, any performance improvements are similar irrespective of rabbit type. To explore the impact of the different NN models on the gaming experience, we looked at the relative “intelligence” of the agents and the games frame rate performance.
Both network models were trained using self-play. To learn more about training setup, see the previous blog series about the Boss Battle Demo. A consequence of training the models using self-play is that there is no readily available metric to compare the models’ “intelligence”. Consequently, we decided to play the Wanderer rabbits with the two different NN models against each other. See Figure 3:
Figure 3: Wanderers with LSTM NN (Blue) vs. Wanderer with MLP (orange)
There are only Wrabbits on both teams. The aim of the Wanderer rabbits is to defeat all their opponents, so the Wanderers reward function encourages defeating and attacking the opponents. The orange rabbits have the old MLP model and the blue rabbits have the new LSTM model, which means there are slight differences in strategy. Both teams employ a form of wave attacks, but the LSTM model team stays in groups and has a more defensive strategy. The rabbits with the LSTM-based model appear to win more consistently than the MLP-based rabbits. See Table 1 for the result after 11 games. This suggests that they have learnt a better strategy, at least against the simpler MLP network agents, which extrapolates to suggest that they are cleverer.
Table 1: Number of wins for LSTM model-based rabbit team against MLP model-based rabbits.
Note: Games were played with only Wanderer rabbits.
Another aspect of the performance metric is the inference performance, which influences the frame rate. The game ran on a Google Pixel 7 Pro with inference only on the CPU. To understand the reasoning behind this, see our previous blog. When comparing the inference time between the two, there is a slight increase in time taken for the computation for the LSTM-based model. The decide Action is approximately 1.6 times slower, see Figure 4.
Figure 4: Inference Time comparison between LSTM based model and MLP based model.
The Decide action function call is the call in Unity’s ML-agents when the actual model computation is run. This is a relatively significant increase in computation time. However, the Frames Per Second (FPS) profiling runs indicated that it can still run close to 60 FPS with 100 Wanderer agents with LSTM-based NN, see Figure 5.
Figure 5: FPS comparison between MLP and LSTM models for the Wanderer rabbits
The results show that the non-player character (NPC) behavior is improved by changing the NN used. This leads to a more engaging gaming experience. The trade-off is increased inference time, which might result in game performance reduction. However, no drops in FPS were seen in our results.
Using different NN models for the policy when training multi-agent systems results in different and perhaps “cleverer” behaviors. This may lead to a more engaging gaming experience. However, the trade-off is that, as models become more complex, the model size and inference time increases, which affects the game’s performance. This potential drop in performance might be offset by reducing the number of intelligent NPC in the game. This also affects the inference time.Therefore, finding the right trade-off which is optimal for a particular gaming scenario is difficult. However, as mobiles become more efficient at processing data, and new NN are available, there are more opportunities to explore the use of different models to get the most engaging gaming experience.