Multi-agent reinforcement learning scalability on Android devices

September 20, 2022

7 minute read time.

In my last summer internship at Arm, I explored the topic of Reinforcement Learning and how Unity's ML Agents Toolkit can help you to develop intelligent Non-Playable Characters (NPCs). Following on from that project, a series of blogs (Part 1, 2 and 3) explained how we developed the 'Dr Arm' game demo against a game AI implemented using Unity's ML-Agents Toolkit. In this blog, done as part of my internship this summer at Arm, what happens when we increase the quantity of intelligent agents.

The mobile gaming market is now following this trend towards NPCs with smarter behaviors.

Typically, a game scene involves numerous NPCs. It is thus valuable to assess how mobile devices perform as the quantity of intelligent NPCs increases. To pursue this aim, we will first modify an existing environment, Dungeon Escape which is provided with the Unity ML-Agents Toolkit, to enable the addition of more agents. This will facilitate the creation of multi-agent workloads, that we can load onto Android devices. We can then, using several analysis tools, observe how the CPUs of an Android device react to increasing multi-agent workloads.

Multi-Agent Cooperative Behavior

In Multi-Agent systems, several agents seek to achieve a goal through the maximization of a group reward. Current MARL (Multi-Agent Reinforcement Learning) algorithms assume that the quantity of agents active is constant throughout an episode. However, in many scenarios, an agent could become inactive (i.e., terminate) before its teammates.

Multi-Agent Posthumous Credit Assignment (MA-POCA) is a novel multi-agent trainer created by Unity Technologies, that addresses this dilemma by propagating value from rewards earned by remaining teammates to terminated agents. It achieves this by training a centralized critic, i.e., a Neural Network (NN) that acts as a “coach” for the whole group of agents. The resulting multi-agent gameplay can be observed in Dungeon Escape!

Dungeon Escape

Dungeon Escape presents a scenario in which agents are trapped in a dungeon with a dragon and must cooperate to escape. To escape, one agent must sacrifice itself (terminate) to slay the dragon, thereby forcing the dragon to drop the key. The remaining agents can pick up the key and use it to unlock the dungeon door. If the agents are not able to slay the dragon in a reasonable amount of time, the dragon will escape through the portal.

Figure 1. Original Dungeon Escape Environment.

To create multi-agent workloads of varying sizes (i.e., varying quantities of agents), several modifications were made to Dungeon Escape. The first and most significant modification introduced the ability to instantiate a user specified number of agents.

Figure 2: The ability to instantiate any number of agents.

The second modification accommodated the first by scaling the environment size to enable the instantiation of more agents. The last modification enabled the user to place more dragons, and therefore more keys in the environment to increase game difficulty, however this was not essential for our goal.

Figure 3. Modified Dungeon Escape Environment.

NN Model Structure

After training the environment with a default configuration of MA-POCA, we produce an NN model in ONNX format, the structure of which is shown in Figure 5. Given a single observation vector (i.e., a batch size of 1) as input (green box in figure), the Multi-Layer Perceptron (MLP; blue box) outputs an action vector of size 7. A single observation vector denotes the three Stacked Raycasts of an agent and whether that agent is holding a key. A Raycast casts rays to detect objects that have colliders in the scene.

Figure 4. Ray Sensor casting rays.

Some further computation is performed to mask actions inside the action vector and finally output a single scalar value that refers to the discrete action that the agent should take. This MLP has two hidden layers, each with 256 neurons. Most of the operations occur in the GEMMs (General Matrix Multiplications). We will see how inference time through this structure increases with more agents.

Figure 5: Dungeon Escape NN Structure.

Multi-Agent Workloads on Android devices

The Dungeon Escape environment was run on various Android devices. Using Unity Profiler, we can visualize the many stages involved in the processing of one frame. Focusing on the FixedBehaviourUpdate portion of the frame, which as the name suggests updates the behavior of the agent, we observe several important stages: AgentSendState, DecideAction and AgentAct. AgentSendState refers to the collection of the observation data; DecideAction refers to NN inference; and AgentAct refers to the agent taking the action. Unity can run NN models on both GPU and CPU. But usually, the agent NN model is best executed on CPU, so I chose to use the CPU in this work.

Figure 6. Profiling timeline within a single frame of Dungeon Escape (click to zoom in).

Through the collection and analysis of many frames, we can produce the following bar chart showing an almost linear dependence between the total execution time of the three stages and the number of agents. For each quantity of agents, ten frames were randomly sampled, and the various stages of execution across them were recorded and averaged. This correlation is even more clear if we experiment with the addition of more agents.

Correlations between execution times and agent quantities Figure 7. Bar Chart showing the correlation between execution times and agent quantities (2-8 Agents).

In the next chart, we experiment with larger agent quantities, ranging them from 50 to 500 agents at intervals of 50. Here the correlation, especially of that between DecideAction (inference) and quantity of agents, is more obvious.

Figure 8. Bar Chart showing the correlation between execution times and agent quantities (50-500 Agents).

Using Arm’s Streamline Analysis software, we can capture the activity of CPUs when running the Dungeon Escape environment. Here we can see that only the medium cores are active, the Cortex-A710s.

Figure 9. Streamline capture of an Android device.

We measured the mean temperatures of components whilst varying the number of agents. From the plot we can see that the middle CPU has the highest temperatures. This reflects what we saw previously in the Streamline Analysis Capture where the Cortex-A710 (the middle CPU) was the most active CPU for Dungeon Escape.

The plot also shows that we can have acceptable FPS up to 50 agents. Past 50 agents, FPS deteriorates; however, this can be alleviated via several strategies. We could adopt a multi-threaded approach where the agents are grouped, and different groups have their inference executed on different threads. Alternatively, we could interleave inference for different groups of agents on different frames. Moreover, the matrix multiplications in the NN model make up most of the total operations, so Scalable Matrix Extension (SME) could be used as it can speed up matrix multiplications.

Temperature of components vs agent quantity Figure 10. Temperatures of Components vs Agent Quantity (and FPS)

Conclusion

In this project we have explored multi-agent reinforcement learning scalability on Android devices, and we have observed several interesting outcomes. First and foremost, inference scales linearly with the addition of agents, as seen in our bar charts. Second, ML-Agents appears to constrain inference to the middle CPUs, as shown by our Streamline Analysis Capture, with the only active CPUs being the Cortex-A710. This was reflected in our temperature chart, which showed the middle cores as the hottest component regardless of number of agents. Lastly, as the agent quantity increases, the ML workload on the CPU also increases, and thus, as shown by the impact on FPS, performance drops.

So, what are the potential implications of these results when running multi-agent workloads on Android devices? Well, we have shown that the NN models produced by MA-POCA are simple and that ML-Agent inference workload is predictable as it scales linearly with the number of agents. Additionally, we have seen that game FPS is severely affected as agent quantity climbs and this matter could be exacerbated in more CPU intensive environments. There are several workarounds as mentioned before. We could multithread inference or interleave inference between frames. SME appears to be a better solution for larger ML workloads because most operations in the NN model come from the matrix multiplications and SME speeds up matrix multiplications.

Writing additions and editing for this blog have been provided by Koki Mitsunami, Staff Engineer, Arm.

You can learn more about how to use ML Agents at a workshop presented by Ben Clark, Staff Engineer, Arm, at Arm DevSummit 2022, which takes place on 26th and 27th October.

0 comments
0 members are here

Mobile, Graphics, and Gaming blog

Optimizing 3D scenes in Godot on Arm GPUs

Clay John

Exploring advanced mobile GPU optimizations in Godot using Arm tools like Streamline and Mali Offline Compiler for real-world performance gains.
- July 10, 2025
Optimizing 3D scenes in Godot on Arm GPUs

Clay John

In part 1 of this series, learn how we utilized Arm Performance Studio to identify and resolve major performance issues in Godot’s Vulkan-based mobile renderer.
- June 11, 2025
Bringing realistic clothing simulation to mobile: A new frontier for game developers

Mina Dimova

Realistic clothing simulation on mobile—our neural GAT model delivers lifelike cloth motion without heavy physics or ground-truth data.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog