The developers of mobile games strive to ensure that their content works well across a broad range of devices, from the latest high-end premium smartphones to mass-market or older devices. As the complexity of mobile game content increases, developers rely on good quality tools that can provide them with the insight they need to keep their frame rates stable and their power consumption down.
In this blog, I'm going to talk about how the new Arm Mobile Studio collection of tools can help with Android performance analysis, and how they can work together with game engines to produce an even more compelling performance analysis capability. Arm Mobile Studio’s Starter Edition is available for free and the source code for the sample project used in this blog is available on github.
Specifically, I'm going to focus on Streamline, the most general-purpose and most detailed performance analysis component of Arm Mobile Studio. I'll describe how it can be integrated with Unity to really show you how different aspects of your game make use of the Arm Cortex-A CPU or Arm Mali GPU resources on a mobile device.
Streamline collects sample-based and event-based performance data from a number of sources on an Android device and displays the aggregated results in several different views, with the Timeline view being the one that we're going to concentrate on in this blog. The top half of the screen shows collected system performance counters, and the lower half can show a variety of different types of information on the same timeline. Here, it shows the Heat Map, which indicates how computation activity is distributed across the threads of the profiled application:
We’re going to analyze some very simple Unity content - a flythrough of a procedurally generated terrain. The camera twists and turns, with new terrain tiles generated on-the-fly as they are needed. Tiles that get too far away from the camera are deleted, so the complexity of the scene remains roughly the same over time, once the scene is filled out. Sometimes the camera moves slowly, so the rate of new terrain generation is very slow, then it speeds up, so the rate of terrain generation needs to increase. The edges of each tile are rendered in a darker colour, so you can see the size of each tile:
Because the generation of a terrain tile is computationally intensive, we use the Unity Job Scheduler, which allows us to dispatch background threads that won't hold up the main Unity thread. This ensures that the user experiences a steady frame rate, rather than a jerky pause whenever new terrain is generated.
The demo is configured to run through four different scenes, which look identical but generate the terrain tiles differently. As the following diagram shows, the terrain is composed of many terrain blocks, each of which is a fixed size. Each block comprises several meshes (one for the green terrain, one for the yellow terrain and one for the water) which have a fixed resolution. The Render Distance controls the number of tiles around the player that will get generated.
The four scenes are configured as follows:
I’ll be performing the profiling activity on a Huawei P10 phone, which was released in 2017. It contains a HiSilicon Kirin 960 chip, which comprises four high-performance Arm Cortex-A73 CPU cores, four high-efficiency Arm Cortex-A53 CPU cores, and the Arm Mali-G71 MP8 GPU.
Unity itself contains a profiler, and it works just great on Android devices:
Unity's profiler does a great job of showing us when jobs are scheduled, but it doesn't show details of the platform's physical resources (CPUs and GPUs for the purposes of this blog) and how they are being used. We might be hitting our 60 FPS, but are we maxing out all of our CPU cores and burning battery to do so? This is where Streamline comes in. In the rest of this blog, we'll show you what data Streamline captures and presents, and how we can use Streamline's annotation features to pass some high-level context from the Unity game down into Streamline, making the data easier to interpret.
Before we talk about how annotations can be inserted into your Unity game, let’s see what the end result looks like for our example content when we’ve modified the game to make use of three of Streamline’s annotation features:
Once we have collected a profile (more on that later), it opens in Streamline and we can start to discover what’s going on.
When examining our content, the first thing that draws our attention are the Markers (in green at the top of the timeline), which here indicate where each frame begins:
We can see that the frame rate isn’t as regular as we’d like, and there is considerable bursty activity across all the CPU cores. The frame rate starts slow and then seems to pick up, with occasional pauses. That’s pretty consistent with what we’d expect from Terrain generation, can we look any deeper?
The Timeline view in Streamline is divided into two parts – the top view shows the metric graphs, and the bottom half can show a variety of different things, including the Heat Map, which shows us how the work was distributed across the system, and allows us to filter the top timeline to show only the work attributed to specific processes or threads. By examining the Heat Map, and selecting first the UnityMain thread and then all of the Worker Thread threads, we can see how the CPU activity was split across the main Unity thread and the threads in the job scheduler:
CPU profile for the UnityMain thread, showing large bursts of activity on the Cortex-A73 CPUs.
CPU profile for all the Worker Thread threads, showing smaller burst of activity across both the Cortex-A73 and Cortex-A53 CPUs.
Let's take a look at the main thread. If you look closely at the left-hand screenshot, you'll see an “A” marker next to the UnityMain thread that we’re examining. This means that Streamline Annotation Channels are present. We’ll zoom into the timeline a bit, and expand the UnityMain thread to see what’s going on:
The Scene and TerrainController rows are Streamline channels, generated by annotations placed in the game. The Scene channel shows us which scene is currently executing – we can see that this the 20x20, 32x32 version with a render distance of 3 and 8 threads runnable in parallel.
The TerrainController channel is used to indicate when particularly interesting pieces of code are running on the main Unity thread. The blue blocks mark up the code that runs when a Terrain job completes. The green blocks mark up where new Terrains are scheduled for generation. We can see here that all the main thread activity is essentially due to the work that needs to be done when a job completes and the final mesh needs to be generated and inserted into the scene.
As well as focusing on particular threads, we can also constrain our analysis to particular periods of time. Streamline’s calipers allow us to mark up a particular time region for analysis – here, we have selected the start and end of the intense period of activity associated with Terrain completion (calipers are set at the top of the Timeline view):
If we flip now to the Call Paths view, we can get a fair idea of where time is being spent during the region of time selected by the calipers. Because we used the IL2CPP scripting backend for Unity, we get a lot more information than if we'd used the default Mono runtime. I'm not going to delve into the detail of what's going on here, but there's clearly a lot going on that warrants a deeper dive:
When we filter to show only the Worker Thread threads, there are no surprises here, given that we have asked for a maximum of eight jobs to run in parallel. In this screenshot, we’ve expanded the Cortex-A53 cluster so we can see the utilization of individual cores.
We see some green blocks in the TerrainController channel that indicate new Terrains being scheduled, then some intense activity across all cores, and then some blue activity in the TerrainController to process those Terrains in the main thread once they’ve been generated (we don’t see that main thread activity in the graphs because we don’t have the UnityMain thread selected).
It is interesting to compare this to the activity in the second scene, where the terrain tiles are of the same complexity, but we only allow one to be scheduled at a time:
There are a couple of things to note here:
We can also compare the profile with the third scene, which uses smaller tiles:
As you can see, the CPU activity is much less intensive and the blocks of blue completion work in the main thread are much shorter, resulting in a smoother frame rate relative to the first scene (but of course there are more jobs overall, so we have to make sure that Terrain generation still keeps up with the rate at which the camera flies over the terrain).
The fourth scene, with small tiles and only one Terrain generation running at a time shows the smoothest frame rate overall, but we have to be very careful to ensure that the Terrain generation happens at a sufficient rate to keep up with the camera, and you’ll see in the original video that this isn’t always the case when the camera is moving fast over the fourth scene:
Finally, we can use a Custom Activity Map to get even more insight into how the worker threads are performing Terrain generation. Each Custom Activity Map appears as an option in the bottom-left menu that up until now we’ve been using to display the Heat Map:
When we select the Terrain Generation view, we’ll see a colored box for each Terrain generation activity, showing when it started and stopped, with a mouseover showing the world coordinates of that Terrain tile, when it was initiated and how long it took to complete. Also in this screenshot, we’re graphing the compute work that took place on the Mali GPU – as we’d expect, there is a steady increase in GPU activity as the terrain gets filled out. This screenshot was taken while focusing on the beginning of the first scene, where we are generating large tiles, up to 8 concurrently. The pauses while the main thread prepares all the new geometry are causing the GPU to be idle for long periods of time:
Moving to the fourth scene, where we are generating smaller tiles serially, we see a much smoother ramp in GPU activity, and we can clearly see that only one Terrain job was running at a time (and each job is shorter, due to the smaller tile size):
This has been a quick walk-through of some of the additional insight that we can get in Streamline if we use annotations from the game itself to provide us with some more high-level context. We used:
That’s all very cool, so how does it work?
Let's take a deeper look into how Streamline works. When you analyze an Android application, a separate process (running as the same user as the application) called gator runs on the device, collecting profiling information from various hardware sources (such as Mali GPUs an Arm Cortex-A CPUs) and transmitting the aggregated stream of metrics back to your computer. Streamline annotations are a mechanism by which the application itself can insert its own markers and metrics into that stream.
Streamline annotations use a specific protocol, and an open-source C implementation is provided as part of Arm Mobile Studio. In order to make it easy to generate Streamline annotations from within Unity content, you need some C# wrappers around the C implementation. The wrappers used for in this walkthrough, along with the required C implementation are available as a Unity Asset Package. Download it and import it into your project as a custom Asset package. The package adds new methods in an Arm namespace that allow you to easily use Streamline annotations in your own project. API documentation can be found in the README.md file inside the package.
If you want to get the fastest and easiest-to-analyze Android builds out of Unity, there are some specific Android Player settings that you should configure:
Make sure that you are using IL2CPP as the Scripting Backend and set the C++ Compiler Configuration to Debug. This will not only compile your scripts to native code for better performance, but it also means that Streamline can see the debug information to map performance data back to your functions in the Call Path view.
Set the Target Architecture to ARM64 (ARMv7 is the default). Most mobile devices today are 64-bit, and you’ll get higher quality code-generation as a result.
Markers are the easiest annotation to use. The provided method takes a string and an optional color. For example, to emit the green per-frame markers, the following code was used in one of the GameObjects (if you're not familiar with Unity architecture, the Update() method is called automatically once per frame).
void Update ()
Arm.Annotations.marker("Frame " + Time.frameCount, Color.green);
Using channels isn’t much harder. First, you have to create a channel, specifying its name. You can then log annotations into the channel using methods on the Channel object, for example:
channel = new Arm.Annotations.Channel("Scene");
Remember that annotations in channels will span a period of time. If you want to end the annotation before starting your next one, you can use the end()method. For example, the part of TerrainController that performs Terrain completion in the main thread is wrapped as follows:
// Begin annotation
Mesh mesh = obj.GetComponent<MeshFilter>().mesh;
mesh.vertices = job.vertices.ToArray();
mesh.uv = job.uv.ToArray();
mesh.uv2 = job.uv2.ToArray();
// End annotation
Custom Activity Maps (CAMs) can be thought of as just another layer on top of channels. You must first name the CAM, before creating tracks within it. You can then add annotations to those tracks, much as you would add them to channels.
In the example, the Terrain Generation CAM was created as follows:
terrainCAM = new Arm.Annotations.CustomActivityMap("Terrain Generation");
terrainTracks = new Arm.Annotations.CustomActivityMap.Track;
for (int i = 0; i < 16; i++)
terrainTracks[i] = terrainCAM.createTrack("TerrainJob " + i);
However, there is one complication for our use case: when a job is running in the Unity Job System, it can’t interact with the rest of your game’s object model much at all (which helps to keep things thread-safe). All we can do in the job is remember the start and stop time, and then when the main thread cleans up the job, that's when we are able to register the job's activity in the CAM.
The C# wrappers provide a function that can be safely called from within jobs that returns the current time in the format that Streamline annotations need.
UInt64 startTime = Arm.Annotations.getTime();
Once we’re back in the main thread, we pick a track to use (we manage them in a pool to ensure that there’s no overlap because that is helpful visually) and register the job onto that track. Here, job.timings is a two-entry array filled out by the job containing the start time and stop time of the job.
track.registerJob(obj.name, Color.grey, job.timings, job.timings);
And that’s it! I expect that we’ll refine this Unity Package over time to add more functionality; your feedback is always welcome!
There are a few steps that you need to go through in order to collect your first profile in Streamline, but once you are set up things are quite straightforward.
First, you need to download and install the free Windows, Mac or Linux version of Arm Mobile Studio Starter Edition.
Recall from the earlier description of Streamline's architecture that there are a few thing you need to put in place:
Streamline provides you with a few ways of achieving this, but we have found that simplest method that is robust across a range of devices is first to make sure that you know a few key pieces of information:
Once you know these pieces of information, the steps to perform analysis are as follows:
Once gator is running, you can install new versions of your application, start and stop Streamline and perform more analyses without having to restart gator.
To make the process easier, you can download the gatorme script that you can use to configure and run gator and adb for you; all you need to provide is the path to the gator binary that you want to run, the Package Name of your application and which Mali GPU you have in your device (This helps if gator can't figure out which GPU you have by probing the device). It also performs several other steps to ensure that this method works well on the broadest range of mobile devices and ensures that gator is shut down properly once you are finished with your profiling activity. (Yes, we will be folding the gatorme functionality directly into Streamline in the near future!).
The gatorme documentation explains the detail, but as a worked example, here's how the InfiniteTerrain content could be profiled, once the APK is installed on your device.
First, run gatorme from the command line:
$ ./gatorme.sh com.Arm.InfiniteTerrain G71 ./mobilestudio-macosx/streamline/bin/arm64/gatord
You can now launch Streamline and get ready to capture. There are a couple of settings that you want to make sure you get right:
Once you get this set up, repeated deploy/analyse/fix steps are easy - you can leave gatorme running while you shut down your application, Build-and-Run direct from Unity and capture more Streamline information.
I hope you've found this biog interesting and useful. All of the source code for the InfiniteTerrain example can be downloaded on GitHub (Apache 2.0 license). As well as all of the source code and graphics assets, there is also the ArmMobileStudio.unitypackage, which is the Unity custom asset package that you can import into your own projects to add Streamline Annotations to them. There is also a pre-built InfiniteTerrain.apk, which is a 64-bit Android development build, ready to deploy to your device if you just want to get started analysing some content quickly.
And finally, if you've got any questions about Arm Mobile Studio or Arm in graphics and gaming in general, please join us in the Graphics and Multimedia Forum or read more about our tools on the Arm Mobile Studio developer site below!
Arm Mobile Studio resources