World of Tanks (WoT) Blitz is a session-based tank shooter that has been constantly updated by MS-1 (Wargaming’s oldest and largest mobile studio) over the past 7 years since its release. In 2014, the game launched on iOS and Android; a couple of years later it was released on desktop platforms (Windows and macOS), and in 2020, WoT Blitz made its debut on Nintendo Switch. Android and iOS remain the key platforms for us, and it should be noted here that we strive to make Blitz scalable performance-wise; from super-low-end mobile devices to the latest flagships.
Performance is critical for every game. Our development team of almost 200 people working in independent, cross-disciplinary teams ships around 10 major updates every year. To keep up with this tempo and this scale, we have to automate a lot of processes, and performance testing is no exception. This blog explores how use of continuous integration (CI) testing helps us drive optimal player experiences for WoT Blitz. We will also dig into the following topics:
If we put the game lobby aside and consider only the actual gameplay (battle arena), performance testing includes the control over the following metrics:
Automated tests are triggered nightly for mainline builds. QA engineers also trigger these tests when they are validating changesets containing one of the following:
We run tests using a CI server, where test devices play the role of build agents. Our test farm consists of more than 30 devices that represent all supported platforms and the entire performance level spectrum.
We do our best to identify problems before they go into the mainline, but if that happens, nightly regression tests show it. As we have plenty of content and lots of device models, we made a special dashboard that displays all the nighttime test results. The following is the state of just a part of that dashboard (the full version has 13 columns and 30 rows) for a map loading time test.
With our development tempo and scale, it is essential to have a healthy mainline: as we stick to the principle of early integration and a feature branch is always merged with the mainline when a build is assembled for testing, a problem in the mainline may be interpreted as a feature branch problem. As a result, we have strict regulations regarding fixing problems found in the mainline: the feature that caused a problem must be disabled or reverted within one working day.
Considering how much content we have, and how many platforms and performance levels we support, it is hard to imagine our development without automated performance tests:
As we mentioned above, it is critical for us to solve mainline problems within a single work day. But as we have lots of independent and specialized teams, first we have to find out which one is responsible for the problem. Though the tests are run nightly, with our development scale 10-15 non-trivial PRs are merged into the mainline every day. And it is not always possible to understand which one has caused a problem by simply looking at the code. That is why we want the automated test reports to have as much detail as possible regarding what exactly caused some metric to change.
Besides defining which team is responsible for fixing a problem in the mainline, the extra data in the reports is very valuable for feature testing. Seeing detailed data in reports (for example, which function is now taking longer to perform), a programmer is able to locate the problem even without manual profiling of the game. In the same manner, the content makers with access to this kind of data are able to understand right away where to look for optimization.
For example, for memory tests we use special builds with Memory Profiler. This can divide the entire memory pool used into “categories” (there are almost 30 of these) and see not just a change of a single value (the overall memory usage by a process), but a much more detailed screenshot:
For a map loading time test, the report contains just a number (loading time):
But in case there is a problem, you can locate a json with profiler trace in test artifacts:
To sort out everything with the increased map loading time, it is enough to download json files of two test runs and compare them using chrome://tracing.
For FPS tests, we spent quite some time thinking of a convenient way to present the details in reports. But before showing what has come out of that, we will tell a bit about our general approach towards this type of tests.
For a long time, the sole result we observed in reports was the average FPS value. Also, among test run artifacts you could find a detailed chart for the FPS value during the replay, where points represented average FPS values for one second long intervals. The latter was a mistake, as single long frames were going under the radar. When we realized this, we decided to not measure the duration of every frame, but to go farther and switch on our internal CPU profiler for the tests. We had to limit the number of counters for every frame so that the profiler would not impact the frame time on low-end devices. For that, we added the tagging functionality: every counter in the code is tagged, and we may switch tag groups on or off. By default, we have no more than 70 counters per frame when an automated test is launched.
Turning the profiler on in automated tests opens up a new opportunity: setting the budgets for separate functions’ work time. This allows you to trace spikes in the performance of certain systems. Besides that, budgeting makes it easier to take decisions about introducing new systems into the game. For example, when we decided to add the automated layouting of UI controls in battle several years ago, we just made sure that it would not impact FPS values in our tests. After we introduced the profiler, we were quite surprised by how much frame time we were spending on automatic layouting sometimes. The right approach would have been to define the budget before adding the new system. For example, “Layouting must work for no longer than 1ms per frame on a Samsung S8”) and to further control it with autotests so that the changes in code or content would not cause going over the budget.
At the end of a replay, we analyze the profiler trace and put a json containing only the frames with any functions exceeding their budget into artifacts. In the following example, several dozens of frames are listed; in each of them, the Scene::Update function took longer to perform than it should have.
Everything is relatively simple with the spikes, but which representation would make it simple to notice small increases in execution time of some functions? Comparing two four-minute-long traces did not look like an acceptable option, so we decided to start with some statistics. For a small set of the most interesting functions, we calculate the number of frames during which the function execution time hit a certain range, getting the frequency distribution:
Next, we send this data into bigquery, and visualize them using data studio:
On the chart above, you can see that the number of frames where Engine::OnFrame took 24-28 ms to execute grew significantly on September 27. To find the cause for that, we can use a simple sequence:
If we have reached the lowermost level and found the ‘guilty’ function, it is fine. But what if the execution time has grown more or less evenly for every function? There may be several reasons for that:
Though introducing an instrumenting CPU profiler into tests did not preclude the necessity of using a sampling profiler, it decreased the number of cases where the latter was necessary. One case was described above, the other is micro-optimization when you need to see the execution time for separate instructions in the assembler code.
Let us get back to our “September 27” problem. Applying a sequence of actions listed above, we found that the rhi::DevicePresent function was the one to blame. This function just calls buffer swapping, so the bottleneck must be on the GPU side. But how can we learn what caused GPU to process frames longer from the reports?
An attentive reader may notice that one of the values in the previous filter on the data studio screenshot does not look like a function name: “DrawCallCount”. We have more metrics like that: “PrimitiveCount”, “TriangleCount”, “VertexCount”. Alas, this is the only data we can gather on the CPU side to find out what caused the slowdown on the GPU side. But to solve the “September 27” problem, that was sufficient:
Turns out, we started to draw 150k more primitives in many frames. By matching this data to the list of PRs merged in the time interval between the two test runs (we see from the chart they were separated by 3 days, but 2 of these were weekend days with no tests) we identified the ‘guilty’ changeset promptly.
But what if the root of the problem was a shader getting more complicated? Or a change in the draw order and, as a consequence, increased overdraw? Or the depth buffer accidentally starting being saved into the main memory, impacting the bandwidth negatively?
These metrics are hard (in case of the overdraw) or impossible (in cases of shader complexity and bandwidth) to trace within the game itself. And here is where Arm Mobile Studio Pro proves invaluable and helps us.
Arm Mobile Studio Pro allows us to record hardware counters for the Mali family GPUs while performing an autotest on a CI server, helping us solve three types of tasks:
It is very important for us that Arm Mobile Studio features an instrument called Performance Advisor allowing us to present a profiler trace recorded during a test as an easily readable html report.
Let us see how Arm Mobile Studio (and performance Advisor in particular) speed up solving a type 3 task.
Recently, we added decals into the engine, and our tech artist faced the task of defining the budget for overlapping of arena geometry and the decals. The essence of this task was:
PA (performance Advisor) simplifies stage 4 greatly. Let us suppose on stage 1 we chose the Samsung Galaxy A50 with a Mali-G72 MP3 as our target device. Here is how the first part of the PA report for a test run on this device looks (high graphics quality settings, but no decals for now):
The main observations:
We launch the test for a build with the decals, find the html report generated by PA within the artifacts, and open it:
Well... on this device and with the number of decals that our tech artist added for the first iteration, the share of fragment-bound frames has increased significantly. Onward, to the next iteration.
PA will not directly show you the bottleneck in every case. For example, for powerful devices we often see unknown boundness prevailing:
In this situation, complete capture files help; you can download these from the artifacts of the test run and analyze them with Streamline. Luckily, performance counters are described in detail in the documentation, and Arm engineers are always ready to provide a consultation.
It should be noted that Arm Mobile Studio Professional Edition became part of our CI workflow easily and promptly. Besides convenient PA reports and full traces of the profiler, we also received a json file with average values and centiles for all the metrics shown on graphs in PA reports. Soon, we plan to send that data to bigquery and visualize it using data studio like we already do for our CPU profiler data. The metrics themselves are described here.
While Arm Mobile Studio lets us obtain extra data only for the Mali family GPUs, they are the most wide-spread GPUs within our player base. Similarly, lots of observations that we can draw from the tests that run on devices with the Mali GPU family stay valid for other devices (for example, if the bandwidth has grown for Mali, it is likely to have grown for other GPUs, and probably even for other platforms).
So, kudos to Arm: your tools allowed to solve our problem of the lack of data on GPU performance in automated test reports.
Here at MS-1 (Wargaming), we are passionate about routine work automation. Performance testing has been automated for a long time, but there was still room for improvement.
Significant effort we had recently invested into the verbosity of test reports provided us with the automation framework that saves time for both testing and problem investigation. Now, we can identify a lot of problems with our code and content skipping the boring phase of manual profiling setup. Collecting profiling data in an automated test environment not only makes it more accessible, this approach also increases the quality of that data:
And the quality of data dramatically affects the speed of analysis.
We think that automated performance testing is a must for any mid-sized or large game studio. It is not just about saving time for your people, it is also about the quality of testing and risk management: with nightly tests, you do not need to worry about performance issues popping up during the final playtest and affecting your release schedule.