“What does the benchmark say?”

March 3, 2014

12 minute read time.

I’m asked quite a lot about how I feel about benchmarks. When I sit down to write these blogs I usually go searching for suitable quotes. For this one I found a quote that perfectly sums up my feeling.

This is from business leadership consultant Tom Peters:

"... I hate Benchmarking! Benchmarking is Stupid!”

Yep, I’m with Tom on this one, but we may need to qualify that a bit more… back to Tom:

“Why is it stupid?"

"Because we pick the current industry leader and then we launch a five year program, the goal of which is to be as good as whoever was best five years ago, five years from now."

While this statement was originally aimed at business leadership and strategy it is equally true of any type of performance benchmarking.

I’ve spent the last three years directly involved in and most of my 20 year career indirectly involved in the mire that is synthetic benchmarking of GPU's. Everything I've seen leads me to come to the conclusion that GPU benchmarks are a reinforcement of the above statement. They do nothing but focus the attention on macroscopic subsections of performance while purporting to tell you about the holistic performance of a GPU.

It seems a logical statement to say, that in order to provide valuable input to an end consumer’s purchasing decision it is better for GPU benchmarks to reflect real-world use-cases. Understanding how readily a GPU delivers the graphics of a user’s favorite game and the length of time that they can be played at a suitable FPS would be useful information for both consumers and OEMs alike. However, is this really the data that popular benchmarks deliver at the moment?

Desktop GPU benchmarking went through a similar evolution to the one that mobile GPUs are currently undergoing. In its earliest days it consisted of extremely theoretical and somewhat woolly comparisons of architectural triangles/second and pixels/second rates. This later developed into actual applications that purportedly measured tri/s and pix/s before arbitrary spinning objects (spinning tori/donuts, Utah Teapots and Venus de Milo’s) entered the scene, which then led to the stage that the mobile GPU benchmarking scene is at currently: the stage where benchmarks consist of synthetic game scenes designed specifically to test a GPU’s maximum compute capacity. The next development, and where the PC market currently stands, is the comparison of metrics garnered by running actual content - real games - and assessing each GPU’s merits based on that. Well there’s a novel concept! Actually using the content that people are running and care about? Shocker!

What’s wrong with current benchmarks?

Before we go any further, I feel an explanation as to why current benchmarks are not the best representation of GPU performance is needed. Current popular benchmarks claim to stress-test GPUs to discover the maximum number of frames they can deliver in a certain time period. In many ways this seems reasonable – all benchmarking really requires in order to be effective is a single figure derived from a test that is the same for all contenders and maximum compute performance of the GPU fits into this category.

However, there are a number of issues with the way GPU benchmarks do this at the moment. Informing consumers that the device is capable of delivering 300+ frames of a particular content in a fixed time period may be a useful metric in certain circumstances, but it is not when there is no content that the consumer would normally use on his/her device which exercises the GPU in the way the GPU benchmarks currently do.

To the consumer, the figure delivered by benchmarks is completely arbitrary and does not correspond to any experience that he might have of the device. It is easily possible to deliver exactly the same visual experience which the benchmarks use at much higher frame rates or, more appropriate to embedded devices, at a fraction of the energy cost and computing resources if the benchmarks were coded in a more balanced way.

Surely, when the quality of graphics is the same between a benchmark and a popular game, it is better for a consumer to know how well the GPU delivers content that uses regular techniques and balanced workloads rather than an irregularly coded benchmark?

Later we'll look at my "Tao of GPU benchmarks" and discuss what guidelines a benchmark should follow, but first lets take a look under the hood of popular content and the benchmarks that are supposed to mirror them.

But benchmarks look exactly like popular games, so what’s the difference?

As an internal project, ARM has been running in excess of 1M frames of real content from top OpenGL® ES – enabled games on the App Store, including titles such as Angry Birds, Asphalt 7 and Temple Run. We analyse multiple performance areas including CPU load, frames per second, uArch data and a tonne of GPU agnostic API usage and render flow composition data.

When you look at some examples of the data we gather in this sort of analysis, the results are quite striking. Looking at say the imagery in Asphalt 7 and T-Rex HD on the same ARM® Mali™-based 1080p device, you'd see that they appear to show similar levels of graphical user experience. This would leave a user to believe that they are constructed from a broadly similar level of workload. When we look at the results which compare a selection of popular benchmarks and a selection of popular games, we see the following:

1080P	Benchmark A	Benchmark B	Benchmark C	Asphalt 7	NFS Most Wanted	Spiderman
Avg. Vert./Frame	11K	760K	830K	200K	27K	40K
Avg Tris./Frame	12.5K	460K	780K	140K	18K	26K
Avg. Frags./Frame	3.6M	6.2M	10M	8.0M	6.8M	8.1M
Avg. Vert FLOPS/Frame	1.3M	53M	99M	11.5M	3.3M	5.1M
Avg. Frag FLOPS/Frame	80M	148M	490M	165M	116M	258M

The first and most striking observation is that whilst the fragment count for benchmarks is similar to that of popular games, the vertex count goes through the roof! And in fact, when we look more closely at Benchmark C, the use of vertices is in no way efficient.

“Do not use a hatchet to remove a fly from your friend's forehead” - Chinese proverb

The global average for primitive to fragment ratio in this benchmark at 1080p is 1:13.1 which is close to (but just the right side of) our low watermark of 1:10 which we defined in the “Better Living Through (Appropriate) Geometry” blog, compared to a ratio of 1:53 in Asphalt 7. However, examining the content draw call by draw call, 50% of Benchmark C draw calls have a ratio of less than 1:1 primitive to fragment and an additional 24% have a ratio of less than 1:10 - against a recommended guideline of more than 1:10! The same is true for Benchmark B where 66% of the draw calls are producing micropolygons.

Real games are more balanced and consistent with less micro triangles and the majority of draw calls handling more than ten fragments per triangle.

Benchmark providers admit that they use high vertex counts in order to stress GPUs with the justification being that it provides the users with “realistic” feedback on how their GPU will respond to future content. However, as demonstrated, such stress testing is not realistic as it doesn’t accurately reflect the balance of fragment and geometry used in applications that are being used by consumers on a daily basis. While the fragment rate and vertex rate of the real games shows variation, the ratios stay pretty consistent.

Benchmarks vs Real Apps: Bandwidth

One of the major effects of the geometry imbalance shown above is it does not take into account by far the most limiting factor in terms of mobile device performance: the bandwidth. It’s extremely easy to break the bandwidth limit in an instant with these high cost/low visual yield micro polygons (as discussed in “PHENOMENAL COSMIC POWERS! Itty-bitty living space!”).

Let’s take a look at the benchmarks and see what the relative bandwidth looks like when compared to the real applications:

Test Name	Frame Buffer	Texture	Geometry
Benchmark A	40%	40%	20%
Benchmark B	10%	10%	80%
Benchmark C	20%	20%	60%
Asphalt 7	35%	35%	30%
NFS Most Wanted	30%	35%	35%

As you can see, again, the real world applications are more consistent in the balance of bandwidth used across the rendering. “Benchmark A” starts off pretty well, but unfortunately it goes off the rails pretty quickly. What we see here is 3-8x more bandwidth being used for the geometry (which, as discussed in “Better living through (appropriate) geometry”, is supposed to be a container for the samples) meaning there is less bandwidth available for fragment generation - which is what the user will actually see.

The future of mobile benchmarking

So, what’s the conclusion? Well, GPU benchmarks generally still have a long way to go, mobile one more so. I am looking forward to the time when, like for desktop and console games, mobile game developers release their own benchmarks using sections from real application workloads, allowing for a far more well-rounded view of the GPU.

Until then, I have a couple of suggestions that will not only make GPU benchmarking a lot more informative for consumers but it will also leave semiconductor companies with more time to worry about how to improve GPU performance for consumer content rather than how to impress customers in the next important benchmark rankings.

I have produced the following “Tao of GPU benchmarks” as a guide which I hope people will follow:

Apply Moore’s Law.
- Moore’s Law (compute potential doubles every 18 months) applies to GPUs as much as it does CPUs.
- Year on year the average workload represented in a benchmark should not exceed double the previous year’s and it should remain balanced. This way you don’t attempt to outstrip Moore’s law.
Make it a GPU test not a bandwidth test.
- The raw bandwidth per frame at 60fps should not exceed the available bandwidth.
- The baseline for bandwidth should be set at a typical mobile device for the next 24 months
- Make the objective of the test as independent as possible from whether the device has high bandwidth capacity or not.
Tests should use recognized techniques.
- Techniques should be aligned with current best practice
- These techniques should also be relevant to the mobile market
Excessive geometry is not an acceptable proxy for workload.
- Primitive to fragment ratio per draw call should be balanced.
- Lots of benchmarks at present have far too much geometry.
- The 10 frags/prim rule should be the lowest water mark for this.
Overdraw is not an acceptable proxy for workload.
- Keep it real! An overdraw average in excess of 2x on any surface is not representative.

5 comments
0 members are here

Mobile, Graphics, and Gaming blog

Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

Lisa Sheckleford

With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
- March 18, 2025
Generative AI in game development

Roberto Lopez Mendez

How is Generative AI (GenAI) technology impacting different areas of game development?
- March 13, 2025
Physics simulation with graph neural networks targeting mobile

Tomas Zilhao Borges

In this blog post, we perform a study of the GNN architecture and the new TF-GNN API and determine whether GNNs are a viable approach for implementing physics simulations.
- February 26, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

“What does the benchmark say?”

What’s wrong with current benchmarks?

But benchmarks look exactly like popular games, so what’s the difference?

“Do not use a hatchet to remove a fly from your friend's forehead” - Chinese proverb

Benchmarks vs Real Apps: Bandwidth

The future of mobile benchmarking

Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

Generative AI in game development

Physics simulation with graph neural networks targeting mobile