中文版 Chinese Version:“基准测试代表什么?”
I’m asked quite a lot about how I feel about benchmarks. When I sit down to write these blogs I usually go searching for suitable quotes. For this one I found a quote that perfectly sums up my feeling.
This is from business leadership consultant Tom Peters:
"... I hate Benchmarking! Benchmarking is Stupid!”
Yep, I’m with Tom on this one, but we may need to qualify that a bit more… back to Tom:
“Why is it stupid?"
"Because we pick the current industry leader and then we launch a five year program, the goal of which is to be as good as whoever was best five years ago, five years from now."
While this statement was originally aimed at business leadership and strategy it is equally true of any type of performance benchmarking.
I’ve spent the last three years directly involved in and most of my 20 year career indirectly involved in the mire that is synthetic benchmarking of GPU's. Everything I've seen leads me to come to the conclusion that GPU benchmarks are a reinforcement of the above statement. They do nothing but focus the attention on macroscopic subsections of performance while purporting to tell you about the holistic performance of a GPU.
It seems a logical statement to say, that in order to provide valuable input to an end consumer’s purchasing decision it is better for GPU benchmarks to reflect real-world use-cases. Understanding how readily a GPU delivers the graphics of a user’s favorite game and the length of time that they can be played at a suitable FPS would be useful information for both consumers and OEMs alike. However, is this really the data that popular benchmarks deliver at the moment?
Desktop GPU benchmarking went through a similar evolution to the one that mobile GPUs are currently undergoing. In its earliest days it consisted of extremely theoretical and somewhat woolly comparisons of architectural triangles/second and pixels/second rates. This later developed into actual applications that purportedly measured tri/s and pix/s before arbitrary spinning objects (spinning tori/donuts, Utah Teapots and Venus de Milo’s) entered the scene, which then led to the stage that the mobile GPU benchmarking scene is at currently: the stage where benchmarks consist of synthetic game scenes designed specifically to test a GPU’s maximum compute capacity. The next development, and where the PC market currently stands, is the comparison of metrics garnered by running actual content - real games - and assessing each GPU’s merits based on that. Well there’s a novel concept! Actually using the content that people are running and care about? Shocker!
Before we go any further, I feel an explanation as to why current benchmarks are not the best representation of GPU performance is needed. Current popular benchmarks claim to stress-test GPUs to discover the maximum number of frames they can deliver in a certain time period. In many ways this seems reasonable – all benchmarking really requires in order to be effective is a single figure derived from a test that is the same for all contenders and maximum compute performance of the GPU fits into this category.
However, there are a number of issues with the way GPU benchmarks do this at the moment. Informing consumers that the device is capable of delivering 300+ frames of a particular content in a fixed time period may be a useful metric in certain circumstances, but it is not when there is no content that the consumer would normally use on his/her device which exercises the GPU in the way the GPU benchmarks currently do.
To the consumer, the figure delivered by benchmarks is completely arbitrary and does not correspond to any experience that he might have of the device. It is easily possible to deliver exactly the same visual experience which the benchmarks use at much higher frame rates or, more appropriate to embedded devices, at a fraction of the energy cost and computing resources if the benchmarks were coded in a more balanced way.
Surely, when the quality of graphics is the same between a benchmark and a popular game, it is better for a consumer to know how well the GPU delivers content that uses regular techniques and balanced workloads rather than an irregularly coded benchmark?
Later we'll look at my "Tao of GPU benchmarks" and discuss what guidelines a benchmark should follow, but first lets take a look under the hood of popular content and the benchmarks that are supposed to mirror them.
As an internal project, ARM has been running in excess of 1M frames of real content from top OpenGL® ES – enabled games on the App Store, including titles such as Angry Birds, Asphalt 7 and Temple Run. We analyse multiple performance areas including CPU load, frames per second, uArch data and a tonne of GPU agnostic API usage and render flow composition data.
When you look at some examples of the data we gather in this sort of analysis, the results are quite striking. Looking at say the imagery in Asphalt 7 and T-Rex HD on the same ARM® Mali™-based 1080p device, you'd see that they appear to show similar levels of graphical user experience. This would leave a user to believe that they are constructed from a broadly similar level of workload. When we look at the results which compare a selection of popular benchmarks and a selection of popular games, we see the following:
1080P
Benchmark A
Benchmark B
Benchmark C
Asphalt 7
NFS Most Wanted
Spiderman
Avg. Vert./Frame
11K
760K
830K
200K
27K
40K
Avg Tris./Frame
12.5K
460K
780K
140K
18K
26K
Avg. Frags./Frame
3.6M
6.2M
10M
8.0M
6.8M
8.1M
Avg. Vert FLOPS/Frame
1.3M
53M
99M
11.5M
3.3M
5.1M
Avg. Frag FLOPS/Frame
80M
148M
490M
165M
116M
258M
The first and most striking observation is that whilst the fragment count for benchmarks is similar to that of popular games, the vertex count goes through the roof! And in fact, when we look more closely at Benchmark C, the use of vertices is in no way efficient.
The global average for primitive to fragment ratio in this benchmark at 1080p is 1:13.1 which is close to (but just the right side of) our low watermark of 1:10 which we defined in the “Better Living Through (Appropriate) Geometry” blog, compared to a ratio of 1:53 in Asphalt 7. However, examining the content draw call by draw call, 50% of Benchmark C draw calls have a ratio of less than 1:1 primitive to fragment and an additional 24% have a ratio of less than 1:10 - against a recommended guideline of more than 1:10! The same is true for Benchmark B where 66% of the draw calls are producing micropolygons.
Real games are more balanced and consistent with less micro triangles and the majority of draw calls handling more than ten fragments per triangle.
Benchmark providers admit that they use high vertex counts in order to stress GPUs with the justification being that it provides the users with “realistic” feedback on how their GPU will respond to future content. However, as demonstrated, such stress testing is not realistic as it doesn’t accurately reflect the balance of fragment and geometry used in applications that are being used by consumers on a daily basis. While the fragment rate and vertex rate of the real games shows variation, the ratios stay pretty consistent.
One of the major effects of the geometry imbalance shown above is it does not take into account by far the most limiting factor in terms of mobile device performance: the bandwidth. It’s extremely easy to break the bandwidth limit in an instant with these high cost/low visual yield micro polygons (as discussed in “PHENOMENAL COSMIC POWERS! Itty-bitty living space!”).
Let’s take a look at the benchmarks and see what the relative bandwidth looks like when compared to the real applications:
Test Name
Frame Buffer
Texture
Geometry
40%
20%
10%
80%
60%
35%
30%
As you can see, again, the real world applications are more consistent in the balance of bandwidth used across the rendering. “Benchmark A” starts off pretty well, but unfortunately it goes off the rails pretty quickly. What we see here is 3-8x more bandwidth being used for the geometry (which, as discussed in “Better living through (appropriate) geometry”, is supposed to be a container for the samples) meaning there is less bandwidth available for fragment generation - which is what the user will actually see.
So, what’s the conclusion? Well, GPU benchmarks generally still have a long way to go, mobile one more so. I am looking forward to the time when, like for desktop and console games, mobile game developers release their own benchmarks using sections from real application workloads, allowing for a far more well-rounded view of the GPU.
Until then, I have a couple of suggestions that will not only make GPU benchmarking a lot more informative for consumers but it will also leave semiconductor companies with more time to worry about how to improve GPU performance for consumer content rather than how to impress customers in the next important benchmark rankings.
I have produced the following “Tao of GPU benchmarks” as a guide which I hope people will follow:
Thanks Peter,
DoF is something that I hadn't considered, but it makes sense as it would at least require a downsample of the full frame and downsamples of the subsequent results, which would add significantly to the pixel count (1.7x?). I expect that the situation would be similar for screen-space bloom. I likely also underestimate the resolutions of shadow maps!
Very helpful... Thanks!
Cheers,
Sean
Most modern games 3D games will have multiple off-screen passes. It's not necessarily due a deferred lighting pipeline (that's still pretty uncommon in mobile content); shadow mapping, and post-processing effects such as HDR or depth of field are pretty standard now though.
The has been a tremendous read. Thank you!
Are the fragment counts for the named applications correct? They seem awfully high (per frame) for a 1080p display -- rendering 8M fragments seems wasteful. Are these apps using a deferred pipeline? Do they have incredible amounts of overdraw? Or is the framebuffer re-read (many times) to composite effects on top of? Is this typical among optimized mobile graphics (eg. do ARM demos often have similar per-frame fragment counts)?
I'm wondering if I have a gap in my understanding of GLES rendering so any insights would be greatly appreciated!
Good blog and comment ! Very useful ! I hope more Chinese developers could see it !
Thanks Ed - nice blog - many issues close to my own heart. I would add a few more steps to your "Tao of GPU Benchmarking":
On the first point: Most engineering teams code review their code to ensure "best practice" is followed - please do the same with your data. GPUs are fundamentally data-plane engines and good data encoding can make a huge difference to performance even through the output is visually the same.
We have seen a number of cases in both applications and benchmarks with not only very complex models in terms of pure vertex counts, but also where those high complexity models have been built inefficiently. Like any modern processor, GPUs function efficiently because of their ability to cache data, so make sure the models sent to the GPU are following standard industry best-practise in terms of spatial locality (i.e. send triangles which are close together on screen close together in the vertex attribute arrays), and removal of attributes you are not actually using in your shaders.
On the second point: By all means have higher geometry complexity, and more textures, and more complex shaders, but please spend those cycles on something where it actually makes a visual improvement to the scene!
We often see really expensive models which are scaled down to be some minor eye candy in the scene, or multi-pass shaders which generate very little additional visual improvement over the original shader. If you are going to spend an extra 10 million cycles on a frame, then make it count! I think my favourite example of this is one game with an NPC character wearing heavily tinted glasses. Behind those glasses are heavily tessellated eyelids, and behind those are heavily tessellated eyeballs. In most frames with that character in we have >20K triangles in an area spanning only a couple of pixels. This is about 20% of the vertex complexity of the scene, and its behind sunglasses so you can't really see it anyway. That 20% would have made far more visual impact if it had been spent improving the rest of the models in the scene.