Flipping the FLOPS - how ARM measures GPU compute performance

September 11, 2013

9 minute read time.

It's time we dealt with the measurement of compute performance in GPUs. In another in a series of ARM blogs intended to enlighten and reduce the amount of confusion in the graphics industry, I'd like to cover the issue of Floating-point Operations Per Second (FLOPS, or GFLOPS or TFLOPS).

In the past, Tom Olson talked about triangles per second, Ed Plowman talked about pixels per second, Sean Ellis addressed floating-point precision and hopefully we managed to amuse people as well as educate. Today let's look at compute performance - it's a useful measure.

Competition is good

...But open and honest competition is better. The market for GPUs is very competitive, with a number of companies supplying IP as well as those who make their own, for inclusion in SoCs. I love competition; how else can you win if you don't have competition? Or, as one of the most competitive people I know said to me: "What is the point in competing if you don't win?" (she was a runner, but suffice to say there are a lot of people round here who want to win at anything they commit to). In this competitive environment, we know that our partners can sometimes struggle to understand performance metrics for GPUs. They need to compare the offerings from multiple suppliers and pick the right product for their needs. This can be a complex subject, but it doesn't have to be as complex as some try to make it. I want to win on honest, open metrics.

Graphics is compute

Graphics is a really computationally intensive problem - you have to do lots of arithmetic in it, which is one reason people have been interested in utilising those capabilities for more than "just" graphics. To draw stuff, we start off by describing some objects in a three-dimensional space by dividing them into a number of triangles and listing the co-ordinates of each vertex of the triangles. We can argue about why we use triangles, and some have, but a triangle is simple, and the three points in it are guaranteed to form a plane. We then define some light sources and give them types and positions; we define the projection model (the camera) and give that a position; we define the colours and surface detail of the objects (made up of those triangles). Sometimes we add lots more detail; sometimes we animate the objects and make them move. After all that, we try to work out what a picture from the camera would look like, if it were projected onto a two-dimensional screen. As you can imagine, there are lots of 3-D equations to solve, and lots of trigonometry. Most of the numbers we use are floating-point numbers, so the rate at which we can perform floating-point arithmetic has a big effect on our graphics performance. It's not the only thing, of course, but it is important. It is certainly good to understand it.

First describe the problem

In our GPUs (and lots of others) we have floating-point operations performed in all the places I described above. Some are in fixed-function units and some are in programmable units. Some examples may help here: when you load a value from a texture, the texture unit will calculate a memory address, based on the co-ordinates within the texture that you specify, and then possibly interpolate between several values in memory to produce the texture you want, possibly bi-linearly filtering between some adjacent values. And, if the texture was in a compressed format like ASTC, the values will have to be uncompressed as part of that process as well. That's a lot of calculation (integer and floating-point). It's very good for graphics, but utilising those units for more general-purpose compute is somewhere between a bit hard and impossible.

Some GPUs "just" do graphics and do not do general purpose compute.

The ARM Mali-400 family for example, was designed for OpenGL ES 2.0, which has low precision requirements. Some operations need to be performed at 32-bit precision, some 24-bit and some 16-bit. OpenCL on NEON on the ARM CPU can be used as a compute companion.

Some GPUs do graphics and compute

For example, the Mali-T600 family of GPUs use the Midgard architecture (described by me in a previous blog). In that architecture, we have arithmetic pipelines that execute instructions like ADD and MUL. We have a balanced mix of scalar and vector (SIMD) units, so we can do multiple operations like that in parallel (e.g. four FP32, 8 FP16). We also have dot product instructions and a bunch of trigonometry instructions (like sin, cos, tan etc.).

How should you express the number of floating-point operations in a trigonometric function like sin()?

The Mali-T600 series was designed for compute and the newest graphics APIs like OpenCL, OpenGL ES 3.0, and Microsoft DirectX11 so it supports full 32-bit precision floating-point operations conformant with IEEE-754-2008. We also do double-precision (64-bit floating-point) and as an aside, we can also do a wide variety of integer operations including 64-bit as well (traditionally GPUs lack good integer capabilities).

To summarise, we have some GPUs with differing performance levels of integer and floating point arithmetic and differing precisions, with differing levels of usability from code.

Then define your metric

Now comes the thorny problem of how to define a metric that measures how much arithmetic is going on in a GPU: what to measure?

Now here at ARM, we like to be inclusive: partnership is one of our big things, after all. So, I'm prepared to go as far as this: it doesn't matter so much what you do, as long as you show your working (as UK teachers would say to students, i.e. explain the method you are using). However, anyone who doesn't explain their numbers (in small print, even) must be trying to hide something, and that just won't do. So, in the spirit of openness, how do we produce our numbers? Well, the headline is about FLOPS, so for the time being, we're going to ignore integer arithmetic. Here are ARM's rules:

ARM includes only directly-programmable arithmetic operations: classical arithmetic operations exposed to the shader programmer such as ADD, MUL, and vector versions of those.
We count the number of ADDs, MULs etc. (including those in dot product operations) that we can execute in one cycle, from a real piece of code in a computeshader. This is our architectural FLOPS rate (measured in FLOPS per cycle).
Although we can do some functions (like trig) really efficiently we don't add anything into the mix for these - that way lies madness.
From a real, fully laid-out, placed-and-routed synthesis, using real physical IP libraries (e.g. TSMC 28nm HPM, specifying channel lengths etc.), we get a maximum operating frequency. We openly specify in what conditions (e.g. slow-slow silicon corner, Vdd at -10% of Vnom etc.). This is not just a PowerPoint number: our partners should easily be able to achieve this frequency. For most partners, who would use more "typical"; parameters, they should easily exceed it. If you want to implement on a higher-speed process that burns more power, you can definitely exceed it. This is what we believe is right for an IP supplier. Silicon manufacturers will quote whatever frequency they guarantee their chips at.
We multiply the number of FLOPS per cycle by the number of arithmetic pipelines per core, then the number of cores, then by the frequency. That gives you a number of FLOPS. It's a big number, so usually we specify a number of GFLOPS (gigaflops), but soon we'll be using teraflops - we have teraflop cores being developed for delivery this year.
For the Mali-T600 series, the headline number is single-precision (32-bit floating-point). We quote a second number which is double-precision (64-bit) FLOPS. For most "graphics" GPUs, that 64-bit number is smaller. For a GPU we would target at high-performance computing or supercomputers, (and we have been asked) it might be the same, or even bigger.
We'll also show shader code that actually manages to include all those operations. We'll show any difference between real code run on real silicon and the architectural FLOPS rate. Currently we can achieve 97% of the architectural GFLOPS rate on real silicon. We believe that's a very high percentage number compared to others. Perhaps you know better?
We also run benchmarks. If you need to know the execution speed of real code, this is probably more useful information to you than looking at architectural numbers! ARM likes independent, third-party benchmarks and there are a host of them to measure performance achieved (rather than architectural numbers). Common ones used for compute-intensive numerical applications are SAXPY and SGEMM originally from the LINPACK and LAPACK BLAS libraries, although recently companies have been starting to look at GPU computing on consumer devices, e.g. with CLBenchmark from Kishonti. This is a large subject and is really best left to a later blog.

What we don't do

ARM does not include FLOPS from fixed-function units, or things only available from graphics, e.g. texture units, blending units, varying interpolation, triangle setup, Z-culling etc.

We don't include any relaxed precision operations. We only include full IEEE-compliant ops. The subject of IEEE compliance, precision and rounding modes is complex and there is room for significant confusion here. Explaining and demystifying this is best left to a later blog.
We don't make any assumptions about how many operations were involved in calculating any of the library functions that might be implemented as instructions.
We don't quote a theoretical maximum frequency that we cannot justify from a real layout/synthesis. We can provide the EDA tools report to back up our claims.
We don't quote a maximum frequency for ridiculously hot, leaky processes that cannot be sensibly used by most of our partners.
We don't multiply the number we come up with by the ZIP code of our office in San Jose, or shift left by the telephone number of our HQ.

And finally

I have described how we define and produce our architectural FLOPS numbers. It should give you all the ammunition you need to go and question your supplier about how they calculate theirs. Hopefully that will lead to useful, productive conversations. Maybe we need a standard. Maybe it will lead to us changing the way we define our numbers to match others' methods. That's OK, as long as we're open about it.

I've also indicated the role that benchmarks need to play in describing real-world performance. We need to get industry agreement about which benchmarks matter. Too many benchmarks can lead to confusion.

Like our method? Hate it? Think we're wrong? Want to suggest anything different? Got any amusing tales to tell about how some others do it? Let us know. Feel free to comment to this blog.

Jem Davies over 10 years ago

Thanks for the kind words, Sean.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 10 years ago

The Mali seems really well positioned to take care of partners optimizing for different targets (eg. perf vs. cost/mm2). I would love to read a blog post on this subject, as I'm sure it would be very enlightening beyond what has already been shared.
Oh, and congrats on a great interview with Anandtech. I thoroughly enjoyed watching it!
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jem Davies over 10 years ago

Sean
Thermal limits and silicon area are related but are not the same. Some of our Partners create products that are primarily limited by thermal constraints (limited by the dissipation capabilities of their packaging, and thermal design of the case, for example). Some of these are related to cost (cost of expensive packaging, or metal cases), but some of these are fundamental (you just cannot dissipate more than a certain number of Watts into a mobile phone form factor device). These Partners will often ask what performance can be obtained within a particular number of Watts (frames per second per Watt).
Many Partners are making more cost-constrained (but high volume) devices. For these guys, the cost of silicon area is paramount (it always used to be 10 cents per square millimetre, but with newer silicon processes that is increasing), so their primary focus is performance within a certain silicon area (frames per second per square millimetre).
One size does not fit all, so we produce two roadmaps of GPUs, one focussed on fps/Watt and one focussed on fps/sq. mm.
--
Jem
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 10 years ago

Thanks jemdavies, this was a great read!
I would love to get some insight into the performance challenges as they are related to chip thermal limits and die area, and how Mali specifically optimizes for these. Thermal limit is one area of graphics hardware that's rarely discussed, but I'm assuming that is of the utmost importance, specifically in a mobile targeted SoC where power is a scarce resource.
I'm also curious when you mention that you have TFLOP capable hardware. If this is with the maximum listed core-count Mali T760 (or T678, which should results in the same number of ALUs), then I would assume that the clocks would have to be quite high to peak at this level. Could it be that a 16-core Mali T760 (for example) may not be an impassable upper limit, and a higher core-count would be reserved for very special implementation cases? Or perhaps the T760 can be configured with more ALUs. Could this be a hint at a GPU core to come?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jem Davies over 11 years ago

Thanks for that David. We know Tomas, and he makes a lot of sense.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Arm Performance Studio: A look back, and a look forward

Peter Harris

Arm Performance Studio release 2024.6 release bringing you quality-of-life improvements and bug fixes. Read this blog post for more information about other features in this release.
- December 20, 2024
The future of AI for games

Ian Bolton

Arm sponsored the AI and Games Conference at Goldsmiths in London, read about the day that brought experts and enthusiasts together for talks on the intersection of AI & gaming.
- November 29, 2024
Hidden Surface Removal in Immortalis-G925: The Fragment Prepass

Tord Øygard

Arm's Immortalis and Mali GPUs are energy efficient. In this blog post fragment pre-pass for Arm GPUs is discussed with Immortalis-G925, Mali-G725 & Mali-G625.
- November 28, 2024

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog