It's time we dealt with the measurement of compute performance in GPUs. In another in a series of ARM blogs intended to enlighten and reduce the amount of confusion in the graphics industry, I'd like to cover the issue of Floating-point Operations Per Second (FLOPS, or GFLOPS or TFLOPS).
In the past, Tom Olson talked about triangles per second, Ed Plowman talked about pixels per second, Sean Ellis addressed floating-point precision and hopefully we managed to amuse people as well as educate. Today let's look at compute performance - it's a useful measure.
...But open and honest competition is better. The market for GPUs is very competitive, with a number of companies supplying IP as well as those who make their own, for inclusion in SoCs. I love competition; how else can you win if you don't have competition? Or, as one of the most competitive people I know said to me: "What is the point in competing if you don't win?" (she was a runner, but suffice to say there are a lot of people round here who want to win at anything they commit to). In this competitive environment, we know that our partners can sometimes struggle to understand performance metrics for GPUs. They need to compare the offerings from multiple suppliers and pick the right product for their needs. This can be a complex subject, but it doesn't have to be as complex as some try to make it. I want to win on honest, open metrics.
Graphics is a really computationally intensive problem - you have to do lots of arithmetic in it, which is one reason people have been interested in utilising those capabilities for more than "just" graphics. To draw stuff, we start off by describing some objects in a three-dimensional space by dividing them into a number of triangles and listing the co-ordinates of each vertex of the triangles. We can argue about why we use triangles, and some have, but a triangle is simple, and the three points in it are guaranteed to form a plane. We then define some light sources and give them types and positions; we define the projection model (the camera) and give that a position; we define the colours and surface detail of the objects (made up of those triangles). Sometimes we add lots more detail; sometimes we animate the objects and make them move. After all that, we try to work out what a picture from the camera would look like, if it were projected onto a two-dimensional screen. As you can imagine, there are lots of 3-D equations to solve, and lots of trigonometry. Most of the numbers we use are floating-point numbers, so the rate at which we can perform floating-point arithmetic has a big effect on our graphics performance. It's not the only thing, of course, but it is important. It is certainly good to understand it.
In our GPUs (and lots of others) we have floating-point operations performed in all the places I described above. Some are in fixed-function units and some are in programmable units. Some examples may help here: when you load a value from a texture, the texture unit will calculate a memory address, based on the co-ordinates within the texture that you specify, and then possibly interpolate between several values in memory to produce the texture you want, possibly bi-linearly filtering between some adjacent values. And, if the texture was in a compressed format like ASTC, the values will have to be uncompressed as part of that process as well. That's a lot of calculation (integer and floating-point). It's very good for graphics, but utilising those units for more general-purpose compute is somewhere between a bit hard and impossible.
The ARM Mali-400 family for example, was designed for OpenGL ES 2.0, which has low precision requirements. Some operations need to be performed at 32-bit precision, some 24-bit and some 16-bit. OpenCL on NEON on the ARM CPU can be used as a compute companion.
For example, the Mali-T600 family of GPUs use the Midgard architecture (described by me in a previous blog). In that architecture, we have arithmetic pipelines that execute instructions like ADD and MUL. We have a balanced mix of scalar and vector (SIMD) units, so we can do multiple operations like that in parallel (e.g. four FP32, 8 FP16). We also have dot product instructions and a bunch of trigonometry instructions (like sin, cos, tan etc.).
The Mali-T600 series was designed for compute and the newest graphics APIs like OpenCL, OpenGL ES 3.0, and Microsoft DirectX11 so it supports full 32-bit precision floating-point operations conformant with IEEE-754-2008. We also do double-precision (64-bit floating-point) and as an aside, we can also do a wide variety of integer operations including 64-bit as well (traditionally GPUs lack good integer capabilities).
To summarise, we have some GPUs with differing performance levels of integer and floating point arithmetic and differing precisions, with differing levels of usability from code.
Now comes the thorny problem of how to define a metric that measures how much arithmetic is going on in a GPU: what to measure?
Now here at ARM, we like to be inclusive: partnership is one of our big things, after all. So, I'm prepared to go as far as this: it doesn't matter so much what you do, as long as you show your working (as UK teachers would say to students, i.e. explain the method you are using). However, anyone who doesn't explain their numbers (in small print, even) must be trying to hide something, and that just won't do. So, in the spirit of openness, how do we produce our numbers? Well, the headline is about FLOPS, so for the time being, we're going to ignore integer arithmetic. Here are ARM's rules:
ARM does not include FLOPS from fixed-function units, or things only available from graphics, e.g. texture units, blending units, varying interpolation, triangle setup, Z-culling etc.
I have described how we define and produce our architectural FLOPS numbers. It should give you all the ammunition you need to go and question your supplier about how they calculate theirs. Hopefully that will lead to useful, productive conversations. Maybe we need a standard. Maybe it will lead to us changing the way we define our numbers to match others' methods. That's OK, as long as we're open about it.
I've also indicated the role that benchmarks need to play in describing real-world performance. We need to get industry agreement about which benchmarks matter. Too many benchmarks can lead to confusion.
Like our method? Hate it? Think we're wrong? Want to suggest anything different? Got any amusing tales to tell about how some others do it? Let us know. Feel free to comment to this blog.
Thanks for the kind words, Sean.
The Mali seems really well positioned to take care of partners optimizing for different targets (eg. perf vs. cost/mm2). I would love to read a blog post on this subject, as I'm sure it would be very enlightening beyond what has already been shared.
Oh, and congrats on a great interview with Anandtech. I thoroughly enjoyed watching it!
Sean
Thermal limits and silicon area are related but are not the same. Some of our Partners create products that are primarily limited by thermal constraints (limited by the dissipation capabilities of their packaging, and thermal design of the case, for example). Some of these are related to cost (cost of expensive packaging, or metal cases), but some of these are fundamental (you just cannot dissipate more than a certain number of Watts into a mobile phone form factor device). These Partners will often ask what performance can be obtained within a particular number of Watts (frames per second per Watt).
Many Partners are making more cost-constrained (but high volume) devices. For these guys, the cost of silicon area is paramount (it always used to be 10 cents per square millimetre, but with newer silicon processes that is increasing), so their primary focus is performance within a certain silicon area (frames per second per square millimetre).
One size does not fit all, so we produce two roadmaps of GPUs, one focussed on fps/Watt and one focussed on fps/sq. mm.
--
Jem
Thanks jemdavies, this was a great read!
I would love to get some insight into the performance challenges as they are related to chip thermal limits and die area, and how Mali specifically optimizes for these. Thermal limit is one area of graphics hardware that's rarely discussed, but I'm assuming that is of the utmost importance, specifically in a mobile targeted SoC where power is a scarce resource.
I'm also curious when you mention that you have TFLOP capable hardware. If this is with the maximum listed core-count Mali T760 (or T678, which should results in the same number of ALUs), then I would assume that the clocks would have to be quite high to peak at this level. Could it be that a 16-core Mali T760 (for example) may not be an impassable upper limit, and a higher core-count would be reserved for very special implementation cases? Or perhaps the T760 can be configured with more ALUs. Could this be a hint at a GPU core to come?