Chinese Version 中文版:基准测试移动 GPU 中的浮点精度
When we talk about GPU performance we usually talk about speed. In previous blogs we've talked about how many pixels per second you can put on the screen, how many triangles per second you can pretend to draw (don't ask), and most recently, how many floating-point operations per second you can issue. Talking about speed is great fun, and we love doing it, but it isn't the only interesting GPU performance metric; quality also counts. After all, it doesn't matter how fast you compute something if you get the wrong answer. In my next few blogs, I'll let go (temporarily) of my obsession with speed, and talk about benchmarking the quality of GPU floating-point arithmetic. I have a lot to cover, so this is unavoidably long. Save it up for when you can concentrate.
Using floating point arithmetic is tricky. A lot of programmers, even some very good ones, don't really understand it. Quite innocent-looking pieces of code can turn around and bite you, and it is often very hard to figure out why you're getting bitten. That's a problem even when you're coding for nice, well-behaved, IEEE-754-compliant CPUs. When you're targeting devices with more, shall we say, character (cough GPUs cough), it's tempting to assume that anything strange you see is the result of some flaw in how they are doing arithmetic. But that is not necessarily the case; it could just be that your intuitive notion of what the result should be... is wrong.
If you're going to do anything remotely edgy with floating-point - and certainly, quality benchmarking falls into that category - you'd better get used to thinking about exactly what's happening inside the floating-point unit, which means understanding how floating-point works in a lot more detail than, perhaps, you really wanted to. Get over it, hacker!
If you already know how floating-point works, you can skip this section; if you don't, there are excellent Wikipedia articles on IEEE-754 and single precision floating-point that you really ought to read. But for this blog, all you really need to know is this:
Your basic floating-point number consists of a sign bit n, a few exponent bits, and a few more significand bits. (Some people say mantissa instead of significand; I kind of like the sound of it "mantissa, mantissa" but these days it is considered retro. Oh well.) If I'm using a typical FP32 float, there will be 8 bits of exponent and 23 bits of significand. The exponent runs (logically) from -126 to +127; here I'll write the logical value as E. The significand is a binary fixed point number which I'll write as 1.sss..., and whose value is . Finally, the value of the floating point number as a whole is given by
my_value = (-1)n × 2E × 1.sssssssssssssssssssssss
Since the significand has a finite number of bits, there is a limit to how precisely a number can be represented.
Suppose we want to add two numbers, say sixteen million and 11.3125:
16000000 = (-1)0 × 223 × 1.11101000010010000000000
11.31250 = (-1)0 × 23 0× 1.01101010000000000000000
To add them, we first right-shift (aka denormalize) the significand of the smaller number to make the exponents equal. In this case, we have to shift by 20 bits:
11.31250 = (-1)0 × 223 × 0.00000000000000000001011(010100...00)
... and then add the significands to get the result:
16000011 = (-1)0 × 223 × 1.11101000010010000001011
...and finally renormalize if necessary, but in this case it isn't.
Note that some of the bits of the smaller number (in red above) got shifted off the end of the significand and fell on the floor, so our result is off by 0.3125; this is a common way to lose precision when you're doing floating-point arithmetic. The bigger the difference in the exponents of the two numbers you're adding, the more bits you lose.
Now we're ready to start talking about floating-point on GPUs. I was originally inspired to tackle this subject by Stuart Russell's post on the Youi Labs site. He compared six mobile GPUs, plus a desktop card, and found some interesting things. I'll start by reviewing his results. I said earlier that floating-point is tricky, and that correct behavior can produce unintuitive results... and so it proved.
Stuart did his comparisons using a cleverly designed OpenGL ES 2.0 fragment (pixel) shader. My version is below; it's slightly different, but the modifications don't affect the results. His blog includes pictures of what the shader produces on each of the devices, and I highly recommend spending some time looking at them. There is a remarkable amount of variation in the results. These are all OpenGL ES 2.0 compliant devices, but OpenGL ES defines floating-point arithmetic quite loosely. That isn't a problem for normal graphics applications, but the test shader is deliberately designed to be sensitive to what happens in the darker corners of the floating-point range. // Youi Labs GPU precision shader (slightly modified) precision highp float; uniform vec2 resolution; void main( void ) { float y = ( gl_FragCoord.y / resolution.y ) * 26.0; float x = 1.0 - ( gl_FragCoord.x / resolution.x ); float b = fract( pow( 2.0, floor(y) ) + x ); if(fract(y) >= 0.9) b = 0.0; gl_FragColor = vec4(b, b, b, 1.0 ); }
// Youi Labs GPU precision shader (slightly modified) precision highp float; uniform vec2 resolution; void main( void ) { float y = ( gl_FragCoord.y / resolution.y ) * 26.0; float x = 1.0 - ( gl_FragCoord.x / resolution.x ); float b = fract( pow( 2.0, floor(y) ) + x ); if(fract(y) >= 0.9) b = 0.0; gl_FragColor = vec4(b, b, b, 1.0 ); }
Like, what kind of variation? It turns out there are several different (and largely unrelated) things going in in Stuart's results. I'm going to start with the simplest: all of the images divide the screen into a number of horizontal bars, but the number of bars ranges from as low as ten to as many as twenty-three. Why?
In order to answer the question, we need to look at the test shader in some detail.
The shader above is run at every pixel on the image. The built-in input variable gl_FragCoord supplies the x and y pixel coordinates. The first line (variable y) of the function divides the image into 26 horizontal bars, where the integer part of y tells you which bar the current pixel is in (0 through 25), and the fractional part tells you how far up the bar it is. The second line (variable x) computes an intensity value that varies linearly from nearly 1.0 (white) at the left edge of the image, to nearly 0.0 (black) at the right edge. Lines 4 and 5 turn the top 10% of pixels in each bar black, to make it easy to count the bars.
gl_FragCoord
The funny business happens in line 3:
float b = fract( pow( 2.0, floor(y) ) + x );
The built-in pow() function returns an integer: 20 in the first bar, 21 in the second, 22 in the third, and so on, reaching 225 in the last bar. That (integer) value is added to the intensity x, and then the integer part of the sum is thrown away by the fract() function.
pow()
fract()
We've seen what happens when you add floating-point numbers of different sizes: low-order bits of the smaller number get thrown away. So, when the shader throws away the integer part, what we're left with is the original intensity x, except that some of the low order bits have gotten lost; we lose 0 bits in the first bar, 1 bit in the second, and so on. As a result, the intensities get quantized into a smaller and smaller number of grey levels, and the nice smooth ramp becomes increasingly blocky. When the difference in exponents becomes equal to the number of bits in the significand, all of the bits of x are discarded, and we see no bar at all. Now, since x is always less than one, its floating-point exponent is at most -1; so if you crunch through a little third-grade arithmetic (is that when they introduce negative numbers?), you'll convince yourself that the number of non-black bars in the image is exactly the number of bits in the fractional part of the shader engine's floating-point significand. Cool!
So, the first thing the images tell us is that different GPUs have different numbers of bits in their significands. There seem to be two distinct populations: the minimalists, providing only what OpenGL ES 2.0 requires, and the luxury models, providing something close to FP32. Let's consider them separately.
Two of the GPUs in the comparison take a minimalist approach: ARM's Mali-400 has a ten-bit significand, and NVIDIA's Tegra 3 has thirteen, both about half of what the other four GPUs provide. That's a big difference - what's going on here?
What's going on is that OpenGL ES 2.0 (or rather, the GLSL ES 1.0 shading language) defines three different kinds of floating-point numbers: highp, mediump, and lowp. The first kind (highp) have at least a seven bit exponent and a sixteen bit significand, while the second (mediump) have at least a five bit exponent and a ten bit significand. (The third (lowp) kind isn't actually floating-point at all; the minimal implementation is ten-bit fixed point, with eight bits of fractional precision.) It's important to realize that these are minimum values; an implementation is perfectly free to implement lowp as 64-bit float, if it wants to.
It's even more important to realize that in OpenGL ES 2.0, support for highp precision in the fragment shader is optional. Mali-400 and Tegra 3 don't support highp; the other four GPUs do. Why the difference? The other four GPUs are unified shader architectures; they use the same compute engine for both vertex and fragment shading. OpenGL ES 2.0 requires highp support in the vertex shader; and since it has to be there for vertices, making it available for fragments as well adds little silicon area cost on those architectures. Mali-400 and Tegra 3 are non-unified shaders, meaning that they use separate compute engines for vertex and fragment shading. This allows them to optimize each engine for the task it has to do. Supporting highp is expensive in silicon area and power, and it isn't required by the standard, so throwing it out is sort of a no-brainer for these architectures. Well-written OpenGL ES 2.0 content doesn't need it and getting rid of it results in cores that are very, very efficient.
There's a lot more to know about writing code for GPUs that don't support highp; for a fuller discussion, see seanellis's blog post on the topic.
Now let's look at the luxury models. In Stuart's result images, if you zoom in and count carefully, you'll see that Qualcomm's Adreno 225 has 21 bars, ARM's Mali-T604 has 22, and the Vivante and Imagination cores have 23. Does that mean that GC2000 and SGX544 have higher precision than Mali and Adreno?
I lost sleep over that question when Stuart's blog came out. Eventually, I noticed that the Mali-T604 image has a status bar at the top of the screen, in addition to the standard Android navigation bar at the bottom. The Adreno 225 image has a thicker one, and the GC2000 and SGX544 images have none. Hmm...Off to see Jesse, our resident Android hacker. It turns out that if you aren't careful, Android status bars can be composited over your allegedly full-screen app; maybe they were covering up some of the bars? OK, I'll admit it, that's the real reason I re-implemented Stuart's shader. I just had to know!
Figure 1 shows the result of running the shader on a Mali-T604-powered Nexus 10, and on a Samsung Galaxy SIII (US edition), which uses Qualcomm's Adreno 225. (We used GL coordinates in our implementation instead of DX coordinates, so our images are upside-down relative to Stuart's; if that bothers you, try standing on your head while you look at them.) What the images show, if you don't feel like counting the bars, is that these two GPUs do indeed have 23 fractional bits in their significands, just like the Imagination and Vivante cores.
That is: all of these GPUs offer exactly the same raw precision.
Figure 1: Test shader running on Mali-T604 (Nexus 10, left) and Adreno 225 (Samsung Galaxy SIII, right)
We've settled the question of what the number of bars in Stuart's images tells you: it's the number of fractional bits in the fragment shader significand. Mali-400 has ten, as you'd expect from a device that uses IEEE-754 half precision (binary16) as its floating-point type. Adreno 225, GC4000, Mali-T604, and SGX544 all provide twenty-three, suggesting that they provide something close to IEEE-754 single precision (binary32). The Tegra 3 significand has thirteen fractional bits, which as far as I know is unique to NVIDIA.
But if you look at Stuart's images, the number of bars isn't the first thing you notice. The thing that jumps out and bites you is that the bars are organized into patterns with quite different shapes. Some, like the Mali-T604 in figure 1 above, form a symmetrical bowl or beehive shape; others, like the Adreno 225, hug the left edge of the image and curve away to the right; and the Imagination SGX544 does something completely sui generis. What's going on here? The answers turn out to be pretty interesting, but this blog is too long already, so let's call it a day. In the next installment, we'll explore the differences and see what they tell us about these GPUs.
Part 2 of this series is now available. Read it by clicking on the link below.
[CTAToken URL = "https://community.arm.com/graphics/b/blog/posts/benchmarking-floating-point-precision-in-mobile-gpus---part-ii" target="_blank" text="Read Part 2 here" class ="green"]