Benchmarking floating-point precision in mobile GPUs - Part II

September 11, 2013

10 minute read time.

This is the second in a series of blogs about floating-point quality in GPUs. In part I, I claimed that a lot of programmers don't really understand floating-point numbers, and argued that if you're going to use it for anything remotely tricky, you'd better be prepared to learn how it works in more detail than you probably wanted to. I explained Stuart's test, and showed that it reveals how many bits of floating-point precision are used in the GPU fragment shader. That was good fun, but the test has other interesting things to tell us. In this installment, I'll talk about those.

The test, and the results

Stuart's test program uses a special fragment shader to compute a gray-scale intensity value at every pixel on the screen. My version is shown here, as a reminder.

precision highp float;
uniform vec2 resolution;
void main( void )
{
  float y = ( gl_FragCoord.y / resolution.y ) * 26.0;
  float x = 1.0 "" ( gl_FragCoord.x / resolution.x );
  float b = fract( pow( 2.0, floor(y) ) + x );
  if(fract(y) >= 0.9)
b = 0.0;
  gl_FragColor = vec4(b, b, b, 1.0 );
}

Box 1: Youi Labs GPU precision shader (slightly modified)

In my previous post, I went over the code in detail, so here I'll just summarize: the shader draws a series of 26 horizontal bars. The gray value for each bar is, ideally, a linear ramp from 1.0 (white) on the left side to 0.0 (black) on the right. However, the gray value is corrupted by first adding it to 2B (where B is the index of the bar the pixel is in), and then throwing away the integer part of the sum. This reduces the precision of the gray value by one bit in each successive bar, causing the ramps to become increasingly blocky. Eventually, all the bits are thrown away and the bar becomes completely black.

In his blog, Stuart published pictures of the images this shader draws for six mobile GPUs and one high-end desktop graphics card. The images vary in two basic ways. One is just the number of non-black bars; as we saw last time, that number turns out to equal the number of fractional bits in the shader engine's floating-point significand. The other is perhaps more striking: the bars make quite different patterns on the screen. That's the issue I want to talk about here.

When we look at the images, there seem to be two distinct populations: one group, consisting of the Nvidia Tegra 3, Vivante GC4000, and Qualcomm Adreno 225, produces bars that are white all the way to the left edge of the screen, but that trail off to the right. The resulting shape reminds me of a killer whale's dorsal fin, so I'll call this the "orca" pattern (see figure 1). The other group, consisting of the NVIDIA desktop GPU and the two ARM Mali devices, produces a symmetrical pattern which I'll call the "beehive" shape (see figure 2). (The Imagination SGX544 does something slightly different, but seems to be in the beehive camp as well.) What do these shapes tell us? Is one better than the other?

Figure 1: "Orca" pattern (Huawei Ascend D1 / Vivante GC4000)
Figure 2: "Beehive" pattern (Nexus 10 / Mali-T604)

In his blog, Stuart equates good floating-point quality with having a lot of bars that are white all the way to the left edge of the screen. So, he really likes the "orca" GPUs, and isn't impressed with the "beehive" camp. In particular, he says:

"The drift from the left edge indicates error in calculation (areas that should be white are black), which would translate into undesirable visual glitches if not accounted for."

Is he right? To find out, we'll have to look at what's going in inside the GPU's floating-point units when the shader is running; but before we do that, we have to dive a little deeper into how floating-point works.

More detail than you really wanted, part 2

In part I of this series, I gave a quick introduction to a generic single-precision floating-point format with eight bits of exponent and twenty-four bits (including the hidden bit) of significand. I ended with an example of what happens when you add two numbers of different magnitude, say eight million and 11.3125. We start with this:

(-1)⁰ x 2²² x 1.11101000010010000000000 = 8000000.0

(-1)⁰ x 2³0x 1.01101010000000000000000 = 11.3125

and align the binary points by shifting the smaller number nineteen bits to the right. After we do that, the smaller number no longer has the usual '1' bit to the left of the binary point, so we say that it is denormalized. The numbers we want to add now look like this:

(-1)⁰ x 2²² x 1.11101000010010000000000

(-1)⁰ x 2²² x 0.00000000000000000010110(1010...0)

and the sum is obviously

(-1)⁰ x 2²² x 1.11101000010010000010110(1010...0) = 8000011.3125

Notice that the red bits don't fit into the significand anymore. The question is, what should we do with them? The easiest thing is just to drop them on the floor; in the numerics business, that's called round-toward-zero (RTZ) or truncation. It is equivalent to pretending the red bits are all zero, even if they aren't. Converting ones into zeros introduces error; in this case, rounding toward zero gives us

(-1)⁰ x 2²² x 1.11101000010010000010110 = 8000011.0

and a total error of 0.3125. If you think about it, the worst-case error occurs when all the red bits started out as ones, at which point the error we're introducing into the significand is

or about 2^-23

If we're willing to work just a little harder, we can do better. Instead of dropping the red bits, we can round them up or down to whichever 24-bit significand value is closer. That turns out to be easy: if the first red bit is zero, we truncate (round down) as above. If it's one, and at least one other red bit is a one, we round up. In the example above, our ideal sum

(-1)⁰ × 2²² × 1.11101000010010000010110(1010...0) = 8000011.3125

is rounded up to
(-1)⁰ x 2²² x 1.11101000010010000010111 = 8000011.5

for a total error of 0.1875, quite a bit better than the round-toward-zero result. If the first red bit is a one, and no other red bit is, we're exactly halfway between two representable values; what do we do then? Various tie-breaking rules are possible; the preferred one (and the required default for IEEE-754-2008) is to round whichever way will produce a zero in the least significant bit of the significand. This is called round-to-nearest-even (RNE). If we use this rule (or any other round-to-nearest rule), the worst-case error is 2^-24 rather than 2^-23. That may not sound like much improvement, but think about it: using RNE instead of RTZ cuts the worst-case error in half. That's a big deal; it's almost like getting an extra bit of precision for free.

Round-up Time

What does all this have to do with the orcas and the beehives in Stuart Russell's images? His shader (see Box 1 above) does more or less what we did in the examples in the previous section: it adds a series of ever-larger integers to a set of grey values between 1.0 and 0.0, causing an ever-greater loss of precision. Let's consider what happens in the 23^rd bar, where we're adding the grey value to 2²². The power of two is represented as

(-1)⁰ x 2²² x 1.00000000000000000000000 = 4194304.0

The next largest value we can represent in our floating-point number system is

(-1)⁰ x 2²² x 1.00000000000000000000001 = 4194304.5

and the next largest one is

(-1)⁰ x 2²² × 1.00000000000000000000010 = 4194305.0

The grey value we're adding to 2²² is between zero and one, so clearly the floating-point unit is going to have to round the sum to one of these three values. After the addition, the shader throws away the integer part of the sum, so we're going to be left with one of only two possible results: 0.0, or 0.5.

A GPU using RTZ always rounds positive numbers down. So, if the gray value is less than 0.5, the sum will be rounded down to 4194304.0, and we'll end up with an output grey value of 0.0. If the gray value is greater than 0.5, the sum will be rounded (down again) to 4194304.5, and we'll end up with an output value of 0.5. Looking at the topmost visible bar in Figure 1, that's exactly what we see; the right half of the bar (initial grey values less than 0.5) becomes black, and the left half (initial values greater than 0.5) becomes 50% grey. The "orca" GPUs are using round-toward-zero!

A GPU using RNE, on the other hand, will round the sum to the nearest value it can represent. When the grey value is less than 0.25, the sum will be rounded down to 4194304.0, producing black. When it is between 0.25 and 0.75, the sum will be rounded to 4194304.5, producing 50% grey. When the grey value is above 0.75, the sum will be rounded up to 4194305.0, which corresponds logically to white; however, when the integer part of the sum is discarded, we'll end up with black again. That's what produces the "drift from the left edge" that Stuart refers to in his blog, and that we see in Figure 2. The "beehive" GPUs are using round-to-nearest.

To make visualizing this a little easier, we can modify the shader so that it preserves the grey value of 1.0 that results when the sum is rounded up to an integer. Box 2 shows the code, and figure 3 shows the result of running it on another "beehive" GPU, an AMD desktop part (Radeon HD3650). Compared to figure 2, the bars now extend all the way to the left edge of the image, and there's an extra twenty-fourth bar corresponding to that "extra bit of precision" that round-to-nearest (sort of) gives us.

precision highp float;
uniform vec2 resolution;
void main( void )
{
  float y = ( gl_FragCoord.y / resolution.y ) * 26.0;
  float x = 1.0 — ( gl_FragCoord.x / resolution.x );
  float p = pow( 2.0, floor(y) );
  float b = ( p + x ) - p;
  if(fract(y) >= 0.9)
b = 0.0;
  gl_FragColor = vec4(b, b, b, 1.0 );
}

Box 2: Precision shader modified to produce output in range [0.0, 1.0]

Figure 3: Shader modified to allow grey levels in range (0.0,1.0)

Looking at pictures is fun, but in this case the difference is easier to see if we just plot the input and output grey values for the top few bars, for both "orca" and "beehive" GPUs.

Figure 4 shows what you get. (What you're seeing is exactly the same data as in Figures 1 and 3, at least for bars 22-24 — we're just viewing it as a graph, rather than as a grey value.) What do we see? The RNE output is a better approximation to the input than the RTZ output; also, its average error is zero, while the RTZ output has a bias (i.e., a non-zero average value).

Still not convinced? In figure 5 I've plotted the error in the RTZ and RNE curves — that is, the absolute value of the difference between output and input. If you study them a bit, and integrate the area under the curves in your head, you'll be pleased (but not surprised!) to discover that on average, the RNE method produces exactly half the error of the RTZ method.

Figure 4: Graph of RNE output

Whose GPU has the highest quality floating-point unit?

Now we can finally answer the question: What do the shapes in Stuart's images tell us about floating-point quality in the GPUs he tested? In his view, they mean that the RTZ GPUs (specifically, Vivante GC4000 and Qualcomm Adreno 225) produced the highest quality output. But in fact, the opposite is true: GPUs that perform RNE rounding, such as ARM's Mali-T604, produce more accurate results and lower error. That's why round-to-nearest-even is specified as the default rounding method in IEEE-754-2008. Stuart is welcome to prefer the orca shape over the beehive; but it'll have to be on the grounds of personal taste, not quality.

What next?

What I really love about Stuart's shader is the way it converts fairly esoteric details of floating-point behaviour into striking visual images. Can we write shaders that do something similar for other dark corners of IEEE-754? We can! Next time, we'll peek over the edge of the dreaded Zero Hole, and look at a shader that tells you whether your GPU has what it takes to fill it. Until then — want to tell me why directed rounding really is better than round-to-nearest? Want to mount an impassioned defence of round-to-nearest-odd? Get in touch.

David McQuillan over 12 years ago

The big reason for using round to nearest, and in fact go to trouble even for the half way case of doing that in a fair manner each way, is that with a chain of calculations the error tends to grow as the square root of the number of calculations rather than growing linearly because of the bias.

In practice it isn't quite that good because of for instance rounded constants being used, but if for instance a shape is rotated by iteratively applying a rotation matrix rather than from scratch each time then this can mean mean the difference between artifacts becoming visible in a couple of seconds or only after half an hour or so. Of course one wouldn't do that in graphics but with using GPU's for computation having long chains of computations is practically the whole point.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
david moloney over 12 years ago

Hi Tom

Movidius takes the same view on FP fidelity as ARM, albeit our focus is on DSP and computer vision rather than graphics rendering

This has important implications for DSP and CV algorithms

For instance the designers of IBMs Cell SPE also made the decision to implement RTZ rather than RNE with important consequences for FFT numerical performance

[url="http://www.fftw.org/cell/"]http://www.fftw.org/cell/[/url]

“The SPEs are fully IEEE-754 compliant in double precision.

In single precision, they only implement round-towards-zero as opposed to the standard round-to-even mode. (The PPE is fully IEEE-754 compliant like all other PowerPC implementations.)

Because of the rounding mode, FFTW is less accurate when running on the SPEs than on the PPE.

The accuracy loss is hard to quantify in general, but as a rough guideline, the L2 norm of the relative roundoff error for random inputs is 4-8 times larger than the corresponding calculation in round-to-even arithmetic.

In other words, expect to lose 2 to 3 bits of accuracy.

FFTW currently does not use any algorithm that degrades accuracy to gain performance on the SPE.

One implication of this choice is that large 1D transforms run slower than they would if we were willing to sacrifice another bit or so of accuracy.“

Regards,

-David
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Benchmarking floating-point precision in mobile GPUs - Part II

The test, and the results

More detail than you really wanted, part 2

Round-up Time

Whose GPU has the highest quality floating-point unit?

What next?

Unlock the power of SVE and SME with SIMD Loops

What is Arm Performance Studio?

How Neural Super Sampling works: Architecture, training, and inference