This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Strange arithmetic performance on G76 with/without explicit mad()

Hi!

I've recently encountered a peculiar behaviour regarding explicit mad() in a fragment shader. We compile shaders using DXC.

The line in question is

Fullscreen

1
#define FAST_GAMMA_2_LINEAR( COLOR ) ( COLOR * (COLOR * (COLOR * 0.305306011 + 0.682171111) + 0.012522878) )
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#define FAST_GAMMA_2_LINEAR( COLOR ) ( COLOR * (COLOR * (COLOR * 0.305306011 + 0.682171111) + 0.012522878) )

when changing the code to

Fullscreen

1
#define FAST_GAMMA_2_LINEAR( COLOR ) ( COLOR * mad(COLOR, mad(COLOR, 0.305306011, 0.682171111), 0.012522878) )
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#define FAST_GAMMA_2_LINEAR( COLOR ) ( COLOR * mad(COLOR, mad(COLOR, 0.305306011, 0.682171111), 0.012522878) )

all shaders that use the code become slower.

When running SPIR-V through malioc:

1. I see that

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Work registers: 31
Uniform registers: 14
Stack spilling: false
16-bit arithmetic: 32%
                                A      LS       V       T    Bound
Total instruction cycles:    2.80    1.00    4.00    1.00        V
Shortest path cycles:        0.67    0.00    1.25    0.50        V
Longest path cycles:         2.80    1.00    4.00    1.00        V
//----------------//
         %98 = OpFConvert %v4half %97
         %99 = OpVectorShuffle %v3half %98 %98 0 1 2
        %100 = OpVectorTimesScalar %v3half %99 %half_0x1_38cpn2
        %101 = OpFAdd %v3half %100 %54
        %102 = OpFMul %v3half %99 %101
        %103 = OpFAdd %v3half %102 %56
        %104 = OpFMul %v3half %99 %103
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Work registers: 31
Uniform registers: 14
Stack spilling: false
16-bit arithmetic: 32%

                                A      LS       V       T    Bound
Total instruction cycles:    2.80    1.00    4.00    1.00        V
Shortest path cycles:        0.67    0.00    1.25    0.50        V
Longest path cycles:         2.80    1.00    4.00    1.00        V

//----------------//

         %98 = OpFConvert %v4half %97
         %99 = OpVectorShuffle %v3half %98 %98 0 1 2
        %100 = OpVectorTimesScalar %v3half %99 %half_0x1_38cpn2
        %101 = OpFAdd %v3half %100 %54
        %102 = OpFMul %v3half %99 %101
        %103 = OpFAdd %v3half %102 %56
        %104 = OpFMul %v3half %99 %103

becomes

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Work registers: 32
Uniform registers: 12
Stack spilling: false
16-bit arithmetic: 27%
                                A      LS       V       T    Bound
Total instruction cycles:    2.97    1.00    4.00    1.00        V
Shortest path cycles:        0.71    0.00    1.25    0.50        V
Longest path cycles:         2.97    1.00    4.00    1.00        V
//----------------//
        %101 = OpFConvert %v4half %100
        %102 = OpVectorShuffle %v3half %101 %101 0 1 2
         %36 = OpExtInst %v3half %1 Fma %102 %59 %61
         %37 = OpExtInst %v3half %1 Fma %102 %36 %63
        %103 = OpFMul %v3half %102 %37
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Work registers: 32
Uniform registers: 12
Stack spilling: false
16-bit arithmetic: 27%

                                A      LS       V       T    Bound
Total instruction cycles:    2.97    1.00    4.00    1.00        V
Shortest path cycles:        0.71    0.00    1.25    0.50        V
Longest path cycles:         2.97    1.00    4.00    1.00        V

//----------------//

        %101 = OpFConvert %v4half %100
        %102 = OpVectorShuffle %v3half %101 %101 0 1 2
         %36 = OpExtInst %v3half %1 Fma %102 %59 %61
         %37 = OpExtInst %v3half %1 Fma %102 %36 %63
        %103 = OpFMul %v3half %102 %37

2. There are now two new decorations for related registers:

Fullscreen

1
2
               OpDecorate %36 NoContraction
               OpDecorate %37 NoContraction
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

               OpDecorate %36 NoContraction
               OpDecorate %37 NoContraction

I was under the impression that manually using mad() could be more beneficial since it would be a direct hint to the compiler/driver about our intentions. But, it looks like even the DXC compiler avoids explicit FMAs. Performance loss and ALU cost increase is observed in shader variants with either explicit half precision types or RelaxedOps.

Top replies

Peter Harris over 3 years ago +1 verified

With the current compiler using any NoContraction decoration disables a number of optimizations globally. This is something we're looking to improve, but this is the reason for the slow down you are seeing...