Hi!
I've recently encountered a peculiar behaviour regarding explicit mad() in a fragment shader. We compile shaders using DXC.
The line in question is
#define FAST_GAMMA_2_LINEAR( COLOR ) ( COLOR * (COLOR * (COLOR * 0.305306011 + 0.682171111) + 0.012522878) )
when changing the code to
#define FAST_GAMMA_2_LINEAR( COLOR ) ( COLOR * mad(COLOR, mad(COLOR, 0.305306011, 0.682171111), 0.012522878) )
all shaders that use the code become slower.
When running SPIR-V through malioc:
1. I see that
Work registers: 31 Uniform registers: 14 Stack spilling: false 16-bit arithmetic: 32% A LS V T Bound Total instruction cycles: 2.80 1.00 4.00 1.00 V Shortest path cycles: 0.67 0.00 1.25 0.50 V Longest path cycles: 2.80 1.00 4.00 1.00 V //----------------// %98 = OpFConvert %v4half %97 %99 = OpVectorShuffle %v3half %98 %98 0 1 2 %100 = OpVectorTimesScalar %v3half %99 %half_0x1_38cpn2 %101 = OpFAdd %v3half %100 %54 %102 = OpFMul %v3half %99 %101 %103 = OpFAdd %v3half %102 %56 %104 = OpFMul %v3half %99 %103
becomes
Work registers: 32 Uniform registers: 12 Stack spilling: false 16-bit arithmetic: 27% A LS V T Bound Total instruction cycles: 2.97 1.00 4.00 1.00 V Shortest path cycles: 0.71 0.00 1.25 0.50 V Longest path cycles: 2.97 1.00 4.00 1.00 V //----------------// %101 = OpFConvert %v4half %100 %102 = OpVectorShuffle %v3half %101 %101 0 1 2 %36 = OpExtInst %v3half %1 Fma %102 %59 %61 %37 = OpExtInst %v3half %1 Fma %102 %36 %63 %103 = OpFMul %v3half %102 %37
2. There are now two new decorations for related registers:
OpDecorate %36 NoContraction OpDecorate %37 NoContraction
I was under the impression that manually using mad() could be more beneficial since it would be a direct hint to the compiler/driver about our intentions. But, it looks like even the DXC compiler avoids explicit FMAs. Performance loss and ALU cost increase is observed in shader variants with either explicit half precision types or RelaxedOps.
With the current compiler using any NoContraction decoration disables a number of optimizations globally. This is something we're looking to improve, but this is the reason for the slow down you are seeing.
We'd suggest preprocessing the SPIR-V to remove the NoContraction decoration, unless you really need the invariant output for multiple FMA operations.
Hi Peter! Thanks, I'll try manually removing NoContraction and see if it benefits us. On a related note, is there a benefit to replace
float _FromValue; float _ToValue; lerpResult = lerp( _FromValue, _ToValue, t );
with
float _FromValue; float _ToValueMinusFromValue; lerpResult = mad( t, _ToValueMinusFromValue, _FromValue );
?
Does lerp() generate internally ( b - a ) evaluation, or is there some kind of HW optimization for lerps?
*Assuming that _FromValue, _ToValue, _ToValueMinusFromValue are external values from interpolators/cbuffers and were calculated in advance.
No dedicated lerp instruction - will just end up as normal maths ops.
Sorry to resurrect such an old thread, but due to some technical issues I wasn't able to reply the same day. I tried patching out NoContraction from SPIR-V and that did not help the performance.