I have a Unity shader using the multi-compile keyword. I am trying to replace it with a uniform flow-control in order to reduce the number of variants.
I have 4 questions.
Q1: I cannot understand the output of MALIOC (Mali-G71).
Arithmetic Cycles of Fragment shader (in all cases Total Cycles==Shortest Path Cycles==Longest Path Cycles)
- Without the keyword: 7.50
- With the keyword: 7.65
- Uniform flow-control: 7.50
It seems to me that MALIOC reports the cycles of shader with uniform flow-control by assuming the uniform value, and thus only computes the cycles of a path.
If the instructions of both paths are executed, the cycles should be much longer.
Q2: Is uniform flow-control so terrible as described here ? https://developer.arm.com/documentation/101897/0200/shader-code/uniform-control-flow
Q3: May we assume that the driver optimises the shader on-the-fly based upon the uniform value so that only one branch will be executed (I guess not) ?
Q4: Which GPU counters should I check in Streamline for the potential problems of uniform flow control ? According to my experiment, the "Diverged instructions" are almost none in all cases.
Based on what you're showing it looks like the compiler is just managing to statically optimize out all runtime control flow. This is possible in some cases of small conditional blocks being entirely replaced with inlined conditional selects. Mali Offline Compiler won't assume any specific uniform value.
Q2: Is uniform flow-control so terrible as described here ?
On older Mali hardware (Utgard, Midgard architecture), yes. On newer hardware (Bifrost, Valhall architecture) the impact of branchy shader control flow is much lower as long as you don't get divergent control flow.
Q3: May we assume that the driver optimises the shader on-the-fly based upon the uniform value so that only one branch will be executed?
There is no compile specialization of binaries based on uniform values, so if you have a uniform branch then the branch will end up in the executed code if the compiler can't completely optimize it out. If all threads branch the same way (no divergence), which should always the case for uniform-controlled branches, then you shouldn't have a major problem.
Q4: Which GPU counters should I check in Streamline for the potential problems of uniform flow control?
The "Diverged instruction" counter is the one to use. Uniform-based branches, by definition, cannot be divergent.
Kind regards, Pete
I use UNITY_BRANCH to force a branch instruction to be generated. It's still uniform-based flow control.
Unity shows the generated code uses if-statement instead of ternary operador (default). I cannot know how exactly they will be translated to in lower-level.
According to my measurement with Streamline, the fragment cycles and executed instructions are almost the same for both implementation. If I understand it correctly, even if a branch instruction is generated, both then-path and else-path must be executed by the shader core anyway, is this correct ?
Isn't there any branch prediction in this case ?