This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Offline compiler GLSL clamp performance on Mali-Gxx

Hi,

Been doing some analysis on some GLSL shader programs with the Mali Offline Compiler, great program btw.

It is however reporting that a clamp (vec3,float,float) is somehow slower for the Mali-Gxx than doing a seperate min/max according the offline compiler.

vec3 clamp_minmax (vec3 val, float minimum, float maximum){
return clamp(val, minimum, maximum);
}

vs

vec3 clamp_minmax (vec3 val, float minimum, float maximum){
vec3 rval = min(val, maximum);
return max (rval, minimum);
}

Note that the minimum is a constant 0.0f.

Is this expected and correct for a Mali-Gxx ?

  • Hi,

    I would expect these two to have the same performance in practice. If you can share a full shader that reproduces the issue I can double check.

    Cheers,
    Pete

  • Hi Pete,

    Tried to minimize it to the case where I see it happen. It seems to be somehow dependent on the multiply that happens before. 

    The following fragment code on a G72 it gives 2.75 on the _a (clamp) and 1.25 on the _b max/min variant:

    #version 300 es

    precision highp float;

    in vec2 vTextureCoord;

    uniform sampler2D sTexture;

    uniform float myparam;
    uniform float myparam2;


    out vec4 fragColor;

    vec3 clamp_minmax_b (vec3 val, float minimum, float maximum){
    vec3 rval = max(val, minimum);
    return min (rval, maximum);
    }

    vec3 clamp_minmax_a (vec3 val, float minimum, float maximum){
    return clamp(val,minimum,maximum);
    }


    void main() {
    vec4 raw = texture(sTexture, vTextureCoord);

    vec3 clamped_raw = vec3(raw.r, raw.g,raw.b);
    clamped_raw = clamped_raw * myparam2;
    clamped_raw = clamp_minmax_a( clamped_raw, 0.0f,myparam);

    fragColor = vec4(clamped_raw, 1.0f);
    }

    Regards,

    Danny

  • It seems to be somehow dependent on the multiply that happens before. 

    A single arithmetic instruction is a packed pair of operations, so the number of cycles can be sensitive to how well surrounding operations pack into those pairings.

  • Agree that that instruction packing can cause a difference in cycle count in general. However in this particular case for _b adding the multiply reduces the cycle count (from 2 ->1.25) whereas for _a it increases from (2 ->2.75) according to the tool.  So that does make doubt the output of the tool for these GPU's as it does not seem logical that adding a multiply reduces the overall cycle count.

    As I am currently mostly using the tool for some estimations I would like to understand if the tool output really reflects the actual performance on these GPU's or if am just looking at some glitch in the tooling.

  • The absolute numbers for the G series are incorrect (at least for the arithmetic cost) - we'll be fixing this in the next offline compiler release later in the year. The trend direction in terms of is it getting faster or slower should be accurate though, to the best of my knowledge.

  • Ok , that's enough info for me! Be looking forward to that next release. Thank you for the quick support!