This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

glCompileShader taking a long time

Hello.

I am working on a game and one of the device (Samsung Galaxy S3 (GT-I9300) with gpu Mali-400MP4, android version 4.0.4) is displaying problems while trying to compile one of the shaders. The gl call glCompileShader takes exceptionally long time for this shader (20+seconds). It is not a necessarily a big complicated shader and I am attaching the source file here. I have tried experimenting with changing the sahders and the compile time does go down if I start taking out instructions but even a simple acceptable shader for the game is taking 5-10 seconds to compile depending on the features. Unfortunately I have hit a wall while trying to figure out exactly which instruction is causing this issue and am not getting anywhere. Since it doesnt technically crash I get no information from glGetShaderInfoLog. Any help on this will be greatly appreciated.


PS - I am not seeing this issue on most of the other devices. I also trying using the offline compiler but I ran into other issues like the compiled shader would not link complaining (L0101 All attached shaders must be compiled prior to linking).




shaderGlsl.frag.zip
Parents
  • I've had a bit of a tinker with the offline Mali-400 compiler in terms of trying to improve the shader program performance - short programs tend to compile faster as there is less for the compiler backend to worry about, and I have a few suggestions for things where the HSL compiler is breaking up operations which could be handled more efficiently.

    Optimization 1: Use built-in for vector normalize

    There are quite a few places in the code where HSL has turned the equivalent of the ESSL built-in function ...

    bar.xyz = normalize( foo.xyz )
    
    
    
    
    
    

    ... in to ...

    bar.xyz = (inversesqrt(dot (foo.xyz, foo.xyz)) * foo).xyz;
    
    
    
    
    

    The built-in function always works out 1 cycle quicker than rolling your own, so if you can spot and unpick this in your code generation tool it would help - you have three instances of this which are trivial to replace and one where you use the inverse for something else. In this latter case:

    float divisor = 1.0 / length( foo.xyz )

    ... is still better than rolling your own.

    Optimization 2: Use vector built-ins rather than scalar built-ins


    My hunch that using vector built-in values rather than scalar version plays more nicely with register allocation seems to be true. For example:

    foo.x = exp2(foo.xxxx).x;
    foo.y = exp2(foo.yyyy).y;
    foo.z = exp2(foo.zzzz).z;
    
    
    
    
    

    ... seems to be a cycle shorter when compiled as ...

    foo.xyz = exp2( foo.xyz );
    
    
    
    
    

    ... but there are a number of other similar instances for other built-in functions, e.g. log2(), abs().


    Optimization 3: Avoid widen then narrow


    The code generator has some cases where the code widens a scalar to a vector and then throws away pieces of it afterwards. This seems to be not handled too efficiently, so best avoided.

    r0_5.w = clamp (vec4(dot (r4_1.xyz, r1_4.xyz)), 0.0, 1.0).w;
    
    
    
    


    ... seems to compile better as ...


     r0_5.w = clamp (dot (r4_1.xyz, r1_4.xyz)), 0.0, 1.0);

    0, 1.0).w;

    Optimization 4: Process more CPU-side

    Slightly higher level optimization - you have some parts of the code where you have a lot of operations which are "uniform modified by a uniform". These can be folded out and processed CPU-side, and a modified uniform uploaded. For example - nothing the code below varies per fragment, so is a significant amount of redundant processing:

      r0_5.x = pc[3].x;
      r1_4.x = max (pc[4].xxxx, r0_5.xxxx).x;
      r0_5.x = (r1_4.xxxx + -(pc[3].xxxx)).x;
      r0_5.x = (1.0/(r0_5.x));
    
    
    
    

    End results

    With all of the above, and the generic removal of redundant swizzles and vectorization of scalar operations I've almost doubled the performance of the shader (it drops from 45 instructions to ~23 instructions) so I would hope that this should compile quite a bit faster.

    Hope that is of some use (and thanks for giving me some fun bits of code to play with on the weekend ),

    Pete

Reply
  • I've had a bit of a tinker with the offline Mali-400 compiler in terms of trying to improve the shader program performance - short programs tend to compile faster as there is less for the compiler backend to worry about, and I have a few suggestions for things where the HSL compiler is breaking up operations which could be handled more efficiently.

    Optimization 1: Use built-in for vector normalize

    There are quite a few places in the code where HSL has turned the equivalent of the ESSL built-in function ...

    bar.xyz = normalize( foo.xyz )
    
    
    
    
    
    

    ... in to ...

    bar.xyz = (inversesqrt(dot (foo.xyz, foo.xyz)) * foo).xyz;
    
    
    
    
    

    The built-in function always works out 1 cycle quicker than rolling your own, so if you can spot and unpick this in your code generation tool it would help - you have three instances of this which are trivial to replace and one where you use the inverse for something else. In this latter case:

    float divisor = 1.0 / length( foo.xyz )

    ... is still better than rolling your own.

    Optimization 2: Use vector built-ins rather than scalar built-ins


    My hunch that using vector built-in values rather than scalar version plays more nicely with register allocation seems to be true. For example:

    foo.x = exp2(foo.xxxx).x;
    foo.y = exp2(foo.yyyy).y;
    foo.z = exp2(foo.zzzz).z;
    
    
    
    
    

    ... seems to be a cycle shorter when compiled as ...

    foo.xyz = exp2( foo.xyz );
    
    
    
    
    

    ... but there are a number of other similar instances for other built-in functions, e.g. log2(), abs().


    Optimization 3: Avoid widen then narrow


    The code generator has some cases where the code widens a scalar to a vector and then throws away pieces of it afterwards. This seems to be not handled too efficiently, so best avoided.

    r0_5.w = clamp (vec4(dot (r4_1.xyz, r1_4.xyz)), 0.0, 1.0).w;
    
    
    
    


    ... seems to compile better as ...


     r0_5.w = clamp (dot (r4_1.xyz, r1_4.xyz)), 0.0, 1.0);

    0, 1.0).w;

    Optimization 4: Process more CPU-side

    Slightly higher level optimization - you have some parts of the code where you have a lot of operations which are "uniform modified by a uniform". These can be folded out and processed CPU-side, and a modified uniform uploaded. For example - nothing the code below varies per fragment, so is a significant amount of redundant processing:

      r0_5.x = pc[3].x;
      r1_4.x = max (pc[4].xxxx, r0_5.xxxx).x;
      r0_5.x = (r1_4.xxxx + -(pc[3].xxxx)).x;
      r0_5.x = (1.0/(r0_5.x));
    
    
    
    

    End results

    With all of the above, and the generic removal of redundant swizzles and vectorization of scalar operations I've almost doubled the performance of the shader (it drops from 45 instructions to ~23 instructions) so I would hope that this should compile quite a bit faster.

    Hope that is of some use (and thanks for giving me some fun bits of code to play with on the weekend ),

    Pete

Children