This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

glCompileShader taking a long time

Hello.

I am working on a game and one of the device (Samsung Galaxy S3 (GT-I9300) with gpu Mali-400MP4, android version 4.0.4) is displaying problems while trying to compile one of the shaders. The gl call glCompileShader takes exceptionally long time for this shader (20+seconds). It is not a necessarily a big complicated shader and I am attaching the source file here. I have tried experimenting with changing the sahders and the compile time does go down if I start taking out instructions but even a simple acceptable shader for the game is taking 5-10 seconds to compile depending on the features. Unfortunately I have hit a wall while trying to figure out exactly which instruction is causing this issue and am not getting anywhere. Since it doesnt technically crash I get no information from glGetShaderInfoLog. Any help on this will be greatly appreciated.


PS - I am not seeing this issue on most of the other devices. I also trying using the offline compiler but I ran into other issues like the compiled shader would not link complaining (L0101 All attached shaders must be compiled prior to linking).




shaderGlsl.frag.zip
Parents
  • Thanks for reporting.

    Some ideas which may help. I'm guessing it is auto-generated, but you seem to have a lot of code which does something like:

      r0_5.w = (r0_5.wwww * pc[6].xxxx).w;
    
    
    

    Why not just write is as scalar code in the first place rather than using vector ops and throwing away 75% of it? I would guess that the following is a little easier for the compiler to handle ...

    r0_5.w = r0_5.w * pc[6].x;
    
    
    

    Similarly operations such as ...

      r1_4.x = max (pc[4].xxxx, r0_5.xxxx).x;
    
    
    

    ... would also seem to massively over-complicate it. What's wrong with  ...


     r1_4.x = max(pc[4].x, r0_5.x);

    ... and for this ...

    r2_3.xyw = (r0_5.yyyy * pc[11].xyzz).xyw;
    
    
    

    ... why not ...

     r2_3.xyw = r0_5.yyy * pc[11].xyz;

    There are other places where you write scalar code, and force the compiler to re-vectorize it. This ...

      r2_3.x = exp2(r1_4.xxxx).x;
      r2_3.y = exp2(r1_4.yyyy).y;
      r2_3.z = exp2(r1_4.zzzz).z;
    
    
    

    ... could be ...

      r2_3.xyz = exp2(r1_4.xyz);
    
    
    

    I'm not sure how much this adds up to in terms of compiler overhead - but it is all work which takes at least some time for the compiler to unpick. That said, I suspect that doesn't add up to much - it would just make the code more readable . I suspect the main issue is the amount of working data you have hanging about in registers which has quite a long lifetime in the program. The register allocator is going to have to work quite hard to pack that into the register file as efficiently as possible to avoid spending a lot of time stacking and unstacking variables.

    Some unrelated observations which may help - you have a relatively bulky block of uniforms in the "pc" array - many of them you only use one of the vector channels so you're losing some efficiency there. You end up with some things which you forcefully devectorize such as ...

      r3_2.w = tmpvar_9.w;
      r3_2.xyz = (tmpvar_9 * pc[5]).xyz;
    
    

    Given that you never use the pc[5].w uniform for anything, why not set it to 1 when you upload the uniform and just do ...

      r3_2 = tmpvar_9 * pc[5];
    
    

    It may not solve your compile time issues, but it should go faster once it's compiled .

    I also trying using the offline compiler but I ran into other issues like the compiled shader would not link complaining

    The best way to get a binary which will work is to compile it once on the target (e.g. at install time), and cache the binary, which can be reloaded later. The reload may fail if the driver is updated (binaries may become incompatible), so you may need to recompile from source and recache an updated binary after a firmware update.

    See this extension:

    http://www.khronos.org/registry/gles/extensions/ARM/ARM_mali_program_binary.txt

    HTH,
    Pete

Reply
  • Thanks for reporting.

    Some ideas which may help. I'm guessing it is auto-generated, but you seem to have a lot of code which does something like:

      r0_5.w = (r0_5.wwww * pc[6].xxxx).w;
    
    
    

    Why not just write is as scalar code in the first place rather than using vector ops and throwing away 75% of it? I would guess that the following is a little easier for the compiler to handle ...

    r0_5.w = r0_5.w * pc[6].x;
    
    
    

    Similarly operations such as ...

      r1_4.x = max (pc[4].xxxx, r0_5.xxxx).x;
    
    
    

    ... would also seem to massively over-complicate it. What's wrong with  ...


     r1_4.x = max(pc[4].x, r0_5.x);

    ... and for this ...

    r2_3.xyw = (r0_5.yyyy * pc[11].xyzz).xyw;
    
    
    

    ... why not ...

     r2_3.xyw = r0_5.yyy * pc[11].xyz;

    There are other places where you write scalar code, and force the compiler to re-vectorize it. This ...

      r2_3.x = exp2(r1_4.xxxx).x;
      r2_3.y = exp2(r1_4.yyyy).y;
      r2_3.z = exp2(r1_4.zzzz).z;
    
    
    

    ... could be ...

      r2_3.xyz = exp2(r1_4.xyz);
    
    
    

    I'm not sure how much this adds up to in terms of compiler overhead - but it is all work which takes at least some time for the compiler to unpick. That said, I suspect that doesn't add up to much - it would just make the code more readable . I suspect the main issue is the amount of working data you have hanging about in registers which has quite a long lifetime in the program. The register allocator is going to have to work quite hard to pack that into the register file as efficiently as possible to avoid spending a lot of time stacking and unstacking variables.

    Some unrelated observations which may help - you have a relatively bulky block of uniforms in the "pc" array - many of them you only use one of the vector channels so you're losing some efficiency there. You end up with some things which you forcefully devectorize such as ...

      r3_2.w = tmpvar_9.w;
      r3_2.xyz = (tmpvar_9 * pc[5]).xyz;
    
    

    Given that you never use the pc[5].w uniform for anything, why not set it to 1 when you upload the uniform and just do ...

      r3_2 = tmpvar_9 * pc[5];
    
    

    It may not solve your compile time issues, but it should go faster once it's compiled .

    I also trying using the offline compiler but I ran into other issues like the compiled shader would not link complaining

    The best way to get a binary which will work is to compile it once on the target (e.g. at install time), and cache the binary, which can be reloaded later. The reload may fail if the driver is updated (binaries may become incompatible), so you may need to recompile from source and recache an updated binary after a firmware update.

    See this extension:

    http://www.khronos.org/registry/gles/extensions/ARM/ARM_mali_program_binary.txt

    HTH,
    Pete

Children