Hello.
I am working on a game and one of the device (Samsung Galaxy S3 (GT-I9300) with gpu Mali-400MP4, android version 4.0.4) is displaying problems while trying to compile one of the shaders. The gl call glCompileShader takes exceptionally long time for this shader (20+seconds). It is not a necessarily a big complicated shader and I am attaching the source file here. I have tried experimenting with changing the sahders and the compile time does go down if I start taking out instructions but even a simple acceptable shader for the game is taking 5-10 seconds to compile depending on the features. Unfortunately I have hit a wall while trying to figure out exactly which instruction is causing this issue and am not getting anywhere. Since it doesnt technically crash I get no information from glGetShaderInfoLog. Any help on this will be greatly appreciated.
PS - I am not seeing this issue on most of the other devices. I also trying using the offline compiler but I ran into other issues like the compiled shader would not link complaining (L0101 All attached shaders must be compiled prior to linking).
Thanks for reporting.
Some ideas which may help. I'm guessing it is auto-generated, but you seem to have a lot of code which does something like:
r0_5.w = (r0_5.wwww * pc[6].xxxx).w;
Why not just write is as scalar code in the first place rather than using vector ops and throwing away 75% of it? I would guess that the following is a little easier for the compiler to handle ...
r0_5.w = r0_5.w * pc[6].x;
Similarly operations such as ...
r1_4.x = max (pc[4].xxxx, r0_5.xxxx).x;
... would also seem to massively over-complicate it. What's wrong with ...
r1_4.x = max(pc[4].x, r0_5.x);
... and for this ...
r2_3.xyw = (r0_5.yyyy * pc[11].xyzz).xyw;
... why not ...
r2_3.xyw = r0_5.yyy * pc[11].xyz;
There are other places where you write scalar code, and force the compiler to re-vectorize it. This ...
r2_3.x = exp2(r1_4.xxxx).x; r2_3.y = exp2(r1_4.yyyy).y; r2_3.z = exp2(r1_4.zzzz).z;
... could be ...
r2_3.xyz = exp2(r1_4.xyz);
I'm not sure how much this adds up to in terms of compiler overhead - but it is all work which takes at least some time for the compiler to unpick. That said, I suspect that doesn't add up to much - it would just make the code more readable . I suspect the main issue is the amount of working data you have hanging about in registers which has quite a long lifetime in the program. The register allocator is going to have to work quite hard to pack that into the register file as efficiently as possible to avoid spending a lot of time stacking and unstacking variables.
Some unrelated observations which may help - you have a relatively bulky block of uniforms in the "pc" array - many of them you only use one of the vector channels so you're losing some efficiency there. You end up with some things which you forcefully devectorize such as ...
r3_2.w = tmpvar_9.w; r3_2.xyz = (tmpvar_9 * pc[5]).xyz;
Given that you never use the pc[5].w uniform for anything, why not set it to 1 when you upload the uniform and just do ...
r3_2 = tmpvar_9 * pc[5];
It may not solve your compile time issues, but it should go faster once it's compiled .
I also trying using the offline compiler but I ran into other issues like the compiled shader would not link complaining
The best way to get a binary which will work is to compile it once on the target (e.g. at install time), and cache the binary, which can be reloaded later. The reload may fail if the driver is updated (binaries may become incompatible), so you may need to recompile from source and recache an updated binary after a firmware update.
See this extension:
http://www.khronos.org/registry/gles/extensions/ARM/ARM_mali_program_binary.txt
HTH, Pete
Thanks for your prompt reply The shader gets compiled as hlsl through D3DXCompileShader (using max optimizations) and then we translate it to glsl which is why the code is the way it is. I will go through the translator code to see if we can improve it based on your suggestions.
I suspect the main issue is the amount of working data you have hanging about in registers which has quite a long lifetime in the program. The register allocator is going to have to work quite hard to pack that into the register file as efficiently as possible to avoid spending a lot of time stacking and unstacking variables.
By this do you mean local registers like the following?
mediump vec4 r4_1;
mediump vec4 r3_2;
mediump vec4 r2_3;
mediump vec4 r1_4;
mediump vec4 r0_5;
Eliminating them would be hard because these get generated by D3DXCompileShader. I am just surprised as to why this problem only occurs on Mali400MP4 devices. I was hoping it was a something to do with a specific instruction. I will also spend some more time digging into the offline compiler a bit more.
The shader gets compiled as hlsl through D3DXCompileShader (using max optimizations) and then we translate it to glsl which is why the code is the way it is.
Yep - I guessed it would be something like that - it definitely has that "written by a machine" look about it.
I will go through the translator code to see if we can improve it based on your suggestions.
Try hand fixing one first. It may make no difference, so I'd hate for you to spend a load of time improving the translator for that effort to be wasted .
By this do you mean local registers like the following? mediump vec4 r4_1; mediump vec4 r3_2; mediump vec4 r2_3; mediump vec4 r1_4; mediump vec4 r0_5;
Indirectly yes. Any variable which exists in the program needs to have register storage (spilling to stack storage if we run out of space) from the point it is first assigned a value to the point it is last used. Things like uniforms and constants are handled differently - so are not counted in this.
This program has quite a lot of "working state". Most of the variables are assigned relatively early and stay "alive" for a long time because they are used in the final few instructions of the program, so the compiler has to work out how to most optimally keep this data in registers, while also packing things efficiently for the vector ALUs.
If you can change the algorithm to need fewer live variables it could help - but it would change the visual output of course.
P.S. I've just checked with the current offline compiler from http://malideveloper.arm.com/develop-for-mali/tools/analysis-debug/mali-gpu-offline-shader-compiler/ and this seems to perform OK on a desktop PC (4ms to compile) so I suspect you are running in to an issue which is only present on older driver releases.
chrisvarns can you please raise a support ticket.
Cheers,
Pete
Will do on Tuesday, UK bank holiday on Monday.
Thanks,
Chris
I've had a bit of a tinker with the offline Mali-400 compiler in terms of trying to improve the shader program performance - short programs tend to compile faster as there is less for the compiler backend to worry about, and I have a few suggestions for things where the HSL compiler is breaking up operations which could be handled more efficiently.
Optimization 1: Use built-in for vector normalize
There are quite a few places in the code where HSL has turned the equivalent of the ESSL built-in function ...
bar.xyz = normalize( foo.xyz )
... in to ...
bar.xyz = (inversesqrt(dot (foo.xyz, foo.xyz)) * foo).xyz;
The built-in function always works out 1 cycle quicker than rolling your own, so if you can spot and unpick this in your code generation tool it would help - you have three instances of this which are trivial to replace and one where you use the inverse for something else. In this latter case:
float divisor = 1.0 / length( foo.xyz )
... is still better than rolling your own.
Optimization 2: Use vector built-ins rather than scalar built-ins
My hunch that using vector built-in values rather than scalar version plays more nicely with register allocation seems to be true. For example:
foo.x = exp2(foo.xxxx).x; foo.y = exp2(foo.yyyy).y; foo.z = exp2(foo.zzzz).z;
... seems to be a cycle shorter when compiled as ...
foo.xyz = exp2( foo.xyz );
... but there are a number of other similar instances for other built-in functions, e.g. log2(), abs().
Optimization 3: Avoid widen then narrow
The code generator has some cases where the code widens a scalar to a vector and then throws away pieces of it afterwards. This seems to be not handled too efficiently, so best avoided.
r0_5.w = clamp (vec4(dot (r4_1.xyz, r1_4.xyz)), 0.0, 1.0).w;
... seems to compile better as ...
r0_5.w = clamp (dot (r4_1.xyz, r1_4.xyz)), 0.0, 1.0);
0, 1.0).w;
Optimization 4: Process more CPU-side
Slightly higher level optimization - you have some parts of the code where you have a lot of operations which are "uniform modified by a uniform". These can be folded out and processed CPU-side, and a modified uniform uploaded. For example - nothing the code below varies per fragment, so is a significant amount of redundant processing:
r0_5.x = pc[3].x; r1_4.x = max (pc[4].xxxx, r0_5.xxxx).x; r0_5.x = (r1_4.xxxx + -(pc[3].xxxx)).x; r0_5.x = (1.0/(r0_5.x));
With all of the above, and the generic removal of redundant swizzles and vectorization of scalar operations I've almost doubled the performance of the shader (it drops from 45 instructions to ~23 instructions) so I would hope that this should compile quite a bit faster.
Hope that is of some use (and thanks for giving me some fun bits of code to play with on the weekend ),
First of all thanks for all the help
Based on your suggestions I made modifications to our hlsl to glsl translator and I am attaching the new modified file. Unfortunately it has had no impact on the compile time . I wasnt able to apply optimization 2 as it will be a lot of work. Even with the give optimizations the shader compile is still 30-40+ seconds. It maybe that the shaders compile fine on the latest firmware so this may just be a bug with only 4.0.4 version. I will upgrade my firmware to do a quick test. If it is a bug in an older firmware is there a way to fix it?
The weird thing is that I do have shaders that more complicated than this one except only this one is problematic.
Hi s2moudgi,
I've created the ticket MPDDEVREL-1076 for this. We'll try and find a workaround for affected devices, which in this case should just be alternative shader code. Is it possible for your generator to take different driver revisions into account when emitting the GLSL code? If not I'm not sure how you will incorporate a workaround.
Hi Chris,
Thanks for opening a ticket. If we can track down what is causing the issue I should be able to find some way to fix it. Either by changing the original shader or changing the code at the hlsl->glsl translation level. I just have no control over the hlsl assembly outputted by D3DXCompileShader. There are a few flags I can tinker around with but I wont know if they will help unless I know the source of the problem.
I did a quick test with firmware 4.3 and this issue does not happen on it. So this is only a problem with 4.0.4 firmware. Let me know if there is way to track the ticket. Thanks
The ticket is for an internal bug tracking system not publically accessible, I just provided it for your convenience if you wanted to refer to it in future
I've found that by reducing the length of the shaders, you can bring the compilation times down to something more reasonable. Your original shader I gave up timing after 12 minutes, Pete's optimized version took ~5 minutes, but by removing chunks of the shader I had versions that ran at 96 seconds, 8.5 seconds, 2.6 seconds, and 65ms. Compilation time in this driver version seems to go up exponentially with shader length. This is fixed in later versions of the driver (this driver is 2 years old now) and so I would urge your users to upgrade to the latest Android versions. For this version of the shader my advice would be:
Hope this helps,