Hi,
Nice tools you're offering, I really love the offline compiler, and it really fits well in our pipeline to deliver shader to arm/mali mobile (unit tests and debug).
Now we'd like to advance on the perf departement, but I'm having hard time to figure out the documentation.
So I'm running
#malics -V -frag myshade.frag
in hope to get useful information on our shader performance, and got a nice array of results, but I could use some documentation on each column. Here's what I get
"
8 work registers used, 4 uniform registers used, spilling used.
I did search the website but couldn't find anything on any of those info, I can guess some, but really prefer making sure I have the perfect meaning of each.
I'd really like to be able to get as much info and meaning from command line
(this avoiding the huge 'studio' thing usage which implies too much setup for each shader where a simple compilation is enough)
Thanks !
Hi Kuranes,
You're running the Mali 600 offline shader compiler so it's giving you statistics for that shader, compiled with that driver version, for a T6xx core (defaults to T670 r3p0 but you can select). The T6xx+ (Midgard) GPUs all implement what we call a "Tripipe" architecture, as they have A, LS, and T pipes, referring to Arithmetic, Load/Store and Texture respectively. Each instruction in your shader will execute in one of these pipes. The shader compiler is showing you the total number of instructions in the shader, as well as the longest and shortest execution path the shader will take through those instructions, in your case 66 and 74. It also shows you a breakdown of which pipes those instructions execute in, and therefore which one you are likely to be bound on and should consider optimizing first. I'll have a hunt for some documentation but I believe any documentation should be shipped with the compiler, and more general advice on the tri-pipe architecture is dotted throughout blogs on this site and also in the Mali Optimization guide available from Mali GPU Application Optimization Guide v3.0 « Mali Developer Center
Hope this helps,
Chris
Thanks a lot, that explains a lot. (and could be in next help)
So Textures: is that texture fetch count or gpu cycles needed for texture operation ?
While you're on a hunt, a thorough explanation of "spilling used / spilling not used" would be great.
(not sure about the meaning, implication, and how to avoid it (I'm thinking about a sort of register not sufficient, and leading to cache miss or something)
It is the number of instructions executed in the T pipe, so is a count of GPU cycles. The T pipe does all texture sampling/filtering.
Spilling is an interesting one, basically the number of threads that can concurrently execute in the shader core is determined by the number of registers that those threads use. 4 or less means we can concurrently execute the maximum number of threads, up to 8 means we can only do half. If we need more than 8 registers (complicated shaders with lots of variables with long lifetimes) then it's better to "spill" some of those registers to the cache for temporary storage, as this is more performant in practice than expanding the register set any more. The trade-off is that this increases L/S pipe load as it has to save/load those variables.
Hth,
One slight clarification on this one.
The counter counts the number of texture instructions. As Chris mentions one texture instruction does one texture access, including filtering, decompression, etc. Single-sample or bi-linear filtered texture instructions take a single cycle, trilinear or 3D textures take two cycles. The compiler doesn't know what data assets are used so it will only ever assume single cycle.
Texture instructions can effectively take longer than a single cycle if you get bad cache behaviour (e.g. applying a very large texture to a very small screen area without mipmaps so you thrash the texture cache). Any cycle count overheads due to cache misses are not shown by this counter (again, the compiler doesn't know).
HTH,
Pete
Thanks Pete,
The shader compiler states this in it's output as well:
Note: The cycles counts do not include possible stalls due to cache misses.
Thanks a lot for the explanation and support.
Now, I'm a bit at loss with the Texture result, and I reproduced it with that simple fragment shader
precision lowp float; precision lowp int; uniform sampler2D myRenderTexture; #define SAMPLES 256 void main(){ float samples_f = float(SAMPLES); float idx_u = 1.0/samples_f; vec2 uv = vec2(0.0, 0.0); for (int i = 0; i < SAMPLES; ++i){ uv.x += idx_u; uv.y -= idx_u; gl_FragColor.rgb += texture2D(myRenderTexture, gl_FragCoord.xy + uv).rgb; } gl_FragColor.rgb /= float(SAMPLES); gl_FragColor.a = 1.0; }
precision lowp float;
precision lowp int;
uniform sampler2D myRenderTexture;
#define SAMPLES 256
void main(){
float samples_f = float(SAMPLES);
float idx_u = 1.0/samples_f;
vec2 uv = vec2(0.0, 0.0);
for (int i = 0; i < SAMPLES; ++i){
uv.x += idx_u;
uv.y -= idx_u;
gl_FragColor.rgb += texture2D(myRenderTexture, gl_FragCoord.xy + uv).rgb;
}
gl_FragColor.rgb /= float(SAMPLES);
gl_FragColor.a = 1.0;
My understanding was that I ought to get "SAMPLES" value in the T column
Whatever I'm changing SAMPLES to, I get 1 T ?
Btw, what does it means when I get a row of "-1" in "longest path" ?
This is a known issue in the stats reported by the offline compiler - the static analysis pass which generates the stats doesn't really understand loops, so assumes that the loop body is executed only once.
HTH, Pete
Sorry, the "-1" in "longest path" row should have made me realize... Makes much more sense now.
Now, let's unroll !
And Thanks again for the great support, it really helps.