We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi,
I'm trying to understand something about the performance of one of my test shaders (I'm using Mali G78 on Pixel 6a - I believe it's 20 cores) and using streamline, I'm getting values as 15 giga instructions per second, with arithmetic unit utilization of around 99%.According to my calculation, we are processing around 600 giga scalar add/mul per second (counting them in the shader - which I think cannot be optimized -, multiplying by the number of pixels, times 4 because I'm using vec4 and times the fps).
I am not sure how to reconcile the 15 giga instructions/s with my calculated 600 gflops. If I assume an instruction can be run on 32 f32 simultaneously, that would give me 15 * 32 = 480 gflops which is still quite lower than what I estimate with my shader and fillrate.
Thanks,
Lorenzo
Mali counters only count for one arithmetic unit, and increment per warp, not per thread. This allows normalization relative to clock frequency.
For Mali-G78 there are 2 arithmetic units, using 16-wide warps (see https://developer.arm.com/documentation/102849/latest/).
15G * 2 * 16 = 480G instructions/second.
Further, one instruction could be fused FMA or scalar op such as MUL and ADD, so 480G could be 960G flops if you count FMA as 2.
I'm counting FMA as one, but I ended up only having ADDs in my shader to simplify things.
I guess the reason I'm getting more than 480G is either approximation or something is getting optimized.
Thanks for your reply.
I confirm I'm matching the 480G as expected. My previous numbers (600G) were based on not waiting for the last frame to complete (they are very slow frames).