Hello forum,
I wanted to know if the MaliCorePUInstructionsFMAInstructions hardware counter counts the FMA instructions per processing unit or execution core or for the entire GPU.
By the name of the counter it seems like it is per processing unit. If so, how can I scale this to infer the total FMA instructions executed on the entire GPU.
Thank you,
rchakena
In Streamline, the Mali instruction counters count the performance of a single unit, averaged across all shader cores to show single core performance. This is the most useful measure for performance analysis as the dominant single data-path throughput per-core is what you need to know to determine critical path performance bottlenecks.
To compute "whole GPU totals" multiply the value in Streamline by your core count (accessible via $MaliConstantsShaderCoreCount) and the number of processing units per core (not accessible programatically). I can't find a good public reference for the number of PUs per core, but if you let me know what GPU you are using I can give you the scale factor.
Cheers, Pete
Hello Pete,
Thanks for the detailed explanation. It clarified my doubts to a large extent.
Here are some follow up points to validate my understanding and additional doubts.
1. I am using G77-MP7 and G78-MP14 GPU which as per my understanding has 7 and 14 cores respectively and 2 PU's per core. So the scale factor will be 7*2=14 for G77-MP7 and 14*2=28 for G78-MP14.
2. I was of the understanding that not all cores will be utilized for small workloads. So the scale factor might vary based on how many cores were actually active during the workload execution. (OR should I assume all cores will be active irrespective of the workload.)
3. If the #of active cores and scale factor is dependent on the workload, what counter or streamline info can tell me how may cores we actually active during workload execution. (Which be used to adjust the scale factor accordingly)
Cheers,
Yes, your scale factors look correct.
Streamline will sum and average counters assuming all cores are active, so scaling by the total core count will give the correct global total.
You cannot tell how many cores were actually active in Streamline; it's a transparent aspect of power management policy controlled by the platform provider.