Hello community,
In GPU datasheet, the fp32 operations per cycle is 256 for immortalis-g715. Is this for all 16 cores or 1 core only?
Thanks,
Venkatesh.
Thanks Christian for the reply. That's interesting and it triggered a query about hiding ALU latency. if possible could you suggest or else point out any doc on how Mali hides ALU latency between dependent instructions, similar to how RDNA hides ALU latency mentioned in section 7.6.1 in usermanual.wiki/.../view
Like all GPUs, dependency hiding (and memory latency hiding) is handled by having a _lot_ of threads per core. If one thread is blocked, pick another one. Immortalis-G715 has up to 2048 threads per core ...
See: www.youtube.com/watch
Thanks Peter for the link, it is really helpful.
After watching series of videos, I've following assumption on fp32 operations per cycle. could you confirm that whether my understanding is correct?
The 256 fp32 operations per cycle is for both CVT and FMA units i.e. CVT has 128 operations per cycle and FMA has 128 operations per cycle.
After watching series of videos, I've following assumptions. could you confirm that whether my understanding is correct?
1. The 256 fp32 operations per cycle is for both CVT and FMA units i.e. CVT has 128 operations per cycle and FMA has 128 operations per cycle.
2. Each PU has 64 ALUs and it has one 16-wide FMA and one 16-wide CVT pipelines. The number of ALU's (64) and number of instructions issued to pipelines (32 = 16 FMA + 16 CVT) are not matching. Am I missing something here?
Industry convention for GPUs is to only count FMAs, and count them as two operations (mul + add), so the data sheet saying 256 fp32 ops/cy means that each shader core can do 128 fp32 FMA operations per clock cycle. These numbers completely ignore the CVT and SFU units.
Whether you get more than 256 ops/cy because of the CVT and SFU units depends on the operation and GPU generation. Some can issue in parallel, some can't, as we continuously rebalance the shader core design to get the right balance of operations for industry content trends and to optimize for energy efficiency.
Thanks for the response, Peter.
So, FMA throughput is 128 fp32 operations per cycle.
Could you suggest what is the throughput capability of CVT wrt integer ops /cycle?
For Mali-G715 it's 64 int32 ops per cycle
Hi Peter, Thanks for CVT throughput. As per Mali sources I understood that CVT unit executes branches, bitwise and integer computations. However I see that in some cases only one of CVT or FMA unit will be utilized, for example FMA unit will be idle when CVT executes branch instructions as control flow path is unknown. so I just want to know what benefit CVT unit really provides for improving performance?
I can't really add much on a public forum, sorry - that level of microarchitecture explanation isn't publicly disclosed.
Cheers, Pete
NP, thank you for the repsonses.