In GPU datasheet, the fp32 operations per cycle is 256 for immortalis-g715. Is this for all 16 cores or 1 core only?
That's per core, i.e. 256 FP32 ops (e.g. 128 FP32 FMAs counted as 2 FP32 ops) per cycle per core.
Thanks Christian for the reply. That's interesting and it triggered a query about hiding ALU latency. if possible could you suggest or else point out any doc on how Mali hides ALU latency between dependent instructions, similar to how RDNA hides ALU latency mentioned in section 7.6.1 in usermanual.wiki/.../view
Like all GPUs, dependency hiding (and memory latency hiding) is handled by having a _lot_ of threads per core. If one thread is blocked, pick another one. Immortalis-G715 has up to 2048 threads per core ...
Thanks Peter for the link, it is really helpful.
After watching series of videos, I've following assumption on fp32 operations per cycle. could you confirm that whether my understanding is correct?
The 256 fp32 operations per cycle is for both CVT and FMA units i.e. CVT has 128 operations per cycle and FMA has 128 operations per cycle.
After watching series of videos, I've following assumptions. could you confirm that whether my understanding is correct?
1. The 256 fp32 operations per cycle is for both CVT and FMA units i.e. CVT has 128 operations per cycle and FMA has 128 operations per cycle.
2. Each PU has 64 ALUs and it has one 16-wide FMA and one 16-wide CVT pipelines. The number of ALU's (64) and number of instructions issued to pipelines (32 = 16 FMA + 16 CVT) are not matching. Am I missing something here?
Industry convention for GPUs is to only count FMAs, and count them as two operations (mul + add), so the data sheet saying 256 fp32 ops/cy means that each shader core can do 128 fp32 FMA operations per clock cycle. These numbers completely ignore the CVT and SFU units.
Whether you get more than 256 ops/cy because of the CVT and SFU units depends on the operation and GPU generation. Some can issue in parallel, some can't, as we continuously rebalance the shader core design to get the right balance of operations for industry content trends and to optimize for energy efficiency.
Thanks for the response, Peter.
So, FMA throughput is 128 fp32 operations per cycle.
Could you suggest what is the throughput capability of CVT wrt integer ops /cycle?
For Mali-G715 it's 64 int32 ops per cycle