What increase in throughput can I expect on my device from changing a sequence of
```
fmla v1.4s, v1.4s, v1.4
to
mla v1.16b, v1.16b, v1.16b
?
My device consist of X3, A715 and A510 processors.
In profiling peakflops I got a ~2x increase in throughput. I would have expected an 4x increase.
Is there any matrix multiplication related instruction on arm in which I can expect a 4x increase in throughput by using int8 data types (possibly widening accumulator)?
Can we know that the ~2X increase result is got from which processor: X3, A715, A510?
Did you bind your workload application to a specific CPU processor, e.g taskset?
Thanks a lot for the reply.I am running this on a Samsung S9 tablet. I am unfortunately not sure which processor resides on which socket. Lstopo shows the following (Cache sizes aren't read correctly):```Machine (7221MB total) L3 L#0 (0KB) NUMANode L#0 (P#0 7221MB) Package L#0 L2 L#0 (0KB) + L1d L#0 (0KB) + L1i L#0 (0KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (0KB) L1d L#1 (0KB) + L1i L#1 (0KB) + Core L#1 + PU L#1 (P#1) L1d L#2 (0KB) + L1i L#2 (0KB) + Core L#2 + PU L#2 (P#2) Package L#1 L2 L#2 (0KB) + L1d L#3 (0KB) + L1i L#3 (0KB) + Core L#3 + PU L#3 (P#3) L2 L#3 (0KB) + L1d L#4 (0KB) + L1i L#4 (0KB) + Core L#4 + PU L#4 (P#4) L2 L#4 (0KB) + L1d L#5 (0KB) + L1i L#5 (0KB) + Core L#5 + PU L#5 (P#5) L2 L#5 (0KB) + L1d L#6 (0KB) + L1i L#6 (0KB) + Core L#6 + PU L#6 (P#6) Package L#2 + L2 L#6 (0KB) + L1d L#7 (0KB) + L1i L#7 (0KB) + Core L#7 + PU L#7 (P#7)```The results of likwid-bench are the following (running on all cores for each socket):| Socket | FMLA (GFLOPS/sec) | MLA (GFLOPS/sec) | Facor FMLA/MLA ||--------|---------------------------------|----------------------------|--------------------------|| 0 | 48 | 188 | ~4 || 1 | 133 | 267 | ~2 || 2 | 106 | 213 | ~2 |
likwid-bench always requires binding the test to a particular socket, and tests cannot span sockets.
Thanks for your data sharing. It seems Socket-0 can reach the Facor FMLA/MLA to ~4. For Socket-1 and Socket-2 CPUs, the results may be impacted by many reasons. 1) You can break down likwid-bench workload if you can to check further with perf trace's help.
2) The different CPU microarchitectures can have different behavior.
3) Refer to Arm Software Optimization Guide for the specific CPU type to fine-tune the code.
Thanks a lot for your reply. Based on your suggestion, I read the optimization guides and can confirm that the observed throughput is as expected. The latency of both FMLA and MLA is four cycles on all architectures (X3, A510, A715). The throughput halfs for MLA vs FMLA on the A715 and A510. Although the number of variables passing through the system increases by a factor of 4, only half the resources are available, resulting in a 2x increase. In contrast, the throughput stays constant for the A510, yielding a 4x increase. Below is the summary table of throughput and arch:
```| | X3 | A510 | A715 ||------|----|------|------|| FMLA | 4 | 2 | 2 || MLA | 2 | 2 | 1 |```To cross-reference the predictions from the hardware guides, I would like to understand which sockets refer to which arch. So far, the sockets are "anonymous" to me, as I cannot relate socket ID to its arch name. Do you know of any way to do that, Zhifei Yang ?
Great to hear that you have findings for the throughput increase.
I don't have any S9 tablet datasheet at hand. There is a generic method to check the Arm CPU processor type.
In Android or Linux-like OS, you can run this command " cat /proc/cpuinfo". Here is one example for you.
Please check the CPU part number. After you know the CPU type of each CPU id, you can try to connect it to the Socket ID.
<quote>
# cat /proc/cpuinfoprocessor : 0BogoMIPS : 26.00Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 btiCPU implementer : 0x41CPU architecture: 8CPU variant : 0x0CPU part : 0xd46CPU revision : 2
processor : 1BogoMIPS : 26.00Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 btiCPU implementer : 0x41CPU architecture: 8CPU variant : 0x0CPU part : 0xd46CPU revision : 2
</quote>