Expected Increase in Throughput for Int8 vs FP32 Multiplication

FabianSchuetze 1 month ago

What increase in throughput can I expect on my device from changing a sequence of

```

fmla v1.4s, v1.4s, v1.4

```

mla v1.16b, v1.16b, v1.16b

```

My device consist of X3, A715 and A510 processors.

In profiling peakflops I got a ~2x increase in throughput. I would have expected an 4x increase.

Is there any matrix multiplication related instruction on arm in which I can expect a 4x increase in throughput by using int8 data types (possibly widening accumulator)?

Top replies

Zhifei Yang 1 month ago in reply to FabianSchuetze +1 suggested

Thanks for your data sharing. It seems Socket-0 can reach the Facor FMLA/MLA to ~4. For Socket-1 and Socket-2 CPUs, the results may be impacted by many reasons. 1) You can break down likwid-bench workload...
FabianSchuetze 1 month ago in reply to Zhifei Yang +1

Thanks a lot for your reply. Based on your suggestion, I read the optimization guides and can confirm that the observed throughput is as expected. The latency of both FMLA and MLA is four cycles on all...