What increase in throughput can I expect on my device from changing a sequence of
```
fmla v1.4s, v1.4s, v1.4
to
mla v1.16b, v1.16b, v1.16b
?
My device consist of X3, A715 and A510 processors.
In profiling peakflops I got a ~2x increase in throughput. I would have expected an 4x increase.
Is there any matrix multiplication related instruction on arm in which I can expect a 4x increase in throughput by using int8 data types (possibly widening accumulator)?