Expected Increase in Throughput for Int8 vs FP32 Multiplication

What increase in throughput can I expect on my device from changing a sequence of

```

fmla  v1.4s, v1.4s, v1.4

```

to

```

mla  v1.16b, v1.16b, v1.16b  

```

?

My device consist of X3, A715 and A510 processors.

In profiling peakflops I got a ~2x increase in throughput. I would have expected an 4x increase.

Is there any matrix multiplication related instruction on arm in which I can expect a 4x increase in throughput by using int8 data types (possibly widening accumulator)?