Support forums

Architectures and Processors forum Expected Increase in Throughput for Int8 vs FP32 Multiplication

State Suggested Answer
Locked Locked
Replies 5 replies
Answers 1 answer
Subscribers 351 subscribers
Views 2233 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Expected Increase in Throughput for Int8 vs FP32 Multiplication

FabianSchuetze over 1 year ago

What increase in throughput can I expect on my device from changing a sequence of

```

fmla v1.4s, v1.4s, v1.4

```

mla v1.16b, v1.16b, v1.16b

```

My device consist of X3, A715 and A510 processors.

In profiling peakflops I got a ~2x increase in throughput. I would have expected an 4x increase.

Is there any matrix multiplication related instruction on arm in which I can expect a 4x increase in throughput by using int8 data types (possibly widening accumulator)?

Top replies

Zhifei Yang over 1 year ago in reply to FabianSchuetze +1 suggested

Thanks for your data sharing. It seems Socket-0 can reach the Facor FMLA/MLA to ~4. For Socket-1 and Socket-2 CPUs, the results may be impacted by many reasons. 1) You can break down likwid-bench workload...
FabianSchuetze over 1 year ago in reply to Zhifei Yang +1

Thanks a lot for your reply. Based on your suggestion, I read the optimization guides and can confirm that the observed throughput is as expected. The latency of both FMLA and MLA is four cycles on all...