Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.
I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html
Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?
Hi Sad Clouds,
As per the Cortex-A72 Software Optimization Guide,
the MUL instruction has a throughput of 1 per cycle.
Also, the assembly code of the loop in your post contains many loads:
40451c: fd000380 str d0, [x28] 404520: aa1303e0 orr x0, xzr, x19 404524: 34000274 cbz w20, 404570 <int64_mul+0xa0> 404528: f94037e2 ldr x2, [sp, #104] 40452c: 71000694 subs w20, w20, #0x1 404530: f94037e1 ldr x1, [sp, #104] 404534: f94037e3 ldr x3, [sp, #104] 404538: 9b027e73 madd x19, x19, x2, xzr 40453c: f94037e2 ldr x2, [sp, #104] 404540: 9b017c00 madd x0, x0, x1, xzr 404544: f94037e1 ldr x1, [sp, #104] 404548: 9b037f5a madd x26, x26, x3, xzr 40454c: f94037e3 ldr x3, [sp, #104] 404550: 9b027f39 madd x25, x25, x2, xzr 404554: f94037e2 ldr x2, [sp, #104] 404558: 9b017f18 madd x24, x24, x1, xzr 40455c: f94037e1 ldr x1, [sp, #104] 404560: 9b037ef7 madd x23, x23, x3, xzr 404564: 9b027ed6 madd x22, x22, x2, xzr 404568: 9b017eb5 madd x21, x21, x1, xzr 40456c: 54fffde1 b.ne 404528 <int64_mul+0x58> // b.any 404570: f90033e0 str x0, [sp, #96]
I think you should tweak your code and compiler settings to obtain a loop more like the following, with no memory access:
.L3: subs w0, w0, #1 mul x15, x15, x1 mul x14, x14, x1 mul x13, x13, x1 mul x12, x12, x1 mul x11, x11, x1 mul x10, x10, x1 mul x9, x9, x1 mul x8, x8, x1 bne .L3
Best regards,
Vincent.
Hi, thanks for the suggestion, but assembly code for int32 and double data types is very similar and does not have such issues with throughput. So it doesn't seem that LDR instruction is the sole culprit here. Below is the assembly code for the loop multiplying double data types:
404718: fd0002a0 str d0, [x21] 40471c: 1e604109 fmov d9, d8 404720: 1e60410a fmov d10, d8 404724: 1e60410b fmov d11, d8 404728: 1e60410c fmov d12, d8 40472c: 1e60410d fmov d13, d8 404730: 1e60410e fmov d14, d8 404734: 1e60410f fmov d15, d8 404738: 34000293 cbz w19, 404788 <dbl_mul+0xa8> 40473c: d503201f hint #0x0 404740: fd403fe3 ldr d3, [sp, #120] 404744: 71000673 subs w19, w19, #0x1 404748: fd403fe2 ldr d2, [sp, #120] 40474c: fd403fe1 ldr d1, [sp, #120] 404750: fd403fe0 ldr d0, [sp, #120] 404754: 1e630908 fmul d8, d8, d3 404758: 1e6209ef fmul d15, d15, d2 40475c: fd403fe3 ldr d3, [sp, #120] 404760: 1e6109ce fmul d14, d14, d1 404764: fd403fe2 ldr d2, [sp, #120] 404768: 1e6009ad fmul d13, d13, d0 40476c: fd403fe1 ldr d1, [sp, #120] 404770: fd403fe0 ldr d0, [sp, #120] 404774: 1e63098c fmul d12, d12, d3 404778: 1e62096b fmul d11, d11, d2 40477c: 1e61094a fmul d10, d10, d1 404780: 1e600929 fmul d9, d9, d0 404784: 54fffde1 b.ne 404740 <dbl_mul+0x60> // b.any
it has just as many loads and has much better throughput, i.e. x3 of the int64 multiply.
There seems to be something about int64 multiplication that either stalls the pipeline or has some overhead that int32, float and double multiplication does not have. Would be interesting to find out and understand that.
Sad Clouds said:So it doesn't seem that LDR instruction is the sole culprit here.
Yes, I would say also. Unless you flush the cache between runs, val is read from cache. Also on a A72 with a bus width of 64bit or even 128bit to the cache, there should be no difference between loading 32bits or 64.
vstehle said:As per the Cortex-A72 Software Optimization Guide, the MUL instruction has a throughput of 1 per cycle.
The doc shows the MADD has a latency of 5 for 64bit compared to 3 for 32bit and the throughput is 1/3 (what ever /3 means) compared to 1.Does 1/3 mean one third? Or: 1 or 3?
From the doc "5. X-form multiply accumulates stall the multiplier pipeline for two extra cycles. "
So this code
404538: 9b027e73 madd x19, x19, x2, xzr 40453c: f94037e2 ldr x2, [sp, #104] 404540: 9b017c00 madd x0, x0, x1, xzr 404544: f94037e1 ldr x1, [sp, #104] 404548: 9b037f5a madd x26, x26, x3, xzr
adds a wait-cycle after each "MADD"
Yes, I think you're right here. I also didn't notice the "X-form multiply" until you pointed it out. The manual defines throughput as:
Execution Throughput is defined as the maximum throughput (in instructions / cycle) of the specified instruction group
So according to the manual, MADD on W register has throughput of 1 instruction per cycle and MADD on X register has throughput of 1/3 instruction per cycle. This would explain why my test is also showing throughput of int64 multiply to be 1/3 of the others. Strange how FMUL on D register has higher throughput, i.e. 2 instructions per cycle.
This is a bit of a disappointment, since I imagine 64-bit multiplication on aarch64 is quite common. Well at least this answers my question.
Check the document, it shows there are two FPU/SIMD units, but only one multiply unit. But if you have a lot of multiplies, you may use SIMD instead.
Edit: Oh, it seems there is no 64bit vector multiplication. At least I can't find it. ;(
OK, but even with one FP pipeline you can run float/double multiplication at 1 instruction per cycle, which is way better than int64 multiplication at 1/3 instruction per cycle. This is strange, because normally people assume integer arithmetic is more efficient (or at the very least not worse) than floating point arithmetic, and this is not the case on aarch64. I really wonder why these CPUs have such poor int64 multiplication throughput. There must have been some design constraint?
Not Aarch64, CA72. Other implementation may be better or worse.