Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.
I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html
Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?
Hi, thanks for the suggestion, but assembly code for int32 and double data types is very similar and does not have such issues with throughput. So it doesn't seem that LDR instruction is the sole culprit here. Below is the assembly code for the loop multiplying double data types:
404718: fd0002a0 str d0, [x21] 40471c: 1e604109 fmov d9, d8 404720: 1e60410a fmov d10, d8 404724: 1e60410b fmov d11, d8 404728: 1e60410c fmov d12, d8 40472c: 1e60410d fmov d13, d8 404730: 1e60410e fmov d14, d8 404734: 1e60410f fmov d15, d8 404738: 34000293 cbz w19, 404788 <dbl_mul+0xa8> 40473c: d503201f hint #0x0 404740: fd403fe3 ldr d3, [sp, #120] 404744: 71000673 subs w19, w19, #0x1 404748: fd403fe2 ldr d2, [sp, #120] 40474c: fd403fe1 ldr d1, [sp, #120] 404750: fd403fe0 ldr d0, [sp, #120] 404754: 1e630908 fmul d8, d8, d3 404758: 1e6209ef fmul d15, d15, d2 40475c: fd403fe3 ldr d3, [sp, #120] 404760: 1e6109ce fmul d14, d14, d1 404764: fd403fe2 ldr d2, [sp, #120] 404768: 1e6009ad fmul d13, d13, d0 40476c: fd403fe1 ldr d1, [sp, #120] 404770: fd403fe0 ldr d0, [sp, #120] 404774: 1e63098c fmul d12, d12, d3 404778: 1e62096b fmul d11, d11, d2 40477c: 1e61094a fmul d10, d10, d1 404780: 1e600929 fmul d9, d9, d0 404784: 54fffde1 b.ne 404740 <dbl_mul+0x60> // b.any
it has just as many loads and has much better throughput, i.e. x3 of the int64 multiply.
There seems to be something about int64 multiplication that either stalls the pipeline or has some overhead that int32, float and double multiplication does not have. Would be interesting to find out and understand that.
Sad Clouds said:So it doesn't seem that LDR instruction is the sole culprit here.
Yes, I would say also. Unless you flush the cache between runs, val is read from cache. Also on a A72 with a bus width of 64bit or even 128bit to the cache, there should be no difference between loading 32bits or 64.