Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.
I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html
Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?
OK, but even with one FP pipeline you can run float/double multiplication at 1 instruction per cycle, which is way better than int64 multiplication at 1/3 instruction per cycle. This is strange, because normally people assume integer arithmetic is more efficient (or at the very least not worse) than floating point arithmetic, and this is not the case on aarch64. I really wonder why these CPUs have such poor int64 multiplication throughput. There must have been some design constraint?
Not Aarch64, CA72. Other implementation may be better or worse.