Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.
I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html
Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?
Sad Clouds said:So it doesn't seem that LDR instruction is the sole culprit here.
Yes, I would say also. Unless you flush the cache between runs, val is read from cache. Also on a A72 with a bus width of 64bit or even 128bit to the cache, there should be no difference between loading 32bits or 64.