We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.
I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html
Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?
Sad Clouds said:So it doesn't seem that LDR instruction is the sole culprit here.
Yes, I would say also. Unless you flush the cache between runs, val is read from cache. Also on a A72 with a bus width of 64bit or even 128bit to the cache, there should be no difference between loading 32bits or 64.