This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Cortex-A72 64-bit multiply (MADD) instruction low throughput

Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.

I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html

Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?

Parents
  • Yes, I think you're right here. I also didn't notice the "X-form multiply" until you pointed it out. The manual defines throughput as: 

    Execution Throughput is defined as the maximum throughput (in instructions / cycle) of the specified instruction group

    So according to the manual, MADD on W register has throughput of 1 instruction per cycle and MADD on X register has throughput of 1/3 instruction per cycle. This would explain why my test is also showing throughput of int64 multiply to be 1/3 of the others. Strange how FMUL on D register has higher throughput, i.e. 2 instructions per cycle.

    This is a bit of a disappointment, since I imagine 64-bit multiplication on aarch64 is quite common. Well at least this answers my question.

Reply
  • Yes, I think you're right here. I also didn't notice the "X-form multiply" until you pointed it out. The manual defines throughput as: 

    Execution Throughput is defined as the maximum throughput (in instructions / cycle) of the specified instruction group

    So according to the manual, MADD on W register has throughput of 1 instruction per cycle and MADD on X register has throughput of 1/3 instruction per cycle. This would explain why my test is also showing throughput of int64 multiply to be 1/3 of the others. Strange how FMUL on D register has higher throughput, i.e. 2 instructions per cycle.

    This is a bit of a disappointment, since I imagine 64-bit multiplication on aarch64 is quite common. Well at least this answers my question.

Children