This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Cortex-A72 64-bit multiply (MADD) instruction low throughput

Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.

I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html

Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?

Parents
  • As per the Cortex-A72 Software Optimization Guide,

    the MUL instruction has a throughput of 1 per cycle.

    The doc shows the MADD has a latency of 5 for 64bit compared to 3 for 32bit and the throughput is 1/3 (what ever /3 means) compared to 1.
    Does 1/3 mean one third? Or: 1 or 3?

    From the doc "5.  X-form multiply accumulates stall the multiplier pipeline for two extra cycles. "

    So this code

      404538:       9b027e73        madd    x19, x19, x2, xzr
      40453c:       f94037e2        ldr     x2, [sp, #104]
      404540:       9b017c00        madd    x0, x0, x1, xzr
      404544:       f94037e1        ldr     x1, [sp, #104]
      404548:       9b037f5a        madd    x26, x26, x3, xzr

    adds a wait-cycle after each "MADD"

Reply
  • As per the Cortex-A72 Software Optimization Guide,

    the MUL instruction has a throughput of 1 per cycle.

    The doc shows the MADD has a latency of 5 for 64bit compared to 3 for 32bit and the throughput is 1/3 (what ever /3 means) compared to 1.
    Does 1/3 mean one third? Or: 1 or 3?

    From the doc "5.  X-form multiply accumulates stall the multiplier pipeline for two extra cycles. "

    So this code

      404538:       9b027e73        madd    x19, x19, x2, xzr
      40453c:       f94037e2        ldr     x2, [sp, #104]
      404540:       9b017c00        madd    x0, x0, x1, xzr
      404544:       f94037e1        ldr     x1, [sp, #104]
      404548:       9b037f5a        madd    x26, x26, x3, xzr

    adds a wait-cycle after each "MADD"

Children
  • Yes, I think you're right here. I also didn't notice the "X-form multiply" until you pointed it out. The manual defines throughput as: 

    Execution Throughput is defined as the maximum throughput (in instructions / cycle) of the specified instruction group

    So according to the manual, MADD on W register has throughput of 1 instruction per cycle and MADD on X register has throughput of 1/3 instruction per cycle. This would explain why my test is also showing throughput of int64 multiply to be 1/3 of the others. Strange how FMUL on D register has higher throughput, i.e. 2 instructions per cycle.

    This is a bit of a disappointment, since I imagine 64-bit multiplication on aarch64 is quite common. Well at least this answers my question.

  • Check the document, it shows there are two FPU/SIMD units, but only one multiply unit. But if you have a lot of multiplies, you may use SIMD instead.

    Edit: Oh, it seems there is no 64bit vector multiplication. At least I can't find it. ;(

  • OK, but even with one FP pipeline you can run float/double multiplication at 1 instruction per cycle, which is way better than int64 multiplication at 1/3 instruction per cycle. This is strange, because normally people assume integer arithmetic is more efficient (or at the very least not worse) than floating point arithmetic, and this is not the case on aarch64. I really wonder why these CPUs have such poor int64 multiplication throughput. There must have been some design constraint? 

  • Not Aarch64, CA72. Other implementation may be better or worse.