This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Cortex-A72 64-bit multiply (MADD) instruction low throughput

Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.

I've posted the same question on NetBSD arm mailing list. More details can be found here:

Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?

  • As per the Cortex-A72 Software Optimization Guide,

    the MUL instruction has a throughput of 1 per cycle.

    The doc shows the MADD has a latency of 5 for 64bit compared to 3 for 32bit and the throughput is 1/3 (what ever /3 means) compared to 1.
    Does 1/3 mean one third? Or: 1 or 3?

    From the doc "5.  X-form multiply accumulates stall the multiplier pipeline for two extra cycles. "

    So this code

      404538:       9b027e73        madd    x19, x19, x2, xzr
      40453c:       f94037e2        ldr     x2, [sp, #104]
      404540:       9b017c00        madd    x0, x0, x1, xzr
      404544:       f94037e1        ldr     x1, [sp, #104]
      404548:       9b037f5a        madd    x26, x26, x3, xzr

    adds a wait-cycle after each "MADD"

  • As per the Cortex-A72 Software Optimization Guide,

    the MUL instruction has a throughput of 1 per cycle.

    The doc shows the MADD has a latency of 5 for 64bit compared to 3 for 32bit and the throughput is 1/3 (what ever /3 means) compared to 1.
    Does 1/3 mean one third? Or: 1 or 3?

    From the doc "5.  X-form multiply accumulates stall the multiplier pipeline for two extra cycles. "

    So this code

      404538:       9b027e73        madd    x19, x19, x2, xzr
      40453c:       f94037e2        ldr     x2, [sp, #104]
      404540:       9b017c00        madd    x0, x0, x1, xzr
      404544:       f94037e1        ldr     x1, [sp, #104]
      404548:       9b037f5a        madd    x26, x26, x3, xzr

    adds a wait-cycle after each "MADD"
