This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Cortex-A72 64-bit multiply (MADD) instruction low throughput

Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.

I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html

Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?

Parents
  • Hi ,

    As per the Cortex-A72 Software Optimization Guide,

    the MUL instruction has a throughput of 1 per cycle.

    Also, the assembly code of the loop in your post contains many loads:

      40451c:       fd000380        str     d0, [x28]
      404520:       aa1303e0        orr     x0, xzr, x19
      404524:       34000274        cbz     w20, 404570 <int64_mul+0xa0>
      404528:       f94037e2        ldr     x2, [sp, #104]
      40452c:       71000694        subs    w20, w20, #0x1
      404530:       f94037e1        ldr     x1, [sp, #104]
      404534:       f94037e3        ldr     x3, [sp, #104]
      404538:       9b027e73        madd    x19, x19, x2, xzr
      40453c:       f94037e2        ldr     x2, [sp, #104]
      404540:       9b017c00        madd    x0, x0, x1, xzr
      404544:       f94037e1        ldr     x1, [sp, #104]
      404548:       9b037f5a        madd    x26, x26, x3, xzr
      40454c:       f94037e3        ldr     x3, [sp, #104]
      404550:       9b027f39        madd    x25, x25, x2, xzr
      404554:       f94037e2        ldr     x2, [sp, #104]
      404558:       9b017f18        madd    x24, x24, x1, xzr
      40455c:       f94037e1        ldr     x1, [sp, #104]
      404560:       9b037ef7        madd    x23, x23, x3, xzr
      404564:       9b027ed6        madd    x22, x22, x2, xzr
      404568:       9b017eb5        madd    x21, x21, x1, xzr
      40456c:       54fffde1        b.ne    404528 <int64_mul+0x58>  // b.any
      404570:       f90033e0        str     x0, [sp, #96]

    I think you should tweak your code and compiler settings to obtain a loop more like the following, with no memory access:

    .L3:
            subs    w0, w0, #1
            mul     x15, x15, x1
            mul     x14, x14, x1
            mul     x13, x13, x1
            mul     x12, x12, x1
            mul     x11, x11, x1
            mul     x10, x10, x1
            mul     x9, x9, x1
            mul     x8, x8, x1
            bne     .L3

    Best regards,

    Vincent.

Reply
  • Hi ,

    As per the Cortex-A72 Software Optimization Guide,

    the MUL instruction has a throughput of 1 per cycle.

    Also, the assembly code of the loop in your post contains many loads:

      40451c:       fd000380        str     d0, [x28]
      404520:       aa1303e0        orr     x0, xzr, x19
      404524:       34000274        cbz     w20, 404570 <int64_mul+0xa0>
      404528:       f94037e2        ldr     x2, [sp, #104]
      40452c:       71000694        subs    w20, w20, #0x1
      404530:       f94037e1        ldr     x1, [sp, #104]
      404534:       f94037e3        ldr     x3, [sp, #104]
      404538:       9b027e73        madd    x19, x19, x2, xzr
      40453c:       f94037e2        ldr     x2, [sp, #104]
      404540:       9b017c00        madd    x0, x0, x1, xzr
      404544:       f94037e1        ldr     x1, [sp, #104]
      404548:       9b037f5a        madd    x26, x26, x3, xzr
      40454c:       f94037e3        ldr     x3, [sp, #104]
      404550:       9b027f39        madd    x25, x25, x2, xzr
      404554:       f94037e2        ldr     x2, [sp, #104]
      404558:       9b017f18        madd    x24, x24, x1, xzr
      40455c:       f94037e1        ldr     x1, [sp, #104]
      404560:       9b037ef7        madd    x23, x23, x3, xzr
      404564:       9b027ed6        madd    x22, x22, x2, xzr
      404568:       9b017eb5        madd    x21, x21, x1, xzr
      40456c:       54fffde1        b.ne    404528 <int64_mul+0x58>  // b.any
      404570:       f90033e0        str     x0, [sp, #96]

    I think you should tweak your code and compiler settings to obtain a loop more like the following, with no memory access:

    .L3:
            subs    w0, w0, #1
            mul     x15, x15, x1
            mul     x14, x14, x1
            mul     x13, x13, x1
            mul     x12, x12, x1
            mul     x11, x11, x1
            mul     x10, x10, x1
            mul     x9, x9, x1
            mul     x8, x8, x1
            bne     .L3

    Best regards,

    Vincent.

Children