This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Cortex-A72 64-bit multiply (MADD) instruction low throughput

Hi, I've been benchmarking performance of Cortex-A72 CPU on Raspberry Pi 4 Model B Rev 1.1. It looks like the throughput of int64 multiply (MADD) instruction is about 1/3rd of multiply instructions for int32, float and double C data types on the same hardware.

I've posted the same question on NetBSD arm mailing list. More details can be found here: http://mail-index.netbsd.org/port-arm/2020/04/15/msg006614.html

Is this expected at all? Anyone knows why int64 multiply is so much slower compared to other data types?

Parents
  • Hi, thanks for the suggestion, but assembly code for int32 and double data types is very similar and does not have such issues with throughput. So it doesn't seem that LDR instruction is the sole culprit here. Below is the assembly code for the loop multiplying double data types:

      404718:       fd0002a0        str     d0, [x21]
      40471c:       1e604109        fmov    d9, d8
      404720:       1e60410a        fmov    d10, d8
      404724:       1e60410b        fmov    d11, d8
      404728:       1e60410c        fmov    d12, d8
      40472c:       1e60410d        fmov    d13, d8
      404730:       1e60410e        fmov    d14, d8
      404734:       1e60410f        fmov    d15, d8
      404738:       34000293        cbz     w19, 404788 <dbl_mul+0xa8>
      40473c:       d503201f        hint    #0x0
      404740:       fd403fe3        ldr     d3, [sp, #120]
      404744:       71000673        subs    w19, w19, #0x1
      404748:       fd403fe2        ldr     d2, [sp, #120]
      40474c:       fd403fe1        ldr     d1, [sp, #120]
      404750:       fd403fe0        ldr     d0, [sp, #120]
      404754:       1e630908        fmul    d8, d8, d3
      404758:       1e6209ef        fmul    d15, d15, d2
      40475c:       fd403fe3        ldr     d3, [sp, #120]
      404760:       1e6109ce        fmul    d14, d14, d1
      404764:       fd403fe2        ldr     d2, [sp, #120]
      404768:       1e6009ad        fmul    d13, d13, d0
      40476c:       fd403fe1        ldr     d1, [sp, #120]
      404770:       fd403fe0        ldr     d0, [sp, #120]
      404774:       1e63098c        fmul    d12, d12, d3
      404778:       1e62096b        fmul    d11, d11, d2
      40477c:       1e61094a        fmul    d10, d10, d1
      404780:       1e600929        fmul    d9, d9, d0
      404784:       54fffde1        b.ne    404740 <dbl_mul+0x60>  // b.any
    

    it has just as many loads and has much better throughput, i.e. x3 of the int64 multiply.

    There seems to be something about int64 multiplication that either stalls the pipeline or has some overhead that int32, float and double multiplication does not have. Would be interesting to find out and understand that.

Reply
  • Hi, thanks for the suggestion, but assembly code for int32 and double data types is very similar and does not have such issues with throughput. So it doesn't seem that LDR instruction is the sole culprit here. Below is the assembly code for the loop multiplying double data types:

      404718:       fd0002a0        str     d0, [x21]
      40471c:       1e604109        fmov    d9, d8
      404720:       1e60410a        fmov    d10, d8
      404724:       1e60410b        fmov    d11, d8
      404728:       1e60410c        fmov    d12, d8
      40472c:       1e60410d        fmov    d13, d8
      404730:       1e60410e        fmov    d14, d8
      404734:       1e60410f        fmov    d15, d8
      404738:       34000293        cbz     w19, 404788 <dbl_mul+0xa8>
      40473c:       d503201f        hint    #0x0
      404740:       fd403fe3        ldr     d3, [sp, #120]
      404744:       71000673        subs    w19, w19, #0x1
      404748:       fd403fe2        ldr     d2, [sp, #120]
      40474c:       fd403fe1        ldr     d1, [sp, #120]
      404750:       fd403fe0        ldr     d0, [sp, #120]
      404754:       1e630908        fmul    d8, d8, d3
      404758:       1e6209ef        fmul    d15, d15, d2
      40475c:       fd403fe3        ldr     d3, [sp, #120]
      404760:       1e6109ce        fmul    d14, d14, d1
      404764:       fd403fe2        ldr     d2, [sp, #120]
      404768:       1e6009ad        fmul    d13, d13, d0
      40476c:       fd403fe1        ldr     d1, [sp, #120]
      404770:       fd403fe0        ldr     d0, [sp, #120]
      404774:       1e63098c        fmul    d12, d12, d3
      404778:       1e62096b        fmul    d11, d11, d2
      40477c:       1e61094a        fmul    d10, d10, d1
      404780:       1e600929        fmul    d9, d9, d0
      404784:       54fffde1        b.ne    404740 <dbl_mul+0x60>  // b.any
    

    it has just as many loads and has much better throughput, i.e. x3 of the int64 multiply.

    There seems to be something about int64 multiplication that either stalls the pipeline or has some overhead that int32, float and double multiplication does not have. Would be interesting to find out and understand that.

Children