floating point performance benchmark

My benchmark of an Arm processor (arm64-v8a) on a Householder QR decomposition, written in assembly language, is yielding a computation rate of about 2 Gflops/second. This seems exceptionally high and I would like an understanding of what the compiler is really doing. The matrix is 384x240 and is being solved in 17-18 milliseconds. The processor is an Octa-core (2x2.6 GHz Cortex-A76 & 2x1.92 GHz Cortex-A76 & 4x1.8 GHz Cortex-A55) on a Kirin 980 (7 nm) chip. The main loop in the code is seven instructions: 2 loads, multiply-accumulate (single instruction), one store, increment an address, decrement loop count, and branch.

Aij:
        ldr d0, [x0, x11, lsl #3]
        ldr d1, [x7], #8
        fmsub d0, d4, d1, d0
        str d0, [x0, x11, lsl #3]
        add x11, x11, x4
        subs x13, x13, #1
        cbnz x13, Aij

I suppose the compiler could be running this in parallel over the processors but I doubt this is true. If the Arm processor is really a VLIW architecture, then this is really a two instruction loop: the two loads are two instructions, the multiply-accumulate, the store, the address increment could all be done in one instruction in parallel, and likewise the decrement and branch. If this were the case, a single core would run this loop at about 2 Gflops/second, which is what was measured. I'm no expert on the capabilities of the compiler or of the Arm architecture and I would appreciate any comments about this benchmark and perhaps a pointer (!) to what the Arm chip is really doing at the instruction set level.  This was an app developed with Android Studio. Thanks,

Bob

Parents
  • Hi,

    If you consider the theoretical performance of the A76, it has 2 FPU ports, each one is able to output one FMA per cycle, at 2.6Ghz you have a peak of 2x2x2.6=10.4GFLOPS (not considering SIMD, only scalar mode). Thus you are far from the peak ! In fact in your case you have only 1 FP instruction and you are loading 16 bytes of data so you are clearly in a memory bound scenario. If you want to optimize your code a little bit more, you may try to vectorize it using NEON and unroll it. But don't expect a 2x speedup. Another interesting aspect of using NEON, since you seem to develop for smartphones, is that you may consume less energy for the same computation.

    Sylvain

Reply
  • Hi,

    If you consider the theoretical performance of the A76, it has 2 FPU ports, each one is able to output one FMA per cycle, at 2.6Ghz you have a peak of 2x2x2.6=10.4GFLOPS (not considering SIMD, only scalar mode). Thus you are far from the peak ! In fact in your case you have only 1 FP instruction and you are loading 16 bytes of data so you are clearly in a memory bound scenario. If you want to optimize your code a little bit more, you may try to vectorize it using NEON and unroll it. But don't expect a 2x speedup. Another interesting aspect of using NEON, since you seem to develop for smartphones, is that you may consume less energy for the same computation.

    Sylvain

Children
More questions in this forum