This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

floating point performance benchmark

My benchmark of an Arm processor (arm64-v8a) on a Householder QR decomposition, written in assembly language, is yielding a computation rate of about 2 Gflops/second. This seems exceptionally high and I would like an understanding of what the compiler is really doing. The matrix is 384x240 and is being solved in 17-18 milliseconds. The processor is an Octa-core (2x2.6 GHz Cortex-A76 & 2x1.92 GHz Cortex-A76 & 4x1.8 GHz Cortex-A55) on a Kirin 980 (7 nm) chip. The main loop in the code is seven instructions: 2 loads, multiply-accumulate (single instruction), one store, increment an address, decrement loop count, and branch.

Aij:
        ldr d0, [x0, x11, lsl #3]
        ldr d1, [x7], #8
        fmsub d0, d4, d1, d0
        str d0, [x0, x11, lsl #3]
        add x11, x11, x4
        subs x13, x13, #1
        cbnz x13, Aij

I suppose the compiler could be running this in parallel over the processors but I doubt this is true. If the Arm processor is really a VLIW architecture, then this is really a two instruction loop: the two loads are two instructions, the multiply-accumulate, the store, the address increment could all be done in one instruction in parallel, and likewise the decrement and branch. If this were the case, a single core would run this loop at about 2 Gflops/second, which is what was measured. I'm no expert on the capabilities of the compiler or of the Arm architecture and I would appreciate any comments about this benchmark and perhaps a pointer (!) to what the Arm chip is really doing at the instruction set level.  This was an app developed with Android Studio. Thanks,

Bob

Parents
  • When you measure the GFLOPS, you only count floating point instructions. In your case you perform a multiply accumulate instruction per iteration, thus 2 floating point instructions (mul and sub). For your matrix, you perform a total of 384x240x2 = 184320 FP instructions and you need 0.017s to perform them thus 10842352 FLOPS which is very low. How did you obtain 2GFLOPS ? Do you measure this only loop ?

    If you want more details on the architecture you can look at the Cortex Optimization Guides :

    https://static.docs.arm.com/swog307215/a/Arm_Cortex-A76_Software_Optimization_Guide.pdf

    You will see that it is not a VLIW architecture, instead, micro-instructions are dispatched to several ports/execution pipelines (section 2.1) so for the Cortex A76 you can issue up to 8 intructions in parallel (1 branch, 3 integer, 2 FP, 2 ld/st).

Reply
  • When you measure the GFLOPS, you only count floating point instructions. In your case you perform a multiply accumulate instruction per iteration, thus 2 floating point instructions (mul and sub). For your matrix, you perform a total of 384x240x2 = 184320 FP instructions and you need 0.017s to perform them thus 10842352 FLOPS which is very low. How did you obtain 2GFLOPS ? Do you measure this only loop ?

    If you want more details on the architecture you can look at the Cortex Optimization Guides :

    https://static.docs.arm.com/swog307215/a/Arm_Cortex-A76_Software_Optimization_Guide.pdf

    You will see that it is not a VLIW architecture, instead, micro-instructions are dispatched to several ports/execution pipelines (section 2.1) so for the Cortex A76 you can issue up to 8 intructions in parallel (1 branch, 3 integer, 2 FP, 2 ld/st).

Children
  • Thank you again for your reply and the reference you gave. The flop count for the QR factorization is 2*n*n*(m-n/3) given in Golub and Van Loan, chapter 5, or http://www.seas.ucla.edu/~vandenbe/133A/lectures/qr.pdf.

    In the code, there is a second loop in series, 6 cycles, essentially identical but without the store, which I had left out for simplicity. This was originally done as part of the "stay home, stay safe" paradigm to help keep my sanity and I was quite amazed at the performance of this architecture. 

    Your comments in the last paragraph show the impressive results converting instructions into executable by the folks at Arm. 

    Thanks again; I appreciate your comments very much!