This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

floating point performance benchmark

My benchmark of an Arm processor (arm64-v8a) on a Householder QR decomposition, written in assembly language, is yielding a computation rate of about 2 Gflops/second. This seems exceptionally high and I would like an understanding of what the compiler is really doing. The matrix is 384x240 and is being solved in 17-18 milliseconds. The processor is an Octa-core (2x2.6 GHz Cortex-A76 & 2x1.92 GHz Cortex-A76 & 4x1.8 GHz Cortex-A55) on a Kirin 980 (7 nm) chip. The main loop in the code is seven instructions: 2 loads, multiply-accumulate (single instruction), one store, increment an address, decrement loop count, and branch.

Aij:
        ldr d0, [x0, x11, lsl #3]
        ldr d1, [x7], #8
        fmsub d0, d4, d1, d0
        str d0, [x0, x11, lsl #3]
        add x11, x11, x4
        subs x13, x13, #1
        cbnz x13, Aij

I suppose the compiler could be running this in parallel over the processors but I doubt this is true. If the Arm processor is really a VLIW architecture, then this is really a two instruction loop: the two loads are two instructions, the multiply-accumulate, the store, the address increment could all be done in one instruction in parallel, and likewise the decrement and branch. If this were the case, a single core would run this loop at about 2 Gflops/second, which is what was measured. I'm no expert on the capabilities of the compiler or of the Arm architecture and I would appreciate any comments about this benchmark and perhaps a pointer (!) to what the Arm chip is really doing at the instruction set level.  This was an app developed with Android Studio. Thanks,

Bob

Parents
  • Thank you again for your reply and the reference you gave. The flop count for the QR factorization is 2*n*n*(m-n/3) given in Golub and Van Loan, chapter 5, or http://www.seas.ucla.edu/~vandenbe/133A/lectures/qr.pdf.

    In the code, there is a second loop in series, 6 cycles, essentially identical but without the store, which I had left out for simplicity. This was originally done as part of the "stay home, stay safe" paradigm to help keep my sanity and I was quite amazed at the performance of this architecture. 

    Your comments in the last paragraph show the impressive results converting instructions into executable by the folks at Arm. 

    Thanks again; I appreciate your comments very much!

Reply
  • Thank you again for your reply and the reference you gave. The flop count for the QR factorization is 2*n*n*(m-n/3) given in Golub and Van Loan, chapter 5, or http://www.seas.ucla.edu/~vandenbe/133A/lectures/qr.pdf.

    In the code, there is a second loop in series, 6 cycles, essentially identical but without the store, which I had left out for simplicity. This was originally done as part of the "stay home, stay safe" paradigm to help keep my sanity and I was quite amazed at the performance of this architecture. 

    Your comments in the last paragraph show the impressive results converting instructions into executable by the folks at Arm. 

    Thanks again; I appreciate your comments very much!

Children
No data