Support forums

Architectures and Processors forum floating point performance benchmark

State Accepted Answer
Locked Locked
Replies 5 replies
Subscribers 351 subscribers
Views 23484 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

floating point performance benchmark

bobford over 5 years ago

My benchmark of an Arm processor (arm64-v8a) on a Householder QR decomposition, written in assembly language, is yielding a computation rate of about 2 Gflops/second. This seems exceptionally high and I would like an understanding of what the compiler is really doing. The matrix is 384x240 and is being solved in 17-18 milliseconds. The processor is an Octa-core (2x2.6 GHz Cortex-A76 & 2x1.92 GHz Cortex-A76 & 4x1.8 GHz Cortex-A55) on a Kirin 980 (7 nm) chip. The main loop in the code is seven instructions: 2 loads, multiply-accumulate (single instruction), one store, increment an address, decrement loop count, and branch.

Fullscreen

1
2
3
4
5
6
7
8
Aij:
        ldr d0, [x0, x11, lsl #3]
        ldr d1, [x7], #8
        fmsub d0, d4, d1, d0
        str d0, [x0, x11, lsl #3]
        add x11, x11, x4
        subs x13, x13, #1
        cbnz x13, Aij
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Aij:
        ldr d0, [x0, x11, lsl #3]
        ldr d1, [x7], #8
        fmsub d0, d4, d1, d0
        str d0, [x0, x11, lsl #3]
        add x11, x11, x4
        subs x13, x13, #1
        cbnz x13, Aij

I suppose the compiler could be running this in parallel over the processors but I doubt this is true. If the Arm processor is really a VLIW architecture, then this is really a two instruction loop: the two loads are two instructions, the multiply-accumulate, the store, the address increment could all be done in one instruction in parallel, and likewise the decrement and branch. If this were the case, a single core would run this loop at about 2 Gflops/second, which is what was measured. I'm no expert on the capabilities of the compiler or of the Arm architecture and I would appreciate any comments about this benchmark and perhaps a pointer (!) to what the Arm chip is really doing at the instruction set level. This was an app developed with Android Studio. Thanks,

Bob

Top replies

sjub over 5 years ago in reply to bobford +1 verified

When you measure the GFLOPS, you only count floating point instructions. In your case you perform a multiply accumulate instruction per iteration, thus 2 floating point instructions (mul and sub). For...