My benchmark of an Arm processor (arm64-v8a) on a Householder QR decomposition, written in assembly language, is yielding a computation rate of about 2 Gflops/second. This seems exceptionally high and I would like an understanding of what the compiler is really doing. The matrix is 384x240 and is being solved in 17-18 milliseconds. The processor is an Octa-core (2x2.6 GHz Cortex-A76 & 2x1.92 GHz Cortex-A76 & 4x1.8 GHz Cortex-A55) on a Kirin 980 (7 nm) chip. The main loop in the code is seven instructions: 2 loads, multiply-accumulate (single instruction), one store, increment an address, decrement loop count, and branch.
Aij: ldr d0, [x0, x11, lsl #3] ldr d1, [x7], #8 fmsub d0, d4, d1, d0 str d0, [x0, x11, lsl #3] add x11, x11, x4 subs x13, x13, #1 cbnz x13, Aij
I suppose the compiler could be running this in parallel over the processors but I doubt this is true. If the Arm processor is really a VLIW architecture, then this is really a two instruction loop: the two loads are two instructions, the multiply-accumulate, the store, the address increment could all be done in one instruction in parallel, and likewise the decrement and branch. If this were the case, a single core would run this loop at about 2 Gflops/second, which is what was measured. I'm no expert on the capabilities of the compiler or of the Arm architecture and I would appreciate any comments about this benchmark and perhaps a pointer (!) to what the Arm chip is really doing at the instruction set level. This was an app developed with Android Studio. Thanks,
Bob
Thank you for your reply and thoughts on this. This is a seven instruction loop which suggest either that this is a VLIW architecture and the seven instructions are executed in only, perhaps two, clock cycles or the loop is executing in parallel across the multiple A76 cores. If this loop executes in seven clock cycles, then we have 2x2.6/7 = 0.7 Gflops/second, substantially less than measured by a factor of three. The question of how the seven instructions are transmitted to executable remains!
ARM is not a VLIW architecture. Which you can easily find out by reading the docs.
When you measure the GFLOPS, you only count floating point instructions. In your case you perform a multiply accumulate instruction per iteration, thus 2 floating point instructions (mul and sub). For your matrix, you perform a total of 384x240x2 = 184320 FP instructions and you need 0.017s to perform them thus 10842352 FLOPS which is very low. How did you obtain 2GFLOPS ? Do you measure this only loop ?
If you want more details on the architecture you can look at the Cortex Optimization Guides :
https://static.docs.arm.com/swog307215/a/Arm_Cortex-A76_Software_Optimization_Guide.pdf
You will see that it is not a VLIW architecture, instead, micro-instructions are dispatched to several ports/execution pipelines (section 2.1) so for the Cortex A76 you can issue up to 8 intructions in parallel (1 branch, 3 integer, 2 FP, 2 ld/st).
Thank you again for your reply and the reference you gave. The flop count for the QR factorization is 2*n*n*(m-n/3) given in Golub and Van Loan, chapter 5, or http://www.seas.ucla.edu/~vandenbe/133A/lectures/qr.pdf.
In the code, there is a second loop in series, 6 cycles, essentially identical but without the store, which I had left out for simplicity. This was originally done as part of the "stay home, stay safe" paradigm to help keep my sanity and I was quite amazed at the performance of this architecture.
Your comments in the last paragraph show the impressive results converting instructions into executable by the folks at Arm.
Thanks again; I appreciate your comments very much!