My benchmark of an Arm processor (arm64-v8a) on a Householder QR decomposition, written in assembly language, is yielding a computation rate of about 2 Gflops/second. This seems exceptionally high and I would like an understanding of what the compiler is really doing. The matrix is 384x240 and is being solved in 17-18 milliseconds. The processor is an Octa-core (2x2.6 GHz Cortex-A76 & 2x1.92 GHz Cortex-A76 & 4x1.8 GHz Cortex-A55) on a Kirin 980 (7 nm) chip. The main loop in the code is seven instructions: 2 loads, multiply-accumulate (single instruction), one store, increment an address, decrement loop count, and branch.
ldr d0, [x0, x11, lsl #3]
ldr d1, [x7], #8
fmsub d0, d4, d1, d0
str d0, [x0, x11, lsl #3]
add x11, x11, x4
subs x13, x13, #1
cbnz x13, Aij
I suppose the compiler could be running this in parallel over the processors but I doubt this is true. If the Arm processor is really a VLIW architecture, then this is really a two instruction loop: the two loads are two instructions, the multiply-accumulate, the store, the address increment could all be done in one instruction in parallel, and likewise the decrement and branch. If this were the case, a single core would run this loop at about 2 Gflops/second, which is what was measured. I'm no expert on the capabilities of the compiler or of the Arm architecture and I would appreciate any comments about this benchmark and perhaps a pointer (!) to what the Arm chip is really doing at the instruction set level. This was an app developed with Android Studio. Thanks,
If you consider the theoretical performance of the A76, it has 2 FPU ports, each one is able to output one FMA per cycle, at 2.6Ghz you have a peak of 2x2x2.6=10.4GFLOPS (not considering SIMD, only scalar mode). Thus you are far from the peak ! In fact in your case you have only 1 FP instruction and you are loading 16 bytes of data so you are clearly in a memory bound scenario. If you want to optimize your code a little bit more, you may try to vectorize it using NEON and unroll it. But don't expect a 2x speedup. Another interesting aspect of using NEON, since you seem to develop for smartphones, is that you may consume less energy for the same computation.
Thank you for your reply and thoughts on this. This is a seven instruction loop which suggest either that this is a VLIW architecture and the seven instructions are executed in only, perhaps two, clock cycles or the loop is executing in parallel across the multiple A76 cores. If this loop executes in seven clock cycles, then we have 2x2.6/7 = 0.7 Gflops/second, substantially less than measured by a factor of three. The question of how the seven instructions are transmitted to executable remains!
ARM is not a VLIW architecture. Which you can easily find out by reading the docs.
View all questions in Cortex-A / A-Profile forum