Hi awesome guy,
I have a question on ARM A53 platform, and I needs your help!
8 ldr operations which using uncorrelated Qn register and 8 fmla operations which also using uncorrelated Qn reigster, codes shows as follows,
and
the address of X1 and X2 are on stack. why ldr loop will consume double time of the fmla loop?
I have refer to doc "Cortex_A57_Software_Optimization_Guide_external.pdf", ldr lantency is 5, and fmla is 10.
Try to force the LDR data address to be 64-bit aligned and retest. Any changes?
We have test, the data address are all 64-bits aligned!
address a:0x7ff6ea9f00, b:0x7ff6ea9f10, c:0x7ff6ea9f20, d:0x7ff6ea9f30
By the way, the complier is AArch64
FMLA instruction utilizes the SIMD hardware floating point unit. So it is possible that Load-Store operations are slower than SIMD/FP instructions.
The two pictures as below is cutted from doc "Cortex_A57_Software_Optimization_Guide_external.pdf", how to interpret lantency?
or else these instructions are different from A53?
From Cortex-A53 Software Optimization guide, I did not see the similar instruction table. So the latency number may be different between CA53 and CA57.
can you show me the latency of fmla q-form and ldr q-form in CA53? i am confused when i test them.
And you can see chapter 3 INSTRUCTION CHARACTERISTICS to find the instruction table in Cortex_A57_Software_Optimization_Guide_external.pdf