Hi awesome guy,
I have a question on ARM A53 platform, and I needs your help!
8 ldr operations which using uncorrelated Qn register and 8 fmla operations which also using uncorrelated Qn reigster, codes shows as follows,
and
the address of X1 and X2 are on stack. why ldr loop will consume double time of the fmla loop?
I have refer to doc "Cortex_A57_Software_Optimization_Guide_external.pdf", ldr lantency is 5, and fmla is 10.
And you can see chapter 3 INSTRUCTION CHARACTERISTICS to find the instruction table in Cortex_A57_Software_Optimization_Guide_external.pdf