Hi awesome guy of ARM,
I have a question on ARM A53 platform, and I needs your help!
I have writen a small program to verify floating data compute paralleled performance, main loop was made of several "fmla" instructions, and related registers have no dependencies with each other. As a result, the dual issue was not I expected, as I know, we inserted some other neon instructions which registers not related to "fmla" so that it can get "dual issue". such as,
fmla v0.4s, v0.4s, v20.s[0] //line 0 ldr q30,[x1] fmla v1.4s, v1.4s, v20.s[1] //line 1
but, it was found that the running time became long since the "ldr" instruction was inserted, unless the first operand of ldr instruction is general register(such as Xn), or else the running time must become long when insert it. and then we inserted " add v22.4s,v22.4s,v23.4s" or "str q30,[x1]" between line 0 and line 1, we got the same result.
I refered to the doc. “Cortex_A57_Software_Optimization_Guide_external.pdf”, contents as follows,
ldr was issued by pipeline "Load",
str was issued by pipeline "Store",
fmla was issued by pipeline "FP/ASIMD 0" or "FP/ASIMD 1",
As I understand it, ldr and fmla should realize "dual issue".
Wether I have got mistake in comprehension?
Besides, if there is a document of A53 corresponds with "Cortex_A57_Software_Optimization_Guide_external.pdf".
Thanks !
As we konw, Cortex-A53 has two 64-bits Neon unit, so if we use 128bits register like Q/V.4s, it will cover all of two Neon units, so it can not "dual issue" when one uses 128bits Q register and at the same time, the other one also use 128bits Q register.
Wether I misunderstand?