Dual emission problem related to neon instruction set on A53

Hi awesome guy of ARM,

           I have a question on ARM A53 platform, and I needs your help!

           I have writen a small program to verify floating data compute paralleled performance, main loop was made of several "fmla" instructions, and related registers have no dependencies with each other. As a result, the dual issue was not I expected, as I know, we inserted some other neon instructions which registers not related to "fmla" so that it can get "dual issue". such as,

          fmla v0.4s, v0.4s, v20.s[0]     //line 0
        ldr q30,[x1]
        fmla v1.4s, v1.4s, v20.s[1]     //line 1

        but, it was found that the running time became long since the "ldr" instruction was inserted, unless the first operand of ldr instruction is general register(such as Xn), or else the running time must become long when insert it. and then we inserted " add v22.4s,v22.4s,v23.4s" or "str q30,[x1]" between line 0 and line 1, we got the same result.

        I refered to the doc. “Cortex_A57_Software_Optimization_Guide_external.pdf”, contents as follows,

       ldr was issued by pipeline "Load",

        str was issued by pipeline "Store",

       fmla was issued by pipeline "FP/ASIMD 0" or "FP/ASIMD 1",

       As I understand it, ldr and fmla should realize "dual issue".

       Wether I have got mistake in comprehension?

       Besides, if there is a document of A53 corresponds with "Cortex_A57_Software_Optimization_Guide_external.pdf".

       Thanks !

       

      

 

  

More questions in this forum