This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    Code Patch:
    In test1: One matrix operand is completely loaded into S registers and another operand I am loading into S16-S19(only 4 float at a time) one after the other  and intermediate saving results and storing in s20-s23 and to memory
    In Test2 : operand 1 and 2 loaded into Q registers

    test1:
      "vldmia %2, { s0-s15 } \n\t"
     
      "vldmia %1, { s16-s19 } \n\t"
      "add %1, %1, #16\n\t"
     
      "vmul.f32 s20, s0, s16\n\t"
      "vmul.f32 s21, s1, s16\n\t"
      "vmul.f32 s22, s2, s16\n\t"
      "vmul.f32 s23, s3, s16\n\t"
       .
       .
       . 
      "vstmia %0, { s20-s23 }\n\t"
      "add %0, %0, #16\n\t"
           .
           .
           .
           .

    test 2:

        "vldmia %1, { q4-q7 } \n\t"
      "vldmia %2, { q8-q11 } \n\t"
     
      "vmul.f32 q0, q8, d8[0]\n\t"
      "vmul.f32 q1, q8, d10[0]\n\t"
      "vmul.f32 q2, q8, d12[0]\n\t"
      "vmul.f32 q3, q8, d14[0]\n\t"
          
        .
           .
           .

    Thanks and Regards,
    KP
Reply
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    Code Patch:
    In test1: One matrix operand is completely loaded into S registers and another operand I am loading into S16-S19(only 4 float at a time) one after the other  and intermediate saving results and storing in s20-s23 and to memory
    In Test2 : operand 1 and 2 loaded into Q registers

    test1:
      "vldmia %2, { s0-s15 } \n\t"
     
      "vldmia %1, { s16-s19 } \n\t"
      "add %1, %1, #16\n\t"
     
      "vmul.f32 s20, s0, s16\n\t"
      "vmul.f32 s21, s1, s16\n\t"
      "vmul.f32 s22, s2, s16\n\t"
      "vmul.f32 s23, s3, s16\n\t"
       .
       .
       . 
      "vstmia %0, { s20-s23 }\n\t"
      "add %0, %0, #16\n\t"
           .
           .
           .
           .

    test 2:

        "vldmia %1, { q4-q7 } \n\t"
      "vldmia %2, { q8-q11 } \n\t"
     
      "vmul.f32 q0, q8, d8[0]\n\t"
      "vmul.f32 q1, q8, d10[0]\n\t"
      "vmul.f32 q2, q8, d12[0]\n\t"
      "vmul.f32 q3, q8, d14[0]\n\t"
          
        .
           .
           .

    Thanks and Regards,
    KP
Children
No data