This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
  • Note: This was originally posted on 27th November 2012 at http://forums.arm.com

    Thanks for your reply,

    The both ARM assembly code and NEON assembly code do same functionality.

    But as you know I am getting 70% improvement on NEON assembly compared with pure fixed pint C
    code for this algorithm. I am getting 40% improvement with ARM 9 assembly code.

    I mean NEON assembly code and ARM assembly code difference is only 30 -35 % difference.
    This is my issue, why only 30% improvement on NEON assembly compared with ARM assembly code. ???

    I also knew there is "out of order" feature in Cortex-A9, and this feature only help for ARM
    instructions. But due to this feature arm assembly code is performing better on Cortex-A9,
    which is the reason I see less performance difference between NEON and ARM assembly code.

    Can you explain me in detail.   

    As you know I written both arm and NEON assembly code to understand the difference between two unit.
    I expect for NEON case 4*ARM improvement .
  • Note: This was originally posted on 28th November 2012 at http://forums.arm.com

    what about "out of order"  execution feature in Cortex-A9. Is it help only for ARM instruction ?
  • Note: This was originally posted on 29th November 2012 at http://forums.arm.com

    As you know I cannot share the code here, but I have done some other experiment, please see the below details. 

    I created few test cases, to understand details about time /cycles for ARM vs NEON on Coretx-A9 processor.

    Project 1-> which has two functions, the both function do 1000 million of addition.

    Function-1:  1000 million times of addition by using ARM instruction "loc_add_ARM".

    Function-2:  1000 million times of addition by using NEON instruction "loc_add_NEON".

    Please see time tick for the above two function in below table. I used the gettimeofday() function to get time in Cortex-A9  on our target .

    [size="3"][font="Calibri"]Function Name:   loc_add_ARM    :   (895230   - time)
                                       
    Function Name:   loc_add_NEON    :   (380375 - time)

    Project2-> In this case I have only enabled the function1 (1000 million times of addition by using ARM instruction).[/font][/size]

    Please see below time tick table for this case;


    Function Name:   loc_add_ARM    :   (800792 - time)
                                       
    Function Name:   loc_add_NEON    :   (not enabled / not called from the main function (0 - time) )



    Project3-> In this case I added one NEON instruction in function1 (function1 (1000 million times of addition by using ARM instruction)

    Please see the below table for this case;

    Function Name:   loc_add_ARM+1 NEON instruction    :   (895235- time)
                                       
    Function Name:   loc_add_NEON    :   (not enabled / not called from the main function (0 - time) )




    My question now, why there is a big time/cycle difference for the function "loc_add_ARM" in these three cases.
    Is it something related pipeline ?

    Thanks ,

    mj
  • Note: This was originally posted on 30th November 2012 at http://forums.arm.com

    Hi Servin,
    Please see question,

    Right now I am not worried about the NOEN assembly code verse ARM assembly code.

    Right now my issue is, for simple way:

    I have one assembly code which I written by using ARM instructions. This is algo just do 1000 million of addition.  

    Please see the below code:

    res =loc_add_ARM(1000000000);

            ARM

            REQUIRE8

            PRESERVE8



            AREA ||.text||, CODE, READONLY, ALIGN=2

                                    global loc_add_ARM

    loc_add_ARM

            PUSH     {r4,r5,lr}

            MOV      r5,#1 ; val    

            MOV      r1,#0

            MOV      r2,#0

            MOV      r3,#0

            MOV      r4,#0

            MOV    r0,r0, asr #2       

    loc_add_ARM_LOOP

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5      

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5        

            SUBS     r0,r0,#4

            BGT      loc_add_ARM_LOOP

           

            add      r0,r1,r2

            add      r1,r3,r4

            add      r0,r1      

            ; res ->r0

            POP      {r4,r5,pc}

            END

    =============================================================

    To completed this operation it takes time  "800792"

    Then for my next experiment, I used the same ARM assembly code but just added on extra instruction NEON

    res =loc_add_ARM(1000000000);

            ARM

            REQUIRE8

            PRESERVE8



            AREA ||.text||, CODE, READONLY, ALIGN=2

                                    global loc_add_ARM

    loc_add_ARM

            PUSH     {r4,r5,lr}

            Veor.s32  q0,q0  ;; just added on extra instruction NEON

            MOV      r5,#1 ; val    

            MOV      r1,#0

            MOV      r2,#0

            MOV      r3,#0

            MOV      r4,#0

            MOV    r0,r0, asr #2       

    loc_add_ARM_LOOP

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5      

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5        

            SUBS     r0,r0,#4

            BGT      loc_add_ARM_LOOP

           

            add      r0,r1,r2

            add      r1,r3,r4

            add      r0,r1      

            ; res ->r0

            POP      {r4,r5,pc}

            END

    =============================================================

    But it give time as "895230"

    Why this increase in time due to one NEON instruction addition?

    Could you please help for this?

    Thanks,

    MJ



  • Note: This was originally posted on 30th November 2012 at http://forums.arm.com

    could please share me Cortex-A9 pipeline document
  • Note: This was originally posted on 1st December 2012 at http://forums.arm.com

    Thanks shervin for you reply .

    [font="Calibri"][size="3"]I don't know understand why NEON behaves as like coprocessor. [/size][/font]

    So in that case, I may have few instructions from ARM instruction sets in the billion count of NEON loop. For examples handing the loop count or index modifications.

    There will be lots of delay switch between the ARM and NEON.

    So I don't think it is correct, the NEON behave as coprocessor.

    [font="Calibri"][size="3"]please see the below  for loop,[/size][/font]

    =============================================================
    ;r3 ------------big value
    ;r0 -addrs 
    FORLOOP
    VLD1.16   {d0,d1,d2,d3},[r0],#32
    VQDMULL.S16  q4, d0, d1
    VQDMULL.S16  q5, d2, d3
    VST1.32   {q4,q5},[r2],#32
    SUBS   r3, r3, #32
    BGT    FORLOOP
    ================================================
    So in this for loop there is ARM instruction after NEON. So it will have the delay what you mentioned about NEON due to coprocessor


    Thanks,
    MJ
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Hi,

    I executed NEON operation test on Linux platform board.
    (1)   Matrix multiplication: Method of  calculating one bye one.Here I have used only S registers. (Normal ARM instructions)


    (2)   Matrix multiplication: Since 128  bit calculation is done, the number of instructions will become 1/4 compared to  (1). Here I have used Q and D registers. (Neon instructions)


    I am using linux 3.0.35  and test code is executed on Linux platform (Cortex-a9 architecture) .
    But there is no speed difference between (1) and (2).


    In my Linux kernel configuration following options enabled
    CONFIG_VFP=y
    CONFIG_VFPv3=y
    CONFIG_NEON=y

    Following gcc command I have used to build the NEON application and  gcc compiler version is gcc 4.6.2
    gcc  -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard  -o test.out test.c

    But whether any other settings need to be done to enable NEON? 
    Why I dint find any performance difference between normal ARM and NEON codes?
    Let me know If I missed anything..

    Thanks in advance
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Thanks Shervin for your reply,

    Both the code written in assembly code for 4*4 matrix multiplication.
    In (1)  I am loading the float array content to S registers (32-bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using S registers as operand and to to hold the result.
    in (2) I am loading the complete float array content to Q registers(128 bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using Q(128 bit) and D (64 bit) registers which will obviously reduce the number of instructions ( Load, store, multiplication) to 1/4 th of (1) code.

    So I am expecting performance improvement in (2) which I am not able to achieve. What can be the issue?

    Thanks and Regards
    KP
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Thanks again..

    One more observation is that with Cortex-A8 I was able to achieve performance difference with the same code.
    So I cannot observe much performance difference in (1) and (2) in cortex - A9?  as in cortex-A9,NEON processing is done with 64bit means takes 2 cycle to  complete 128bit.

    But There should be some difference in speed between (1) and (2) right?

    Any solutions to overcome this issue?

    Can I expect any performance improvement if I add PLD instructions?

    Regards,
    KP
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Hi,

      So enabling NEON alone is not helping for me.
      As per my knowledge,NEON will not work on S registers; It will work on only Q and D registers.
      I have to get some sort of speed difference between normal ARM assembly code and NEON code right?
      My query is why I am not getting any speed difference? I should get advantage on using NEON instructions. Why It is not happening?
      I understand the memory bottleneck. But some percentage of performance gain should be there.

      Correct me If I am wrong.

    Thanks and Regards,
    KP
  • Note: This was originally posted on 25th March 2013 at http://forums.arm.com

    Both the test codes are doing Same thing. 4*4 matrix multiplication.

    inputs:
    Two float arrays both carrying 16 contents.

    Test code (1) Using only S registers.
    Operand 1 will be loaded into S0 to S15, operand 2 will be loaded S16-S19(only 4 float numbers at a time). S20-S23 for storing the result.
    After multiplication with loaded 4 float numbers(in S16-S19) i done, next 4 float numbers loaded into S16-S19 registers.

    Test code (2) Using Q and D registers.
    Operand 1 will be loaded into Q4 to Q7, operand 2 will be loaded Q8-Q11. Q0-Q3 for storing the result.

    I am measuring the timing by calling gettimeofday API twice(before & after test code call) and subtracting the difference.
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    Code Patch:
    In test1: One matrix operand is completely loaded into S registers and another operand I am loading into S16-S19(only 4 float at a time) one after the other  and intermediate saving results and storing in s20-s23 and to memory
    In Test2 : operand 1 and 2 loaded into Q registers

    test1:
      "vldmia %2, { s0-s15 } \n\t"
     
      "vldmia %1, { s16-s19 } \n\t"
      "add %1, %1, #16\n\t"
     
      "vmul.f32 s20, s0, s16\n\t"
      "vmul.f32 s21, s1, s16\n\t"
      "vmul.f32 s22, s2, s16\n\t"
      "vmul.f32 s23, s3, s16\n\t"
       .
       .
       . 
      "vstmia %0, { s20-s23 }\n\t"
      "add %0, %0, #16\n\t"
           .
           .
           .
           .

    test 2:

        "vldmia %1, { q4-q7 } \n\t"
      "vldmia %2, { q8-q11 } \n\t"
     
      "vmul.f32 q0, q8, d8[0]\n\t"
      "vmul.f32 q1, q8, d10[0]\n\t"
      "vmul.f32 q2, q8, d12[0]\n\t"
      "vmul.f32 q3, q8, d14[0]\n\t"
          
        .
           .
           .

    Thanks and Regards,
    KP
  • Note: This was originally posted on 28th November 2012 at http://forums.arm.com

    It's really hard to help without seeing any detail - can you share a code sequence for both NEON and ARM which is going slower in the NEON case?
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    I'd also suggest posting a code example. Many issues with "benchmarks" of small code sections is that they often do not test what you think they are testing (either because the code under test is inefficient, or the timing method doesn't scale down to very short timing).

    It's much easier to give precise answers if we actually know exactly what your code is trying to do =P
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    What is this code actually trying to do?

    Your two tests cases are not doing the same thing, and both are loading a huge amount of data they are not using.


    Can you also explain how you are timing it?