This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 25th March 2013 at http://forums.arm.com

    Both the test codes are doing Same thing. 4*4 matrix multiplication.

    inputs:
    Two float arrays both carrying 16 contents.

    Test code (1) Using only S registers.
    Operand 1 will be loaded into S0 to S15, operand 2 will be loaded S16-S19(only 4 float numbers at a time). S20-S23 for storing the result.
    After multiplication with loaded 4 float numbers(in S16-S19) i done, next 4 float numbers loaded into S16-S19 registers.

    Test code (2) Using Q and D registers.
    Operand 1 will be loaded into Q4 to Q7, operand 2 will be loaded Q8-Q11. Q0-Q3 for storing the result.

    I am measuring the timing by calling gettimeofday API twice(before & after test code call) and subtracting the difference.
Reply
  • Note: This was originally posted on 25th March 2013 at http://forums.arm.com

    Both the test codes are doing Same thing. 4*4 matrix multiplication.

    inputs:
    Two float arrays both carrying 16 contents.

    Test code (1) Using only S registers.
    Operand 1 will be loaded into S0 to S15, operand 2 will be loaded S16-S19(only 4 float numbers at a time). S20-S23 for storing the result.
    After multiplication with loaded 4 float numbers(in S16-S19) i done, next 4 float numbers loaded into S16-S19 registers.

    Test code (2) Using Q and D registers.
    Operand 1 will be loaded into Q4 to Q7, operand 2 will be loaded Q8-Q11. Q0-Q3 for storing the result.

    I am measuring the timing by calling gettimeofday API twice(before & after test code call) and subtracting the difference.
Children
No data