This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?

Parents

Krish ks over 12 years ago

Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

Thanks Shervin for your reply,

Both the code written in assembly code for 4*4 matrix multiplication.
In (1) I am loading the float array content to S registers (32-bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using S registers as operand and to to hold the result.
in (2) I am loading the complete float array content to Q registers(128 bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using Q(128 bit) and D (64 bit) registers which will obviously reduce the number of instructions ( Load, store, multiplication) to 1/4 th of (1) code.

So I am expecting performance improvement in (2) which I am not able to achieve. What can be the issue?

Thanks and Regards
KP
Cancel
Vote up 0 Vote down

Cancel

Reply

Krish ks over 12 years ago

Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

Thanks Shervin for your reply,

Both the code written in assembly code for 4*4 matrix multiplication.
In (1) I am loading the float array content to S registers (32-bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using S registers as operand and to to hold the result.
in (2) I am loading the complete float array content to Q registers(128 bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using Q(128 bit) and D (64 bit) registers which will obviously reduce the number of instructions ( Load, store, multiplication) to 1/4 th of (1) code.

So I am expecting performance improvement in (2) which I am not able to achieve. What can be the issue?

Thanks and Regards
KP
Cancel
Vote up 0 Vote down

Cancel

Children

No data