This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?

Parents

Martin Weidmann over 12 years ago

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

NEON can IN SOME CASES give a 4x improvement, but not in every case. To get this kind of improvement you need an algorithm that lends itself to vectorization, and be able to process four bits of data at a time. If your calculations are mostly scalar (not vector) you won;t be able to get a 4x improvement.

Also to get "good" performance you also have to consider several other factors. Like how is the data laid out in memory. Can it be efficiently loaded into the vector registers? Can you re-organise to get better cache performance. Etc....
Cancel
Vote up 0 Vote down

Cancel

Reply

Martin Weidmann over 12 years ago

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

NEON can IN SOME CASES give a 4x improvement, but not in every case. To get this kind of improvement you need an algorithm that lends itself to vectorization, and be able to process four bits of data at a time. If your calculations are mostly scalar (not vector) you won;t be able to get a 4x improvement.

Also to get "good" performance you also have to consider several other factors. Like how is the data laid out in memory. Can it be efficiently loaded into the vector registers? Can you re-organise to get better cache performance. Etc....
Cancel
Vote up 0 Vote down

Cancel

Children

No data