This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 1st December 2012 at http://forums.arm.com

    Oh I see what you are doing now. When you first use a NEON instruction in your code, the NEON "coprocessor" will probably be switched off / idle to save power. When the CPU tries to execute your first NEON instruction, it will generate an invalid instruction exception, and then software in your OS would switch on the NEON coprocessor, and then execute your NEON instruction. So this explains the big delay caused by a single NEON instruction. In other words, you shouldn't need to worry about this delay. You only really use NEON in critical loops that are repeated thousands or millions or billions of times, where the initialization time isn't noticeable. So the test you performed isn't useless for any real-world NEON scenario.

    But to be honest, it is a higher delay than I expected. Perhaps your OS was busy processing other threads at the time during your tests.

    -Shervin.
Reply
  • Note: This was originally posted on 1st December 2012 at http://forums.arm.com

    Oh I see what you are doing now. When you first use a NEON instruction in your code, the NEON "coprocessor" will probably be switched off / idle to save power. When the CPU tries to execute your first NEON instruction, it will generate an invalid instruction exception, and then software in your OS would switch on the NEON coprocessor, and then execute your NEON instruction. So this explains the big delay caused by a single NEON instruction. In other words, you shouldn't need to worry about this delay. You only really use NEON in critical loops that are repeated thousands or millions or billions of times, where the initialization time isn't noticeable. So the test you performed isn't useless for any real-world NEON scenario.

    But to be honest, it is a higher delay than I expected. Perhaps your OS was busy processing other threads at the time during your tests.

    -Shervin.
Children
No data