This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 3rd December 2012 at http://forums.arm.com

    No it is fine & efficient to mix ARM and NEON instructions together in a loop, as long as they don't try to access the same registers or memory. There is a fairly big delay of around 12-20 clock cycles when ARM & NEON need the same CPU registers or ARM & VFP do (it depends on which order and which processor you  have). There is also a roughly 12-20 clock cycle delay if you mix any NEON and VFP instructions in the same loop, because only one "coprocessor" can be used at a time. This is a tricky problem because NEON and VFP instructions are now "unified" to look the same, so if you use 64-bit registers, sometimes you need to make sure you are using NEON isntructions and not VFP instructions.

    When I said that the NEON "coprocessor" will basically power up on your first NEON instruction and therefore cause a delay, remember that it will only happen for the first instruction and not again after that, so you can nearly always ignore the delay.

    So basically, your loop should run efficiently, it doesn't have any real problems. But I'd highly recommend using cache preloading in your loop, because NEON will only speed up the calculations, not the memory access, so without the correct amount of cache preloading (PLD instruction) in your loop, NEON might not seem any faster than ARM code.

    Cheers,
    Shervin.
    http://www.shervinemami.info/armAssembly.html
Reply
  • Note: This was originally posted on 3rd December 2012 at http://forums.arm.com

    No it is fine & efficient to mix ARM and NEON instructions together in a loop, as long as they don't try to access the same registers or memory. There is a fairly big delay of around 12-20 clock cycles when ARM & NEON need the same CPU registers or ARM & VFP do (it depends on which order and which processor you  have). There is also a roughly 12-20 clock cycle delay if you mix any NEON and VFP instructions in the same loop, because only one "coprocessor" can be used at a time. This is a tricky problem because NEON and VFP instructions are now "unified" to look the same, so if you use 64-bit registers, sometimes you need to make sure you are using NEON isntructions and not VFP instructions.

    When I said that the NEON "coprocessor" will basically power up on your first NEON instruction and therefore cause a delay, remember that it will only happen for the first instruction and not again after that, so you can nearly always ignore the delay.

    So basically, your loop should run efficiently, it doesn't have any real problems. But I'd highly recommend using cache preloading in your loop, because NEON will only speed up the calculations, not the memory access, so without the correct amount of cache preloading (PLD instruction) in your loop, NEON might not seem any faster than ARM code.

    Cheers,
    Shervin.
    http://www.shervinemami.info/armAssembly.html
Children
No data