This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
  • Note: This was originally posted on 27th November 2012 at http://forums.arm.com

    First off, your flags looks fine.

    I think you problem is more fundamental.  NEON is not some magic wand that will immediately make you code run faster.  How much you will see depends almost entirely on what kind of processing you do, and how well you write the code.
  • Note: This was originally posted on 27th November 2012 at http://forums.arm.com

    NEON can IN SOME CASES give a 4x improvement, but not in every case.  To get this kind of improvement you need an algorithm that lends itself to vectorization, and be able to process four bits of data at a time.  If your calculations are mostly scalar (not vector) you won;t be able to get a 4x improvement.

    Also to get "good" performance you also have to consider several other factors.  Like how is the data laid out in memory.  Can it be efficiently loaded into the vector registers?  Can you re-organise to get better cache performance. Etc....
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    It's actually only 32-bit integer multiplication that NEON has a throughput of 32-bits per cycle. 8-bit and 16-bit integer, as well as single precision floats, have a throughput of 64-bits per cycle. This applies to all forms of multiplications and multiply + accumulate. This is the case for both Cortex-A8 and Cortex-A9.

    On Cortex-A9 (and A8) 32-bit integer multiplication needs two cycles so it can still be faster to use NEON.
  • Note: This was originally posted on 30th November 2012 at http://forums.arm.com

    Sorry but it is still not obvious what your tests are doing unless if you show something more concrete such as actual code. For all we know, maybe most of the delay is caused because you are calling a C function on every iteration, or loading data from RAM, etc. And it is not obvious what you are wondering with these results.

    Like several people have already said in your other threads, there is a big difference in NEON pipeline & cache system for Cortex-A8 and Cortex-A9, so it is expected you will get very difference speeds with NEON, and also NEON will only give a speedup if you use it efficiently and for suitable algorithms. If you use NEON for things that aren't suited to NEON then you will get a slow-down instead of a speedup, even if you use Assembly code.

    -Shervin.
  • Note: This was originally posted on 1st December 2012 at http://forums.arm.com

    Oh I see what you are doing now. When you first use a NEON instruction in your code, the NEON "coprocessor" will probably be switched off / idle to save power. When the CPU tries to execute your first NEON instruction, it will generate an invalid instruction exception, and then software in your OS would switch on the NEON coprocessor, and then execute your NEON instruction. So this explains the big delay caused by a single NEON instruction. In other words, you shouldn't need to worry about this delay. You only really use NEON in critical loops that are repeated thousands or millions or billions of times, where the initialization time isn't noticeable. So the test you performed isn't useless for any real-world NEON scenario.

    But to be honest, it is a higher delay than I expected. Perhaps your OS was busy processing other threads at the time during your tests.

    -Shervin.
  • Note: This was originally posted on 3rd December 2012 at http://forums.arm.com

    No it is fine & efficient to mix ARM and NEON instructions together in a loop, as long as they don't try to access the same registers or memory. There is a fairly big delay of around 12-20 clock cycles when ARM & NEON need the same CPU registers or ARM & VFP do (it depends on which order and which processor you  have). There is also a roughly 12-20 clock cycle delay if you mix any NEON and VFP instructions in the same loop, because only one "coprocessor" can be used at a time. This is a tricky problem because NEON and VFP instructions are now "unified" to look the same, so if you use 64-bit registers, sometimes you need to make sure you are using NEON isntructions and not VFP instructions.

    When I said that the NEON "coprocessor" will basically power up on your first NEON instruction and therefore cause a delay, remember that it will only happen for the first instruction and not again after that, so you can nearly always ignore the delay.

    So basically, your loop should run efficiently, it doesn't have any real problems. But I'd highly recommend using cache preloading in your loop, because NEON will only speed up the calculations, not the memory access, so without the correct amount of cache preloading (PLD instruction) in your loop, NEON might not seem any faster than ARM code.

    Cheers,
    Shervin.
    http://www.shervinemami.info/armAssembly.html
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    How do you know that code 1 is normal ARM instructions and code 2 is NEON instructions and 1/4 of instructions? Did you write both codes in Assembly or NEON Intrinsics to make sure, or are you just using plain C/C++ code?

    Also, it is very common on Cortex-A9 that memory will be your bottleneck, not your CPU arithmetic. So if you are completely "memory bound" then using NEON instead of ARM CPU code will often have no difference, and you are better of looking into other optimization possibilities such as cache preloading, and/or designing your code to make better use of cache, and/or using GPU to perform the operation (eg: GLSL shaders for current hardware or GPGPU acceleration if you are targeting future systems).
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Oh OK yes it makes sense. Normally I would say that it is because memory access is your main bottleneck, so it doesn't matter if you speedup your calculations because your CPU is nearly always just waiting around for the data from memory. As I mentioned, this is a common problem on Cortex-A8, even worse on Cortex-A9, and will sometimes be a problem on Cortex-A15.

    But in your specific case (involving multiplies), if you look at the CPU pipeline of Cortex-A9 or the instruction timings of Cortex-A9, you will see that multiplication is only performed with 32-bits at a time. This is for ARM CPU code and for NEON code, even if your instruction is VMUL using 128-bit registers. eg: 128-bit VMUL takes 4 times as many cycles than 32-bit MUL because 128-bit MUL is basically a macro instruction to perform 4 x 32-bit multiplies in a sequence. This is quite different to the behaviour of other instructions such as Addition, where 128-bit VADD is typically the same speed as 32-bit ADD, rather than 4x slower.
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Yes as I mentioned earlier, you can look into other optimization possibilities such as cache preloading, and/or  designing your code to make better use of cache, and/or using GPU to  perform the operation (eg: GLSL shaders for current hardware or GPGPU  acceleration if you are targeting future systems). Another method is multi-threading if your Cortex-A9 is a multi-core device, but not all algorithms are suited to multi-threading.
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Like I said, the multiplication hardware is only 32-bits wide, so multiplying Q registers is roughly the same speed as multiplying S registers 4 times, as mentioned in the cycle timing of the Cortex-A9 NEON TRM.