Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.


I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens? 


Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.


  • Note: This was originally posted on 29th November 2012 at http://forums.arm.com

    Any one have document about Cortex-A9 pipeline ?
  • Note: This was originally posted on 8th August 2012 at http://forums.arm.com



            vld3.8      {d0-d2}, [r1]!   @ cycles 0-3, result in N2 of last cycle

            vmull.u8    q3, d0, d5       @ cycle 4 (can't dual issue due to previous result in N2)
            vmlal.u8    q3, d1, d4       @ cycle 5
            vmlal.u8    q3, d2, d3       @ cycle 6, result in N6

            vshrn.u16   d6, q3, #8       @ cycle 12 (value needed in N1, 5 cycle stall), result in N3
            vst1.8      {d6}, [r0]!      @ cycle 15 (value needed in N1, 2 cycle stall)

            subs        r2, r2, #1       @ overlaps w/NEON
            bne             .loop        @ overlaps w/NEON


    So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.


    I'm coming off a bit late to this so sorry if this doesn't interest you anymore. However I found some problems in this analysis, or maybe I'm falling short on something. Please correct me if I'm wrong:
    • In the vshrn.u16 instruction you said 5-sycle stall, which I agree on, however you counted 6 cycles. Same extra cycle is counted in the vst1.8 instruction which is supposed to stall for 2 cycle, yet stalls for 3. If this is correct than you analysis should have shown 14 cycles, not 16.
    • Now, don't the vmlal.u8 instructions require q3 as source in N3 which would stall their execution by 3 cycles each?
    • This is just an observation about the reasoning, but the fact that the vmull.u8 instruction is at cycle 4 has nothing to do with waiting for the result of the load instruction. The load instruction just takes 4 cycles to issue.
    If I'm correct, than this could be scheduled in 18 cycles, not 16 (or 14).
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    Hi,

    I executed NEON operation test on Linux platform board. I am doing  for 4*4 matrix multiplication using arm & neon instructions.
    (1)   Matrix multiplication: Method of  calculating one bye one.Here I have used only S registers. (Normal ARM instructions)
      Here I am loading the float array content to S registers (32-bit)  using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix  multiplication using S registers as operand and to to hold the result.

    (2)   Matrix multiplication: Since 128  bit calculation is done, the  number of instructions will become 1/4 compared to  (1). Here I have  used Q and D registers. (Neon instructions)
    Here  I am loading the complete float array content to Q registers(128  bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix  multiplication using Q(128 bit) and D (64 bit) registers which will  obviously reduce the number of instructions ( Load, store,  multiplication) to 1/4 th of (1) code.

    I am using linux 3.0.35  and test code is executed on Linux platform (Cortex-a9 architecture) .
    But there is no speed difference between (1) and (2).

    In my Linux kernel configuration following options enabled
    CONFIG_VFP=y
    CONFIG_VFPv3=y
    CONFIG_NEON=y

    Following gcc command I have used to build the NEON application and  gcc compiler version is gcc 4.6.2
    gcc  -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard  -o test.out test.c

    Why I dint find any performance difference between normal ARM and NEON codes?
    I have tested same code with the Cortex-A8 and I am able to achieve the performance difference.

    Thanks in advance
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    Hi Shervin,

    I asked  question in this post because I have got performance difference with Cortex-a8 cpu, but not with cortex-A9 cpu.
    So just I was eager to know why that is happening..(May be because of any Neon difference in Cortex-a8 and A9).

    Regards,
    Krishna
  • Note: This was originally posted on 8th August 2012 at http://forums.arm.com

    R.E. (1) - it's a 5 cycle stall and 1 cycles to issue - 6 cycles in total.
    R.E. (2) - I'm not sure in this specific case, but this is a common usage so most MAC instructions tend to have a special forwarding path for the accumulator register, so there is no stall.
    R.E. (3) - Correct.
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com

    With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.

    Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.

    If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.
  • Note: This was originally posted on 25th July 2011 at http://forums.arm.com

    Yes the two implementations of NEON are different, so I'd expect different performance numbers between the two cores.

    Can you give as an example of an algorithm you are trying, and how you are building it? The fact you see absolutely no performance difference is "suspicious" - I'd expect some difference, even if only small. Check you are not running the same binary 3 times - it seems like the obvious conclusion to three identical performance numbers =)
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com

    Yeah, should have said unrolling and/or software pipelining. Although you still left one stall cycle there ;)
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com

    Okay, so with 16384 pixels and 8 pixels per iteration that's 2048 iterations per loop. That's a little low for trying to remove the function overhead, but it should still be a small fraction of a percent so I'll just ignore it for now. The bigger error is going to be the roundoff on the time measurement. 400 calls makes 819200 iterations. 17ms     13ms      20ms    NEON-ASM-CODE

    That makes:
    20.75ns/loop on the i.MX51
    15.86ns/loop on the S5PC110
    24.41ns/loop on the AML8726-M

    In cycles:

    16.6 cycles/loop on the i.MX51
    15.86 cycles/loop on the S5PC110
    19.52 cycles/loop on the AML8726-M

    On the A8 you'd expect:


            vld3.8      {d0-d2}, [r1]!   @ cycles 0-3, result in N2 of last cycle

            vmull.u8    q3, d0, d5    @ cycle 4 (can't dual issue due to previous result in N2)
            vmlal.u8    q3, d1, d4    @ cycle 5
            vmlal.u8    q3, d2, d3    @ cycle 6, result in N6

            vshrn.u16   d6, q3, #8    @ cycle 12 (value needed in N1, 5 cycle stall), result in N3
            vst1.8      {d6}, [r0]!      @ cycle 15 (value needed in N1, 2 cycle stall)

            subs        r2, r2, #1    @ overlaps w/NEON
            bne       .loop        @ overlaps w/NEON


    So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.

    Your total image size is exhausting L1 data-cache on all platforms, so at least some of the time the loads will come from L2. This is where you might be hit by latency on the Cortex-A9. It wouldn't seem like you're hitting the full latency, although it's possible you're only missing in L1 cache 33% of the time on the AM8726-M (32KB of L1 data cache), and the vld itself would be hiding some of the latency.

    It'd be interesting to try it again with a smaller image that fits entirely in L1 cache, and with far more calls to the function (to get in the thousands of ms instead of tens)

    From the very beginning, I
    don't think AML8726-M is a good platform for its 128KB L2 and 65nm fab
    process, but its multimedia performance is pretty well, 1080P, Mali
    400.
    What is the differences between imx515 and imx535, freq?


    i.MX53 is a die shrink to 45nm with some new features. The CPU clock is increased to up to 1.2GHz, memory clock up to 400MHz (but the CPU bus clock only 200MHz), support for LPDDR2 and DDR3, GPU up to 200MHz and with its SRAM doubled, and has 1080p decoding. This document describes it: http://www.freescale...note/AN4271.pdf

    To me Mali-400 in AM8726-M doesn't seem like a strong competitor since it's probably only single core. The SGX 540 in S5PC110 can surely beat it.. if given a choice between the two I'd definitely go for the Samsung part.
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com

    Could you tell us precisely how large the image is (in pixels, an exact count) and how many times you're calling the function to get the numbers you're getting? Then we can put together some rough cycles/iteration counts and analyze the loop to see how the numbers compare with what we expect.

    It's actually interesting that the memory performance was holding you back more on the amlogic board than the i.MX51. I was actually considering using AML8276-M for a device over i.MX535.. guess there would have been a good reason not to..
  • Note: This was originally posted on 8th August 2012 at http://forums.arm.com

    isogen's answers are right.. to elaborate a little bit more: if you have an instruction that outputs in N3 and the next one right after it needs its result in N2 then there'll be a cycle in between where the NEON unit is doing nothing. So the second one will start two cycles after the first one.

    You should try setting up a test loop that runs iterations of code like this many times, so you can time how long it takes and see for yourself. Then you can change instructions one at a time and see what happens.
  • Note: This was originally posted on 25th July 2011 at http://forums.arm.com

    I haven't tested NEON on Cortex-A9 directly, but according to available information the following should be true:

    - On Cortex-A8 a NEON instruction can dual issue a load, store, or permute type instruction with any other type of instruction. On Cortex-A9 the NEON unit is described as only accepting one dispatch per cycle, so this probably precludes this sort of dual-issue.
    - On Cortex-A8 the NEON pipeline begins after the main pipeline is completely done, which on Cortex-A9 it runs in parallel, with dispatch to it (presumably to a queue like in A8) occurring fairly early in the pipeline. However, in the A8 pipeline loads to NEON registers are queued and serviced well before the NEON pipeline itself begins. This allows for hiding latency, not only from L1 cache (load-use penalty) but even some or all from L2 cache. The queuing also allows for limited out-of-order loading (allowing hit under miss). So on A9 NEON loads will suffer from higher latency.
    - On the other hand, preloads on Cortex-A9 go to L1 cache instead of L2 cache, and there's now an automatic preload engine (at least as an option, don't know if the amlogic SoC implements it). So there'll be a higher L1 hit-rate for streaming data.

    So you can see the interface between the NEON unit and the rest of the core changed, but as far as I'm aware the NEON unit itself didn't. So the dispatch and latencies of the instructions should be the same, and would appear to be from the cycle charts. Note that on A9 NEON instructions still execute in order.

    These differences could have a major change in performance if you're loading from L2 cache or main memory, if there's no automatic prefetch or somehow it isn't kicking in. But I agree with everyone else that getting the exact same performance looks extremely suspicious. The amlogic SoC does have NEON (I've seen its datasheet), it also only has 128KB of L2 cache. It's possible NEON is disabled, but the only way you'd get the same performance is if a non-NEON path was compiled and executed. And if the non-NEON path is compiled from intrinsics it's hard to imagine that it'd end up being the same as the non-vectorized version, but for simple code like this it's possible. But that still wouldn't explain the ASM version performing the same. Benchmarking error seems like the most viable explanation...

    I think the best way to get your bearings straight on this is to start with the simplest possible control loops and ensure that you're getting the right timings for some integer code running for some number of cycles. Like, start with a loop with some nops, and grow it by a cycle or so at a time adding independent instructions. Then start adding NEON instructions and see what happens.
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    KP100, please don't ask the same question on 2 different posts. I already answered on your other post, saying that float multiply hardware is just 32-bits wide so it doesn't matter if you use S registers or Q registers, there wont be a speed difference, whereas other operations like addition have wider hardware so they can be faster in Q registers than S registers.
  • Note: This was originally posted on 9th August 2012 at http://forums.arm.com

    I've also found the same sort of problems in most of my image processing code, where NEON typically gives about 20x boost on a Cortex-A8 but only about 3x boost on a Cortex-A9 CPU! Like the guys have mentioned already in this post, there are many reasons why Cortex-A9 is faster in some ways and slower in other ways (I also compare Cortex-A8 with Cortex-A9 on my webpage "http://www.shervinemami.info/armAssembly.html"). But as you've noticed, it's very important that you try different amounts & positions for Cache Preloading using PLD instructions, because like someone else mentioned early in the post, your device is mostly just waiting on data from memory, rather than doing NEON operations on it!

    So if you are working with megapixel images then you should worry less about counting NEON clock cycles and think in terms of memory stalls, because that is where most of the time will go to!
  • Note: This was originally posted on 30th November 2012 at http://forums.arm.com


    Any one have document about Cortex-A9 pipeline ?


    ARM does :-) Actually the full specs for Cortex-A9 are in several different documents. Google for "ARM Cortex-A9 TRM" to get the main official document, and "ARM Cortex-A9 NEON TRM" for the one about NEON. I also highly recommend reading the Programmer's Guide (Google for "ARM Cortex-A Series Programmers Guide"), it provides a lot of useful info.

    -Shervin.
More questions in this forum