This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.


I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens? 


Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.


  • Note: This was originally posted on 28th July 2011 at http://forums.arm.com


    Although you still left one stall cycle there ;)



    Yes, like that developers will have the pleasure to optimize a little bit more ;)

  • Note: This was originally posted on 1st August 2011 at http://forums.arm.com


    This time, with a small image, 128*128 resolution, the time is shorten from 16.7ms to 11.3ms on my i.MX51.


    I dont' remember the inprove performance I've had when I had made the test!
    I though it was near 2 time faster !



    But on my A9, the improvement is so tiny, just 1ms, from 20ms to 19ms.
    So I'm confused again.



    Well.
    I don't know why but it is not really a surprised.

    The cortex A9 focus on the out of order execution, and the high frequency soc.
    The cycle table is not detailled but what is given let me suppose the cortex A9 is slower than the cortex A8 (at same frequency).
    With NEON (and then the code you tried) it should not have difference for same frequency proc.
    By the other side, the Cortex A9 should be able to work at higher frequency than the Cortex A8.

    To finish, the cortex A9 seem's to be done to improve the bad code produced by compiler and should not be good for cortex A8 optimized code.
    For me, this cpu (the A9) is not a good choice for the moment. Under 1.2 ou 1.5 Ghz, this is not a valid choice for assembly coder.

    May be one day, ARM will give us the pipeline stage of A9 instructions, and then we'll be able to know a little bit more about it.
    But that not seem's to be for now !


    Etienne
  • Note: This was originally posted on 30th November 2012 at http://forums.arm.com


    Any one have document about Cortex-A9 pipeline ?


    ARM does :-) Actually the full specs for Cortex-A9 are in several different documents. Google for "ARM Cortex-A9 TRM" to get the main official document, and "ARM Cortex-A9 NEON TRM" for the one about NEON. I also highly recommend reading the Programmer's Guide (Google for "ARM Cortex-A Series Programmers Guide"), it provides a lot of useful info.

    -Shervin.
  • Note: This was originally posted on 9th August 2012 at http://forums.arm.com

    I've also found the same sort of problems in most of my image processing code, where NEON typically gives about 20x boost on a Cortex-A8 but only about 3x boost on a Cortex-A9 CPU! Like the guys have mentioned already in this post, there are many reasons why Cortex-A9 is faster in some ways and slower in other ways (I also compare Cortex-A8 with Cortex-A9 on my webpage "http://www.shervinemami.info/armAssembly.html"). But as you've noticed, it's very important that you try different amounts & positions for Cache Preloading using PLD instructions, because like someone else mentioned early in the post, your device is mostly just waiting on data from memory, rather than doing NEON operations on it!

    So if you are working with megapixel images then you should worry less about counting NEON clock cycles and think in terms of memory stalls, because that is where most of the time will go to!
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    KP100, please don't ask the same question on 2 different posts. I already answered on your other post, saying that float multiply hardware is just 32-bits wide so it doesn't matter if you use S registers or Q registers, there wont be a speed difference, whereas other operations like addition have wider hardware so they can be faster in Q registers than S registers.
  • Note: This was originally posted on 25th July 2011 at http://forums.arm.com

    I haven't tested NEON on Cortex-A9 directly, but according to available information the following should be true:

    - On Cortex-A8 a NEON instruction can dual issue a load, store, or permute type instruction with any other type of instruction. On Cortex-A9 the NEON unit is described as only accepting one dispatch per cycle, so this probably precludes this sort of dual-issue.
    - On Cortex-A8 the NEON pipeline begins after the main pipeline is completely done, which on Cortex-A9 it runs in parallel, with dispatch to it (presumably to a queue like in A8) occurring fairly early in the pipeline. However, in the A8 pipeline loads to NEON registers are queued and serviced well before the NEON pipeline itself begins. This allows for hiding latency, not only from L1 cache (load-use penalty) but even some or all from L2 cache. The queuing also allows for limited out-of-order loading (allowing hit under miss). So on A9 NEON loads will suffer from higher latency.
    - On the other hand, preloads on Cortex-A9 go to L1 cache instead of L2 cache, and there's now an automatic preload engine (at least as an option, don't know if the amlogic SoC implements it). So there'll be a higher L1 hit-rate for streaming data.

    So you can see the interface between the NEON unit and the rest of the core changed, but as far as I'm aware the NEON unit itself didn't. So the dispatch and latencies of the instructions should be the same, and would appear to be from the cycle charts. Note that on A9 NEON instructions still execute in order.

    These differences could have a major change in performance if you're loading from L2 cache or main memory, if there's no automatic prefetch or somehow it isn't kicking in. But I agree with everyone else that getting the exact same performance looks extremely suspicious. The amlogic SoC does have NEON (I've seen its datasheet), it also only has 128KB of L2 cache. It's possible NEON is disabled, but the only way you'd get the same performance is if a non-NEON path was compiled and executed. And if the non-NEON path is compiled from intrinsics it's hard to imagine that it'd end up being the same as the non-vectorized version, but for simple code like this it's possible. But that still wouldn't explain the ASM version performing the same. Benchmarking error seems like the most viable explanation...

    I think the best way to get your bearings straight on this is to start with the simplest possible control loops and ensure that you're getting the right timings for some integer code running for some number of cycles. Like, start with a loop with some nops, and grow it by a cycle or so at a time adding independent instructions. Then start adding NEON instructions and see what happens.
  • Note: This was originally posted on 8th August 2012 at http://forums.arm.com

    isogen's answers are right.. to elaborate a little bit more: if you have an instruction that outputs in N3 and the next one right after it needs its result in N2 then there'll be a cycle in between where the NEON unit is doing nothing. So the second one will start two cycles after the first one.

    You should try setting up a test loop that runs iterations of code like this many times, so you can time how long it takes and see for yourself. Then you can change instructions one at a time and see what happens.
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com

    Could you tell us precisely how large the image is (in pixels, an exact count) and how many times you're calling the function to get the numbers you're getting? Then we can put together some rough cycles/iteration counts and analyze the loop to see how the numbers compare with what we expect.

    It's actually interesting that the memory performance was holding you back more on the amlogic board than the i.MX51. I was actually considering using AML8276-M for a device over i.MX535.. guess there would have been a good reason not to..
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com

    Okay, so with 16384 pixels and 8 pixels per iteration that's 2048 iterations per loop. That's a little low for trying to remove the function overhead, but it should still be a small fraction of a percent so I'll just ignore it for now. The bigger error is going to be the roundoff on the time measurement. 400 calls makes 819200 iterations. 17ms     13ms      20ms    NEON-ASM-CODE

    That makes:
    20.75ns/loop on the i.MX51
    15.86ns/loop on the S5PC110
    24.41ns/loop on the AML8726-M

    In cycles:

    16.6 cycles/loop on the i.MX51
    15.86 cycles/loop on the S5PC110
    19.52 cycles/loop on the AML8726-M

    On the A8 you'd expect:


            vld3.8      {d0-d2}, [r1]!   @ cycles 0-3, result in N2 of last cycle

            vmull.u8    q3, d0, d5    @ cycle 4 (can't dual issue due to previous result in N2)
            vmlal.u8    q3, d1, d4    @ cycle 5
            vmlal.u8    q3, d2, d3    @ cycle 6, result in N6

            vshrn.u16   d6, q3, #8    @ cycle 12 (value needed in N1, 5 cycle stall), result in N3
            vst1.8      {d6}, [r0]!      @ cycle 15 (value needed in N1, 2 cycle stall)

            subs        r2, r2, #1    @ overlaps w/NEON
            bne       .loop        @ overlaps w/NEON


    So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.

    Your total image size is exhausting L1 data-cache on all platforms, so at least some of the time the loads will come from L2. This is where you might be hit by latency on the Cortex-A9. It wouldn't seem like you're hitting the full latency, although it's possible you're only missing in L1 cache 33% of the time on the AM8726-M (32KB of L1 data cache), and the vld itself would be hiding some of the latency.

    It'd be interesting to try it again with a smaller image that fits entirely in L1 cache, and with far more calls to the function (to get in the thousands of ms instead of tens)

    From the very beginning, I
    don't think AML8726-M is a good platform for its 128KB L2 and 65nm fab
    process, but its multimedia performance is pretty well, 1080P, Mali
    400.
    What is the differences between imx515 and imx535, freq?


    i.MX53 is a die shrink to 45nm with some new features. The CPU clock is increased to up to 1.2GHz, memory clock up to 400MHz (but the CPU bus clock only 200MHz), support for LPDDR2 and DDR3, GPU up to 200MHz and with its SRAM doubled, and has 1080p decoding. This document describes it: http://www.freescale...note/AN4271.pdf

    To me Mali-400 in AM8726-M doesn't seem like a strong competitor since it's probably only single core. The SGX 540 in S5PC110 can surely beat it.. if given a choice between the two I'd definitely go for the Samsung part.
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com

    Yeah, should have said unrolling and/or software pipelining. Although you still left one stall cycle there ;)
  • Note: This was originally posted on 25th July 2011 at http://forums.arm.com

    Yes the two implementations of NEON are different, so I'd expect different performance numbers between the two cores.

    Can you give as an example of an algorithm you are trying, and how you are building it? The fact you see absolutely no performance difference is "suspicious" - I'd expect some difference, even if only small. Check you are not running the same binary 3 times - it seems like the obvious conclusion to three identical performance numbers =)
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com

    With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.

    Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.

    If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.
  • Note: This was originally posted on 8th August 2012 at http://forums.arm.com

    R.E. (1) - it's a 5 cycle stall and 1 cycles to issue - 6 cycles in total.
    R.E. (2) - I'm not sure in this specific case, but this is a common usage so most MAC instructions tend to have a special forwarding path for the accumulator register, so there is no stall.
    R.E. (3) - Correct.
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    Hi Shervin,

    I asked  question in this post because I have got performance difference with Cortex-a8 cpu, but not with cortex-A9 cpu.
    So just I was eager to know why that is happening..(May be because of any Neon difference in Cortex-a8 and A9).

    Regards,
    Krishna
  • Note: This was originally posted on 23rd March 2013 at http://forums.arm.com

    Hi,

    I executed NEON operation test on Linux platform board. I am doing  for 4*4 matrix multiplication using arm & neon instructions.
    (1)   Matrix multiplication: Method of  calculating one bye one.Here I have used only S registers. (Normal ARM instructions)
      Here I am loading the float array content to S registers (32-bit)  using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix  multiplication using S registers as operand and to to hold the result.

    (2)   Matrix multiplication: Since 128  bit calculation is done, the  number of instructions will become 1/4 compared to  (1). Here I have  used Q and D registers. (Neon instructions)
    Here  I am loading the complete float array content to Q registers(128  bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix  multiplication using Q(128 bit) and D (64 bit) registers which will  obviously reduce the number of instructions ( Load, store,  multiplication) to 1/4 th of (1) code.

    I am using linux 3.0.35  and test code is executed on Linux platform (Cortex-a9 architecture) .
    But there is no speed difference between (1) and (2).

    In my Linux kernel configuration following options enabled
    CONFIG_VFP=y
    CONFIG_VFPv3=y
    CONFIG_NEON=y

    Following gcc command I have used to build the NEON application and  gcc compiler version is gcc 4.6.2
    gcc  -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard  -o test.out test.c

    Why I dint find any performance difference between normal ARM and NEON codes?
    I have tested same code with the Cortex-A8 and I am able to achieve the performance difference.

    Thanks in advance