This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.


I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens? 


Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.


  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    Yes the two implementations of NEON are different, so I'd expect different performance numbers between the two cores.

    Can you give as an example of an algorithm you are trying, and how you are building it? The fact you see absolutely no performance difference is "suspicious" - I'd expect some difference, even if only small. Check you are not running the same binary 3 times - it seems like the obvious conclusion to three identical performance numbers =)


    I am completely sure that I didn't run the same binary 3 times because I build an apk file then installed the same apk file into Cortex-A8 and A9 platform respectively. And the time was not identical the same, it's about 106-107ms, but no significant differences.
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    That's strange...

    It's possible that your tests take the same time if you have made a good code that check that NEON is available...
    In this case, may be you don't have NEON on your Cortex A9. (Tegra 2 for example)
    I do not find any information about your processor

    http://www.amlogic.com/product01.htm

    they don't speak about NEON... so !

    In this case all your function call the basic ARM assembly code !
    That could explain the same time result !

    Etienne



    I got the manual guide for this chip, it does have NEON.
    Although it doesn't have NEON, how can it run the NEON assembly code in the ARM core, the CPU transforms it automatically?
    I mean if my A9 doesn't have NEON, I think the app should crash and exit and I cannot get any results from it, right?


  • Note: This was originally posted on 30th July 2011 at http://forums.arm.com


    or by trying this code
    http://pulsar.websha...lng=fr&sample=3


    I had tried the code you gave me before I posted this question.
    Last time, with the huge image, the second assembly code run even slower than the first one. I think it must be the memory latency problem again.
    This time, with a small image, 128*128 resolution, the time is shorten from 16.7ms to 11.3ms on my i.MX51.
    But on my A9, the improvement is so tiny, just 1ms, from 20ms to 19ms.
    So I'm confused again.
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    Sure.
    If you haven't made a specific test, your app can't used default code if NEON is not here.


    What is the size of your pixel array ?
    107 ms is very slow in fact !!!


    I searched in Google for a image that I think big enough, 1572*2362, a bitmap larger than 10MB!!!
    And I transported the c code to my pc, which has an Intel E6700 CPU 3.2GHz, the time is just 8ms!!!

    Now I'm searching for another Cortex-A9 based device to do this test, such as Sumsung Galaxy S II.
    Did you have any experience on any similar problem?
    Last week, I thought my build options was wrong, but now I think the assembly code should has nothing to do with the compiler, it must run as I want.
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    I haven't tested NEON on Cortex-A9 directly, but according to available information the following should be true:

    - On Cortex-A8 a NEON instruction can dual issue a load, store, or permute type instruction with any other type of instruction. On Cortex-A9 the NEON unit is described as only accepting one dispatch per cycle, so this probably precludes this sort of dual-issue.
    - On Cortex-A8 the NEON pipeline begins after the main pipeline is completely done, which on Cortex-A9 it runs in parallel, with dispatch to it (presumably to a queue like in A8) occurring fairly early in the pipeline. However, in the A8 pipeline loads to NEON registers are queued and serviced well before the NEON pipeline itself begins. This allows for hiding latency, not only from L1 cache (load-use penalty) but even some or all from L2 cache. The queuing also allows for limited out-of-order loading (allowing hit under miss). So on A9 NEON loads will suffer from higher latency.
    - On the other hand, preloads on Cortex-A9 go to L1 cache instead of L2 cache, and there's now an automatic preload engine (at least as an option, don't know if the amlogic SoC implements it). So there'll be a higher L1 hit-rate for streaming data.

    So you can see the interface between the NEON unit and the rest of the core changed, but as far as I'm aware the NEON unit itself didn't. So the dispatch and latencies of the instructions should be the same, and would appear to be from the cycle charts. Note that on A9 NEON instructions still execute in order.

    These differences could have a major change in performance if you're loading from L2 cache or main memory, if there's no automatic prefetch or somehow it isn't kicking in. But I agree with everyone else that getting the exact same performance looks extremely suspicious. The amlogic SoC does have NEON (I've seen its datasheet), it also only has 128KB of L2 cache. It's possible NEON is disabled, but the only way you'd get the same performance is if a non-NEON path was compiled and executed. And if the non-NEON path is compiled from intrinsics it's hard to imagine that it'd end up being the same as the non-vectorized version, but for simple code like this it's possible. But that still wouldn't explain the ASM version performing the same. Benchmarking error seems like the most viable explanation...

    I think the best way to get your bearings straight on this is to start with the simplest possible control loops and ensure that you're getting the right timings for some integer code running for some number of cycles. Like, start with a loop with some nops, and grow it by a cycle or so at a time adding independent instructions. Then start adding NEON instructions and see what happens.



    Thanks for your huge reply.
    I think I need to learn more about the architecture of ARM to get as deep as you
    I cannot understand your "Benchmarking error", do you mean my time measuring is wrong?
    And about your suggested test way, do you mean I should firstly get the time cycle of the ARM side by writing some codes that I know how long it should take, then add NEON codes to see the result?
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.

    Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.

    If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.


    Thank you so much!!!
    You are right, when I changed the image size from 10MB to 50KB, I got the wanted time----about 5-6 times faster
    I didn't know the memory access is so time consuming before.
    I can move forward now, thanks again.
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    Hum. I can't believe that this is the problem.
    It does not explain why on Cortex A8 the time are different...

    Except if in it's lowcost soc, the is not cache.

    May be you're right !


    I fix it now, it's the memory latency caused this problem
    And I think I am a little bit "lucky" to get almost three identical time on my A9 platform.
    So my A9 has a poor memory performance which I have to take care in future.
    Thanks for your help!!!
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    Ok.

    and finaly :
    Does the cortex A9 faster than the cortex A8 ?

    Can you give your result (c / asm / neon) for the both proc with the small picture ?
    Can you give the freqency of your proc too ?

    thank's


    I got two A8 and one A9
    A8: i.MX515(800MHz, 256KB L2) S5PC110(1GHz, 256KB L2)
    A9: AML8726-M(800MHz, 128KB L2)

    Here is the results on three platforms:
    i.MX515     S5PC110     AML8726-M
    135ms     108ms      117ms    ARM-C-CODE
    76ms   60ms      48ms    NEON-C-CODE
    17ms     13ms      20ms    NEON-ASM-CODE

    So, A9 is not the fastest.
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com


    Could you tell us precisely how large the image is (in pixels, an exact count) and how many times you're calling the function to get the numbers you're getting? Then we can put together some rough cycles/iteration counts and analyze the loop to see how the numbers compare with what we expect.

    It's actually interesting that the memory performance was holding you back more on the amlogic board than the i.MX51. I was actually considering using AML8276-M for a device over i.MX535.. guess there would have been a good reason not to..


    The resolution is 128*128, and I repeated it 400 times. The freq is 800MHz, so it's about 20ms*800MHz/400times/128*128pixels=2.44 cycle/pixel? I don't know how to calculate it actually.
    From the very beginning, I don't think AML8726-M is a good platform for its 128KB L2 and 65nm fab process, but its multimedia performance is pretty well, 1080P, Mali 400.
    What is the differences between imx515 and imx535, freq?
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    I mean if my A9 doesn't have NEON, I think the app should crash and exit and I cannot get any results from it, right?



    Sure.
    If you haven't made a specific test, your app can't used default code if NEON is not here.


    What is the size of your pixel array ?
    107 ms is very slow in fact !!!
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.

    Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.

    If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.


    Hum. I can't believe that this is the problem.
    It does not explain why on Cortex A8 the time are different...

    Except if in it's lowcost soc, the is not cache.

    May be you're right !
  • Note: This was originally posted on 25th July 2011 at http://forums.arm.com

    That's strange...

    It's possible that your tests take the same time if you have made a good code that check that NEON is available...
    In this case, may be you don't have NEON on your Cortex A9. (Tegra 2 for example)
    I do not find any information about your processor

    http://www.amlogic.com/product01.htm

    they don't speak about NEON... so !

    In this case all your function call the basic ARM assembly code !
    That could explain the same time result !

    Etienne
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    Thank you so much!!!
    You are right, when I changed the image size from 10MB to 50KB, I got the wanted time----about 5-6 times faster
    I didn't know the memory access is so time consuming before.
    I can move forward now, thanks again.


    Ok.

    and finaly :
    Does the cortex A9 faster than the cortex A8 ?

    Can you give your result (c / asm / neon) for the both proc with the small picture ?
    Can you give the freqency of your proc too ?

    thank's
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com


    So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.



    or by trying this code
    http://pulsar.webshaker.net/ccc/result.php?lng=fr&sample=3