This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.


I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens? 


Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.


Parents
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    I haven't tested NEON on Cortex-A9 directly, but according to available information the following should be true:

    - On Cortex-A8 a NEON instruction can dual issue a load, store, or permute type instruction with any other type of instruction. On Cortex-A9 the NEON unit is described as only accepting one dispatch per cycle, so this probably precludes this sort of dual-issue.
    - On Cortex-A8 the NEON pipeline begins after the main pipeline is completely done, which on Cortex-A9 it runs in parallel, with dispatch to it (presumably to a queue like in A8) occurring fairly early in the pipeline. However, in the A8 pipeline loads to NEON registers are queued and serviced well before the NEON pipeline itself begins. This allows for hiding latency, not only from L1 cache (load-use penalty) but even some or all from L2 cache. The queuing also allows for limited out-of-order loading (allowing hit under miss). So on A9 NEON loads will suffer from higher latency.
    - On the other hand, preloads on Cortex-A9 go to L1 cache instead of L2 cache, and there's now an automatic preload engine (at least as an option, don't know if the amlogic SoC implements it). So there'll be a higher L1 hit-rate for streaming data.

    So you can see the interface between the NEON unit and the rest of the core changed, but as far as I'm aware the NEON unit itself didn't. So the dispatch and latencies of the instructions should be the same, and would appear to be from the cycle charts. Note that on A9 NEON instructions still execute in order.

    These differences could have a major change in performance if you're loading from L2 cache or main memory, if there's no automatic prefetch or somehow it isn't kicking in. But I agree with everyone else that getting the exact same performance looks extremely suspicious. The amlogic SoC does have NEON (I've seen its datasheet), it also only has 128KB of L2 cache. It's possible NEON is disabled, but the only way you'd get the same performance is if a non-NEON path was compiled and executed. And if the non-NEON path is compiled from intrinsics it's hard to imagine that it'd end up being the same as the non-vectorized version, but for simple code like this it's possible. But that still wouldn't explain the ASM version performing the same. Benchmarking error seems like the most viable explanation...

    I think the best way to get your bearings straight on this is to start with the simplest possible control loops and ensure that you're getting the right timings for some integer code running for some number of cycles. Like, start with a loop with some nops, and grow it by a cycle or so at a time adding independent instructions. Then start adding NEON instructions and see what happens.



    Thanks for your huge reply.
    I think I need to learn more about the architecture of ARM to get as deep as you
    I cannot understand your "Benchmarking error", do you mean my time measuring is wrong?
    And about your suggested test way, do you mean I should firstly get the time cycle of the ARM side by writing some codes that I know how long it should take, then add NEON codes to see the result?
Reply
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    I haven't tested NEON on Cortex-A9 directly, but according to available information the following should be true:

    - On Cortex-A8 a NEON instruction can dual issue a load, store, or permute type instruction with any other type of instruction. On Cortex-A9 the NEON unit is described as only accepting one dispatch per cycle, so this probably precludes this sort of dual-issue.
    - On Cortex-A8 the NEON pipeline begins after the main pipeline is completely done, which on Cortex-A9 it runs in parallel, with dispatch to it (presumably to a queue like in A8) occurring fairly early in the pipeline. However, in the A8 pipeline loads to NEON registers are queued and serviced well before the NEON pipeline itself begins. This allows for hiding latency, not only from L1 cache (load-use penalty) but even some or all from L2 cache. The queuing also allows for limited out-of-order loading (allowing hit under miss). So on A9 NEON loads will suffer from higher latency.
    - On the other hand, preloads on Cortex-A9 go to L1 cache instead of L2 cache, and there's now an automatic preload engine (at least as an option, don't know if the amlogic SoC implements it). So there'll be a higher L1 hit-rate for streaming data.

    So you can see the interface between the NEON unit and the rest of the core changed, but as far as I'm aware the NEON unit itself didn't. So the dispatch and latencies of the instructions should be the same, and would appear to be from the cycle charts. Note that on A9 NEON instructions still execute in order.

    These differences could have a major change in performance if you're loading from L2 cache or main memory, if there's no automatic prefetch or somehow it isn't kicking in. But I agree with everyone else that getting the exact same performance looks extremely suspicious. The amlogic SoC does have NEON (I've seen its datasheet), it also only has 128KB of L2 cache. It's possible NEON is disabled, but the only way you'd get the same performance is if a non-NEON path was compiled and executed. And if the non-NEON path is compiled from intrinsics it's hard to imagine that it'd end up being the same as the non-vectorized version, but for simple code like this it's possible. But that still wouldn't explain the ASM version performing the same. Benchmarking error seems like the most viable explanation...

    I think the best way to get your bearings straight on this is to start with the simplest possible control loops and ensure that you're getting the right timings for some integer code running for some number of cycles. Like, start with a loop with some nops, and grow it by a cycle or so at a time adding independent instructions. Then start adding NEON instructions and see what happens.



    Thanks for your huge reply.
    I think I need to learn more about the architecture of ARM to get as deep as you
    I cannot understand your "Benchmarking error", do you mean my time measuring is wrong?
    And about your suggested test way, do you mean I should firstly get the time cycle of the ARM side by writing some codes that I know how long it should take, then add NEON codes to see the result?
Children
No data