This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.


I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens? 


Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.


Parents
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.

    Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.

    If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.


    Hum. I can't believe that this is the problem.
    It does not explain why on Cortex A8 the time are different...

    Except if in it's lowcost soc, the is not cache.

    May be you're right !
Reply
  • Note: This was originally posted on 26th July 2011 at http://forums.arm.com


    With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.

    Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.

    If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.


    Hum. I can't believe that this is the problem.
    It does not explain why on Cortex A8 the time are different...

    Except if in it's lowcost soc, the is not cache.

    May be you're right !
Children
No data