This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.


I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens? 


Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.


  • Note: This was originally posted on 8th August 2012 at http://forums.arm.com



            vld3.8      {d0-d2}, [r1]!   @ cycles 0-3, result in N2 of last cycle

            vmull.u8    q3, d0, d5       @ cycle 4 (can't dual issue due to previous result in N2)
            vmlal.u8    q3, d1, d4       @ cycle 5
            vmlal.u8    q3, d2, d3       @ cycle 6, result in N6

            vshrn.u16   d6, q3, #8       @ cycle 12 (value needed in N1, 5 cycle stall), result in N3
            vst1.8      {d6}, [r0]!      @ cycle 15 (value needed in N1, 2 cycle stall)

            subs        r2, r2, #1       @ overlaps w/NEON
            bne             .loop        @ overlaps w/NEON


    So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.


    I'm coming off a bit late to this so sorry if this doesn't interest you anymore. However I found some problems in this analysis, or maybe I'm falling short on something. Please correct me if I'm wrong:
    • In the vshrn.u16 instruction you said 5-sycle stall, which I agree on, however you counted 6 cycles. Same extra cycle is counted in the vst1.8 instruction which is supposed to stall for 2 cycle, yet stalls for 3. If this is correct than you analysis should have shown 14 cycles, not 16.
    • Now, don't the vmlal.u8 instructions require q3 as source in N3 which would stall their execution by 3 cycles each?
    • This is just an observation about the reasoning, but the fact that the vmull.u8 instruction is at cycle 4 has nothing to do with waiting for the result of the load instruction. The load instruction just takes 4 cycles to issue.
    If I'm correct, than this could be scheduled in 18 cycles, not 16 (or 14).
  • Note: This was originally posted on 29th November 2012 at http://forums.arm.com

    Any one have document about Cortex-A9 pipeline ?