This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.

I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens?

Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.

Parents

Liad Weinberger over 12 years ago
Note: This was originally posted on 8th August 2012 at http://forums.arm.com

vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEON

So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.

I'm coming off a bit late to this so sorry if this doesn't interest you anymore. However I found some problems in this analysis, or maybe I'm falling short on something. Please correct me if I'm wrong:
In the vshrn.u16 instruction you said 5-sycle stall, which I agree on, however you counted 6 cycles. Same extra cycle is counted in the vst1.8 instruction which is supposed to stall for 2 cycle, yet stalls for 3. If this is correct than you analysis should have shown 14 cycles, not 16.
Now, don't the vmlal.u8 instructions require q3 as source in N3 which would stall their execution by 3 cycles each?
This is just an observation about the reasoning, but the fact that the vmull.u8 instruction is at cycle 4 has nothing to do with waiting for the result of the load instruction. The load instruction just takes 4 cycles to issue.
If I'm correct, than this could be scheduled in 18 cycles, not 16 (or 14).
Cancel
Vote up 0 Vote down

Cancel

Reply

Liad Weinberger over 12 years ago
Note: This was originally posted on 8th August 2012 at http://forums.arm.com

vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEON

So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.

I'm coming off a bit late to this so sorry if this doesn't interest you anymore. However I found some problems in this analysis, or maybe I'm falling short on something. Please correct me if I'm wrong:
In the vshrn.u16 instruction you said 5-sycle stall, which I agree on, however you counted 6 cycles. Same extra cycle is counted in the vst1.8 instruction which is supposed to stall for 2 cycle, yet stalls for 3. If this is correct than you analysis should have shown 14 cycles, not 16.
Now, don't the vmlal.u8 instructions require q3 as source in N3 which would stall their execution by 3 cycles each?
This is just an observation about the reasoning, but the fact that the vmull.u8 instruction is at cycle 4 has nothing to do with waiting for the result of the load instruction. The load instruction just takes 4 cycles to issue.
If I'm correct, than this could be scheduled in 18 cycles, not 16 (or 14).
Cancel
Vote up 0 Vote down

Cancel

Children

No data