This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Differences between NEON in Cortex-A8 and A9

Note: This was originally posted on 25th July 2011 at http://forums.arm.com

Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(http://hilbert-space.de/?p=22), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.


I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens? 


Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms

Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.

Thanks and looking forward to any useful suggestions.


Parents
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com

    Okay, so with 16384 pixels and 8 pixels per iteration that's 2048 iterations per loop. That's a little low for trying to remove the function overhead, but it should still be a small fraction of a percent so I'll just ignore it for now. The bigger error is going to be the roundoff on the time measurement. 400 calls makes 819200 iterations. 17ms     13ms      20ms    NEON-ASM-CODE

    That makes:
    20.75ns/loop on the i.MX51
    15.86ns/loop on the S5PC110
    24.41ns/loop on the AML8726-M

    In cycles:

    16.6 cycles/loop on the i.MX51
    15.86 cycles/loop on the S5PC110
    19.52 cycles/loop on the AML8726-M

    On the A8 you'd expect:


            vld3.8      {d0-d2}, [r1]!   @ cycles 0-3, result in N2 of last cycle

            vmull.u8    q3, d0, d5    @ cycle 4 (can't dual issue due to previous result in N2)
            vmlal.u8    q3, d1, d4    @ cycle 5
            vmlal.u8    q3, d2, d3    @ cycle 6, result in N6

            vshrn.u16   d6, q3, #8    @ cycle 12 (value needed in N1, 5 cycle stall), result in N3
            vst1.8      {d6}, [r0]!      @ cycle 15 (value needed in N1, 2 cycle stall)

            subs        r2, r2, #1    @ overlaps w/NEON
            bne       .loop        @ overlaps w/NEON


    So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.

    Your total image size is exhausting L1 data-cache on all platforms, so at least some of the time the loads will come from L2. This is where you might be hit by latency on the Cortex-A9. It wouldn't seem like you're hitting the full latency, although it's possible you're only missing in L1 cache 33% of the time on the AM8726-M (32KB of L1 data cache), and the vld itself would be hiding some of the latency.

    It'd be interesting to try it again with a smaller image that fits entirely in L1 cache, and with far more calls to the function (to get in the thousands of ms instead of tens)

    From the very beginning, I
    don't think AML8726-M is a good platform for its 128KB L2 and 65nm fab
    process, but its multimedia performance is pretty well, 1080P, Mali
    400.
    What is the differences between imx515 and imx535, freq?


    i.MX53 is a die shrink to 45nm with some new features. The CPU clock is increased to up to 1.2GHz, memory clock up to 400MHz (but the CPU bus clock only 200MHz), support for LPDDR2 and DDR3, GPU up to 200MHz and with its SRAM doubled, and has 1080p decoding. This document describes it: http://www.freescale...note/AN4271.pdf

    To me Mali-400 in AM8726-M doesn't seem like a strong competitor since it's probably only single core. The SGX 540 in S5PC110 can surely beat it.. if given a choice between the two I'd definitely go for the Samsung part.
Reply
  • Note: This was originally posted on 27th July 2011 at http://forums.arm.com

    Okay, so with 16384 pixels and 8 pixels per iteration that's 2048 iterations per loop. That's a little low for trying to remove the function overhead, but it should still be a small fraction of a percent so I'll just ignore it for now. The bigger error is going to be the roundoff on the time measurement. 400 calls makes 819200 iterations. 17ms     13ms      20ms    NEON-ASM-CODE

    That makes:
    20.75ns/loop on the i.MX51
    15.86ns/loop on the S5PC110
    24.41ns/loop on the AML8726-M

    In cycles:

    16.6 cycles/loop on the i.MX51
    15.86 cycles/loop on the S5PC110
    19.52 cycles/loop on the AML8726-M

    On the A8 you'd expect:


            vld3.8      {d0-d2}, [r1]!   @ cycles 0-3, result in N2 of last cycle

            vmull.u8    q3, d0, d5    @ cycle 4 (can't dual issue due to previous result in N2)
            vmlal.u8    q3, d1, d4    @ cycle 5
            vmlal.u8    q3, d2, d3    @ cycle 6, result in N6

            vshrn.u16   d6, q3, #8    @ cycle 12 (value needed in N1, 5 cycle stall), result in N3
            vst1.8      {d6}, [r0]!      @ cycle 15 (value needed in N1, 2 cycle stall)

            subs        r2, r2, #1    @ overlaps w/NEON
            bne       .loop        @ overlaps w/NEON


    So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.

    Your total image size is exhausting L1 data-cache on all platforms, so at least some of the time the loads will come from L2. This is where you might be hit by latency on the Cortex-A9. It wouldn't seem like you're hitting the full latency, although it's possible you're only missing in L1 cache 33% of the time on the AM8726-M (32KB of L1 data cache), and the vld itself would be hiding some of the latency.

    It'd be interesting to try it again with a smaller image that fits entirely in L1 cache, and with far more calls to the function (to get in the thousands of ms instead of tens)

    From the very beginning, I
    don't think AML8726-M is a good platform for its 128KB L2 and 65nm fab
    process, but its multimedia performance is pretty well, 1080P, Mali
    400.
    What is the differences between imx515 and imx535, freq?


    i.MX53 is a die shrink to 45nm with some new features. The CPU clock is increased to up to 1.2GHz, memory clock up to 400MHz (but the CPU bus clock only 200MHz), support for LPDDR2 and DDR3, GPU up to 200MHz and with its SRAM doubled, and has 1080p decoding. This document describes it: http://www.freescale...note/AN4271.pdf

    To me Mali-400 in AM8726-M doesn't seem like a strong competitor since it's probably only single core. The SGX 540 in S5PC110 can surely beat it.. if given a choice between the two I'd definitely go for the Samsung part.
Children
No data