This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A8 NEON strange performance issue

Note: This was originally posted on 10th July 2013 at http://forums.arm.com

Hi all,

As an experienced assembly programmer, I'm currently working on NEON tutorials.

First example done, it beats the original C function by a huge margin :


pld [r1, #64*0]
pld [r1, #64*1]
pld [r1, #64*2]
pld [r1, #64*3]

    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept

1:
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!

        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vmull.s16   q15, d31, d0[0]

        vaddw.s16   q12, q12, d1
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1

        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8

        vst1.16  {d24-d27}, [r0,:256]!

        subs    r12, r12, #16
    bgt  1b
    bx      lr


However, there are some pipeline stalls here and there, and the dual-issuing isn't fully in effect. So I decided to optimize it :



pld  [r1, #64*0]
    pld  [r1, #64*1]
    pld  [r1, #64*2]
    pld  [r1, #64*3]
    pld  [r1, #64*4]
    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept
    vld1.16  {d28-d31}, [r1,:256]!
    vmull.s16   q12, d28, d0[0]
    vmull.s16   q13, d29, d0[0]
    vmull.s16   q14, d30, d0[0]
    vmull.s16   q15, d31, d0[0]
    1:
        vaddw.s16   q12, q12, d1
        vld1.16  {d20-d23}, [r1,:256]!
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1
        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8
        vmull.s16   q8, d20, d0[0]
        vmull.s16   q9, d21, d0[0]
        vmull.s16   q10, d22, d0[0]
        vmull.s16   q11, d23, d0[0]
        vst1.16  {d24-d27}, [r0,:256]!
        vaddw.s16   q8, q8, d1
        vaddw.s16   q9, q9, d1
        vaddw.s16   q10, q10, d1
        vaddw.s16   q11, q11, d1
        subs    r12, r12, #32
        vqrshrun.s32    d16, q8, #8
        vqrshrun.s32    d17, q9, #8
        vqrshrun.s32    d18, q10, #8
        ble  2f
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!
        vqrshrun.s32    d19, q11, #8
        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vst1.16  {d16-d19}, [r0,:256]!
        vmull.s16   q15, d31, d0[0]
    b 1b
    2:
    vqrshrun.s32    d19, q11, #8
    vst1.16  {d16-d19}, [r0,:256]!
    bx      lr


There is no pipeline stall at all in this optimized version, and all the memory accessing instructions are surrounded by data processing ones, so they get dual issued twice - at top and bottom. - looks good~

Then the bitter surprise comes while benchmarking on my 4th gen iPod touch and iPhone4 though (iOS6.13 if it matters) : The "optimized" version is about 5% slower than the initial one.


(r12 gets the value of a full HD resolution(1920*1080), and both functions are called several thousand times in a row)

I've been trying several variations, repositioning subs and pld around, nothing matters, the optimized version is ALWAYS slower.

And I removed the pld instruction completely, both functions become much slower as expected, but voila, the optimized version is now about 5% faster.


It seems I did the right thing in removing pipeline stalls and unfolding dual issuing capability, but I must have done something the L2 cache doesn't like in the optimized version.

Can someone give me some insight regarding this? I'm really curious, desperate, frustrated or whatever since without knowing the reason of this strange behavior, all the efforts I put in mastering NEON would be in vain.


Thanks in advance.
Parents
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com


    I've found really weird, inexplicable things with NEON too :/ One thing I've had a suspicion of for a while, but little really hard data on, is that maybe there are penalties for too many NEON instructions before hitting a non-NEON instruction. At the very least we do know that the 16-entry queue will fill up and you'll no longer be able to decode non-NEON instructions ahead of NEON execution. You can also try aligning the loop entry point to 8 bytes to maximize fetch throughput.

    One kind of obvious question.. are your iterations actually multiples of 32? If you're rounding them up to 32 you'll be doing half a loop more work. I don't think you need to unroll like this to get enough distance to avoid stalls, you can probably do it just by staggering/software pipelining the loop.

    The only other thing I can think of, is I don't know how well the multi-cycle loads and stores dual-issue. You may want to try splitting it into separate loads and stores to try to get both sides to dual-issue. But you can't put them back to back or the address increment will stall in the integer pipeline. If you really need them back to back you can do it with separate pointers at interleaved addresses and a register increment but that increases register pressure a lot.


    Hi, thanks for the reply.
    If I understood the reference manual correctly, while NEON instructions dominate, ARM is simply waiting up to 16 stages ahead. I don't think there to be any penalty for NEON instructions dominating.
    And I can assure you that my iteration is correctly implemented. Otherwise iOS would punish me with memory access violation exception. (iOS is very strict for this matter)

    Last but not least, I myself have always been curious about vldm/vstm with write back interfering with the pipeline, but I guess it doesn't since it would be very limiting performance-wise that way. I think ARM will do the writing back immediately after putting those NEON instruction onto the queue. I'll write a simple test routine to verify this matter. It wouldn't be that hard.

    Alas, I won't be optimizing NEON codes via dual issuing since A9 and later cores don't dual-issue NEON instructions anyway. I think it isn't worth the effort optimizing solely for the A8.

    Or... does NEON on the A15 dual-issuing?

    best regards
    - Jake
Reply
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com


    I've found really weird, inexplicable things with NEON too :/ One thing I've had a suspicion of for a while, but little really hard data on, is that maybe there are penalties for too many NEON instructions before hitting a non-NEON instruction. At the very least we do know that the 16-entry queue will fill up and you'll no longer be able to decode non-NEON instructions ahead of NEON execution. You can also try aligning the loop entry point to 8 bytes to maximize fetch throughput.

    One kind of obvious question.. are your iterations actually multiples of 32? If you're rounding them up to 32 you'll be doing half a loop more work. I don't think you need to unroll like this to get enough distance to avoid stalls, you can probably do it just by staggering/software pipelining the loop.

    The only other thing I can think of, is I don't know how well the multi-cycle loads and stores dual-issue. You may want to try splitting it into separate loads and stores to try to get both sides to dual-issue. But you can't put them back to back or the address increment will stall in the integer pipeline. If you really need them back to back you can do it with separate pointers at interleaved addresses and a register increment but that increases register pressure a lot.


    Hi, thanks for the reply.
    If I understood the reference manual correctly, while NEON instructions dominate, ARM is simply waiting up to 16 stages ahead. I don't think there to be any penalty for NEON instructions dominating.
    And I can assure you that my iteration is correctly implemented. Otherwise iOS would punish me with memory access violation exception. (iOS is very strict for this matter)

    Last but not least, I myself have always been curious about vldm/vstm with write back interfering with the pipeline, but I guess it doesn't since it would be very limiting performance-wise that way. I think ARM will do the writing back immediately after putting those NEON instruction onto the queue. I'll write a simple test routine to verify this matter. It wouldn't be that hard.

    Alas, I won't be optimizing NEON codes via dual issuing since A9 and later cores don't dual-issue NEON instructions anyway. I think it isn't worth the effort optimizing solely for the A8.

    Or... does NEON on the A15 dual-issuing?

    best regards
    - Jake
Children
No data