This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A8 NEON strange performance issue

Note: This was originally posted on 10th July 2013 at http://forums.arm.com

Hi all,

As an experienced assembly programmer, I'm currently working on NEON tutorials.

First example done, it beats the original C function by a huge margin :


pld [r1, #64*0]
pld [r1, #64*1]
pld [r1, #64*2]
pld [r1, #64*3]

    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept

1:
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!

        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vmull.s16   q15, d31, d0[0]

        vaddw.s16   q12, q12, d1
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1

        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8

        vst1.16  {d24-d27}, [r0,:256]!

        subs    r12, r12, #16
    bgt  1b
    bx      lr


However, there are some pipeline stalls here and there, and the dual-issuing isn't fully in effect. So I decided to optimize it :



pld  [r1, #64*0]
    pld  [r1, #64*1]
    pld  [r1, #64*2]
    pld  [r1, #64*3]
    pld  [r1, #64*4]
    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept
    vld1.16  {d28-d31}, [r1,:256]!
    vmull.s16   q12, d28, d0[0]
    vmull.s16   q13, d29, d0[0]
    vmull.s16   q14, d30, d0[0]
    vmull.s16   q15, d31, d0[0]
    1:
        vaddw.s16   q12, q12, d1
        vld1.16  {d20-d23}, [r1,:256]!
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1
        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8
        vmull.s16   q8, d20, d0[0]
        vmull.s16   q9, d21, d0[0]
        vmull.s16   q10, d22, d0[0]
        vmull.s16   q11, d23, d0[0]
        vst1.16  {d24-d27}, [r0,:256]!
        vaddw.s16   q8, q8, d1
        vaddw.s16   q9, q9, d1
        vaddw.s16   q10, q10, d1
        vaddw.s16   q11, q11, d1
        subs    r12, r12, #32
        vqrshrun.s32    d16, q8, #8
        vqrshrun.s32    d17, q9, #8
        vqrshrun.s32    d18, q10, #8
        ble  2f
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!
        vqrshrun.s32    d19, q11, #8
        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vst1.16  {d16-d19}, [r0,:256]!
        vmull.s16   q15, d31, d0[0]
    b 1b
    2:
    vqrshrun.s32    d19, q11, #8
    vst1.16  {d16-d19}, [r0,:256]!
    bx      lr


There is no pipeline stall at all in this optimized version, and all the memory accessing instructions are surrounded by data processing ones, so they get dual issued twice - at top and bottom. - looks good~

Then the bitter surprise comes while benchmarking on my 4th gen iPod touch and iPhone4 though (iOS6.13 if it matters) : The "optimized" version is about 5% slower than the initial one.


(r12 gets the value of a full HD resolution(1920*1080), and both functions are called several thousand times in a row)

I've been trying several variations, repositioning subs and pld around, nothing matters, the optimized version is ALWAYS slower.

And I removed the pld instruction completely, both functions become much slower as expected, but voila, the optimized version is now about 5% faster.


It seems I did the right thing in removing pipeline stalls and unfolding dual issuing capability, but I must have done something the L2 cache doesn't like in the optimized version.

Can someone give me some insight regarding this? I'm really curious, desperate, frustrated or whatever since without knowing the reason of this strange behavior, all the efforts I put in mastering NEON would be in vain.


Thanks in advance.
  • Note: This was originally posted on 17th July 2013 at http://forums.arm.com


    Two vld's loading 32bytes each = 64bytes = 1 cache line. :)
    Putting one additional PLD didn't help.

    Thanks anyway


    Doh I'm an idiot, sorry I forgot you have a 64-byte cacheline! In that case I don't know how else to help you, sorry. But I still do recommend that look at the performance of your code when you've commented out the calculations vs commented out the memory accesses, to get an idea of where your time is being spent. Maybe all your time is just waiting on DRAM loads & stores and that is something that NEON can't help with.
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com


    Or... does NEON on the A15 dual-issuing?


    Yes the Cortex-A15 is a beast, it can dual-issue NEON and it has numerous NEON ALUs. eg: VADDQ is 1.5 instructions per clock (ie: you can add 24 x 8-bit numbers per clock!) per CPU core. So a quad-core A15 running at 1.6 GHz with 4 cores getting 85% parallelization can process 4 x 0.85 x 24 x 1.6 x 10^9 = 130 GB/s!
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com

    Yes ARM has released surprisingly little details about the Cortex-A15 CPU. Since the chip is already fully available, I assume they won't release any more info than what is currently in the TRM. Anyway, the Cortex-A15 has various NEON improvements, such as out-of-order & register-renaming of NEON instructions, and dual-issue of some NEON instructions (potentially at the same time as it triple-issues regular CPU instructions and do both a load & store, in other words it can co-issue upto 7 instructions at the same time!).

    I have measured more than 130 GB/s in my NEON code on my quad-core A15, so it certainly is possible!
  • Note: This was originally posted on 13th July 2013 at http://forums.arm.com


    Actually that's another thing you need to do a LOT of testing with: PLD.I noticed that your optimized loop above is loading 2 cachelines per iterations but only preloading 1 cacheline per iteration. PLD is something that seems too hard to predict what will work best, you need to do lots of testing with different number of PLD instructions, different locations of PLD within your code, and most importantly, different number of cachelines to load ahead of. But in general, you should preload 2 cachelines if you are loading 2 cachelines, and the PLD instructions are typically better if they are away from the VLD instructions (but not always) and interleaved among other code. And you should also experiment with loading 1, 2, 3, 4, 5, 6, 8, 10, 20, and 30 cachelines ahead, to see which one gives best results. (Obviously if you notice the perf is getting worse as you increase or decrease the PLD amount then you don't need to test every amount, but you get the idea).

    Each cache line consists of 64 Bytes on Cortex A8. In the extended version It's exactly 64Bytes per iteration. No mistakes here.
    I already tried everything you are suggesting, repositioning PLD to every possible position.
    And reading further ahead only reduced the performance. Through my experiments, I found out 4 lines ahead to be the most efficient.

    BTW, I know your site since two years or so Good stuffs there.