This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A8 NEON strange performance issue

Note: This was originally posted on 10th July 2013 at http://forums.arm.com

Hi all,

As an experienced assembly programmer, I'm currently working on NEON tutorials.

First example done, it beats the original C function by a huge margin :


 pld [r1, #64*0]
 pld [r1, #64*1]
 pld [r1, #64*2]
 pld [r1, #64*3]

    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept

 1:
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!

        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vmull.s16   q15, d31, d0[0]

        vaddw.s16   q12, q12, d1
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1

        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8

        vst1.16  {d24-d27}, [r0,:256]!

        subs    r12, r12, #16
    bgt  1b
    bx      lr

However, there are some pipeline stalls here and there, and the dual-issuing isn't fully in effect. So I decided to optimize it :

pld  [r1, #64*0]
    pld  [r1, #64*1]
    pld  [r1, #64*2]
    pld  [r1, #64*3]
    pld  [r1, #64*4]
    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept
    vld1.16  {d28-d31}, [r1,:256]!
    vmull.s16   q12, d28, d0[0]
    vmull.s16   q13, d29, d0[0]
    vmull.s16   q14, d30, d0[0]
    vmull.s16   q15, d31, d0[0]
    1:
        vaddw.s16   q12, q12, d1
        vld1.16  {d20-d23}, [r1,:256]!
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1
        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8
        vmull.s16   q8, d20, d0[0]
        vmull.s16   q9, d21, d0[0]
        vmull.s16   q10, d22, d0[0]
        vmull.s16   q11, d23, d0[0]
        vst1.16  {d24-d27}, [r0,:256]!
        vaddw.s16   q8, q8, d1
        vaddw.s16   q9, q9, d1
        vaddw.s16   q10, q10, d1
        vaddw.s16   q11, q11, d1
        subs    r12, r12, #32
        vqrshrun.s32    d16, q8, #8
        vqrshrun.s32    d17, q9, #8
        vqrshrun.s32    d18, q10, #8
        ble  2f
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!
        vqrshrun.s32    d19, q11, #8
        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vst1.16  {d16-d19}, [r0,:256]!
        vmull.s16   q15, d31, d0[0]
    b 1b
    2:
    vqrshrun.s32    d19, q11, #8
    vst1.16  {d16-d19}, [r0,:256]!
    bx      lr

There is no pipeline stall at all in this optimized version, and all the memory accessing instructions are surrounded by data processing ones, so they get dual issued twice - at top and bottom. - looks good~

Then the bitter surprise comes while benchmarking on my 4th gen iPod touch and iPhone4 though (iOS6.13 if it matters) : The "optimized" version is about 5% slower than the initial one.

(r12 gets the value of a full HD resolution(1920*1080), and both functions are called several thousand times in a row)

I've been trying several variations, repositioning subs and pld around, nothing matters, the optimized version is ALWAYS slower.

And I removed the pld instruction completely, both functions become much slower as expected, but voila, the optimized version is now about 5% faster.

It seems I did the right thing in removing pipeline stalls and unfolding dual issuing capability, but I must have done something the L2 cache doesn't like in the optimized version.

Can someone give me some insight regarding this? I'm really curious, desperate, frustrated or whatever since without knowing the reason of this strange behavior, all the efforts I put in mastering NEON would be in vain.

Thanks in advance.

Parents

Jake Lee over 12 years ago

Note: This was originally posted on 13th July 2013 at http://forums.arm.com

Hi.

The main NEON optimisation method is to let some times between register loading/saving instructions and compute operations

Try something like this

@ preload vld1.16 {d28-d31}, [r1,:256]! @ first operation to tranfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] @ you'll have to make 1 less iteration subs r12, r12, #16 .loop: @ load for next iteration vld1.16 {d28-d31}, [r1,:256]! @ working on preloaded datas vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 @ prepare for saving vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 @ let sone time before saving - transfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] @ saving vst1.16 {d20-d23}, [r0,:256]! subs r12, r12, #16 bgt .loop @ need to finsh last iteration vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 vst1.16 {d20-d23}, [r0,:256]!

That sould works.
But I do not test the code !

I wish so much I could give you a success report, but unfortunately, it's not the case

Anyway, thank you very much for your effort. I really appreciate that.

I had to omit the optimization part in my tutorial. I just submitted the first part.

http://armneon.blogspot.com

cya
Cancel
Vote up 0 Vote down

Cancel

Reply

Jake Lee over 12 years ago

Note: This was originally posted on 13th July 2013 at http://forums.arm.com

Hi.

The main NEON optimisation method is to let some times between register loading/saving instructions and compute operations

Try something like this

@ preload vld1.16 {d28-d31}, [r1,:256]! @ first operation to tranfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] @ you'll have to make 1 less iteration subs r12, r12, #16 .loop: @ load for next iteration vld1.16 {d28-d31}, [r1,:256]! @ working on preloaded datas vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 @ prepare for saving vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 @ let sone time before saving - transfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] @ saving vst1.16 {d20-d23}, [r0,:256]! subs r12, r12, #16 bgt .loop @ need to finsh last iteration vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 vst1.16 {d20-d23}, [r0,:256]!

That sould works.
But I do not test the code !

I wish so much I could give you a success report, but unfortunately, it's not the case

Anyway, thank you very much for your effort. I really appreciate that.

I had to omit the optimization part in my tutorial. I just submitted the first part.

http://armneon.blogspot.com

cya
Cancel
Vote up 0 Vote down

Cancel

Children

No data