This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A8 NEON strange performance issue

Note: This was originally posted on 10th July 2013 at http://forums.arm.com

Hi all,

As an experienced assembly programmer, I'm currently working on NEON tutorials.

First example done, it beats the original C function by a huge margin :


pld [r1, #64*0]
pld [r1, #64*1]
pld [r1, #64*2]
pld [r1, #64*3]

    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept

1:
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!

        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vmull.s16   q15, d31, d0[0]

        vaddw.s16   q12, q12, d1
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1

        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8

        vst1.16  {d24-d27}, [r0,:256]!

        subs    r12, r12, #16
    bgt  1b
    bx      lr


However, there are some pipeline stalls here and there, and the dual-issuing isn't fully in effect. So I decided to optimize it :



pld  [r1, #64*0]
    pld  [r1, #64*1]
    pld  [r1, #64*2]
    pld  [r1, #64*3]
    pld  [r1, #64*4]
    ldr  r12, [sp]
    vdup.16 d0, r2 //coeff
    vdup.16 d1, r3 //intercept
    vld1.16  {d28-d31}, [r1,:256]!
    vmull.s16   q12, d28, d0[0]
    vmull.s16   q13, d29, d0[0]
    vmull.s16   q14, d30, d0[0]
    vmull.s16   q15, d31, d0[0]
    1:
        vaddw.s16   q12, q12, d1
        vld1.16  {d20-d23}, [r1,:256]!
        vaddw.s16   q13, q13, d1
        vaddw.s16   q14, q14, d1
        vaddw.s16   q15, q15, d1
        vqrshrun.s32    d24, q12, #8
        vqrshrun.s32    d25, q13, #8
        vqrshrun.s32    d26, q14, #8
        vqrshrun.s32    d27, q15, #8
        vmull.s16   q8, d20, d0[0]
        vmull.s16   q9, d21, d0[0]
        vmull.s16   q10, d22, d0[0]
        vmull.s16   q11, d23, d0[0]
        vst1.16  {d24-d27}, [r0,:256]!
        vaddw.s16   q8, q8, d1
        vaddw.s16   q9, q9, d1
        vaddw.s16   q10, q10, d1
        vaddw.s16   q11, q11, d1
        subs    r12, r12, #32
        vqrshrun.s32    d16, q8, #8
        vqrshrun.s32    d17, q9, #8
        vqrshrun.s32    d18, q10, #8
        ble  2f
        pld  [r1, #64*4]
        vld1.16  {d28-d31}, [r1,:256]!
        vqrshrun.s32    d19, q11, #8
        vmull.s16   q12, d28, d0[0]
        vmull.s16   q13, d29, d0[0]
        vmull.s16   q14, d30, d0[0]
        vst1.16  {d16-d19}, [r0,:256]!
        vmull.s16   q15, d31, d0[0]
    b 1b
    2:
    vqrshrun.s32    d19, q11, #8
    vst1.16  {d16-d19}, [r0,:256]!
    bx      lr


There is no pipeline stall at all in this optimized version, and all the memory accessing instructions are surrounded by data processing ones, so they get dual issued twice - at top and bottom. - looks good~

Then the bitter surprise comes while benchmarking on my 4th gen iPod touch and iPhone4 though (iOS6.13 if it matters) : The "optimized" version is about 5% slower than the initial one.


(r12 gets the value of a full HD resolution(1920*1080), and both functions are called several thousand times in a row)

I've been trying several variations, repositioning subs and pld around, nothing matters, the optimized version is ALWAYS slower.

And I removed the pld instruction completely, both functions become much slower as expected, but voila, the optimized version is now about 5% faster.


It seems I did the right thing in removing pipeline stalls and unfolding dual issuing capability, but I must have done something the L2 cache doesn't like in the optimized version.

Can someone give me some insight regarding this? I'm really curious, desperate, frustrated or whatever since without knowing the reason of this strange behavior, all the efforts I put in mastering NEON would be in vain.


Thanks in advance.
  • Note: This was originally posted on 13th July 2013 at http://forums.arm.com


    Yes I think Webshaker's idea will give you good results, because it is easy to read ARM & NEON docs and think that counting the CPU clock cycles for your instructions will be enough to estimate the total time of your loop, when in fact most code in mobile spends more time loading & saving memory than performing CPU calculations. So your main goal should be to minimize the amount of time wasted while waiting for data to load & store. This usually means accessing memory in good cache-friendly ways, using main memory the least you can inside your loop, AND trying to increase the number of useful operations that happen between loading a value from memory and actually using it in another instruction.

    Also, having longer loops can sometimes reduce the speed of your code, so this might also be influencing your perf.


    I've been trying about 20 variations, long loops, short loops,,,,, including what webshaker suggested and what you are talking about.

    However, the unoptimized original is ALWAYS the fastest as soon as PLD is enabled.

    There must be still something we aren't aware of....
  • Note: This was originally posted on 13th July 2013 at http://forums.arm.com


    Hi.

    The main NEON optimisation method is to let some times between register loading/saving instructions and compute operations

    Try something like this



    @ preload
            vld1.16   {d28-d31}, [r1,:256]!

    @ first operation to tranfert register
            vmull.s16   q12, d28, d0[0]
            vmull.s16   q13, d29, d0[0]
            vmull.s16   q14, d30, d0[0]
            vmull.s16   q15, d31, d0[0]

    @ you'll have to make 1 less iteration
            subs    r12, r12, #16
    .loop:


    @ load for next iteration
            vld1.16   {d28-d31}, [r1,:256]!

    @ working on preloaded datas

            vaddw.s16   q12, q12, d1
            vaddw.s16   q13, q13, d1
            vaddw.s16   q14, q14, d1
            vaddw.s16   q15, q15, d1

    @ prepare for saving
            vqrshrun.s32    d20, q12, #8
            vqrshrun.s32    d21, q13, #8
            vqrshrun.s32    d22, q14, #8
            vqrshrun.s32    d23, q15, #8



    @ let sone time before saving - transfert register
            vmull.s16   q12, d28, d0[0]
            vmull.s16   q13, d29, d0[0]
            vmull.s16   q14, d30, d0[0]
            vmull.s16   q15, d31, d0[0]

    @ saving
            vst1.16   {d20-d23}, [r0,:256]!


            subs    r12, r12, #16
    bgt   .loop

    @ need to finsh last iteration

            vaddw.s16   q12, q12, d1
            vaddw.s16   q13, q13, d1
            vaddw.s16   q14, q14, d1
            vaddw.s16   q15, q15, d1

            vqrshrun.s32    d20, q12, #8
            vqrshrun.s32    d21, q13, #8
            vqrshrun.s32    d22, q14, #8
            vqrshrun.s32    d23, q15, #8

            vst1.16   {d20-d23}, [r0,:256]!


    That sould works.
    But I do not test the code !


    I wish so much I could give you a success report, but unfortunately, it's not the case :(

    Anyway, thank you very much for your effort. I really appreciate that.

    I had to omit the optimization part in my tutorial. I just submitted the first part.

    http://armneon.blogspot.com

    cya
  • Note: This was originally posted on 17th July 2013 at http://forums.arm.com


    > In the extended version It's exactly 64Bytes per iteration. No mistakes here.

    What I meant is that you have 2 VLD instructions (loading a total of 2 cachelines) per iteration, but you only have 1 PLD instruction. So you should add an extra PLD instruction in your loop, to preload 2 cachelines not just 1, otherwise you are only preloading half of your data!



    Two vld's loading 32bytes each = 64bytes = 1 cache line. :)
    Putting one additional PLD didn't help.

    Thanks anyway
  • Note: This was originally posted on 17th July 2013 at http://forums.arm.com


    Your problem seems very strange

    - Does r0 and r1 reference the same memory zone ? if this is the case try to reference different memory space.

    - Can you try your code without any NEON instruction (except memoryt acces of course) ! May be you had saturate the memory acces capacity (what is the hardware you are using ?) !

    - Finally is there any chance you benchmark method were wrong ?


    I will try you code on my beagleboard to see !

    Etienne


    - r0 and r1 point on different memory blocks.

    - I'm using iPod touch 4th gen and iPhone4, both same gen :(

    - I've been trying different benchmarking methods, from simple Log to nanosec counter offered by Apple. I'm not saying they can't be wrong, but they all show the initial version to be the fastest.

    Thank you again.


    PS : Just finished a high precision 8x8 LLM iDCT with fixed point math, more accurate than with float. Can do about 1.5Million times iDCT/sec on my iPod touch (800Mhz Coretex A8). Not bad, huh?
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com


    I've found really weird, inexplicable things with NEON too :/ One thing I've had a suspicion of for a while, but little really hard data on, is that maybe there are penalties for too many NEON instructions before hitting a non-NEON instruction. At the very least we do know that the 16-entry queue will fill up and you'll no longer be able to decode non-NEON instructions ahead of NEON execution. You can also try aligning the loop entry point to 8 bytes to maximize fetch throughput.

    One kind of obvious question.. are your iterations actually multiples of 32? If you're rounding them up to 32 you'll be doing half a loop more work. I don't think you need to unroll like this to get enough distance to avoid stalls, you can probably do it just by staggering/software pipelining the loop.

    The only other thing I can think of, is I don't know how well the multi-cycle loads and stores dual-issue. You may want to try splitting it into separate loads and stores to try to get both sides to dual-issue. But you can't put them back to back or the address increment will stall in the integer pipeline. If you really need them back to back you can do it with separate pointers at interleaved addresses and a register increment but that increases register pressure a lot.


    Hi, thanks for the reply.
    If I understood the reference manual correctly, while NEON instructions dominate, ARM is simply waiting up to 16 stages ahead. I don't think there to be any penalty for NEON instructions dominating.
    And I can assure you that my iteration is correctly implemented. Otherwise iOS would punish me with memory access violation exception. (iOS is very strict for this matter)

    Last but not least, I myself have always been curious about vldm/vstm with write back interfering with the pipeline, but I guess it doesn't since it would be very limiting performance-wise that way. I think ARM will do the writing back immediately after putting those NEON instruction onto the queue. I'll write a simple test routine to verify this matter. It wouldn't be that hard.

    Alas, I won't be optimizing NEON codes via dual issuing since A9 and later cores don't dual-issue NEON instructions anyway. I think it isn't worth the effort optimizing solely for the A8.

    Or... does NEON on the A15 dual-issuing?

    best regards
    - Jake
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com


    Yes the Cortex-A15 is a beast, it can dual-issue NEON and it has numerous NEON ALUs. eg: VADDQ is 1.5 instructions per clock (ie: you can add 24 x 8-bit numbers per clock!) per CPU core. So a quad-core A15 running at 1.6 GHz with 4 cores getting 85% parallelization can process 4 x 0.85 x 24 x 1.6 x 10^9 = 130 GB/s!


    Wow, I can hardly wait for the A15 based Odin chip next month!

    I'll be starting reading the 15 reference manual.

    Thanks for the info
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com


    Yes the Cortex-A15 is a beast, it can dual-issue NEON and it has numerous NEON ALUs. eg: VADDQ is 1.5 instructions per clock (ie: you can add 24 x 8-bit numbers per clock!) per CPU core. So a quad-core A15 running at 1.6 GHz with 4 cores getting 85% parallelization can process 4 x 0.85 x 24 x 1.6 x 10^9 = 130 GB/s!



    Hmmmmmm... Now I remember. I already read the reference manual and found it to be useless to me since there are no information about dual-issuing and cycle timings. Where did you get the info from?
  • Note: This was originally posted on 11th July 2013 at http://forums.arm.com

    Hi.

    The main NEON optimisation method is to let some times between register loading/saving instructions and compute operations

    Try something like this



    @ preload
            vld1.16      {d28-d31}, [r1,:256]!

    @ first operation to tranfert register
            vmull.s16   q12, d28, d0[0]
            vmull.s16   q13, d29, d0[0]
            vmull.s16   q14, d30, d0[0]
            vmull.s16   q15, d31, d0[0]

    @ you'll have to make 1 less iteration
            subs    r12, r12, #16
    .loop:


    @ load for next iteration
            vld1.16      {d28-d31}, [r1,:256]!

    @ working on preloaded datas

            vaddw.s16   q12, q12, d1
            vaddw.s16   q13, q13, d1
            vaddw.s16   q14, q14, d1
            vaddw.s16   q15, q15, d1

    @ prepare for saving
            vqrshrun.s32    d20, q12, #8
            vqrshrun.s32    d21, q13, #8
            vqrshrun.s32    d22, q14, #8
            vqrshrun.s32    d23, q15, #8



    @ let sone time before saving - transfert register
            vmull.s16   q12, d28, d0[0]
            vmull.s16   q13, d29, d0[0]
            vmull.s16   q14, d30, d0[0]
            vmull.s16   q15, d31, d0[0]

    @ saving
            vst1.16      {d20-d23}, [r0,:256]!


            subs    r12, r12, #16
    bgt      .loop

    @ need to finsh last iteration

            vaddw.s16   q12, q12, d1
            vaddw.s16   q13, q13, d1
            vaddw.s16   q14, q14, d1
            vaddw.s16   q15, q15, d1

            vqrshrun.s32    d20, q12, #8
            vqrshrun.s32    d21, q13, #8
            vqrshrun.s32    d22, q14, #8
            vqrshrun.s32    d23, q15, #8

            vst1.16      {d20-d23}, [r0,:256]!


    That sould works.
    But I do not test the code !
  • Note: This was originally posted on 15th July 2013 at http://forums.arm.com

    Your problem seems very strange

    - Does r0 and r1 reference the same memory zone ? if this is the case try to reference different memory space.

    - Can you try your code without any NEON instruction (except memoryt acces of course) ! May be you had saturate the memory acces capacity (what is the hardware you are using ?) !

    - Finally is there any chance you benchmark method were wrong ?


    I will try you code on my beagleboard to see !

    Etienne
  • Note: This was originally posted on 17th July 2013 at http://forums.arm.com

    I've found really weird, inexplicable things with NEON too :/ One thing I've had a suspicion of for a while, but little really hard data on, is that maybe there are penalties for too many NEON instructions before hitting a non-NEON instruction. At the very least we do know that the 16-entry queue will fill up and you'll no longer be able to decode non-NEON instructions ahead of NEON execution. You can also try aligning the loop entry point to 8 bytes to maximize fetch throughput.

    One kind of obvious question.. are your iterations actually multiples of 32? If you're rounding them up to 32 you'll be doing half a loop more work. I don't think you need to unroll like this to get enough distance to avoid stalls, you can probably do it just by staggering/software pipelining the loop.

    The only other thing I can think of, is I don't know how well the multi-cycle loads and stores dual-issue. You may want to try splitting it into separate loads and stores to try to get both sides to dual-issue. But you can't put them back to back or the address increment will stall in the integer pipeline. If you really need them back to back you can do it with separate pointers at interleaved addresses and a register increment but that increases register pressure a lot.
  • Note: This was originally posted on 18th July 2013 at http://forums.arm.com

    ARM has released some white papers and presentations for Cortex-A15, they do label two NEON issue ports but they don't say what each can do. I have heard from others sources that one can handle 128-bit operations while the other can handle 64-bit operations, and this is also consistent with what shervin is saying and some very limited testing I've done (I think it showed between 1 and 2 vmul.f32, vmla.f32, and stuff like vadd.u32 throughput). Still don't know exactly what operations are and aren't supported by both units.
  • Note: This was originally posted on 11th July 2013 at http://forums.arm.com

    Yes I think Webshaker's idea will give you good results, because it is easy to read ARM & NEON docs and think that counting the CPU clock cycles for your instructions will be enough to estimate the total time of your loop, when in fact most code in mobile spends more time loading & saving memory than performing CPU calculations. So your main goal should be to minimize the amount of time wasted while waiting for data to load & store. This usually means accessing memory in good cache-friendly ways, using main memory the least you can inside your loop, AND trying to increase the number of useful operations that happen between loading a value from memory and actually using it in another instruction.

    Also, having longer loops can sometimes reduce the speed of your code, so this might also be influencing your perf.
  • Note: This was originally posted on 13th July 2013 at http://forums.arm.com

    Actually that's another thing you need to do a LOT of testing with: PLD.

    I noticed that your optimized loop above is loading 2 cachelines per iterations but only preloading 1 cacheline per iteration. PLD is something that seems too hard to predict what will work best, you need to do lots of testing with different number of PLD instructions, different locations of PLD within your code, and most importantly, different number of cachelines to load ahead of. But in general, you should preload 2 cachelines if you are loading 2 cachelines, and the PLD instructions are typically better if they are away from the VLD instructions (but not always) and interleaved among other code. And you should also experiment with loading 1, 2, 3, 4, 5, 6, 8, 10, 20, and 30 cachelines ahead, to see which one gives best results. (Obviously if you notice the perf is getting worse as you increase or decrease the PLD amount then you don't need to test every amount, but you get the idea).
  • Note: This was originally posted on 14th July 2013 at http://forums.arm.com

    > In the extended version It's exactly 64Bytes per iteration. No mistakes here.

    What I meant is that you have 2 VLD instructions (loading a total of 2 cachelines) per iteration, but you only have 1 PLD instruction. So you should add an extra PLD instruction in your loop, to preload 2 cachelines not just 1, otherwise you are only preloading half of your data!

    > BTW, I know your site since two years or so :) Good stuffs there

    Thanks, it's good to see your website is also trying to help others learn ARM & NEON :-)
  • Note: This was originally posted on 15th July 2013 at http://forums.arm.com

    I recommend you find out whether most of your time is in memory access or if it is calculations, because in mobile, it is nearly all just memory access time! For example, try simply commenting out all your NEON arithmetic code, so that your code is only loading & storing all the required data but not modifying it. If you find that the speed is almost the same (this is usually the case!) then it explains why replacing ARM instructions with NEON can't give you a speedup, because NEON isn't necessarily any faster at memory access than ARM (in fact it can be a little slower!). But if you find that say half of the time was used up just in NEON arithmetic calculations, not in loads or stores or preloads, then it means your code probably isn't at a high efficiency and you should try to optimize your code differently.