This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Hello,

I have the following piece of code:

template<int bx, int by>
void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
{
    for (int y = 0; y < by; y++)
    {
        for (int x = 0; x < bx; x++)
        {
            a[x] = (pixel)b[x];
        }

        a += stridea;
        b += strideb;
        }
}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)
    lsl x3, x3, #1
.rept 4
    ld1 {v0.8h}, [x2], x3
    ld1 {v1.8h}, [x2], x3
    xtn v0.8b, v0.8h
    xtn v1.8b, v1.8h
    st1 {v0.d}[0], [x0], x1
    st1 {v1.d}[0], [x0], x1
.endr
    ret
endfunc
However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):
function PFX(blockcopy_sp_8x8)
    MOV x8, 8
    MOV x9, #0
    MOV x6, #0
    MOV x7, #0
    MOV z31.d, #64
    MOV z0.d, #0

    WHILELT p1.d, x9, x8
    B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:
    INDEX z1.d, x6, x3
    INDEX z2.d, x7, x1
.rept 2
    LD1D z3.d, p1/Z, [x2, z1.d]
    ADD z1.d, z1.d, z31.d
    UZP1 z3.b, z3.b, z0.b
    ST1W z3.d, p1, [x0, z2.d, UXTW #2]
    ADD z2.d, z2.d, z31.d
.endr
    INCD x9
    MUL x6, x9, x3
    MUL x7, x9, x1
    WHILELT p1.d, x9, x8
    B.FIRST .L_loopStart_blockcopy_sp_8x8
.L_return_blockcopy_sp_8x8:
    RET
endfunc
However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.
For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.
So, it might be the case that the z3 register is not fully loaded with 16bytes of data.
Can you please tell me what I am doing wrong?
Thank you in advance.
Parents
  • Hi Akis,

    Thanks for sharing the other kernels.

    For optimising for the smaller h values, I was referring to the fact that you
    can usually avoid the additional latency associated with first creating a zero
    accumulator and then accumulating into it. This probably doesn't matter very
    much especially for larger h values, but for smaller cases it could make a
    small but noticeable difference. Taking your example in your latest post, we
    could imagine peeling the first iteration of the rept and doing something like:

    function PFX(pixel_sse_pp_4x4_sve2)
      ptrue p0.s, vl4
    
      // peel first iteration
      ld1b {z0.s}, p0/z, [x0]
      add x0, x0, x1
      ld1b {z17.s}, p0/z, [x2]
      add x2, x2, x3
      sub z0.s, p0/m, z0.s, z17.s
      mul z0.s, p0/m, z0.s, z0.s // only need mul here rather than mla
    
    .rept 4
      ld1b {z16.s}, p0/z, [x0]
      add x0, x0, x1
      ld1b {z17.s}, p0/z, [x2]
      add x2, x2, x3
      sub z16.s, p0/m, z16.s, z17.s
      mla z0.s, p0/m, z16.s, z16.s
    .endr
      uaddv d0, p0, z0.s
      fmov w0, s0
      ret
    endfunc

    Again I don't think that this really makes any significant difference in
    practice, but in general you can sometimes find similar small optimisations in
    the first iterations of such loops.

    For the pixel_sad code I think that your SVE2 implementation is currently doing
    twice as many load instructions as is strictly needed. Although the ld1b->.h
    zero extension is nice here, we are effectively losing of our load bandwidth
    (since the instruction is loading 8 bytes rather than 16). In SVE2 you can
    actually do better than this by instead taking advantage of the UABAL[BT]
    instructions to mirror the widening behaviour of the existing Neon code. This
    would look something like:

      ...
      ptrue p0.b, vl16 // .b instead of .h, vl16 instead of vl8
    .rept 16
      ld1b {z0.b}, p0/z, [x0] // z0.b instead of z0.h
      add x0, x0, x1
      ld1b {z2.b}, p0/z, [x2] // z2.b instead of z2.h
      add x2, x2, x3
      uabalb z16.h, z0.b, z2.b
      uabalt z16.h, z1.b, z3.b
      // ^^ ^ ^ inputs now .b, instruction widens to .h
    .endr

    Hope that helps!

    Thanks,
    George

Reply
  • Hi Akis,

    Thanks for sharing the other kernels.

    For optimising for the smaller h values, I was referring to the fact that you
    can usually avoid the additional latency associated with first creating a zero
    accumulator and then accumulating into it. This probably doesn't matter very
    much especially for larger h values, but for smaller cases it could make a
    small but noticeable difference. Taking your example in your latest post, we
    could imagine peeling the first iteration of the rept and doing something like:

    function PFX(pixel_sse_pp_4x4_sve2)
      ptrue p0.s, vl4
    
      // peel first iteration
      ld1b {z0.s}, p0/z, [x0]
      add x0, x0, x1
      ld1b {z17.s}, p0/z, [x2]
      add x2, x2, x3
      sub z0.s, p0/m, z0.s, z17.s
      mul z0.s, p0/m, z0.s, z0.s // only need mul here rather than mla
    
    .rept 4
      ld1b {z16.s}, p0/z, [x0]
      add x0, x0, x1
      ld1b {z17.s}, p0/z, [x2]
      add x2, x2, x3
      sub z16.s, p0/m, z16.s, z17.s
      mla z0.s, p0/m, z16.s, z16.s
    .endr
      uaddv d0, p0, z0.s
      fmov w0, s0
      ret
    endfunc

    Again I don't think that this really makes any significant difference in
    practice, but in general you can sometimes find similar small optimisations in
    the first iterations of such loops.

    For the pixel_sad code I think that your SVE2 implementation is currently doing
    twice as many load instructions as is strictly needed. Although the ld1b->.h
    zero extension is nice here, we are effectively losing of our load bandwidth
    (since the instruction is loading 8 bytes rather than 16). In SVE2 you can
    actually do better than this by instead taking advantage of the UABAL[BT]
    instructions to mirror the widening behaviour of the existing Neon code. This
    would look something like:

      ...
      ptrue p0.b, vl16 // .b instead of .h, vl16 instead of vl8
    .rept 16
      ld1b {z0.b}, p0/z, [x0] // z0.b instead of z0.h
      add x0, x0, x1
      ld1b {z2.b}, p0/z, [x2] // z2.b instead of z2.h
      add x2, x2, x3
      uabalb z16.h, z0.b, z2.b
      uabalt z16.h, z1.b, z3.b
      // ^^ ^ ^ inputs now .b, instruction widens to .h
    .endr

    Hope that helps!

    Thanks,
    George

Children
  • Hi George,

    thanks for your comments. I followed your advices for both the small h values case and the pixel_sad function.

    I have the following extra questions:

    1) How about when I need to load 12 bytes of data? For example, I have the following NEON code:

    function PFX(pixel_sad_12x\h\()_neon)
        movrel          x12, sad12_mask
        ld1             {v31.16b}, [x12]
        movi            v16.16b, #0
        movi            v17.16b, #0
        mov             w9, #\h/8
    .loop_12x\h:
        sub             w9, w9, #1
    .rept 4
        ld1             {v0.16b}, [x0], x1
        and             v0.16b, v0.16b, v31.16b
        ld1             {v1.16b}, [x2], x3
        and             v1.16b, v1.16b, v31.16b
        ld1             {v2.16b}, [x0], x1
        and             v2.16b, v2.16b, v31.16b
        ld1             {v3.16b}, [x2], x3
        and             v3.16b, v3.16b, v31.16b
        uabal           v16.8h, v0.8b, v1.8b
        uabal2          v17.8h, v0.16b, v1.16b
        uabal           v16.8h, v2.8b, v3.8b
        uabal2          v17.8h, v2.16b, v3.16b
    .endr
        cbnz            w9, .loop_12x\h
        add             v16.8h, v16.8h, v17.8h
        uaddlv          s0, v16.8h
        fmov            w0, s0
        ret
    endfunc
    
    const sad12_mask, align=8
    .byte 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0
    endconst

    I have developed the following SVE2 code:

    function PFX(pixel_sad_12x\h\()_sve2)
        mov             z16.d, #0
        ptrue           p0.h, vl8
        ptrue           p1.h, vl4
        mov             x11, #8
    .rept \h
        ld1b            {z0.h}, p0/z, [x0]
        ld1b            {z1.h}, p1/z, [x0, x11]
        add             x0, x0, x1
        ld1b            {z8.h}, p0/z, [x2]
        ld1b            {z9.h}, p1/z, [x2, x11]
        add             x2, x2, x3
        uaba            z16.h, z0.h, z8.h
        uaba            z16.h, z1.h, z9.h
    .endr
        uaddv           d5, p0, z16.h
        fmov            w0, s5
        ret
    endfunc

    Here again I am using ld1 instructions for also widen the loaded values, but since I cannot perfectly define ptrues (12 bytes of data are needed), is there a better way?

    2) Should I use two ld1 instructions or one ld2 instruction whenever it is possible? For example. I have the following NEON piece of code:

    function PFX(pixel_sse_pp_32x32_neon)
        mov             w12, #8
        movi            v0.16b, #0
        movi            v1.16b, #0
    .loop_sse_pp_32:
        sub             w12, w12, #1
    .rept 4
        ld1             {v16.16b,v17.16b}, [x0], x1
        ld1             {v18.16b,v19.16b}, [x2], x3
        usubl           v2.8h, v16.8b, v18.8b
        usubl2          v3.8h, v16.16b, v18.16b
        usubl           v4.8h, v17.8b, v19.8b
        usubl2          v5.8h, v17.16b, v19.16b
        smlal           v0.4s, v2.4h, v2.4h
        smlal2          v1.4s, v2.8h, v2.8h
        smlal           v0.4s, v3.4h, v3.4h
        smlal2          v1.4s, v3.8h, v3.8h
        smlal           v0.4s, v4.4h, v4.4h
        smlal2          v1.4s, v4.8h, v4.8h
        smlal           v0.4s, v5.4h, v5.4h
        smlal2          v1.4s, v5.8h, v5.8h
    .endr
        cbnz            w12, .loop_sse_pp_32
        add             v0.4s, v0.4s, v1.4s
        trn2            v1.2d, v0.2d, v0.2d
        add             v0.2s, v0.2s, v1.2s
        addp            v0.2s, v0.2s, v0.2s
        fmov            w0, s0
        ret
    endfunc

    I have developed the following SVE2 code:

    function PFX(pixel_sse_pp_32x32_sve2)
        rdvl            x9, #1
        cmp             x9, #16
        bgt             .vl_gt_16_pixel_sse_pp_32x32
        ptrue           p0.b, vl16
        mov             z20.d, #0
        mov             z21.d, #0
        mov             x11, #16
    .rept 32
        ld1b            {z16.b}, p0/z, [x0]
        ld1b            {z17.b}, p0/z, [x0, x11]
        add             x0, x0, x1
        ld1b            {z18.b}, p0/z, [x2]
        ld1b            {z19.b}, p0/z, [x2, x11]
        add             x2, x2, x3
        usublb          z1.h, z16.b, z18.b
        usublt          z2.h, z16.b, z18.b
        usublb          z3.h, z17.b, z19.b
        usublt          z4.h, z17.b, z19.b
        smlalb          z20.s, z1.h, z1.h
        smlalt          z21.s, z1.h, z1.h
        smlalb          z20.s, z2.h, z2.h
        smlalt          z21.s, z2.h, z2.h
        smlalb          z20.s, z3.h, z3.h
        smlalt          z21.s, z3.h, z3.h
        smlalb          z20.s, z4.h, z4.h
        smlalt          z21.s, z4.h, z4.h
    .endr
        uaddv           d3, p0, z20.s
        fmov            w0, s3
        uaddv           d4, p0, z21.s
        fmov            w1, s4
        add             w0, w0, w1
        ret
    .vl_gt_16_pixel_sse_pp_32x32:
        ptrue           p0.b, vl32
        mov             z20.d, #0
        mov             z21.d, #0
    .rept 32
        ld1b            {z16.b}, p0/z, [x0]
        add             x0, x0, x1
        ld1b            {z18.b}, p0/z, [x2]
        add             x2, x2, x3
        usublb          z1.h, z16.b, z18.b
        usublt          z2.h, z16.b, z18.b
        smlalb          z20.s, z1.h, z1.h
        smlalt          z21.s, z1.h, z1.h
        smlalb          z20.s, z2.h, z2.h
        smlalt          z21.s, z2.h, z2.h
    .endr
        uaddv           d3, p0, z20.s
        fmov            w0, s3
        uaddv           d4, p0, z21.s
        fmov            w1, s4
        add             w0, w0, w1
        ret
    endfunc

    Here for example, I think I can use one ld2 instead of two ld1 instructions. Should I go for it? Also, by the way, can you identify something in the code that can be done more efficiently?

    3) I have developed the following piece of code using SVE2:

    function PFX(addAvg_32x\h\()_sve2)
        mov             z30.b, #0x40
        mov             w12, #\h
        rdvl            x9, #1
        cmp             x9, #16
        bgt             .vl_gt_16_addAvg_32x\h
        ptrue           p0.b, vl16
        ptrue           p1.h, vl8
    .loop_eq_16_sve2_addavg_32x\h\():
        sub             w12, w12, #1
        mov             x14, #16
        ld1b            {z0.b}, p0/z, [x0]
        ld1b            {z1.b}, p0/z, [x0, x14]
        add             x14, x14, #16
        ld1b            {z2.b}, p0/z, [x0, x14]
        add             x14, x14, #16
        ld1b            {z3.b}, p0/z, [x0, x14]
        mov             x14, #16
        ld1b            {z4.b}, p0/z, [x1]
        ld1b            {z5.b}, p0/z, [x1, x14]
        add             x14, x14, #16
        ld1b            {z6.b}, p0/z, [x1, x14]
        add             x14, x14, #16
        ld1b            {z7.b}, p0/z, [x1, x14]
        add             x0, x0, x3, lsl #1
        add             x1, x1, x4, lsl #1
        add             z0.h, p1/m, z0.h, z4.h
        add             z1.h, p1/m, z1.h, z5.h
        add             z2.h, p1/m, z2.h, z6.h
        add             z3.h, p1/m, z3.h, z7.h
        saddlb          z4.s, z0.h, z30.h
        saddlt          z5.s, z0.h, z30.h
        saddlb          z6.s, z1.h, z30.h
        saddlt          z7.s, z1.h, z30.h
        saddlb          z8.s, z2.h, z30.h
        saddlt          z9.s, z2.h, z30.h
        saddlb          z10.s, z3.h, z30.h
        saddlt          z11.s, z3.h, z30.h
        shrnb           z0.h, z4.s, #7
        shrnt           z0.h, z5.s, #7
        shrnb           z1.h, z6.s, #7
        shrnt           z1.h, z7.s, #7
        shrnb           z2.h, z8.s, #7
        shrnt           z2.h, z9.s, #7
        shrnb           z3.h, z10.s, #7
        shrnt           z3.h, z11.s, #7
        sqxtunb         z0.b, z0.h
        sqxtunb         z1.b, z1.h
        sqxtunb         z2.b, z2.h
        sqxtunb         z3.b, z3.h
        mov             x14, #8
        st1b            {z0.h}, p1, [x2]
        st1b            {z1.h}, p1, [x2, x14]
        add             x14, x14, #8
        st1b            {z2.h}, p1, [x2, x14]
        add             x14, x14, #8
        st1b            {z3.h}, p1, [x2, x14]
        add             x2, x2, x5
        cbnz            w12, .loop_eq_16_sve2_addavg_32x\h
        ret
    .vl_gt_16_addAvg_32x\h\():
        cmp             x9, #48
        bgt             .vl_gt_48_addAvg_32x\h
        ptrue           p0.b, vl32
        ptrue           p1.h, vl16
        mov             x11, #32
        mov             x10, #16
    .loop_gt_eq_32_sve2_addavg_32x\h\():
        sub             w12, w12, #1
        ld1b            {z0.b}, p0/z, [x0]
        ld1b            {z1.b}, p0/z, [x0, x11]
        ld1b            {z2.b}, p0/z, [x1]
        ld1b            {z3.b}, p0/z, [x1, x11]
        add             x0, x0, x3, lsl #1
        add             x1, x1, x4, lsl #1
        add             z0.h, p1/m, z0.h, z2.h
        add             z1.h, p1/m, z1.h, z3.h
        saddlb          z4.s, z0.h, z30.h
        saddlt          z5.s, z0.h, z30.h
        saddlb          z6.s, z1.h, z30.h
        saddlt          z7.s, z1.h, z30.h
        shrnb           z0.h, z4.s, #7
        shrnt           z0.h, z5.s, #7
        shrnb           z1.h, z6.s, #7
        shrnt           z1.h, z7.s, #7
        sqxtunb         z0.b, z0.h
        sqxtunb         z1.b, z1.h
        st1b            {z0.h}, p1, [x2]
        st1b            {z1.h}, p1, [x2, x10]
        add             x2, x2, x5
        cbnz            w12, .loop_gt_eq_32_sve2_addavg_32x\h
        ret
    .vl_gt_48_addAvg_32x\h\():
        ptrue           p0.b, vl64
        ptrue           p1.h, vl32
    .loop_eq_64_sve2_addavg_32x\h\():
        sub             w12, w12, #1
        ld1b            {z0.b}, p0/z, [x0]
        ld1b            {z1.b}, p0/z, [x1]
        add             x0, x0, x3, lsl #1
        add             x1, x1, x4, lsl #1
        add             z0.h, p1/m, z0.h, z1.h
        saddlb          z1.s, z0.h, z30.h
        saddlt          z2.s, z0.h, z30.h
        shrnb           z0.h, z1.s, #7
        shrnt           z0.h, z2.s, #7
        sqxtunb         z0.b, z0.h
        st1b            {z0.h}, p1, [x2]
        add             x2, x2, x5
        cbnz            w12, .loop_eq_64_sve2_addavg_32x\h
        ret
    endfunc

     As you can see, I am creating the pointers by adding with #16 in each iteration. Should I move their computation outside the loop and have something like this instead?

    mov     x14, #16
    mov     x15, #32
    mov     x16, #48

    and then use x14-x16 as pointers inside the loop without any extra computation? I think the second solution is the best but again it deteriorates the performance. That's why I am asking for your confirmation. Again, if you identify anything else in the function that can be done better, please inform me.

    4) I have the following piece of code in NEON:

    function PFX(pixel_add_ps_16x\h\()_neon)
        lsl             x5, x5, #1
        mov             w12, #\h / 8
    .loop_add_ps_16x\h\():
        sub             w12, w12, #1
    .rept 4
        ld1             {v0.16b}, [x2], x4
        ld1             {v1.16b}, [x2], x4
        ld1             {v16.8h-v17.8h}, [x3], x5
        ld1             {v18.8h-v19.8h}, [x3], x5
        uxtl            v4.8h, v0.8b
        uxtl2           v5.8h, v0.16b
        uxtl            v6.8h, v1.8b
        uxtl2           v7.8h, v1.16b
        add             v24.8h, v4.8h, v16.8h
        add             v25.8h, v5.8h, v17.8h
        add             v26.8h, v6.8h, v18.8h
        add             v27.8h, v7.8h, v19.8h
        sqxtun          v4.8b, v24.8h
        sqxtun2         v4.16b, v25.8h
        sqxtun          v5.8b, v26.8h
        sqxtun2         v5.16b, v27.8h
        st1             {v4.16b}, [x0], x1
        st1             {v5.16b}, [x0], x1
    .endr
        cbnz            w12, .loop_add_ps_16x\h
        ret
    endfunc
    .endm

    I have developed the following SVE2 version for it:

    function PFX(pixel_add_ps_16x\h\()_neon)
        ptrue           p0.b, vl16
        mov             x10, #8
    .rept \h
        ld1b            {z0.b}, p0/z, [x2]
        add             x2, x2, x4
        ld1h            {z2.h}, p0/z, [x3]
        ld1h            {z3.h}, p0/z, [x3, x10, lsl #1]
        add             x3, x3, x5, lsl #1
        uunpklo         z6.h, z0.b
        uunpkhi         z7.h, z0.b
        add             z24.h, z6.h, z2.h
        add             z25.h, z7.h, z3.h
        sqxtunb         z6.b, z24.h
        sqxtunb         z7.b, z25.h
        st1b            {z6.h}, p0, [x0]
        st1b            {z7.h}, p0, [x0, x10]
        add             x0, x0, x1
    .endr
        ret
    endfunc
    .endm

    Although it works, I am using two st1 instructions. No matter how much I tried, I couldn't find a better solution. Do you have a more efficient solution in mind?

    Thank you in advance,

    Akis

  • Hi Akis,

    1)

    To get a predicate for the first 12 elements you should be able to use a
    WHILELT instruction, something like:

    mov w0, #12
    whilelt p0.b, wzr, w0

    Then I think you can use the widening absolute difference instruction as
    in my earlier reply, ending up with something like:

      mov w0, #12
      whilelt p0.b, wzr, w0
    .rept \h
      ld1b {z0.h}, p0/z, [x0]
      add x0, x0, x1
      ld1b {z8.h}, p0/z, [x2]
      add x2, x2, x3
      uabalb z16.h, z0.b, z8.b
      uabalt z16.h, z1.b, z9.b
    .endr

    It could also be worth using separate accumulators for the UABALB and UABALT
    instructions and summing them together at the end, but whether the overhead of
    the addition at the end is worth it probably depends on the value of \h since
    the accumulator latency of these instructions is only a single cycle on both
    Neoverse N2 and Neoverse V2.

    2)

    The details of things like ld2 instructions can be found on the software
    optimization guides linked previously. It is a long way to scroll back through
    this thread now, so here they are again:

    Neoverse V2: developer.arm.com/.../latest
    Neoveres N2: developer.arm.com/.../latest

    In particular here you can see the SVE LD2B instruction is both higher latency
    as well as worse than half the throughput (1 vs 3) compared to LD1B, so in this
    case we can say that a pair of LD1B should be preferred instead.

    The rest of the code looks reasonable to me!

    3)

    Moving the constants outside the loop makes sense to me. If you are already
    guarding that particular code path such that it is specific to a 128-bit vector
    length you could also just use the vl-scaled immediate loads, e.g.

    mov x14, #16
    ld1b {z0.b}, p0/z, [x0]
    ld1b {z1.b}, p0/z, [x0, x14]
    add x14, x14, #16
    ld1b {z2.b}, p0/z, [x0, x14]
    add x14, x14, #16
    ld1b {z3.b}, p0/z, [x0, x14]

    can become

    ld1b {z0.b}, p0/z, [x0]
    ld1b {z1.b}, p0/z, [x0, #1, mul vl]
    ld1b {z2.b}, p0/z, [x0, #2, mul vl]
    ld1b {z3.b}, p0/z, [x0, #3, mul vl]

    Also, I'm not 100% sure but I notice that z30 is initialised to #0x40 and added
    before the narrowing shifts. I am wondering if you can instead use the existing
    rounding-and-narrowing shift instructions to avoid needing this addition at
    all. For the code like:

    saddlb z4.s, z0.h, z30.h
    saddlt z5.s, z0.h, z30.h
    shrnb z0.h, z4.s, #7
    shrnt z0.h, z5.s, #7
    sqxtunb z0.b, z0.h

    Perhaps this can instead simply be:

    sqrshrnb z0.b, z0.h, #7

    For more information, see the instruction documentation here:
    developer.arm.com/.../SQRSHRNB--Signed-saturating-rounding-shift-right-narrow-by-immediate--bottom--

    Let me know if this doesn't work and I can investigate in more detail, but it
    feels possible.

    4)

    As you correctly point out I think there is no nice way of using a single store
    here, since SQXTUNT writes to the top half of each pair of elements (as opposed
    to SQXTUN2 in Neon which writes to the top half of the overall vector).

    I think you can use a pair of zero-extending loads (ld1b {z0.h}) rather than
    UUNPKLO/HI to save one instruction here, but that is the only obvious
    improvement here that I can see.

    Thanks,
    George

  • Hi George,

    1) Thanks! To be honest, I completely forgot about the tricks that you can do with whilelt instruction.

    2) OK. So, I will continue using 2 ld1 instructions instead of 1 ld2. However, I have some questions. Based on the Neoverse N2 optimization guide you provided, it is stated that for load instructions, the latency represents the maximum latency to load all the vector registers written by the instruction. The latencies for l1b and l2b are 6 and 8 respectively. So, two ld1b instructions need 6+6=12, while one l2b needs 8. Doesn't this imply that the execution time of 2 ld1b is greater than 1 l2b? To be honest, I didn't understand what execution throughput is.

    3) Thanks for your comments. Unfortunately, the version of the code that uses 'sqrshrb' instruction does not pass the tests. Don't we also need the addition with '0x40' that the code without sqrshrb does?

    4) I have figured out how we can use only one store (but ld2 should be used for loading):

    function PFX(pixel_add_ps_16x\h\()_neon)
        ptrue           p0.b, vl16
        ptrue           p2.b, vl8
    .rept \h
        ld2b            {z0.b, z1.b}, p2/z, [x2]
        add             x2, x2, x4
        ld2h            {z2.h, z3.h}, p0/z, [x3]
        add             x3, x3, x5, lsl #1
        uunpklo         z5.h, z0.b
        uunpklo         z6.h, z1.b
        add             z24.h, z5.h, z2.h
        add             z25.h, z6.h, z3.h
        sqxtunb         z5.b, z24.h
        sqxtunt         z5.b, z25.h
        st1b            {z5.b}, p0, [x0]
        add             x0, x0, x1
    .endr
        ret
    endfunc

    Do you think that this version of the code is faster than the one which uses 2 stores? Also, if your answer is no, how about using the NEON code when the vector size is equal to 128 bits and the SVE2 code when the vector sizes in greater then 128 bits? I can easily do this as I am checking and branching based on that size (if of course the NEON code is faster than the SVE2 with the two stores). What do you think?

    Thanks for everything!

    Akis

  • Hi Akis,

    2)

    The execution throughput here is referring to the number of instances of that
    instruction that can be started on the same cycle. For example if LD2B has a
    throughput of 1 meaning that every cycle one of these instructions can begin
    (but each one still takes 8 cycles until the result is available for use by any
    instructions that depend on the result).

    This means that for a series of independent LD1B instructions with a latency of
    6 cycles and a throughput of 3 per cycle:

    ld1b {z0.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z1.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z2.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z3.b}, ... // starts on cycle 1, completed on cycle 7
    ld1b {z4.b}, ... // starts on cycle 1, completed on cycle 7
    ld1b {z5.b}, ... // starts on cycle 1, completed on cycle 7

    And for the same amount of data loaded with LD2B instructions with a latency of
    8 cycles and a throughput of 3 per cycle:

    ld2b {z0.b, z1.b}, ... // starts on cycle 0, completed on cycle 8
    ld2b {z2.b, z3.b}, ... // starts on cycle 1, completed on cycle 9
    ld2b {z4.b, z5.b}, ... // starts on cycle 2, completed on cycle 10

    You can see that even though we required fewer LD2B instructions, the
    combination of the worse throughput and higher latency compared to LD1B means
    that it is preferrable to use LD1B in this instance. More generally, which
    sequence is preferrable will depend on your micro-architecture of interest, and
    the overall picture may change if there are other instructions being executed
    at the same time which might compete for the same execution resources on the
    CPU, but this is fine for an initial estimate at least.

    3)

    The 0x40 here should in theory be handled by the "rounding" part of sqrshrb. I
    think the difference is that the original code initialised every .b element to
    0x40 rather than every .h element as I imagined. I'm not really sure why the
    code initialises .b elements given that the rest of the arithmetic is done with
    .h. Assuming that the code you showed is definitely what you want, I think you
    can correct for this by just adding 0x80 at the end:

    sqrshrnb z0.b, z0.h, #7
    add z0.b, z0.b, #0x80

    I had a go at writing a very quick test case to check this, hopefully this
    mostly makes sense:

    .global foo
    foo:
      mov z30.b, #0x40
      saddlb z4.s, z0.h, z30.h
      saddlt z5.s, z0.h, z30.h
      shrnb z0.h, z4.s, #7
      shrnt z0.h, z5.s, #7
      sqxtunb z0.b, z0.h
      ret
    
    .global bar
    bar:
      sqrshrnb z0.b, z0.h, #7
      add z0.b, z0.b, #0x80
      ret

    #include <arm_neon.h>
    #include <stdio.h>
    
    int8x8_t foo(int16x4_t x);
    int8x8_t bar(int16x4_t x);
    
    int main() {
      for (int i=-32768; i<32768; ++i) {
        int y = foo(vdup_n_s16(i))[0];
        int z = bar(vdup_n_s16(i))[0];
        if (y != z) {
          printf("%d -> %d or %d\n", i, y, z);
        }
      }
    }
    

    4)

    Of course there is nothing stopping you from using Neon for the 128-bit cases,
    or even using Neon for just the sqxtun(2) instructions and nothing else since
    at vl=128 the registers are identical and overlap. Having said that, I don't
    think that having two stores here is much of a problem. Did you try a version
    using the zero-extending loads instead, since I think that would still beat the
    Neon version by not needing the uxtl(2)/uunpk(lo/hi) instructions?

    Thanks,
    George

  • Hi George,

    2) Now I understand. Thanks. Just one question. Does the execution throughput refer to instances of the same instruction? I mean which one of the following is preferable:

    ld1b    {z0.b}, p0/z, [x0]
    ld1b    {z1.b}, p0/z, [x1]
    add     x0, x0, x5
    add     x1, x1, x6

    or

    ld1b    {z0.b}, p0/z, [x0]
    add     x0, x0, x5
    ld1b    {z1.b}, p0/z, [x1]
    add     x1, x1, x6

    If the execution throughput refers to instances of the same instruction, I guess the firth option is the best. Or?

    3) Your code works perfectly. Thanks!

    4) I switched back to using the zero-extending loads instead. Regarding the performance, I think it is better, as you said. Thanks!

    I might come back to you if I need anything else. Thanks for everything!

    BR,

    Akis

  • Hi Akis,

    Throughput in this case is referring to the number of the same instruction that
    can begin execution on each cycle. The exact code layout is not particularly
    important for large out-of-order cores like Neoverse N2 or Neoverse V2, so I
    would expect both arrangements to perform more or less the same. The bottleneck
    in such cores is instead usually any dependency chain between instructions, for
    example in the case of the load instructions here the loads cannot begin
    execution until the addresses x0 and x1 have been calculated.

    Glad to hear the new code worked as expected!

    Thanks,
    George