Improve Performance of specific NEON functions using SVE/SVE2

Hello,

I have the following 3 functions that utilize NEON instruction set:

function pixel_avg2_w8_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.8b}, [x2], x3
    ld1         {v2.8b}, [x4], x3
    urhadd      v0.8b,  v0.8b,  v2.8b
    ld1         {v1.8b}, [x2], x3
    ld1         {v3.8b}, [x4], x3
    urhadd      v1.8b,  v1.8b,  v3.8b
    st1         {v0.8b}, [x0], x1
    st1         {v1.8b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_avg2_w16_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.16b}, [x2], x3
    ld1         {v2.16b}, [x4], x3
    urhadd      v0.16b, v0.16b, v2.16b
    ld1         {v1.16b}, [x2], x3
    ld1         {v3.16b}, [x4], x3
    urhadd      v1.16b, v1.16b, v3.16b
    st1         {v0.16b}, [x0], x1
    st1         {v1.16b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_sad_\h\()_neon, export=1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabdl       v16.8h,  v0.8b,  v1.8b
    uabdl2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b

.rept \h / 2 - 1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabal       v16.8h,  v0.8b,  v1.8b
    uabal2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b
.endr
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

I want to use SVE/SVE2 instructions set to improve the performance of these functions. My testbed is Alibaba Yitian 710 (vector size=128 bits).

For the first 2, I couldn't find a way to improve the performance. For the latter, I wrote the following function:

function pixel_sad_\h\()_sve, export=1
    ptrue       p0.h, vl8
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uabd        v16.8h,  v0.8h,  v1.8h
    uabd        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h

.rept \h / 2 - 1
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uaba        v16.8h,  v0.8h,  v1.8h
    uaba        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h
.endr
    
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

However, this degrades the performance instead of improving it.

Can someone help me?

Thank you in advance,

Akis

Parents
  • Hi Akis,

    Good to hear that my suggestions for quant_4x4x4_neon worked as we expected!

    For sub8x8_dct8_neon I wonder if we can still remove some of the shift instructions by combining with the successor addition and using the SSRA instruction to perform a shift and addition in a single instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/SSRA--Signed-Shift-Right-and-Accumulate--immediate--

    For example instead of:

    sshr v23.8h, v21.8h, #1
    add v23.8h, v23.8h, v21.8h

    We could instead consider something like:

    ssra v21.8h, v21.8h, #1    // v21.8h += v21.8h >> 1

    This has the disadvantage that we must reuse the same register as the non-shifted addend so this does not work if we need v21 elsewhere later: in your snippet I think that v21 is used in an ADD and SUB after the SSHR+ADD pair, however I suspect that some of this can be solved by re-ordering the code so that the ADD/SUB are done first and therefore the register can be reused for the SSRA?

    For the copy functions like mc_copy_w16_neon and memcpy_aligned_neon there is probably no benefit from SVE at the same vector length as Neon. One small optimisation you could consider is maintaining multiple independent source and destination addresses (e.g. x0, x0+x1, x0+2*x1, x0+3*x1) and incrementing them independently (e.g. by x1*4), since currently in mc_copy_w16_neon for instance the x0 and x2 addresses must be updated four times per loop iteration which could be slow. I don't expect that would have a big impact in performance though.

    For mbtree_propagate_list_internal_neon I wonder if we can also use the SSRA instruction here as well? We currently have e.g.

    sshr v6.8h, v4.8h, #5
    add v6.8h, v6.8h, v29.8h

    Which could instead be:

    ssra v29.8h, v4.8h, #5

    I guess that doesn't work so well in this case because v28 and v29 are needed for the next iteration of the loop, but even a MOV to duplicate them into another variable may still be better since the constants will not be on the critical path of the calculation.

    The UZP1/UZP2 and later ZIP1/ZIP2 instructions in the loop feels strange since the ZIP1/ZIP2 will undo the effect of the earlier UZP1/UZP2 instructions? Perhaps the other operands (v25) can be adjusted so that both pairs of permutes can either be removed or at least
    replaced with a single permute to swap pairs of lanes so that they can interact with each other (REV32.8H?).

    Finally, I suspect it doesn't work in this case but just mentioning it in case it could be useful: for the UMULL+RSHRN pairs we could consider trying to replace those with something like the UMULH SVE instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UMULH--unpredicated---Unsigned-multiply-returning-high-half--unpredicated--

    The problems I suspect that we'll encounter trying to use UMULH here are (a) that the shift is a rounding shift which means we cannot usually just take the top half of the multiplication result, and (b) the shift value is only 10 rather than 16. The shift value might not be a problem if you could instead adjust the operands and multiply by (v25 << 6) instead, but that might not be possible depending on the range of possible values for that multiplicand.

    For pixel_var2_8x\h\()_neon I would assume that we could replace the USUBL+SMULL/SMLAL pairs with UABD+UDOT as we have done previously. Since we have one continguous array it may also be worth loading into full vectors of data here rather than only using half a vector at a time, e.g.

    ld1 {v16.8b}, [x0], #8
    ld1 {v18.8b}, [x1], x3
    ld1 {v17.8b}, [x0], #8
    ld1 {v19.8b}, [x1], x3

    Could be something like:

    ld1 {v16.16b}, [x0], #16 // Merged from v16 and v17.
    ld1 {v18.8b}, [x1], x3
    ld1 {v18.d}[1], [x1], x3 // Load into high half of v18, not v19.

    For pixel_sad_x_h\()_neon_10 I agree with your conclusion. I don't think that there will be much benefit from dot product here since there is never a widening operation, so the UABA instruction is able to operate on full vectors rather than on only half of a vector like in some of our previous examples where we have used UMLAL or UABAL etc.

    Hope that helps!

    Thanks,
    George

Reply
  • Hi Akis,

    Good to hear that my suggestions for quant_4x4x4_neon worked as we expected!

    For sub8x8_dct8_neon I wonder if we can still remove some of the shift instructions by combining with the successor addition and using the SSRA instruction to perform a shift and addition in a single instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/SSRA--Signed-Shift-Right-and-Accumulate--immediate--

    For example instead of:

    sshr v23.8h, v21.8h, #1
    add v23.8h, v23.8h, v21.8h

    We could instead consider something like:

    ssra v21.8h, v21.8h, #1    // v21.8h += v21.8h >> 1

    This has the disadvantage that we must reuse the same register as the non-shifted addend so this does not work if we need v21 elsewhere later: in your snippet I think that v21 is used in an ADD and SUB after the SSHR+ADD pair, however I suspect that some of this can be solved by re-ordering the code so that the ADD/SUB are done first and therefore the register can be reused for the SSRA?

    For the copy functions like mc_copy_w16_neon and memcpy_aligned_neon there is probably no benefit from SVE at the same vector length as Neon. One small optimisation you could consider is maintaining multiple independent source and destination addresses (e.g. x0, x0+x1, x0+2*x1, x0+3*x1) and incrementing them independently (e.g. by x1*4), since currently in mc_copy_w16_neon for instance the x0 and x2 addresses must be updated four times per loop iteration which could be slow. I don't expect that would have a big impact in performance though.

    For mbtree_propagate_list_internal_neon I wonder if we can also use the SSRA instruction here as well? We currently have e.g.

    sshr v6.8h, v4.8h, #5
    add v6.8h, v6.8h, v29.8h

    Which could instead be:

    ssra v29.8h, v4.8h, #5

    I guess that doesn't work so well in this case because v28 and v29 are needed for the next iteration of the loop, but even a MOV to duplicate them into another variable may still be better since the constants will not be on the critical path of the calculation.

    The UZP1/UZP2 and later ZIP1/ZIP2 instructions in the loop feels strange since the ZIP1/ZIP2 will undo the effect of the earlier UZP1/UZP2 instructions? Perhaps the other operands (v25) can be adjusted so that both pairs of permutes can either be removed or at least
    replaced with a single permute to swap pairs of lanes so that they can interact with each other (REV32.8H?).

    Finally, I suspect it doesn't work in this case but just mentioning it in case it could be useful: for the UMULL+RSHRN pairs we could consider trying to replace those with something like the UMULH SVE instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UMULH--unpredicated---Unsigned-multiply-returning-high-half--unpredicated--

    The problems I suspect that we'll encounter trying to use UMULH here are (a) that the shift is a rounding shift which means we cannot usually just take the top half of the multiplication result, and (b) the shift value is only 10 rather than 16. The shift value might not be a problem if you could instead adjust the operands and multiply by (v25 << 6) instead, but that might not be possible depending on the range of possible values for that multiplicand.

    For pixel_var2_8x\h\()_neon I would assume that we could replace the USUBL+SMULL/SMLAL pairs with UABD+UDOT as we have done previously. Since we have one continguous array it may also be worth loading into full vectors of data here rather than only using half a vector at a time, e.g.

    ld1 {v16.8b}, [x0], #8
    ld1 {v18.8b}, [x1], x3
    ld1 {v17.8b}, [x0], #8
    ld1 {v19.8b}, [x1], x3

    Could be something like:

    ld1 {v16.16b}, [x0], #16 // Merged from v16 and v17.
    ld1 {v18.8b}, [x1], x3
    ld1 {v18.d}[1], [x1], x3 // Load into high half of v18, not v19.

    For pixel_sad_x_h\()_neon_10 I agree with your conclusion. I don't think that there will be much benefit from dot product here since there is never a widening operation, so the UABA instruction is able to operate on full vectors rather than on only half of a vector like in some of our previous examples where we have used UMLAL or UABAL etc.

    Hope that helps!

    Thanks,
    George

Children