Hello,
I have the following 3 functions that utilize NEON instruction set:
function pixel_avg2_w8_neon, export=1 1: subs w5, w5, #2 ld1 {v0.8b}, [x2], x3 ld1 {v2.8b}, [x4], x3 urhadd v0.8b, v0.8b, v2.8b ld1 {v1.8b}, [x2], x3 ld1 {v3.8b}, [x4], x3 urhadd v1.8b, v1.8b, v3.8b st1 {v0.8b}, [x0], x1 st1 {v1.8b}, [x0], x1 b.gt 1b ret endfunc
function pixel_avg2_w16_neon, export=1 1: subs w5, w5, #2 ld1 {v0.16b}, [x2], x3 ld1 {v2.16b}, [x4], x3 urhadd v0.16b, v0.16b, v2.16b ld1 {v1.16b}, [x2], x3 ld1 {v3.16b}, [x4], x3 urhadd v1.16b, v1.16b, v3.16b st1 {v0.16b}, [x0], x1 st1 {v1.16b}, [x0], x1 b.gt 1b ret endfunc
function pixel_sad_\h\()_neon, export=1 ld1 {v1.16b}, [x2], x3 ld1 {v0.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 uabdl v16.8h, v0.8b, v1.8b uabdl2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .rept \h / 2 - 1 ld1 {v1.16b}, [x2], x3 ld1 {v0.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 uabal v16.8h, v0.8b, v1.8b uabal2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .endr add v16.8h, v16.8h, v17.8h uaddlv s0, v16.8h fmov w0, s0 ret endfunc
I want to use SVE/SVE2 instructions set to improve the performance of these functions. My testbed is Alibaba Yitian 710 (vector size=128 bits).
For the first 2, I couldn't find a way to improve the performance. For the latter, I wrote the following function:
function pixel_sad_\h\()_sve, export=1 ptrue p0.h, vl8 ld1b {z1.h}, p0/z, [x2] ld1b {z4.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z3.h}, p0/z, [x2] ld1b {z6.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z0.h}, p0/z, [x0] ld1b {z5.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 ld1b {z2.h}, p0/z, [x0] ld1b {z7.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 uabd v16.8h, v0.8h, v1.8h uabd v17.8h, v4.8h, v5.8h uaba v16.8h, v2.8h, v3.8h uaba v17.8h, v7.8h, v6.8h .rept \h / 2 - 1 ld1b {z1.h}, p0/z, [x2] ld1b {z4.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z3.h}, p0/z, [x2] ld1b {z6.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z0.h}, p0/z, [x0] ld1b {z5.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 ld1b {z2.h}, p0/z, [x0] ld1b {z7.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 uaba v16.8h, v0.8h, v1.8h uaba v17.8h, v4.8h, v5.8h uaba v16.8h, v2.8h, v3.8h uaba v17.8h, v7.8h, v6.8h .endr add v16.8h, v16.8h, v17.8h uaddlv s0, v16.8h fmov w0, s0 ret endfunc
However, this degrades the performance instead of improving it.
Can someone help me?
Thank you in advance,
Akis
Hi George,
thank you very much for your answer.
For the quant_4x4x4_neon function, everything worked like a charm. Thanks!
For the hpel_filter_neon function. unfortunately I do not have any more information now. I will check again the code in if I have more info I will contact you again.
For the sub8x8_dct8_neon function, what you proposed is very interesting. However, it seems that we need both the normal subtraction output and the subtraction output right shifted by one. Below, I am listing again the DCT8_1D macro after substituting the internally used macros with the real instructions:
.macro DCT8_1D_SVE type add v18.8h, v3.8h, v4.8h sub v17.8h, v3.8h, v4.8h add v19.8h, v2.8h, v5.8h sub v16.8h, v2.8h, v5.8h add v22.8h, v1.8h, v6.8h sub v21.8h, v1.8h, v6.8h add v23.8h, v0.8h, v7.8h sub v20.8h, v0.8h, v7.8h add v24.8h, v23.8h, v18.8h sub v26.8h, v23.8h, v18.8h add v25.8h, v22.8h, v19.8h sub v27.8h, v22.8h, v19.8h add v30.8h, v20.8h, v17.8h sub v29.8h, v20.8h, v17.8h sshr v23.8h, v21.8h, #1 sshr v18.8h, v16.8h, #1 add v23.8h, v23.8h, v21.8h add v18.8h, v18.8h, v16.8h sub v30.8h, v30.8h, v23.8h sub v29.8h, v29.8h, v18.8h add v28.8h, v21.8h, v16.8h sub v31.8h, v21.8h, v16.8h sshr v22.8h, v20.8h, #1 sshr v19.8h, v17.8h, #1 add v22.8h, v22.8h, v20.8h add v19.8h, v19.8h, v17.8h add v22.8h, v28.8h, v22.8h add v31.8h, v31.8h, v19.8h add v0.8h, v24.8h, v25.8h sub v4.8h, v24.8h, v25.8h sshr v16.8h, v31.8h, #2 sshr v17.8h, v22.8h, #2 add v1.8h, v22.8h, v16.8h sub v7.8h, v17.8h, v31.8h sshr v18.8h, v27.8h, #1 sshr v19.8h, v26.8h, #1 add v2.8h, v26.8h, v18.8h sub v6.8h, v19.8h, v27.8h sshr v20.8h, v30.8h, #2 sshr v21.8h, v29.8h, #2 add v3.8h, v20.8h, v29.8h sub v5.8h, v30.8h, v21.8h .endm
For example, we subtract v6.8h from v1.8h and place the result in v21.8h. Then, we right shift v21.8h by 1 and place the result in v23.8h, as later on we need both v21.8h and v238h. So, I do not think we can use shsub as we will lose v21.8h. Am I missing something?
Also, I have a couple more functions to improve. More specifically:
function mc_copy_w16_neon, export=1 lsl x1, x1, #1 lsl x3, x3, #1 1: subs w4, w4, #4 ld1 {v0.8h, v1.8h}, [x2], x3 ld1 {v2.8h, v3.8h}, [x2], x3 ld1 {v4.8h, v5.8h}, [x2], x3 ld1 {v6.8h, v7.8h}, [x2], x3 st1 {v0.8h, v1.8h}, [x0], x1 st1 {v2.8h, v3.8h}, [x0], x1 st1 {v4.8h, v5.8h}, [x0], x1 st1 {v6.8h, v7.8h}, [x0], x1 b.gt 1b ret endfunc
function memcpy_aligned_neon, export=1 tst x2, #16 b.eq 32f sub x2, x2, #16 ldr q0, [x1], #16 str q0, [x0], #16 32: tst x2, #32 b.eq 640f sub x2, x2, #32 ldp q0, q1, [x1], #32 stp q0, q1, [x0], #32 640: cbz x2, 1f 64: subs x2, x2, #64 ldp q0, q1, [x1, #32] ldp q2, q3, [x1], #64 stp q0, q1, [x0, #32] stp q2, q3, [x0], #64 b.gt 64b 1: ret endfunc
const pw_0to15, align=5 .short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 endconst function mbtree_propagate_list_internal_neon, export=1 movrel x11, pw_0to15 dup v31.8h, w4 // bipred_weight movi v30.8h, #0xc0, lsl #8 ld1 {v29.8h}, [x11] //h->mb.i_mb_x,h->mb.i_mb_y movi v28.4s, #4 movi v27.8h, #31 movi v26.8h, #32 dup v24.8h, w5 // mb_y zip1 v29.8h, v29.8h, v24.8h 8: subs w6, w6, #8 ld1 {v1.8h}, [x1], #16 // propagate_amount ld1 {v2.8h}, [x2], #16 // lowres_cost and v2.16b, v2.16b, v30.16b cmeq v25.8h, v2.8h, v30.8h umull v16.4s, v1.4h, v31.4h umull2 v17.4s, v1.8h, v31.8h rshrn v16.4h, v16.4s, #6 rshrn2 v16.8h, v17.4s, #6 bsl v25.16b, v16.16b, v1.16b // if( lists_used == 3 ) // propagate_amount = (propagate_amount * bipred_weight + 32) >> 6 ld1 {v4.8h,v5.8h}, [x0], #32 sshr v6.8h, v4.8h, #5 sshr v7.8h, v5.8h, #5 add v6.8h, v6.8h, v29.8h add v29.8h, v29.8h, v28.8h add v7.8h, v7.8h, v29.8h add v29.8h, v29.8h, v28.8h st1 {v6.8h,v7.8h}, [x3], #32 and v4.16b, v4.16b, v27.16b and v5.16b, v5.16b, v27.16b uzp1 v6.8h, v4.8h, v5.8h // x & 31 uzp2 v7.8h, v4.8h, v5.8h // y & 31 sub v4.8h, v26.8h, v6.8h // 32 - (x & 31) sub v5.8h, v26.8h, v7.8h // 32 - (y & 31) mul v19.8h, v6.8h, v7.8h // idx3weight = y*x; mul v18.8h, v4.8h, v7.8h // idx2weight = y*(32-x); mul v17.8h, v6.8h, v5.8h // idx1weight = (32-y)*x; mul v16.8h, v4.8h, v5.8h // idx0weight = (32-y)*(32-x) ; umull v6.4s, v19.4h, v25.4h umull2 v7.4s, v19.8h, v25.8h umull v4.4s, v18.4h, v25.4h umull2 v5.4s, v18.8h, v25.8h umull v2.4s, v17.4h, v25.4h umull2 v3.4s, v17.8h, v25.8h umull v0.4s, v16.4h, v25.4h umull2 v1.4s, v16.8h, v25.8h rshrn v19.4h, v6.4s, #10 rshrn2 v19.8h, v7.4s, #10 rshrn v18.4h, v4.4s, #10 rshrn2 v18.8h, v5.4s, #10 rshrn v17.4h, v2.4s, #10 rshrn2 v17.8h, v3.4s, #10 rshrn v16.4h, v0.4s, #10 rshrn2 v16.8h, v1.4s, #10 zip1 v0.8h, v16.8h, v17.8h zip2 v1.8h, v16.8h, v17.8h zip1 v2.8h, v18.8h, v19.8h zip2 v3.8h, v18.8h, v19.8h st1 {v0.8h,v1.8h}, [x3], #32 st1 {v2.8h,v3.8h}, [x3], #32 b.ge 8b ret endfunc
function pixel_var2_8x\h\()_neon, export=1 mov x3, #16 ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 mov x5, \h - 2 usubl v0.8h, v16.8b, v18.8b usubl v1.8h, v17.8b, v19.8b ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 smull v2.4s, v0.4h, v0.4h smull2 v3.4s, v0.8h, v0.8h smull v4.4s, v1.4h, v1.4h smull2 v5.4s, v1.8h, v1.8h usubl v6.8h, v16.8b, v18.8b 1: subs x5, x5, #1 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 smlal v2.4s, v6.4h, v6.4h smlal2 v3.4s, v6.8h, v6.8h usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 smlal v4.4s, v7.4h, v7.4h smlal2 v5.4s, v7.8h, v7.8h usubl v6.8h, v16.8b, v18.8b add v1.8h, v1.8h, v7.8h b.gt 1b ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 smlal v2.4s, v6.4h, v6.4h smlal2 v3.4s, v6.8h, v6.8h usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h smlal v4.4s, v7.4h, v7.4h add v1.8h, v1.8h, v7.8h smlal2 v5.4s, v7.8h, v7.8h saddlv s0, v0.8h saddlv s1, v1.8h add v2.4s, v2.4s, v3.4s add v4.4s, v4.4s, v5.4s mov w0, v0.s[0] mov w1, v1.s[0] addv s2, v2.4s addv s4, v4.4s mul w0, w0, w0 mul w1, w1, w1 mov w3, v2.s[0] mov w4, v4.s[0] sub w0, w3, w0, lsr # 6 + (\h >> 4) sub w1, w4, w1, lsr # 6 + (\h >> 4) str w3, [x2] add w0, w0, w1 str w4, [x2, #4] ret endfunc
function pixel_sad_x_h\()_neon_10, export=1 mov x7, #16 lsl x5, x5, #1 lsl x7, x7, #1 ld1 {v0.8h, v1.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 ld1 {v4.8h, v5.8h}, [x2], x5 uabd v16.8h, v2.8h, v0.8h uabd v20.8h, v3.8h, v1.8h ld1 {v24.8h, v25.8h}, [x3], x5 uabd v17.8h, v4.8h, v0.8h uabd v21.8h, v5.8h, v1.8h ld1 {v6.8h, v7.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 uabd v18.8h, v24.8h, v0.8h uabd v22.8h, v25.8h, v1.8h ld1 {v4.8h, v5.8h}, [x2], x5 uaba v16.8h, v2.8h, v6.8h uaba v20.8h, v3.8h, v7.8h ld1 {v24.8h, v25.8h}, [x3], x5 uaba v17.8h, v4.8h, v6.8h uaba v21.8h, v5.8h, v7.8h ld1 {v26.8h, v27.8h}, [x4], x5 ld1 {v28.8h, v29.8h}, [x4], x5 uaba v18.8h, v24.8h, v6.8h uaba v22.8h, v25.8h, v7.8h uabd v19.8h, v26.8h, v0.8h uabd v23.8h, v27.8h, v1.8h uaba v19.8h, v28.8h, v6.8h uaba v23.8h, v29.8h, v7.8h .rept \h / 2 - 1 ld1 {v0.8h, v1.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 ld1 {v4.8h, v5.8h}, [x2], x5 uaba v16.8h, v2.8h, v0.8h uaba v20.8h, v3.8h, v1.8h ld1 {v24.8h, v25.8h}, [x3], x5 uaba v17.8h, v4.8h, v0.8h uaba v21.8h, v5.8h, v1.8h ld1 {v6.8h, v7.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 uaba v18.8h, v24.8h, v0.8h uaba v22.8h, v25.8h, v1.8h ld1 {v4.8h, v5.8h}, [x2], x5 uaba v16.8h, v2.8h, v6.8h uaba v20.8h, v3.8h, v7.8h ld1 {v24.8h, v25.8h}, [x3], x5 uaba v17.8h, v4.8h, v6.8h uaba v21.8h, v5.8h, v7.8h ld1 {v26.8h, v27.8h}, [x4], x5 ld1 {v28.8h, v29.8h}, [x4], x5 uaba v18.8h, v24.8h, v6.8h uaba v22.8h, v25.8h, v7.8h uaba v19.8h, v26.8h, v0.8h uaba v23.8h, v27.8h, v1.8h uaba v19.8h, v28.8h, v6.8h uaba v23.8h, v29.8h, v7.8h .endr add v16.8h, v16.8h, v20.8h add v17.8h, v17.8h, v21.8h add v18.8h, v18.8h, v22.8h add v19.8h, v19.8h, v23.8h // add up the sads uaddlv s0, v16.8h uaddlv s1, v17.8h uaddlv s2, v18.8h stp s0, s1, [x6], #8 uaddlv s3, v19.8h stp s2, s3, [x6] ret endfunc
For the latter, I tried to use the udot approach, but the performance is degraded. Any thoughts?
You do not have to provide me full functions, just some hints. Sorry for the large amount of help that I am asking. It seems that this forum is the only help I can get.
Hi Akis,
Good to hear that my suggestions for quant_4x4x4_neon worked as we expected!
For sub8x8_dct8_neon I wonder if we can still remove some of the shift instructions by combining with the successor addition and using the SSRA instruction to perform a shift and addition in a single instruction:
https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/SSRA--Signed-Shift-Right-and-Accumulate--immediate--
For example instead of:
sshr v23.8h, v21.8h, #1 add v23.8h, v23.8h, v21.8h
We could instead consider something like:
ssra v21.8h, v21.8h, #1 // v21.8h += v21.8h >> 1
This has the disadvantage that we must reuse the same register as the non-shifted addend so this does not work if we need v21 elsewhere later: in your snippet I think that v21 is used in an ADD and SUB after the SSHR+ADD pair, however I suspect that some of this can be solved by re-ordering the code so that the ADD/SUB are done first and therefore the register can be reused for the SSRA?
For the copy functions like mc_copy_w16_neon and memcpy_aligned_neon there is probably no benefit from SVE at the same vector length as Neon. One small optimisation you could consider is maintaining multiple independent source and destination addresses (e.g. x0, x0+x1, x0+2*x1, x0+3*x1) and incrementing them independently (e.g. by x1*4), since currently in mc_copy_w16_neon for instance the x0 and x2 addresses must be updated four times per loop iteration which could be slow. I don't expect that would have a big impact in performance though.
For mbtree_propagate_list_internal_neon I wonder if we can also use the SSRA instruction here as well? We currently have e.g.
sshr v6.8h, v4.8h, #5 add v6.8h, v6.8h, v29.8h
Which could instead be:
ssra v29.8h, v4.8h, #5
I guess that doesn't work so well in this case because v28 and v29 are needed for the next iteration of the loop, but even a MOV to duplicate them into another variable may still be better since the constants will not be on the critical path of the calculation.
The UZP1/UZP2 and later ZIP1/ZIP2 instructions in the loop feels strange since the ZIP1/ZIP2 will undo the effect of the earlier UZP1/UZP2 instructions? Perhaps the other operands (v25) can be adjusted so that both pairs of permutes can either be removed or at leastreplaced with a single permute to swap pairs of lanes so that they can interact with each other (REV32.8H?).
Finally, I suspect it doesn't work in this case but just mentioning it in case it could be useful: for the UMULL+RSHRN pairs we could consider trying to replace those with something like the UMULH SVE instruction:
https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UMULH--unpredicated---Unsigned-multiply-returning-high-half--unpredicated--
The problems I suspect that we'll encounter trying to use UMULH here are (a) that the shift is a rounding shift which means we cannot usually just take the top half of the multiplication result, and (b) the shift value is only 10 rather than 16. The shift value might not be a problem if you could instead adjust the operands and multiply by (v25 << 6) instead, but that might not be possible depending on the range of possible values for that multiplicand.
For pixel_var2_8x\h\()_neon I would assume that we could replace the USUBL+SMULL/SMLAL pairs with UABD+UDOT as we have done previously. Since we have one continguous array it may also be worth loading into full vectors of data here rather than only using half a vector at a time, e.g.
ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3
Could be something like:
ld1 {v16.16b}, [x0], #16 // Merged from v16 and v17. ld1 {v18.8b}, [x1], x3 ld1 {v18.d}[1], [x1], x3 // Load into high half of v18, not v19.
For pixel_sad_x_h\()_neon_10 I agree with your conclusion. I don't think that there will be much benefit from dot product here since there is never a widening operation, so the UABA instruction is able to operate on full vectors rather than on only half of a vector like in some of our previous examples where we have used UMLAL or UABAL etc.
Hope that helps!
Thanks,George
For sub8x8_dct8_neon, I applied your suggestion and everything worked fine. Thanks!
For the copy functions, as you said, there is no much left to do for improving the performance.
For mbtree_propagate_list_internal_neon, I applied your suggestion. Thanks!
For pixel_var2_8x\h\()_neon, I used the udot instruction, but it doesn't work. It seems that some vectors (for example v0.8h, v1.8h, v6.8h and v7.8h) are still needed after widening instructions. I developed the following function:
function pixel_var2_8x\h\()_sve, export=1 movi v30.4s, #0 movi v31.4s, #0 mov x3, #16 ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 mov x5, \h - 2 uabd v28.8b, v16.8b, v18.8b usubl v0.8h, v16.8b, v18.8b uabd v29.8b, v17.8b, v19.8b usubl v1.8h, v17.8b, v19.8b ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 udot v30.2s, v28.8b, v28.8b udot v31.2s, v29.8b, v29.8b uabd v28.8b, v16.8b, v18.8b usubl v6.8h, v16.8b, v18.8b 1: subs x5, x5, #1 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 udot v30.2s, v28.8b, v28.8b uabd v29.8b, v17.8b, v19.8b usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 udot v31.2s, v29.8b, v29.8b uabd v28.8b, v16.8b, v18.8b usubl v6.8h, v16.8b, v18.8b add v1.8h, v1.8h, v7.8h b.gt 1b ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 udot v30.2s, v6.8b, v6.8b uabd v29.8b, v17.8b, v19.8b usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h udot v31.2s, v29.8b, v29.8b add v1.8h, v1.8h, v7.8h saddlv s0, v0.8h saddlv s1, v1.8h mov w0, v0.s[0] mov w1, v1.s[0] addv s2, v30.4s addv s4, v31.4s mul w0, w0, w0 mul w1, w1, w1 mov w3, v30.s[0] mov w4, v31.s[0] sub w0, w3, w0, lsr # 6 + (\h >> 4) sub w1, w4, w1, lsr # 6 + (\h >> 4) str w3, [x2] add w0, w0, w1 str w4, [x2, #4] ret endfunc
Unit tests fail. Can you please tell me what I am doing wrong? Also, the usage of the three load merging commands instead of the initial four, degrades the performance. I do not know why.
For pixel_sad_x_h\()_neon_10, I also agree that we cannot improve it.
BR,
Unit test fail. Can you please tell me what I am doing wrong? Also, the usage of the three load merging commands instead of the initial four, degrades the performance. I do not know why.
It's a bit hard for me to try and debug the whole code snippet. One thing I did notice though is that at the end of the function you reduce v30 and v31 as such:
addv s2, v30.4s addv s4, v31.4s ... mov w3, v30.s[0] // Should this be v2.s[0] ? mov w4, v31.s[0] // Should this be v4.s[0] ?
This seems suspicious since s2 and s4 are otherwise never used after those instructions.
With regards to still needing the USUBL, do you know if either the absolute difference (UABD) or a non-widening subtract (SUB) would work here instead? If so then we can potentially use only one of those instead since the UABD and USUBL are doing very similar things at the moment? Assuming that a non-widening approach works here you could then sum the results with UADDW or another UDOT instruction with all-1s as the other operand.
For example, instead of:
uabd v28.8b, v16.8b, v18.8b usubl v6.8h, v16.8b, v18.8b udot v30.2s, v28.8b, v28.8b add v0.8h, v0.8h, v6.8h
We could see if something like this would work instead:
uabd v28.8b, v16.8b, v18.8b // or SUB? udot v30.2s, v28.8b, v28.8b uaddw v0.8h, v0.8h, v28.8b
Using the dot product would also work here if we need to widen beyond a 16-bit accumulator for v0 since it allows us to accumulate in 32-bits by multiplying by a vector of all-1s:
mov v6.16b, #1 ... uabd v28.8b, v16.8b, v18.8b // or SUB? udot v30.2s, v28.8b, v28.8b udot v0.2s, v28.8b, v6.8b // v28.8b * 1
If an approach like that works then at that point it may be beneficial to re-try the three-load appoach since the entire computation can be moved from .8b to .16b which could be more significant than your previous attempt?
after using the mov instructions you proposed, everything worked fine! Thanks!
Unfortunately, after some testing, I can neither use sub nor uabd. Unit tests fail again. So I can not use the 3 load instructions as well. But your proposed solution is very interesting and it may help me optimize other functions. Thanks!
I think we can close this thread. You gave me a lot of help. I couldn't reach up to this point without your help. Once again, thanks!
If I need further help, I will create new thread (I hope that this is OK).