This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Improve Performance of specific NEON functions using SVE/SVE2

Hello,

I have the following 3 functions that utilize NEON instruction set:

function pixel_avg2_w8_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.8b}, [x2], x3
    ld1         {v2.8b}, [x4], x3
    urhadd      v0.8b,  v0.8b,  v2.8b
    ld1         {v1.8b}, [x2], x3
    ld1         {v3.8b}, [x4], x3
    urhadd      v1.8b,  v1.8b,  v3.8b
    st1         {v0.8b}, [x0], x1
    st1         {v1.8b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_avg2_w16_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.16b}, [x2], x3
    ld1         {v2.16b}, [x4], x3
    urhadd      v0.16b, v0.16b, v2.16b
    ld1         {v1.16b}, [x2], x3
    ld1         {v3.16b}, [x4], x3
    urhadd      v1.16b, v1.16b, v3.16b
    st1         {v0.16b}, [x0], x1
    st1         {v1.16b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_sad_\h\()_neon, export=1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabdl       v16.8h,  v0.8b,  v1.8b
    uabdl2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b

.rept \h / 2 - 1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabal       v16.8h,  v0.8b,  v1.8b
    uabal2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b
.endr
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

I want to use SVE/SVE2 instructions set to improve the performance of these functions. My testbed is Alibaba Yitian 710 (vector size=128 bits).

For the first 2, I couldn't find a way to improve the performance. For the latter, I wrote the following function:

function pixel_sad_\h\()_sve, export=1
    ptrue       p0.h, vl8
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uabd        v16.8h,  v0.8h,  v1.8h
    uabd        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h

.rept \h / 2 - 1
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uaba        v16.8h,  v0.8h,  v1.8h
    uaba        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h
.endr
    
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

However, this degrades the performance instead of improving it.

Can someone help me?

Thank you in advance,

Akis

Parents
  • Hi Akis,

    It's a bit hard for me to try and debug the whole code snippet. One thing I did notice though is that at the end of the function you reduce v30 and v31 as such:

    addv s2, v30.4s
    addv s4, v31.4s
    ...
    mov w3, v30.s[0] // Should this be v2.s[0] ?
    mov w4, v31.s[0] // Should this be v4.s[0] ?

    This seems suspicious since s2 and s4 are otherwise never used after those instructions.

    With regards to still needing the USUBL, do you know if either the absolute difference (UABD) or a non-widening subtract (SUB) would work here instead? If so then we can potentially use only one of those instead since the UABD and USUBL are doing very similar things at the moment? Assuming that a non-widening approach works here you could then sum the results with UADDW or another UDOT instruction with all-1s as the other operand.

    For example, instead of:

    uabd v28.8b, v16.8b, v18.8b
    usubl v6.8h, v16.8b, v18.8b
    udot v30.2s, v28.8b, v28.8b
    add v0.8h, v0.8h, v6.8h

    We could see if something like this would work instead:

    uabd v28.8b, v16.8b, v18.8b // or SUB?
    udot v30.2s, v28.8b, v28.8b
    uaddw v0.8h, v0.8h, v28.8b

    Using the dot product would also work here if we need to widen beyond a 16-bit accumulator for v0 since it allows us to accumulate in 32-bits by multiplying by a vector of all-1s:

    mov v6.16b, #1
    ...
    uabd v28.8b, v16.8b, v18.8b // or SUB?
    udot v30.2s, v28.8b, v28.8b
    udot v0.2s, v28.8b, v6.8b // v28.8b * 1

    If an approach like that works then at that point it may be beneficial to re-try the three-load appoach since the entire computation can be moved from .8b to .16b which could be more significant than your previous attempt?

    Hope that helps!

    Thanks,
    George

Reply
  • Hi Akis,

    It's a bit hard for me to try and debug the whole code snippet. One thing I did notice though is that at the end of the function you reduce v30 and v31 as such:

    addv s2, v30.4s
    addv s4, v31.4s
    ...
    mov w3, v30.s[0] // Should this be v2.s[0] ?
    mov w4, v31.s[0] // Should this be v4.s[0] ?

    This seems suspicious since s2 and s4 are otherwise never used after those instructions.

    With regards to still needing the USUBL, do you know if either the absolute difference (UABD) or a non-widening subtract (SUB) would work here instead? If so then we can potentially use only one of those instead since the UABD and USUBL are doing very similar things at the moment? Assuming that a non-widening approach works here you could then sum the results with UADDW or another UDOT instruction with all-1s as the other operand.

    For example, instead of:

    uabd v28.8b, v16.8b, v18.8b
    usubl v6.8h, v16.8b, v18.8b
    udot v30.2s, v28.8b, v28.8b
    add v0.8h, v0.8h, v6.8h

    We could see if something like this would work instead:

    uabd v28.8b, v16.8b, v18.8b // or SUB?
    udot v30.2s, v28.8b, v28.8b
    uaddw v0.8h, v0.8h, v28.8b

    Using the dot product would also work here if we need to widen beyond a 16-bit accumulator for v0 since it allows us to accumulate in 32-bits by multiplying by a vector of all-1s:

    mov v6.16b, #1
    ...
    uabd v28.8b, v16.8b, v18.8b // or SUB?
    udot v30.2s, v28.8b, v28.8b
    udot v0.2s, v28.8b, v6.8b // v28.8b * 1

    If an approach like that works then at that point it may be beneficial to re-try the three-load appoach since the entire computation can be moved from .8b to .16b which could be more significant than your previous attempt?

    Hope that helps!

    Thanks,
    George

Children
  • Hi George,

    after using the mov instructions you proposed, everything worked fine! Thanks!

    Unfortunately, after some testing, I can neither use sub nor uabd. Unit tests fail again. So I can not use the 3 load instructions as well. But your proposed solution is very interesting and it may help me optimize other functions. Thanks!

    I think we can close this thread. You gave me a lot of help. I couldn't reach up to this point without your help. Once again, thanks!

    If I need further help, I will create new thread (I hope that this is OK).

    BR,

    Akis