This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Improve Performance of specific NEON functions using SVE/SVE2

Hello,

I have the following 3 functions that utilize NEON instruction set:

function pixel_avg2_w8_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.8b}, [x2], x3
    ld1         {v2.8b}, [x4], x3
    urhadd      v0.8b,  v0.8b,  v2.8b
    ld1         {v1.8b}, [x2], x3
    ld1         {v3.8b}, [x4], x3
    urhadd      v1.8b,  v1.8b,  v3.8b
    st1         {v0.8b}, [x0], x1
    st1         {v1.8b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_avg2_w16_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.16b}, [x2], x3
    ld1         {v2.16b}, [x4], x3
    urhadd      v0.16b, v0.16b, v2.16b
    ld1         {v1.16b}, [x2], x3
    ld1         {v3.16b}, [x4], x3
    urhadd      v1.16b, v1.16b, v3.16b
    st1         {v0.16b}, [x0], x1
    st1         {v1.16b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_sad_\h\()_neon, export=1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabdl       v16.8h,  v0.8b,  v1.8b
    uabdl2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b

.rept \h / 2 - 1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabal       v16.8h,  v0.8b,  v1.8b
    uabal2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b
.endr
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

I want to use SVE/SVE2 instructions set to improve the performance of these functions. My testbed is Alibaba Yitian 710 (vector size=128 bits).

For the first 2, I couldn't find a way to improve the performance. For the latter, I wrote the following function:

function pixel_sad_\h\()_sve, export=1
    ptrue       p0.h, vl8
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uabd        v16.8h,  v0.8h,  v1.8h
    uabd        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h

.rept \h / 2 - 1
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uaba        v16.8h,  v0.8h,  v1.8h
    uaba        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h
.endr
    
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

However, this degrades the performance instead of improving it.

Can someone help me?

Thank you in advance,

Akis

Parents
  • Hi Akis,

    Happy new year!

    I don't have a good explanation about why UABD and SABD may differ in performance. As you point out the Software Optimization Guides identify them as performing identically. Perhaps there are other sources of noise in the benchmarks or the binary layout has changed slightly as a result of re-linking the program?

    For pixel_sad_\h\()_neon_10 we are dealing with .h elements rather than .b so we cannot use the Neon dot product instructions, however we can use the SVE dot-product instructions instead since a 16-bit dot product is available:

    https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UDOT--4-way--vectors---Unsigned-integer-dot-product-

    I am assuming that the "10" in the function name refers to 10-bit input elements rather than full 16-bit wide input. In this instance we can delay
    accumulating into a wider datatype for longer since we have some bits to spare. In the snippet below I have only delayed the accumulation by one instruction but you could consider doing more to further reduce the number of dot-product instructions needed.

    Something like:

    ptrue p0.h
    dup z4.h, #1  // A constant vector of 1s for summing z0.h*1.
    dup z5.d, #0  // A accumulator with 64-bit elements.
    ...
    uabd z0.h, p0/m, z0.h, z1.h  // 10-bits
    uaba z0.h, z2.h, z3.h        // 11-bits
    udot z5.d, z0.h, z4.h
    ...
    uaddlv d0, p0, z5.d
    fmov w0, s0

    For the pixel_ssd_\h\()_neon we can also make use of dot-product instructions, albeit here I think we only need 8-bit rather than 16-bit dot-products.

    We can note that for the subtraction we only actually care about the absolute difference since we always square the result, or we may know that the second operand is always less than the first one. Either way I think each pair of USUBL(2) instructions can be replaced by a single non-widening UABD. We can then make use of the dot product to do the accumulation and widening in a single step.

    For example instead of:

    ld1 {v16.16b}, [x0], x1
    ld1 {v17.16b}, [x2], x3
    usubl v2.8h, v16.8b, v17.8b
    usubl2 v3.8h, v16.16b, v17.16b
    smull v0.4s, v2.4h, v2.4h
    smull2 v1.4s, v2.8h, v2.8h

    We can probably do something like:

    movi v0.4s, #0 // Need to initialise an accumulator.
    ...
    ld1 {v16.16b}, [x0], x1
    ld1 {v17.16b}, [x2], x3
    uabd v2.16h, v16.16b, v17.16b
    udot v0.4s, v2.16b, v2.16b

    Hope that helps!

    Thanks,
    George

Reply
  • Hi Akis,

    Happy new year!

    I don't have a good explanation about why UABD and SABD may differ in performance. As you point out the Software Optimization Guides identify them as performing identically. Perhaps there are other sources of noise in the benchmarks or the binary layout has changed slightly as a result of re-linking the program?

    For pixel_sad_\h\()_neon_10 we are dealing with .h elements rather than .b so we cannot use the Neon dot product instructions, however we can use the SVE dot-product instructions instead since a 16-bit dot product is available:

    https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UDOT--4-way--vectors---Unsigned-integer-dot-product-

    I am assuming that the "10" in the function name refers to 10-bit input elements rather than full 16-bit wide input. In this instance we can delay
    accumulating into a wider datatype for longer since we have some bits to spare. In the snippet below I have only delayed the accumulation by one instruction but you could consider doing more to further reduce the number of dot-product instructions needed.

    Something like:

    ptrue p0.h
    dup z4.h, #1  // A constant vector of 1s for summing z0.h*1.
    dup z5.d, #0  // A accumulator with 64-bit elements.
    ...
    uabd z0.h, p0/m, z0.h, z1.h  // 10-bits
    uaba z0.h, z2.h, z3.h        // 11-bits
    udot z5.d, z0.h, z4.h
    ...
    uaddlv d0, p0, z5.d
    fmov w0, s0

    For the pixel_ssd_\h\()_neon we can also make use of dot-product instructions, albeit here I think we only need 8-bit rather than 16-bit dot-products.

    We can note that for the subtraction we only actually care about the absolute difference since we always square the result, or we may know that the second operand is always less than the first one. Either way I think each pair of USUBL(2) instructions can be replaced by a single non-widening UABD. We can then make use of the dot product to do the accumulation and widening in a single step.

    For example instead of:

    ld1 {v16.16b}, [x0], x1
    ld1 {v17.16b}, [x2], x3
    usubl v2.8h, v16.8b, v17.8b
    usubl2 v3.8h, v16.16b, v17.16b
    smull v0.4s, v2.4h, v2.4h
    smull2 v1.4s, v2.8h, v2.8h

    We can probably do something like:

    movi v0.4s, #0 // Need to initialise an accumulator.
    ...
    ld1 {v16.16b}, [x0], x1
    ld1 {v17.16b}, [x2], x3
    uabd v2.16h, v16.16b, v17.16b
    udot v0.4s, v2.16b, v2.16b

    Hope that helps!

    Thanks,
    George

Children
  • Hi George,

    Happy new year! I wish all the best.

    Regarding the UABD vs SABD, I do not have a good explanation either. I have built the whole project from scratch but still the same is happening. I suspect that the caller function follows different code paths (maybe using an if statement which check the return value of the callee function) and when uabd is used this value is wrong and so different paths are followed which limit the amount of CPU cycles. I do not have any other explanation.

    Yes you are right. "10" stands for 10-bit. The code you provided for pixel_sad_\h\()_neon_10 improves the performance. Thanks! (just a typo, it should be "uaddv" and not "uaddlv").

    Your code for pixel_ssd_\h\()_neon improved the performance as well! Thanks!

    I have a couple of functions more to improve. Here they are:

    function quant_4x4x4_neon, export=1
        ld1         {v16.8h,v17.8h}, [x0]
        abs         v18.8h, v16.8h
        abs         v19.8h, v17.8h
        ld1         {v0.8h,v1.8h}, [x2]
        ld1         {v2.8h,v3.8h}, [x1]
        QUANT_TWO   v0.8h,  v1.8h,  v2,  v3,  v4.16b
        ld1         {v16.8h,v17.8h}, [x0]
        abs         v18.8h, v16.8h
        abs         v19.8h, v17.8h
        QUANT_TWO   v0.8h,  v1.8h,  v2,  v3,  v5.16b
        ld1         {v16.8h,v17.8h}, [x0]
        abs         v18.8h, v16.8h
        abs         v19.8h, v17.8h
        QUANT_TWO   v0.8h,  v1.8h,  v2,  v3,  v6.16b
        ld1         {v16.8h,v17.8h}, [x0]
        abs         v18.8h, v16.8h
        abs         v19.8h, v17.8h
        QUANT_TWO   v0.8h,  v1.8h,  v2,  v3,  v7.16b
        uqxtn       v4.8b,  v4.8h
        uqxtn       v7.8b,  v7.8h
        uqxtn       v6.8b,  v6.8h
        uqxtn       v5.8b,  v5.8h
        fmov        x7,  d7
        fmov        x6,  d6
        fmov        x5,  d5
        fmov        x4,  d4
        mov         w0,  #0
        tst         x7,  x7
        cinc        w0,  w0,  ne
        lsl         w0,  w0,  #1
        tst         x6,  x6
        cinc        w0,  w0,  ne
        lsl         w0,  w0,  #1
        tst         x5,  x5
        cinc        w0,  w0,  ne
        lsl         w0,  w0,  #1
        tst         x4,  x4
        cinc        w0,  w0,  ne
        ret
    endfunc
    
    .macro QUANT_TWO bias0 bias1 mf0_1 mf2_3 mask
        add         v18.8h, v18.8h, \bias0
        add         v19.8h, v19.8h, \bias1
        umull       v20.4s, v18.4h, \mf0_1\().4h
        umull2      v21.4s, v18.8h, \mf0_1\().8h
        umull       v22.4s, v19.4h, \mf2_3\().4h
        umull2      v23.4s, v19.8h, \mf2_3\().8h
        sshr        v16.8h, v16.8h, #15
        sshr        v17.8h, v17.8h, #15
        shrn        v18.4h, v20.4s, #16
        shrn2       v18.8h, v21.4s, #16
        shrn        v19.4h, v22.4s, #16
        shrn2       v19.8h, v23.4s, #16
        eor         v18.16b, v18.16b, v16.16b
        eor         v19.16b, v19.16b, v17.16b
        sub         v18.8h, v18.8h, v16.8h
        sub         v19.8h, v19.8h, v17.8h
        orr         \mask,  v18.16b, v19.16b
        st1         {v18.8h,v19.8h}, [x0], #32
    .endm
    

    function hpel_filter_neon, export=1
        ubfm        x9,  x3,  #0,  #3
        add         w15, w5,  w9
        sub         x13, x3,  x9            // align src
        sub         x10, x0,  x9
        sub         x11, x1,  x9
        sub         x12, x2,  x9
        movi        v30.16b,  #5
        movi        v31.16b,  #20
    1:  // line start
        mov         x3,  x13
        mov         x2,  x12
        mov         x1,  x11
        mov         x0,  x10
        add         x7,  x3,  #16           // src pointer next 16b for horiz filter
        mov         x5,  x15                // restore width
        sub         x3,  x3,  x4,  lsl #1   // src - 2*stride
        ld1         {v28.16b}, [x7], #16    // src[16:31]
    
        add         x9,  x3,  x5            // holds src - 2*stride + width
    
        ld1         {v16.16b}, [x3], x4     // src-2*stride[0:15]
        ld1         {v17.16b}, [x3], x4     // src-1*stride[0:15]
        ld1         {v18.16b}, [x3], x4     // src+0*stride[0:15]
        ld1         {v19.16b}, [x3], x4     // src+1*stride[0:15]
        ld1         {v20.16b}, [x3], x4     // src+2*stride[0:15]
        ld1         {v21.16b}, [x3], x4     // src+3*stride[0:15]
    
        ext         v22.16b, v7.16b,  v18.16b, #14
        uaddl       v1.8h,   v16.8b,  v21.8b
        ext         v26.16b, v18.16b, v28.16b, #3
        umlsl       v1.8h,   v17.8b,  v30.8b
        ext         v23.16b, v7.16b,  v18.16b, #15
        umlal       v1.8h,   v18.8b,  v31.8b
        ext         v24.16b, v18.16b, v28.16b, #1
        umlal       v1.8h,   v19.8b,  v31.8b
        ext         v25.16b, v18.16b, v28.16b, #2
        umlsl       v1.8h,   v20.8b,  v30.8b
    2:  // next 16 pixel of line
        subs        x5,  x5,  #16
        sub         x3,  x9,  x5            // src - 2*stride += 16
    
        uaddl       v4.8h,  v22.8b,  v26.8b
        uaddl2      v5.8h,  v22.16b, v26.16b
        sqrshrun    v6.8b,  v1.8h,   #5
        umlsl       v4.8h,  v23.8b,  v30.8b
        umlsl2      v5.8h,  v23.16b, v30.16b
        umlal       v4.8h,  v18.8b,  v31.8b
        umlal2      v5.8h,  v18.16b, v31.16b
        umlal       v4.8h,  v24.8b,  v31.8b
        umlal2      v5.8h,  v24.16b, v31.16b
        umlsl       v4.8h,  v25.8b,  v30.8b
        umlsl2      v5.8h,  v25.16b, v30.16b
    
        uaddl2      v2.8h,  v16.16b, v21.16b
        sqrshrun    v4.8b,  v4.8h,   #5
        mov         v7.16b, v18.16b
        sqrshrun2   v4.16b, v5.8h,   #5
    
        umlsl2      v2.8h,  v17.16b, v30.16b
        ld1         {v16.16b}, [x3],  x4    // src-2*stride[0:15]
        umlal2      v2.8h,  v18.16b, v31.16b
        ld1         {v17.16b}, [x3],  x4    // src-1*stride[0:15]
        umlal2      v2.8h,  v19.16b, v31.16b
        ld1         {v18.16b}, [x3],  x4    // src+0*stride[0:15]
        umlsl2      v2.8h,  v20.16b, v30.16b
        ld1         {v19.16b}, [x3],  x4    // src+1*stride[0:15]
        st1         {v4.16b},  [x0],  #16
        sqrshrun2   v6.16b, v2.8h,   #5
        ld1         {v20.16b}, [x3],  x4    // src+2*stride[0:15]
        ld1         {v21.16b}, [x3],  x4    // src+3*stride[0:15]
    
        ext         v22.16b, v0.16b, v1.16b, #12
        ext         v26.16b, v1.16b, v2.16b, #6
        ext         v23.16b, v0.16b, v1.16b, #14
        st1         {v6.16b},  [x1],  #16
        uaddl       v3.8h,   v16.8b, v21.8b
        ext         v25.16b, v1.16b, v2.16b, #4
        umlsl       v3.8h,   v17.8b, v30.8b
        ext         v24.16b, v1.16b, v2.16b, #2
    
        umlal       v3.8h,  v18.8b, v31.8b
        add         v4.8h,  v22.8h, v26.8h
        umlal       v3.8h,  v19.8b, v31.8b
        add         v5.8h,  v23.8h, v25.8h
        umlsl       v3.8h,  v20.8b, v30.8b
        add         v6.8h,  v24.8h, v1.8h
    
        ext         v22.16b, v1.16b, v2.16b, #12
        ext         v26.16b, v2.16b, v3.16b, #6
        ext         v23.16b, v1.16b, v2.16b, #14
        ext         v25.16b, v2.16b, v3.16b, #4
        ext         v24.16b, v2.16b, v3.16b, #2
    
        add         v22.8h, v22.8h, v26.8h
        add         v23.8h, v23.8h, v25.8h
        add         v24.8h, v24.8h, v2.8h
    
        sub         v4.8h,  v4.8h,  v5.8h   // a-b
        sub         v5.8h,  v5.8h,  v6.8h   // b-c
    
        sub         v22.8h, v22.8h, v23.8h  // a-b
        sub         v23.8h, v23.8h, v24.8h  // b-c
    
        sshr        v4.8h,  v4.8h,  #2      // (a-b)/4
        sshr        v22.8h, v22.8h, #2      // (a-b)/4
        sub         v4.8h,  v4.8h,  v5.8h   // (a-b)/4-b+c
        sub         v22.8h, v22.8h, v23.8h  // (a-b)/4-b+c
        sshr        v4.8h,  v4.8h,  #2      // ((a-b)/4-b+c)/4
        sshr        v22.8h, v22.8h, #2      // ((a-b)/4-b+c)/4
        add         v4.8h,  v4.8h,  v6.8h   // ((a-b)/4-b+c)/4+c = (a-5*b+20*c)/16
        add         v22.8h, v22.8h, v24.8h  // ((a-b)/4-b+c)/4+c = (a-5*b+20*c)/16
    
        sqrshrun    v4.8b,   v4.8h,   #6
        ld1         {v28.16b}, [x7],   #16  // src[16:31]
        mov         v0.16b,  v2.16b
        ext         v23.16b, v7.16b,  v18.16b, #15
        sqrshrun2   v4.16b,  v22.8h,  #6
        mov         v1.16b,  v3.16b
        ext         v22.16b, v7.16b,  v18.16b, #14
        ext         v24.16b, v18.16b, v28.16b, #1
        ext         v25.16b, v18.16b, v28.16b, #2
        ext         v26.16b, v18.16b, v28.16b, #3
    
        st1         {v4.16b}, [x2], #16
        b.gt        2b
    
        subs        w6,  w6,  #1
        add         x10,  x10,  x4
        add         x11,  x11,  x4
        add         x12,  x12,  x4
        add         x13,  x13,  x4
        b.gt        1b
    
        ret
    endfunc
    

    function sub8x8_dct8_neon, export=1
        mov         x3, #16
        mov         x4, #16
        ld1         {v16.8b}, [x1], x3
        ld1         {v17.8b}, [x2], x4
        ld1         {v18.8b}, [x1], x3
        ld1         {v19.8b}, [x2], x4
        usubl       v0.8h,  v16.8b, v17.8b
        ld1         {v20.8b}, [x1], x3
        ld1         {v21.8b}, [x2], x4
        usubl       v1.8h,  v18.8b, v19.8b
        ld1         {v22.8b}, [x1], x3
        ld1         {v23.8b}, [x2], x4
        usubl       v2.8h,  v20.8b, v21.8b
        ld1         {v24.8b}, [x1], x3
        ld1         {v25.8b}, [x2], x4
        usubl       v3.8h,  v22.8b, v23.8b
        ld1         {v26.8b}, [x1], x3
        ld1         {v27.8b}, [x2], x4
        usubl       v4.8h,  v24.8b, v25.8b
        ld1         {v28.8b}, [x1], x3
        ld1         {v29.8b}, [x2], x4
        usubl       v5.8h,  v26.8b, v27.8b
        ld1         {v30.8b}, [x1], x3
        ld1         {v31.8b}, [x2], x4
        usubl       v6.8h,  v28.8b, v29.8b
        usubl       v7.8h,  v30.8b, v31.8b
    
        DCT8_1D     row
        transpose8x8.h v0, v1, v2, v3, v4, v5, v6, v7, v30, v31
        DCT8_1D     col
    
        st1         {v0.8h,v1.8h,v2.8h,v3.8h}, [x0], #64
        st1         {v4.8h,v5.8h,v6.8h,v7.8h}, [x0], #64
        ret
    endfunc
    
    .macro DCT8_1D type
        SUMSUB_AB   v18.8h, v17.8h, v3.8h,  v4.8h   // s34/d34
        SUMSUB_AB   v19.8h, v16.8h, v2.8h,  v5.8h   // s25/d25
        SUMSUB_AB   v22.8h, v21.8h, v1.8h,  v6.8h   // s16/d16
        SUMSUB_AB   v23.8h, v20.8h, v0.8h,  v7.8h   // s07/d07
    
        SUMSUB_AB   v24.8h, v26.8h,  v23.8h, v18.8h  // a0/a2
        SUMSUB_AB   v25.8h, v27.8h,  v22.8h, v19.8h  // a1/a3
    
        SUMSUB_AB   v30.8h, v29.8h,  v20.8h, v17.8h  // a6/a5
        sshr        v23.8h, v21.8h, #1
        sshr        v18.8h, v16.8h, #1
        add         v23.8h, v23.8h, v21.8h
        add         v18.8h, v18.8h, v16.8h
        sub         v30.8h, v30.8h, v23.8h
        sub         v29.8h, v29.8h, v18.8h
    
        SUMSUB_AB   v28.8h, v31.8h,  v21.8h, v16.8h   // a4/a7
        sshr        v22.8h, v20.8h, #1
        sshr        v19.8h, v17.8h, #1
        add         v22.8h, v22.8h, v20.8h
        add         v19.8h, v19.8h, v17.8h
        add         v22.8h, v28.8h, v22.8h
        add         v31.8h, v31.8h, v19.8h
    
        SUMSUB_AB   v0.8h,  v4.8h,  v24.8h, v25.8h
        SUMSUB_SHR  2, v1.8h,  v7.8h,  v22.8h, v31.8h, v16.8h, v17.8h
        SUMSUB_SHR  1, v2.8h,  v6.8h,  v26.8h, v27.8h, v18.8h, v19.8h
        SUMSUB_SHR2 2, v3.8h,  v5.8h,  v30.8h, v29.8h, v20.8h, v21.8h
    .endm
    
    .macro SUMSUB_AB   sum, sub, a, b
        add         \sum,  \a,  \b
        sub         \sub,  \a,  \b
    .endm
    
    .macro SUMSUB_SHR shift sum sub a b t0 t1
        sshr        \t0,  \b, #\shift
        sshr        \t1,  \a, #\shift
        add         \sum, \a, \t0
        sub         \sub, \t1, \b
    .endm
    
    .macro SUMSUB_SHR2 shift sum sub a b t0 t1
        sshr        \t0,  \a, #\shift
        sshr        \t1,  \b, #\shift
        add         \sum, \t0, \b
        sub         \sub, \a, \t1
    .endm
    
    .macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
        trn1        \r8\().8h,  \r0\().8h,  \r1\().8h
        trn2        \r9\().8h,  \r0\().8h,  \r1\().8h
        trn1        \r1\().8h,  \r2\().8h,  \r3\().8h
        trn2        \r3\().8h,  \r2\().8h,  \r3\().8h
        trn1        \r0\().8h,  \r4\().8h,  \r5\().8h
        trn2        \r5\().8h,  \r4\().8h,  \r5\().8h
        trn1        \r2\().8h,  \r6\().8h,  \r7\().8h
        trn2        \r7\().8h,  \r6\().8h,  \r7\().8h
    
        trn1        \r4\().4s,  \r0\().4s,  \r2\().4s
        trn2        \r2\().4s,  \r0\().4s,  \r2\().4s
        trn1        \r6\().4s,  \r5\().4s,  \r7\().4s
        trn2        \r7\().4s,  \r5\().4s,  \r7\().4s
        trn1        \r5\().4s,  \r9\().4s,  \r3\().4s
        trn2        \r9\().4s,  \r9\().4s,  \r3\().4s
        trn1        \r3\().4s,  \r8\().4s,  \r1\().4s
        trn2        \r8\().4s,  \r8\().4s,  \r1\().4s
    
        trn1        \r0\().2d,  \r3\().2d,  \r4\().2d
        trn2        \r4\().2d,  \r3\().2d,  \r4\().2d
    
        trn1        \r1\().2d,  \r5\().2d,  \r6\().2d
        trn2        \r5\().2d,  \r5\().2d,  \r6\().2d
    
        trn2        \r6\().2d,  \r8\().2d,  \r2\().2d
        trn1        \r2\().2d,  \r8\().2d,  \r2\().2d
    
        trn1        \r3\().2d,  \r9\().2d,  \r7\().2d
        trn2        \r7\().2d,  \r9\().2d,  \r7\().2d
    .endm
    
    

    Unfortunately, I was not able to find any way of improving these functions. Any thoughts?

    BR,

    Akis

  • Hi Akis,

    For the quant_4x4x4_neon function:

    Instead of an ABS followed by an ADD we could consider trying to make use of the SABA instruction to perform an absolute difference with zero and accumulate to do both instructions at once. The obvious problem here is that the bias parameter is reused so would need an additional MOV instruction to duplicate it. This is less of an issue in SVE where we can make use of MOVPRFX, since in this case the additional instruction can be considered "free" if it is destructively used by the following instruction. See:

    https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/MOVPRFX--unpredicated---Move-prefix--unpredicated--https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/SABA--Signed-absolute-difference-and-accumulate-

    So instead of:

    abs v18.8h, v16.8h
    add v18.8h, v18.8h, \bias0

    We could instead consider something like:

    dup z30.h, #0
    ...
    movprfx z18.h, \bias0
    saba v18.h, v16.h, z30.h

    For the UMULL + SSHR #16 pairs, SVE has a "multiply and return high half instruction" in UMULH which I think does what you want here in a single instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UMULH--unpredicated---Unsigned-multiply-returning-high-half--unpredicated--

    The SSHR #15 + EOR + SUB combination looks as if it performing something like a conditional negation based on the sign of v16/v17: (i.e. v16.8h < 0 ? -v18.8h : v18.8h). Perhaps we can make use of SVE predicated instructions here to perform a conditional negation instead?

    So instead of:

    sshr v16.8h, v16.8h, #15
    eor v18.16b, v18.16b, v16.16b
    sub v18.8h, v18.8h, v16.8h

    We could instead consider something like:

    ptrue p1.b
    ...
    cmplt p0.h, p1/z, z16.h, #0
    neg z18.h, p0/m, z18.h

    For the hpel_filter_neon function it's hard to know without understanding the underlying algorithm but it doesn't appear obvious that there is much we can do here.

    The arithmetic is using widening instructions so we could consider trying to make use of the dot-product instructions here, making use of TBL instructions rather than the EXT instructions that are there currently to reorder the data into a layout that makes the use of dot-product instructions more viable. I don't think I understand the background behind the permute that is being performed by the EXT instructions enough here to comment further on this one I'm afraid.


    For the sub8x8_dct8_neon function we have quite a few SSHR following the SUB instruction in the SUMSUB_AB macro. It seems like we could make use of the halving-subtract instruction here to do the same thing in a single instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/SHSUB--Signed-Halving-Subtract-?lang=en

    I think the same is also true for the SSHR #1 in the second use of the SUMSUB_SHR macro?

    Beyond that I don't think there is much we can do here since the code is mainly just adds and subtracts and a transpose which can't really be improved at all.

    Thanks,
    George

  • Hi George,

    thank you very much for your answer.

    For the quant_4x4x4_neon function, everything worked like a charm. Thanks!

    For the hpel_filter_neon function. unfortunately I do not have any more information now. I will check again the code in if I have more info I will contact you again.

    For the sub8x8_dct8_neon function, what you proposed is very interesting. However, it seems that we need both the normal subtraction output and the subtraction output right shifted by one. Below, I am listing again the DCT8_1D macro after substituting the internally used macros with the real instructions:

    .macro DCT8_1D_SVE type
        add         v18.8h,  v3.8h,  v4.8h
        sub         v17.8h,  v3.8h,  v4.8h
        add         v19.8h,  v2.8h,  v5.8h
        sub         v16.8h,  v2.8h,  v5.8h
        add         v22.8h,  v1.8h,  v6.8h
        sub         v21.8h,  v1.8h,  v6.8h
        add         v23.8h,  v0.8h,  v7.8h
        sub         v20.8h,  v0.8h,  v7.8h
    
        add         v24.8h,  v23.8h,  v18.8h
        sub         v26.8h,  v23.8h,  v18.8h
        add         v25.8h,  v22.8h,  v19.8h
        sub         v27.8h,  v22.8h,  v19.8h
    
        add         v30.8h,  v20.8h,  v17.8h
        sub         v29.8h,  v20.8h,  v17.8h
    
        sshr        v23.8h, v21.8h, #1
        sshr        v18.8h, v16.8h, #1
        add         v23.8h, v23.8h, v21.8h
        add         v18.8h, v18.8h, v16.8h
        sub         v30.8h, v30.8h, v23.8h
        sub         v29.8h, v29.8h, v18.8h
    
        add         v28.8h,  v21.8h,  v16.8h
        sub         v31.8h,  v21.8h,  v16.8h
        sshr        v22.8h, v20.8h, #1
        sshr        v19.8h, v17.8h, #1
        add         v22.8h, v22.8h, v20.8h
        add         v19.8h, v19.8h, v17.8h
        add         v22.8h, v28.8h, v22.8h
        add         v31.8h, v31.8h, v19.8h
    
        add         v0.8h,  v24.8h,  v25.8h
        sub         v4.8h,  v24.8h,  v25.8h
    
        sshr        v16.8h,  v31.8h, #2
        sshr        v17.8h,  v22.8h, #2
        add         v1.8h, v22.8h, v16.8h
        sub         v7.8h, v17.8h, v31.8h
        sshr        v18.8h,  v27.8h, #1
        sshr        v19.8h,  v26.8h, #1
        add         v2.8h, v26.8h, v18.8h
        sub         v6.8h, v19.8h, v27.8h
        sshr        v20.8h,  v30.8h, #2
        sshr        v21.8h,  v29.8h, #2
        add         v3.8h, v20.8h, v29.8h
        sub         v5.8h, v30.8h, v21.8h
    .endm
    

    For example, we subtract v6.8h from v1.8h and place the result in v21.8h. Then, we right shift v21.8h by 1 and place the result in v23.8h, as later on we need both v21.8h and v238h. So, I do not think we can use shsub as we will lose v21.8h. Am I missing something?

    Also, I have a couple more functions to improve. More specifically:

    function mc_copy_w16_neon, export=1
        lsl         x1, x1, #1
        lsl         x3, x3, #1
    1:  subs        w4, w4, #4
        ld1         {v0.8h, v1.8h}, [x2], x3
        ld1         {v2.8h, v3.8h}, [x2], x3
        ld1         {v4.8h, v5.8h}, [x2], x3
        ld1         {v6.8h, v7.8h}, [x2], x3
        st1         {v0.8h, v1.8h}, [x0], x1
        st1         {v2.8h, v3.8h}, [x0], x1
        st1         {v4.8h, v5.8h}, [x0], x1
        st1         {v6.8h, v7.8h}, [x0], x1
        b.gt        1b
        ret
    endfunc

    function memcpy_aligned_neon, export=1
        tst         x2,  #16
        b.eq        32f
        sub         x2,  x2,  #16
        ldr         q0,  [x1], #16
        str         q0,  [x0], #16
    32:
        tst         x2,  #32
        b.eq        640f
        sub         x2,  x2,  #32
        ldp         q0,  q1,  [x1], #32
        stp         q0,  q1,  [x0], #32
    640:
        cbz         x2,  1f
    64:
        subs        x2,  x2,  #64
        ldp         q0,  q1,  [x1, #32]
        ldp         q2,  q3,  [x1], #64
        stp         q0,  q1,  [x0, #32]
        stp         q2,  q3,  [x0], #64
        b.gt        64b
    1:
        ret
    endfunc

    const pw_0to15, align=5
        .short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
    endconst
    
    function mbtree_propagate_list_internal_neon, export=1
        movrel      x11,  pw_0to15
        dup         v31.8h,  w4             // bipred_weight
        movi        v30.8h,  #0xc0, lsl #8
        ld1         {v29.8h},  [x11] //h->mb.i_mb_x,h->mb.i_mb_y
        movi        v28.4s,  #4
        movi        v27.8h,  #31
        movi        v26.8h,  #32
        dup         v24.8h,  w5             // mb_y
        zip1        v29.8h,  v29.8h, v24.8h
    8:
        subs        w6,  w6,  #8
        ld1         {v1.8h},  [x1], #16     // propagate_amount
        ld1         {v2.8h},  [x2], #16     // lowres_cost
        and         v2.16b, v2.16b, v30.16b
        cmeq        v25.8h, v2.8h,  v30.8h
        umull       v16.4s, v1.4h,  v31.4h
        umull2      v17.4s, v1.8h,  v31.8h
        rshrn       v16.4h, v16.4s, #6
        rshrn2      v16.8h, v17.4s, #6
        bsl         v25.16b, v16.16b, v1.16b // if( lists_used == 3 )
        //          propagate_amount = (propagate_amount * bipred_weight + 32) >> 6
        ld1         {v4.8h,v5.8h},  [x0],  #32
        sshr        v6.8h,  v4.8h,  #5
        sshr        v7.8h,  v5.8h,  #5
        add         v6.8h,  v6.8h,  v29.8h
        add         v29.8h, v29.8h, v28.8h
        add         v7.8h,  v7.8h,  v29.8h
        add         v29.8h, v29.8h, v28.8h
        st1         {v6.8h,v7.8h},  [x3],  #32
        and         v4.16b, v4.16b, v27.16b
        and         v5.16b, v5.16b, v27.16b
        uzp1        v6.8h,  v4.8h,  v5.8h   // x & 31
        uzp2        v7.8h,  v4.8h,  v5.8h   // y & 31
        sub         v4.8h,  v26.8h, v6.8h   // 32 - (x & 31)
        sub         v5.8h,  v26.8h, v7.8h   // 32 - (y & 31)
        mul         v19.8h, v6.8h,  v7.8h   // idx3weight = y*x;
        mul         v18.8h, v4.8h,  v7.8h   // idx2weight = y*(32-x);
        mul         v17.8h, v6.8h,  v5.8h   // idx1weight = (32-y)*x;
        mul         v16.8h, v4.8h,  v5.8h   // idx0weight = (32-y)*(32-x) ;
        umull       v6.4s,  v19.4h, v25.4h
        umull2      v7.4s,  v19.8h, v25.8h
        umull       v4.4s,  v18.4h, v25.4h
        umull2      v5.4s,  v18.8h, v25.8h
        umull       v2.4s,  v17.4h, v25.4h
        umull2      v3.4s,  v17.8h, v25.8h
        umull       v0.4s,  v16.4h, v25.4h
        umull2      v1.4s,  v16.8h, v25.8h
        rshrn       v19.4h, v6.4s,  #10
        rshrn2      v19.8h, v7.4s,  #10
        rshrn       v18.4h, v4.4s,  #10
        rshrn2      v18.8h, v5.4s,  #10
        rshrn       v17.4h, v2.4s,  #10
        rshrn2      v17.8h, v3.4s,  #10
        rshrn       v16.4h, v0.4s,  #10
        rshrn2      v16.8h, v1.4s,  #10
        zip1        v0.8h,  v16.8h, v17.8h
        zip2        v1.8h,  v16.8h, v17.8h
        zip1        v2.8h,  v18.8h, v19.8h
        zip2        v3.8h,  v18.8h, v19.8h
        st1         {v0.8h,v1.8h},  [x3], #32
        st1         {v2.8h,v3.8h},  [x3], #32
        b.ge        8b
        ret
    endfunc

    function pixel_var2_8x\h\()_neon, export=1
        mov             x3,  #16
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        mov             x5,  \h - 2
        usubl           v0.8h,  v16.8b, v18.8b
        usubl           v1.8h,  v17.8b, v19.8b
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
        smull           v2.4s,  v0.4h,  v0.4h
        smull2          v3.4s,  v0.8h,  v0.8h
        smull           v4.4s,  v1.4h,  v1.4h
        smull2          v5.4s,  v1.8h,  v1.8h
    
        usubl           v6.8h,  v16.8b, v18.8b
    
    1:  subs            x5,  x5,  #1
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        smlal           v2.4s,  v6.4h,  v6.4h
        smlal2          v3.4s,  v6.8h,  v6.8h
        usubl           v7.8h,  v17.8b, v19.8b
        add             v0.8h,  v0.8h,  v6.8h
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
        smlal           v4.4s,  v7.4h,  v7.4h
        smlal2          v5.4s,  v7.8h,  v7.8h
        usubl           v6.8h,  v16.8b, v18.8b
        add             v1.8h,  v1.8h,  v7.8h
        b.gt            1b
    
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        smlal           v2.4s,  v6.4h,  v6.4h
        smlal2          v3.4s,  v6.8h,  v6.8h
        usubl           v7.8h,  v17.8b, v19.8b
        add             v0.8h,  v0.8h,  v6.8h
        smlal           v4.4s,  v7.4h,  v7.4h
        add             v1.8h,  v1.8h,  v7.8h
        smlal2          v5.4s,  v7.8h,  v7.8h
    
        saddlv          s0,  v0.8h
        saddlv          s1,  v1.8h
        add             v2.4s,  v2.4s,  v3.4s
        add             v4.4s,  v4.4s,  v5.4s
        mov             w0,  v0.s[0]
        mov             w1,  v1.s[0]
        addv            s2,  v2.4s
        addv            s4,  v4.4s
        mul             w0,  w0,  w0
        mul             w1,  w1,  w1
        mov             w3,  v2.s[0]
        mov             w4,  v4.s[0]
        sub             w0,  w3,  w0,  lsr # 6 + (\h >> 4)
        sub             w1,  w4,  w1,  lsr # 6 + (\h >> 4)
        str             w3,  [x2]
        add             w0,  w0,  w1
        str             w4,  [x2, #4]
    
        ret
    endfunc

    function pixel_sad_x_h\()_neon_10, export=1
        mov         x7, #16
        lsl         x5, x5, #1
        lsl         x7, x7, #1
    
        ld1         {v0.8h, v1.8h}, [x0], x7
        ld1         {v2.8h, v3.8h}, [x1], x5
    
        ld1         {v4.8h, v5.8h}, [x2], x5
        uabd        v16.8h, v2.8h, v0.8h
        uabd        v20.8h, v3.8h, v1.8h
        ld1         {v24.8h, v25.8h}, [x3], x5
        uabd        v17.8h, v4.8h, v0.8h
        uabd        v21.8h, v5.8h, v1.8h
    
        ld1         {v6.8h, v7.8h}, [x0], x7
        ld1         {v2.8h, v3.8h}, [x1], x5
        uabd        v18.8h, v24.8h, v0.8h
        uabd        v22.8h, v25.8h, v1.8h
        ld1         {v4.8h, v5.8h}, [x2], x5
        uaba        v16.8h, v2.8h, v6.8h
        uaba        v20.8h, v3.8h, v7.8h
    
        ld1         {v24.8h, v25.8h}, [x3], x5
        uaba        v17.8h, v4.8h, v6.8h
        uaba        v21.8h, v5.8h, v7.8h
    
        ld1         {v26.8h, v27.8h}, [x4], x5
        ld1         {v28.8h, v29.8h}, [x4], x5
        uaba        v18.8h, v24.8h, v6.8h
        uaba        v22.8h, v25.8h, v7.8h
        uabd        v19.8h, v26.8h, v0.8h
        uabd        v23.8h, v27.8h, v1.8h
    
        uaba        v19.8h, v28.8h, v6.8h
        uaba        v23.8h, v29.8h, v7.8h
    
    .rept \h / 2 - 1
        ld1         {v0.8h, v1.8h}, [x0], x7
        ld1         {v2.8h, v3.8h}, [x1], x5
    
        ld1         {v4.8h, v5.8h}, [x2], x5
        uaba        v16.8h, v2.8h, v0.8h
        uaba        v20.8h, v3.8h, v1.8h
        ld1         {v24.8h, v25.8h}, [x3], x5
        uaba        v17.8h, v4.8h, v0.8h
        uaba        v21.8h, v5.8h, v1.8h
    
        ld1         {v6.8h, v7.8h}, [x0], x7
        ld1         {v2.8h, v3.8h}, [x1], x5
        uaba        v18.8h, v24.8h, v0.8h
        uaba        v22.8h, v25.8h, v1.8h
        ld1         {v4.8h, v5.8h}, [x2], x5
        uaba        v16.8h, v2.8h, v6.8h
        uaba        v20.8h, v3.8h, v7.8h
    
        ld1         {v24.8h, v25.8h}, [x3], x5
        uaba        v17.8h, v4.8h, v6.8h
        uaba        v21.8h, v5.8h, v7.8h
    
        ld1         {v26.8h, v27.8h}, [x4], x5
        ld1         {v28.8h, v29.8h}, [x4], x5
        uaba        v18.8h, v24.8h, v6.8h
        uaba        v22.8h, v25.8h, v7.8h
        uaba        v19.8h, v26.8h, v0.8h
        uaba        v23.8h, v27.8h, v1.8h
    
        uaba        v19.8h, v28.8h, v6.8h
        uaba        v23.8h, v29.8h, v7.8h
    .endr
    
        add         v16.8h, v16.8h, v20.8h
        add         v17.8h, v17.8h, v21.8h
        add         v18.8h, v18.8h, v22.8h
        add         v19.8h, v19.8h, v23.8h
        
        // add up the sads
        uaddlv      s0, v16.8h
        uaddlv      s1, v17.8h
        uaddlv      s2, v18.8h
    
        stp         s0, s1, [x6], #8
        uaddlv      s3, v19.8h
        stp         s2, s3, [x6]
        ret
    endfunc
    

    For the latter, I tried to use the udot approach, but the performance is degraded. Any thoughts?

    You do not have to provide me full functions, just some hints. Sorry for the large amount of help that I am asking. It seems that this forum is the only help I can get.

    Thank you in advance,

    Akis

  • Hi Akis,

    Good to hear that my suggestions for quant_4x4x4_neon worked as we expected!

    For sub8x8_dct8_neon I wonder if we can still remove some of the shift instructions by combining with the successor addition and using the SSRA instruction to perform a shift and addition in a single instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/SSRA--Signed-Shift-Right-and-Accumulate--immediate--

    For example instead of:

    sshr v23.8h, v21.8h, #1
    add v23.8h, v23.8h, v21.8h

    We could instead consider something like:

    ssra v21.8h, v21.8h, #1    // v21.8h += v21.8h >> 1

    This has the disadvantage that we must reuse the same register as the non-shifted addend so this does not work if we need v21 elsewhere later: in your snippet I think that v21 is used in an ADD and SUB after the SSHR+ADD pair, however I suspect that some of this can be solved by re-ordering the code so that the ADD/SUB are done first and therefore the register can be reused for the SSRA?

    For the copy functions like mc_copy_w16_neon and memcpy_aligned_neon there is probably no benefit from SVE at the same vector length as Neon. One small optimisation you could consider is maintaining multiple independent source and destination addresses (e.g. x0, x0+x1, x0+2*x1, x0+3*x1) and incrementing them independently (e.g. by x1*4), since currently in mc_copy_w16_neon for instance the x0 and x2 addresses must be updated four times per loop iteration which could be slow. I don't expect that would have a big impact in performance though.

    For mbtree_propagate_list_internal_neon I wonder if we can also use the SSRA instruction here as well? We currently have e.g.

    sshr v6.8h, v4.8h, #5
    add v6.8h, v6.8h, v29.8h

    Which could instead be:

    ssra v29.8h, v4.8h, #5

    I guess that doesn't work so well in this case because v28 and v29 are needed for the next iteration of the loop, but even a MOV to duplicate them into another variable may still be better since the constants will not be on the critical path of the calculation.

    The UZP1/UZP2 and later ZIP1/ZIP2 instructions in the loop feels strange since the ZIP1/ZIP2 will undo the effect of the earlier UZP1/UZP2 instructions? Perhaps the other operands (v25) can be adjusted so that both pairs of permutes can either be removed or at least
    replaced with a single permute to swap pairs of lanes so that they can interact with each other (REV32.8H?).

    Finally, I suspect it doesn't work in this case but just mentioning it in case it could be useful: for the UMULL+RSHRN pairs we could consider trying to replace those with something like the UMULH SVE instruction:

    https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UMULH--unpredicated---Unsigned-multiply-returning-high-half--unpredicated--

    The problems I suspect that we'll encounter trying to use UMULH here are (a) that the shift is a rounding shift which means we cannot usually just take the top half of the multiplication result, and (b) the shift value is only 10 rather than 16. The shift value might not be a problem if you could instead adjust the operands and multiply by (v25 << 6) instead, but that might not be possible depending on the range of possible values for that multiplicand.

    For pixel_var2_8x\h\()_neon I would assume that we could replace the USUBL+SMULL/SMLAL pairs with UABD+UDOT as we have done previously. Since we have one continguous array it may also be worth loading into full vectors of data here rather than only using half a vector at a time, e.g.

    ld1 {v16.8b}, [x0], #8
    ld1 {v18.8b}, [x1], x3
    ld1 {v17.8b}, [x0], #8
    ld1 {v19.8b}, [x1], x3

    Could be something like:

    ld1 {v16.16b}, [x0], #16 // Merged from v16 and v17.
    ld1 {v18.8b}, [x1], x3
    ld1 {v18.d}[1], [x1], x3 // Load into high half of v18, not v19.

    For pixel_sad_x_h\()_neon_10 I agree with your conclusion. I don't think that there will be much benefit from dot product here since there is never a widening operation, so the UABA instruction is able to operate on full vectors rather than on only half of a vector like in some of our previous examples where we have used UMLAL or UABAL etc.

    Hope that helps!

    Thanks,
    George

  • Hi George,

    For sub8x8_dct8_neon, I applied your suggestion and everything worked fine. Thanks!

    For the copy functions, as you said, there is no much left to do for improving the performance.

    For mbtree_propagate_list_internal_neon, I applied your suggestion. Thanks!

    For pixel_var2_8x\h\()_neon, I used the udot instruction, but it doesn't work. It seems that some vectors (for example v0.8h, v1.8h, v6.8h and v7.8h) are still needed after widening instructions. I developed the following function:

    function pixel_var2_8x\h\()_sve, export=1
        movi            v30.4s, #0
        movi            v31.4s, #0
        mov             x3,  #16
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        mov             x5,  \h - 2
        uabd            v28.8b, v16.8b, v18.8b
        usubl           v0.8h,  v16.8b, v18.8b
        uabd            v29.8b, v17.8b, v19.8b
        usubl           v1.8h,  v17.8b, v19.8b
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
    
        udot            v30.2s, v28.8b, v28.8b
        udot            v31.2s, v29.8b, v29.8b
    
        uabd            v28.8b, v16.8b, v18.8b
        usubl           v6.8h,  v16.8b, v18.8b
    
    1:  subs            x5,  x5,  #1
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        udot            v30.2s, v28.8b, v28.8b
        uabd            v29.8b, v17.8b, v19.8b
        usubl           v7.8h,  v17.8b, v19.8b
        add             v0.8h,  v0.8h,  v6.8h
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
        udot            v31.2s, v29.8b, v29.8b
        uabd            v28.8b, v16.8b, v18.8b
        usubl           v6.8h,  v16.8b, v18.8b
        add             v1.8h,  v1.8h,  v7.8h
        b.gt            1b
    
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        udot            v30.2s, v6.8b, v6.8b
        uabd            v29.8b, v17.8b, v19.8b
        usubl           v7.8h,  v17.8b, v19.8b
        add             v0.8h,  v0.8h,  v6.8h
        udot            v31.2s, v29.8b, v29.8b
        add             v1.8h,  v1.8h,  v7.8h
    
        saddlv          s0,  v0.8h
        saddlv          s1,  v1.8h
        mov             w0,  v0.s[0]
        mov             w1,  v1.s[0]
        addv            s2,  v30.4s
        addv            s4,  v31.4s
        mul             w0,  w0,  w0
        mul             w1,  w1,  w1
        mov             w3,  v30.s[0]
        mov             w4,  v31.s[0]
        sub             w0,  w3,  w0,  lsr # 6 + (\h >> 4)
        sub             w1,  w4,  w1,  lsr # 6 + (\h >> 4)
        str             w3,  [x2]
        add             w0,  w0,  w1
        str             w4,  [x2, #4]
    
        ret
    endfunc
    

    Unit tests fail. Can you please tell me what I am doing wrong? Also, the usage of the three load merging commands instead of the initial four, degrades the performance. I do not know why.

    For pixel_sad_x_h\()_neon_10, I also agree that we cannot improve it.

    BR,

    Akis

  • Hi George,

    For sub8x8_dct8_neon, I applied your suggestion and everything worked fine. Thanks!

    For the copy functions, as you said, there is no much left to do for improving the performance.

    For mbtree_propagate_list_internal_neon, I applied your suggestion. Thanks!

    For pixel_var2_8x\h\()_neon, I used the udot instruction, but it doesn't work. It seems that some vectors (for example v0.8h, v1.8h, v6.8h and v7.8h) are still needed after widening instructions. I developed the following function:

    function pixel_var2_8x\h\()_sve, export=1
        movi            v30.4s, #0
        movi            v31.4s, #0
        mov             x3,  #16
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        mov             x5,  \h - 2
        uabd            v28.8b, v16.8b, v18.8b
        usubl           v0.8h,  v16.8b, v18.8b
        uabd            v29.8b, v17.8b, v19.8b
        usubl           v1.8h,  v17.8b, v19.8b
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
    
        udot            v30.2s, v28.8b, v28.8b
        udot            v31.2s, v29.8b, v29.8b
    
        uabd            v28.8b, v16.8b, v18.8b
        usubl           v6.8h,  v16.8b, v18.8b
    
    1:  subs            x5,  x5,  #1
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        udot            v30.2s, v28.8b, v28.8b
        uabd            v29.8b, v17.8b, v19.8b
        usubl           v7.8h,  v17.8b, v19.8b
        add             v0.8h,  v0.8h,  v6.8h
        ld1             {v16.8b}, [x0], #8
        ld1             {v18.8b}, [x1], x3
        udot            v31.2s, v29.8b, v29.8b
        uabd            v28.8b, v16.8b, v18.8b
        usubl           v6.8h,  v16.8b, v18.8b
        add             v1.8h,  v1.8h,  v7.8h
        b.gt            1b
    
        ld1             {v17.8b}, [x0], #8
        ld1             {v19.8b}, [x1], x3
        udot            v30.2s, v6.8b, v6.8b
        uabd            v29.8b, v17.8b, v19.8b
        usubl           v7.8h,  v17.8b, v19.8b
        add             v0.8h,  v0.8h,  v6.8h
        udot            v31.2s, v29.8b, v29.8b
        add             v1.8h,  v1.8h,  v7.8h
    
        saddlv          s0,  v0.8h
        saddlv          s1,  v1.8h
        mov             w0,  v0.s[0]
        mov             w1,  v1.s[0]
        addv            s2,  v30.4s
        addv            s4,  v31.4s
        mul             w0,  w0,  w0
        mul             w1,  w1,  w1
        mov             w3,  v30.s[0]
        mov             w4,  v31.s[0]
        sub             w0,  w3,  w0,  lsr # 6 + (\h >> 4)
        sub             w1,  w4,  w1,  lsr # 6 + (\h >> 4)
        str             w3,  [x2]
        add             w0,  w0,  w1
        str             w4,  [x2, #4]
    
        ret
    endfunc
    

    Unit test fail. Can you please tell me what I am doing wrong? Also, the usage of the three load merging commands instead of the initial four, degrades the performance. I do not know why.

    For pixel_sad_x_h\()_neon_10, I also agree that we cannot improve it.

    BR,

    Akis

  • Hi Akis,

    It's a bit hard for me to try and debug the whole code snippet. One thing I did notice though is that at the end of the function you reduce v30 and v31 as such:

    addv s2, v30.4s
    addv s4, v31.4s
    ...
    mov w3, v30.s[0] // Should this be v2.s[0] ?
    mov w4, v31.s[0] // Should this be v4.s[0] ?

    This seems suspicious since s2 and s4 are otherwise never used after those instructions.

    With regards to still needing the USUBL, do you know if either the absolute difference (UABD) or a non-widening subtract (SUB) would work here instead? If so then we can potentially use only one of those instead since the UABD and USUBL are doing very similar things at the moment? Assuming that a non-widening approach works here you could then sum the results with UADDW or another UDOT instruction with all-1s as the other operand.

    For example, instead of:

    uabd v28.8b, v16.8b, v18.8b
    usubl v6.8h, v16.8b, v18.8b
    udot v30.2s, v28.8b, v28.8b
    add v0.8h, v0.8h, v6.8h

    We could see if something like this would work instead:

    uabd v28.8b, v16.8b, v18.8b // or SUB?
    udot v30.2s, v28.8b, v28.8b
    uaddw v0.8h, v0.8h, v28.8b

    Using the dot product would also work here if we need to widen beyond a 16-bit accumulator for v0 since it allows us to accumulate in 32-bits by multiplying by a vector of all-1s:

    mov v6.16b, #1
    ...
    uabd v28.8b, v16.8b, v18.8b // or SUB?
    udot v30.2s, v28.8b, v28.8b
    udot v0.2s, v28.8b, v6.8b // v28.8b * 1

    If an approach like that works then at that point it may be beneficial to re-try the three-load appoach since the entire computation can be moved from .8b to .16b which could be more significant than your previous attempt?

    Hope that helps!

    Thanks,
    George

  • Hi George,

    after using the mov instructions you proposed, everything worked fine! Thanks!

    Unfortunately, after some testing, I can neither use sub nor uabd. Unit tests fail again. So I can not use the 3 load instructions as well. But your proposed solution is very interesting and it may help me optimize other functions. Thanks!

    I think we can close this thread. You gave me a lot of help. I couldn't reach up to this point without your help. Once again, thanks!

    If I need further help, I will create new thread (I hope that this is OK).

    BR,

    Akis