Improve Performance of specific NEON functions using SVE/SVE2

Hello,

I have the following 3 functions that utilize NEON instruction set:

function pixel_avg2_w8_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.8b}, [x2], x3
    ld1         {v2.8b}, [x4], x3
    urhadd      v0.8b,  v0.8b,  v2.8b
    ld1         {v1.8b}, [x2], x3
    ld1         {v3.8b}, [x4], x3
    urhadd      v1.8b,  v1.8b,  v3.8b
    st1         {v0.8b}, [x0], x1
    st1         {v1.8b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_avg2_w16_neon, export=1
1:
    subs        w5,  w5,  #2
    ld1         {v0.16b}, [x2], x3
    ld1         {v2.16b}, [x4], x3
    urhadd      v0.16b, v0.16b, v2.16b
    ld1         {v1.16b}, [x2], x3
    ld1         {v3.16b}, [x4], x3
    urhadd      v1.16b, v1.16b, v3.16b
    st1         {v0.16b}, [x0], x1
    st1         {v1.16b}, [x0], x1
    b.gt        1b
    ret
endfunc

function pixel_sad_\h\()_neon, export=1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabdl       v16.8h,  v0.8b,  v1.8b
    uabdl2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b

.rept \h / 2 - 1
    ld1         {v1.16b}, [x2], x3
    ld1         {v0.16b}, [x0], x1
    ld1         {v3.16b}, [x2], x3
    ld1         {v2.16b}, [x0], x1
    uabal       v16.8h,  v0.8b,  v1.8b
    uabal2      v17.8h,  v0.16b, v1.16b
    uabal       v16.8h,  v2.8b,  v3.8b
    uabal2      v17.8h,  v2.16b, v3.16b
.endr
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

I want to use SVE/SVE2 instructions set to improve the performance of these functions. My testbed is Alibaba Yitian 710 (vector size=128 bits).

For the first 2, I couldn't find a way to improve the performance. For the latter, I wrote the following function:

function pixel_sad_\h\()_sve, export=1
    ptrue       p0.h, vl8
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uabd        v16.8h,  v0.8h,  v1.8h
    uabd        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h

.rept \h / 2 - 1
    ld1b        {z1.h}, p0/z, [x2]
    ld1b        {z4.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z3.h}, p0/z, [x2]
    ld1b        {z6.h}, p0/z, [x2, #1, mul vl]
    add         x2, x2, x3
    ld1b        {z0.h}, p0/z, [x0]
    ld1b        {z5.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    ld1b        {z2.h}, p0/z, [x0]
    ld1b        {z7.h}, p0/z, [x0, #1, mul vl]
    add         x0, x0, x1
    uaba        v16.8h,  v0.8h,  v1.8h
    uaba        v17.8h,  v4.8h,  v5.8h
    uaba        v16.8h,  v2.8h,  v3.8h
    uaba        v17.8h,  v7.8h,  v6.8h
.endr
    
    add         v16.8h,  v16.8h,  v17.8h
    uaddlv      s0,  v16.8h
    fmov        w0,  s0
    ret
endfunc

However, this degrades the performance instead of improving it.

Can someone help me?

Thank you in advance,

Akis

Parents
  • Hi Akis,

    Thanks for the question!

    For a SVE vector length of 128-bits you are probably correct that there is not much performance available here for these particular functions. In general SVE and SVE2 provide a performance uplift either when there are longer vector lengths available or when we can take advantage of some features of SVE that are not already present in Neon. Some examples of where SVE can provide a benefit would be:

    • Code that can take advantage of predication.
    • Code that can make use of gather/scatter instructions.
    • Code that can make use of the widening load instructions or narrowing store instructions.
    • Code that can use some of the new data processing instructions, like the histogram instructions, the bit-manipulation instructions, or the 16-bit dot product instructions to name a few.

    Starting with the pixel_avg2_w16_neon function: we are processing exactly 16-bytes at a time here so gather/scatter and predication will provide no benefit at VL=128. There is no widening or narrowing so probably no benefit from new SVE2 instruction sequences either. I think you are correct that there is not much we can do here.

    For the pixel_avg2_w8_neon: again we have no widening or narrowing so there is probably not much to gain from using new SVE2 instructions. Since we are only processing 64-bits at a time though there is the potential for using gather/scatter instructions to fill out an entire vector and operate on that instead. In particular something like:

    function pixel_avg2_w8_sve, export=1
      ptrue p0.b
      index z4.d, #0, x3    // create a vector of {0,x3,x3*2,x3*3,...}
      index z5.d, #0, x1    // create a vector of {0,x1,x1*2,x1*3,...}
      cntd x6
      mul x3, x6, x3        // input stride *= vl
      mul x1, x6, x1        // output stride *= vl
      whilelt p1.d, wzr, w5 // create a predicate to deal with odd lengths w5,
                            // no need for a branch here since assume w5 > 0.
    1:
      sub w5, w5, w6
      ld1d {z0.d}, p1/z, [x2, z4.d] // gather blocks of 64-bits.
      ld1d {z2.d}, p1/z, [x4, z4.d] // gather blocks of 64-bits.
      urhadd z0.b, p0/m, z0.b, z2.b // operate on a full vector of data.
      st1d {z0.d}, p1, [x0, z4.d]   // store blocks of 64-bits.
      add x2, x2, x3
      add x4, x4, x3
      add x0, x0, x1
      whilelt p1.d, wzr, w5
      b.any 1b
      ret

    While the above sequence is vector-length agnostic and makes full use of SVE features, it is also probably slower than your original Neon implementation since gather and scatter instructions tend to have a relatively high overhead associated with them. This overhead is fine if it enables a significant amount of vector work elsewhere but in this case the data processing is only a single instruction so the tradeoff is unlikely to be worthwhile.

    It's also worth noting that in the above implementation we are using WHILELT instructions for loop control however this might not be necessary if you can guarantee that
    the number of elements being processed is exactly divisible by the vector length (or divisible by two if you only care about VL=128). Removing the WHILELT and going back to using a normal SUBS or CMP for loop control may be a small improvement but the cost here is ultimately dominated by the gather/scatter unfortunately.

    Finally let us consider the pixel_sad_\h\()_neon function. In this case while we could use widening loads as in your suggested SVE code, in reality we generally should prefer to use the widening data-processing instructions like the Neon code does if there is no overhead to doing so, since it means we can load more data with each load instruction and therefore need fewer instructions overall.

    In this case we can probably take advantage of a different instruction sequence, albeit one that also exists in Neon. The potential advantage here is that widening instructions are only able to operate on half of the data at a time, so if we have a better instruction sequence then there is usually some room for improvement even if the replacement sequence is also two instructions.

    The dot-product instructions are optional in Armv8.2-A and were made mandatory in Armv8.4-A, and are available both in Neon and in SVE. You will know if the Neon dot product instructions are available by the presence of the "asimddp" (dp=dot product) feature in /proc/cpuinfo. Since your micro-architecture includes SVE this shouldn't be a problem.

    The dot-product instructions here can be used as a faster way of performing a widening accumulation on many micro-architectures since it tends to have good latency and throughput. This means that we can use a non-widening absolute-difference instruction to calculate a full vector of results at once and then accumulate them separately. Something like:

    function pixel_sad_\h\()_neon_dotprod, export=1
      ld1 {v1.16b}, [x2], x3
      ld1 {v0.16b}, [x0], x1
      ld1 {v3.16b}, [x2], x3
      ld1 {v2.16b}, [x0], x1
      movi v19.4s, #0    // accumulator vector
      movi v18.16b, #1   // constant vector of 1s
      uabd v16.16b, v0.16b, v1.16b
      uabd v17.16b, v2.16b, v3.16b
      udot v19.4s, v16.16b, v18.16b
      udot v19.4s, v17.16b, v18.16b
    
    .rept \h / 2 - 1
      ld1 {v1.16b}, [x2], x3
      ld1 {v0.16b}, [x0], x1
      ld1 {v3.16b}, [x2], x3
      ld1 {v2.16b}, [x0], x1
      uabd v16.16b, v0.16b, v1.16b
      uabd v17.16b, v2.16b, v3.16b
      udot v19.4s, v16.16b, v18.16b
      udot v19.4s, v17.16b, v18.16b
    .endr
    
      addv s0, v19.4s
      fmov w0, s0
      ret

    You can see above that we are using a dot product of our vector of absolute differences (v16 and v17) by a vector of all 1s (v18) to accumulate our result into v19. You may also want to consider having multiple accumulators rather than just v19 as we have above.

    While it is then possible to write an SVE version of the above code, for the reasons we mentioned before there is probably not much benefit to using SVE for
    a vector length of 128-bits here.

    Hope that helps!

    Thanks,
    George

Reply
  • Hi Akis,

    Thanks for the question!

    For a SVE vector length of 128-bits you are probably correct that there is not much performance available here for these particular functions. In general SVE and SVE2 provide a performance uplift either when there are longer vector lengths available or when we can take advantage of some features of SVE that are not already present in Neon. Some examples of where SVE can provide a benefit would be:

    • Code that can take advantage of predication.
    • Code that can make use of gather/scatter instructions.
    • Code that can make use of the widening load instructions or narrowing store instructions.
    • Code that can use some of the new data processing instructions, like the histogram instructions, the bit-manipulation instructions, or the 16-bit dot product instructions to name a few.

    Starting with the pixel_avg2_w16_neon function: we are processing exactly 16-bytes at a time here so gather/scatter and predication will provide no benefit at VL=128. There is no widening or narrowing so probably no benefit from new SVE2 instruction sequences either. I think you are correct that there is not much we can do here.

    For the pixel_avg2_w8_neon: again we have no widening or narrowing so there is probably not much to gain from using new SVE2 instructions. Since we are only processing 64-bits at a time though there is the potential for using gather/scatter instructions to fill out an entire vector and operate on that instead. In particular something like:

    function pixel_avg2_w8_sve, export=1
      ptrue p0.b
      index z4.d, #0, x3    // create a vector of {0,x3,x3*2,x3*3,...}
      index z5.d, #0, x1    // create a vector of {0,x1,x1*2,x1*3,...}
      cntd x6
      mul x3, x6, x3        // input stride *= vl
      mul x1, x6, x1        // output stride *= vl
      whilelt p1.d, wzr, w5 // create a predicate to deal with odd lengths w5,
                            // no need for a branch here since assume w5 > 0.
    1:
      sub w5, w5, w6
      ld1d {z0.d}, p1/z, [x2, z4.d] // gather blocks of 64-bits.
      ld1d {z2.d}, p1/z, [x4, z4.d] // gather blocks of 64-bits.
      urhadd z0.b, p0/m, z0.b, z2.b // operate on a full vector of data.
      st1d {z0.d}, p1, [x0, z4.d]   // store blocks of 64-bits.
      add x2, x2, x3
      add x4, x4, x3
      add x0, x0, x1
      whilelt p1.d, wzr, w5
      b.any 1b
      ret

    While the above sequence is vector-length agnostic and makes full use of SVE features, it is also probably slower than your original Neon implementation since gather and scatter instructions tend to have a relatively high overhead associated with them. This overhead is fine if it enables a significant amount of vector work elsewhere but in this case the data processing is only a single instruction so the tradeoff is unlikely to be worthwhile.

    It's also worth noting that in the above implementation we are using WHILELT instructions for loop control however this might not be necessary if you can guarantee that
    the number of elements being processed is exactly divisible by the vector length (or divisible by two if you only care about VL=128). Removing the WHILELT and going back to using a normal SUBS or CMP for loop control may be a small improvement but the cost here is ultimately dominated by the gather/scatter unfortunately.

    Finally let us consider the pixel_sad_\h\()_neon function. In this case while we could use widening loads as in your suggested SVE code, in reality we generally should prefer to use the widening data-processing instructions like the Neon code does if there is no overhead to doing so, since it means we can load more data with each load instruction and therefore need fewer instructions overall.

    In this case we can probably take advantage of a different instruction sequence, albeit one that also exists in Neon. The potential advantage here is that widening instructions are only able to operate on half of the data at a time, so if we have a better instruction sequence then there is usually some room for improvement even if the replacement sequence is also two instructions.

    The dot-product instructions are optional in Armv8.2-A and were made mandatory in Armv8.4-A, and are available both in Neon and in SVE. You will know if the Neon dot product instructions are available by the presence of the "asimddp" (dp=dot product) feature in /proc/cpuinfo. Since your micro-architecture includes SVE this shouldn't be a problem.

    The dot-product instructions here can be used as a faster way of performing a widening accumulation on many micro-architectures since it tends to have good latency and throughput. This means that we can use a non-widening absolute-difference instruction to calculate a full vector of results at once and then accumulate them separately. Something like:

    function pixel_sad_\h\()_neon_dotprod, export=1
      ld1 {v1.16b}, [x2], x3
      ld1 {v0.16b}, [x0], x1
      ld1 {v3.16b}, [x2], x3
      ld1 {v2.16b}, [x0], x1
      movi v19.4s, #0    // accumulator vector
      movi v18.16b, #1   // constant vector of 1s
      uabd v16.16b, v0.16b, v1.16b
      uabd v17.16b, v2.16b, v3.16b
      udot v19.4s, v16.16b, v18.16b
      udot v19.4s, v17.16b, v18.16b
    
    .rept \h / 2 - 1
      ld1 {v1.16b}, [x2], x3
      ld1 {v0.16b}, [x0], x1
      ld1 {v3.16b}, [x2], x3
      ld1 {v2.16b}, [x0], x1
      uabd v16.16b, v0.16b, v1.16b
      uabd v17.16b, v2.16b, v3.16b
      udot v19.4s, v16.16b, v18.16b
      udot v19.4s, v17.16b, v18.16b
    .endr
    
      addv s0, v19.4s
      fmov w0, s0
      ret

    You can see above that we are using a dot product of our vector of absolute differences (v16 and v17) by a vector of all 1s (v18) to accumulate our result into v19. You may also want to consider having multiple accumulators rather than just v19 as we have above.

    While it is then possible to write an SVE version of the above code, for the reasons we mentioned before there is probably not much benefit to using SVE for
    a vector length of 128-bits here.

    Hope that helps!

    Thanks,
    George

Children