This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Hello,

I have the following piece of code:

template<int bx, int by>
void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
{
    for (int y = 0; y < by; y++)
    {
        for (int x = 0; x < bx; x++)
        {
            a[x] = (pixel)b[x];
        }

        a += stridea;
        b += strideb;
        }
}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)
    lsl x3, x3, #1
.rept 4
    ld1 {v0.8h}, [x2], x3
    ld1 {v1.8h}, [x2], x3
    xtn v0.8b, v0.8h
    xtn v1.8b, v1.8h
    st1 {v0.d}[0], [x0], x1
    st1 {v1.d}[0], [x0], x1
.endr
    ret
endfunc
However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):
function PFX(blockcopy_sp_8x8)
    MOV x8, 8
    MOV x9, #0
    MOV x6, #0
    MOV x7, #0
    MOV z31.d, #64
    MOV z0.d, #0

    WHILELT p1.d, x9, x8
    B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:
    INDEX z1.d, x6, x3
    INDEX z2.d, x7, x1
.rept 2
    LD1D z3.d, p1/Z, [x2, z1.d]
    ADD z1.d, z1.d, z31.d
    UZP1 z3.b, z3.b, z0.b
    ST1W z3.d, p1, [x0, z2.d, UXTW #2]
    ADD z2.d, z2.d, z31.d
.endr
    INCD x9
    MUL x6, x9, x3
    MUL x7, x9, x1
    WHILELT p1.d, x9, x8
    B.FIRST .L_loopStart_blockcopy_sp_8x8
.L_return_blockcopy_sp_8x8:
    RET
endfunc
However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.
For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.
So, it might be the case that the z3 register is not fully loaded with 16bytes of data.
Can you please tell me what I am doing wrong?
Thank you in advance.
Parents
  • Hi George,

    one more question.

    What does the "ptrue p0.h, vl8" do? I couldn't find what the "vl8" means in the ARM tutorials. Does it mean that 8 lanes in p0 become true associated with half words?

    So, the final code using contiguous loads and stores is the following, right?

    function PFX(blockcopy_sp_8x8)
        lsl x3, x3, #1
        ptrue p0.h, vl8
    .rept 8
        ld1h {z0.h}, p0, [x2]
        add x2, x2, x3
        st1b {z0.h}, p0, [x0]
        add x0, x0, x1
    .endr
        ret
    endfunc
    

    Thank you in advance,

    Akis

Reply
  • Hi George,

    one more question.

    What does the "ptrue p0.h, vl8" do? I couldn't find what the "vl8" means in the ARM tutorials. Does it mean that 8 lanes in p0 become true associated with half words?

    So, the final code using contiguous loads and stores is the following, right?

    function PFX(blockcopy_sp_8x8)
        lsl x3, x3, #1
        ptrue p0.h, vl8
    .rept 8
        ld1h {z0.h}, p0, [x2]
        add x2, x2, x3
        st1b {z0.h}, p0, [x0]
        add x0, x0, x1
    .endr
        ret
    endfunc
    

    Thank you in advance,

    Akis

Children
  • Hi Akis,

    The vl8 here is the pattern operand to the ptrue instruction, you can find it
    documented here:

    developer.arm.com/.../PTRUES--Initialise-predicate-from-named-constraint-and-set-the-condition-flags-

    The effect of this operand is to limit the number of elements that are set to
    true to some upper bound, for instance here we are saying that we want only
    exactly eight predicate lanes to be set to true, in order to ensure we are only
    using the low 128-bits of our vector (since 16-bit elements, 16 * 8 = 128). It
    is worth keeping in mind that setting the pattern to something that exceeds the
    vector length will set the whole predicate to false, not all true as you might
    expect, which means that something like:

    ptrue p0.h, vl16

    The above would set p0.h to all-true on a 256-bit vector length machine but
    all-false on a 128-bit vector length machine! That does not matter in our case
    since all SVE machines must have a vector length of at least 128-bits.

    The code you posted looks reasonable to me, although it is worth noting that
    the lsl can be included into the first add instruction at no cost. Also the
    assembler syntax requires a "/z" suffix on the predicated load to indicate that
    lanes with a false predicate are set to zero. So something like:

      ptrue p0.h, vl8
    .rept 8
      ld1h {z0.h}, p0/z, [x2]  // p0/z rather than p0
      add x2, x2, x3, lsl #1   // lsl #1 here rather than done separately
      st1b {z0.h}, p0, [x0]
      add x0, x0, x1
    .endr
      ret

    (Also worth pointing out the obvious that the last rept iteration the result of
    the pair of add instructions is unused, but it probably doesn't matter much).

    Thanks,
    George

  • Thanks George! I really appreciate your help.

    BR,

    Akis

  • Thanks George! I really appreciate your help.

    It seems a little bit more complicated with the number of iterations and read/store data are increased. For example, consider the 16x16 case. A possible NEON implementation can be the following:

    function PFX(blockcopy_sp_16x16_neon)
        lsl             x3, x3, #1
        movrel          x11, xtn_xtn2_table
        ld1             {v31.16b}, [x11]
    .rept 8
        ld1             {v0.8h-v1.8h}, [x2], x3
        ld1             {v2.8h-v3.8h}, [x2], x3
        tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
        tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
        st1             {v0.16b}, [x0], x1
        st1             {v1.16b}, [x0], x1
    .endr
        ret
    endfunc
    
    const xtn_xtn2_table, align=4
    .byte    0, 2, 4, 6, 8, 10, 12, 14
    .byte    16, 18, 20, 22, 24, 26, 28, 30
    endconst
    

    How about the following equivalent code using SVE2?

    function PFX(blockcopy_sp_16x16)
        mov             x8, 16
        mov             w12, 16
    .L_init_ss_16x16:
        mov             x9, #0
        whilelt         p1.h, x9, x8
        b.none          .L_next_stridea_blockcopy_ss_16x16
    .L_blockcopy_ss_16x16:
        ld1h            {z0.h}, p0/z, [x2, x9, lsl #1]
        st1b            {z0.h}, p0, [x0, x9]
        inch            x9
        whilelt         p1.h, x9, x8
        b.first         .L_blockcopy_ss_16x16
    .L_next_stridea_blockcopy_ss_16x16:
        add             x2, x2, x3, lsl #1
        add             x0, x0, x1
        sub             w12, w12, #1
        cbnz            w12, .L_init_ss_16x16
        ret
    endfunc

    Am I missing something?

    Sorry for sending you new questions. As I have mentioned, I am new to this stuff and I am trying to understand.

    BR,

    Akis

  • Hi Akis,

    The code you have posted looks reasonable to me as generic SVE code. It is
    worth keeping in mind that for such a simple inner loop and low trip count it
    is still probably worth specializing for specific vector lengths, in
    particular:

    • The optimal VL = 128-bits implementation is just the 8x8 code but with a pair
      of loads and stores per iteration rather than one, and with .rept 16 instead
      of .rept 8.
    • The optimal VL ≥ 128-bits is almost exactly the 8x8 code but with the ptrue
      pattern set to vl16 rather than vl8 (to process 16 * 16 = 256-bits per load),
      and with .rept 16 rather than 8.

    So with that in mind it is probably worth doing something like comparing the
    result of a rdvl instruction at the start of the function and branching to one
    of the above two possible implementations based on that. Obviously such an
    approach would not scale for increasingly large sizes of bx/by, but that is
    something you can probably determine where such a trade-off no longer makes
    sense for your target application and micro-architecture.

    If you wanted to keep the vector length agnostic version in your reply then you
    could still make some small improvements:

    • The inner loop L_blockcopy_ss_16x16 can only ever be iterated through at most
      twice, so it is worth unrolling this completely and precomputing the
      predicates from the whilelt before the outer loop.
    • Along similar lines as the previous point, the branch on line 7 of your code
      block can never be taken and can be safely removed.
    • Some micro-architectures with relatively immature SVE implementations have
      suboptimal implementations of instructions like inch and whilelt. You can
      find more information about such things by consulting the relevant software
      optimzation guides (see my previous messages for links). This matters a lot
      less once the loop is fully unrolled anyway, but you may find it performant
      to use the scalar equivalents instead (e.g. just normal add or cmp
      instructions) in some tight loops. I'd suggest benchmarking the effect of
      such changes rather than diving straight into making too many changes of this
      variety since it sometimes only serves to complicate the code.

    Finally as an aside, it is also worth noting that the use of tbl in the Neon
    version of the code you posted is sub-optimal: you can use the uzp1 instruction
    to avoid needing to load the tbl indices at the start of the function.

    Hope that helps!

    Thanks,
    George

  • Hi George,

    I would like to thank you again for your answers.

    Regarding the solution for avoiding whilelt and inch instructions, you mean something like this, right?

    function PFX(blockcopy_sp_16x16)
        rdvl            x9, #1
        cmp             x9, #16
        bgt             .vl_ge_16_blockcopy_ss_16_16
        ptrue           p0.h, vl8
        mov             x10, #8
    .rept 16
        ld1h            {z0.h}, p0/z, [x2]
        ld1h            {z1.h}, p0/z, [x2, x10, lsl #1]
        add             x2, x2, x3, lsl #1
        st1b            {z0.h}, p0, [x0]
        st1b            {z1.h}, p0, [x0, x10]
        add             x0, x0, x1
    .endr
        ret
    .vl_ge_16_blockcopy_ss_16_16:
        ptrue           p0.h, vl16
    .rept 16
        ld1h            {z0.h}, p0/z, [x2]
        add             x2, x2, x3, lsl #1
        st1b            {z0.h}, p0, [x0]
        add             x0, x0, x1
    .endr
        ret
    endfunc

    It seems more efficient, but as you said, as bx and by increase, maybe it does not scale well.

    I think I will opt this, because I am little bit concerned with your comment regarding suboptimal implementations of whilelt and inch in some platforms. Please recall that I care only for optimizing the performance (and not necessarily for using the vector length agnostic capability of SVE2).

    Can you please indicate whether you agree with the aforementioned piece of code?

    Thank you once again.

    BR,

    Akis

  • Hi Akis,

    Thanks for providing the updated implementation. That looks reasonable to me, I
    don't really have anything significant to suggest.

    You could unroll and interleave the loop as in your original 16x16 Neon version
    (to end up with a .rept 8 instead of .rept 16), but that is probably a marginal
    benefit, if at all. You could use uzp1 as I had previously suggested for the
    Neon version and then have a single full store (using .b elements) rather than
    a pair of unpacked stores (.h), but this isn't actually a reduction in
    instruction count since you are just trading a store instruction for a vector
    op, so is again probably not significant.

    I will be on holiday from tomorrow so apologies in advance for any delay to
    future replies. I wish you a good break if you are also taking any holiday over
    the new year period!

    Thanks,
    George

  • Hi George,

    thanks for everything (and also for your suggestions regarding NEON code). You have really helped me. I will try to not disturb you during your vacations :). I am also taking some days off.

    I wish you merry Christmas and a happy new year!

    BR,

    Akis

  • Hi George,

    When you will be back from your holidays, I will appreciate if you could answer to the following questions:

    1) I have the following function:

    template<int bx, int by>
    void blockcopy_ps_c(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb)
    {
        for (int y = 0; y < by; y++)
        {
            for (int x = 0; x < bx; x++)
                a[x] = (int16_t)b[x];
    
            a += stridea;
            b += strideb;
        }
    }

    I have created the following piece of code for supporting its functionality using SVE2 (assume bx=by=8):

    function PFX(blockcopy_ps_8x8)
        ptrue           p0.b, vl8
    .rept 8
        ld1b            {z0.b}, p0/z, [x2]
        uxtb            z0.h, p0/M, z0.h
        st1h            {z0.h}, p0, [x0]
        add             x0, x0, x1, lsl #1
        add             x2, x2, x3
    .endr
        ret
    endfunc

    Do you agree? Am I missing something?

    2) I have the following function in NEON:

    function PFX(copy_cnt_4_neon)
        lsl             x2, x2, #1
        movi            v4.8b, #0
    .rept 2
        ld1             {v0.8b}, [x1], x2
        ld1             {v1.8b}, [x1], x2
        stp             d0, d1, [x0], #16
        cmeq            v0.4h, v0.4h, #0
        cmeq            v1.4h, v1.4h, #0
        add             v4.4h, v4.4h, v0.4h
        add             v4.4h, v4.4h, v1.4h
    .endr
        saddlv          s4, v4.4h
        fmov            w12, s4
        add             w0, w12, #16
        ret
    endfunc

    I have created the following piece of code for supporting its functionality using SVE2:

    function PFX(copy_cnt_4)
        ptrue           p0.h, vl4
        mov             x10, #8
        ptrue           p1.h
        mov             z4.h, p1/z, #1
        mov             w0, #0
    .rept 4
        ld1h            {z0.h}, p0/z, [x1]
        st1h            {z0.h}, p0, [x0]
        add             x1, x1, x2, lsl #1
        add             x0, x0, x10
        cmpeq           p2.h, p0/z, z0.h, #0
        uaddv           d6, p2, z4.h
        fmov            w12, s6
        add             w0, w0, w12
    .endr
        add             w0, w0, #16
    endfunc

    Do you agree? Am I missing something?

    Thank you in advance,

    Akis

  • Hi Akis,

    Happy new year!

    1)

    I think there is a bug with your suggested implementation here around how the
    predicate is initialised and how the data is loaded. The core of the function
    is:

    ptrue p0.b, vl8
    ld1b {z0.b}, p0/z, [...]
    uxtb z0.h, p0/m, z0.h
    st1h {z0.h}, p0, [...]

    Note that when mixing precisions (.b and .h above) it is important to
    understand how the predicate layout affects the operation being performed.

    ptrue p0.b, vl8 [1111111100000000]
             ^       |              ^ most significant bit
             |       ^ least significant bit
             v
    ptrue p0.h, vl8 [1010101010101010]
                     |              ^ most significant bit
                     ^ least significant bit

    This makes a difference because the predicate is used based on the element size
    (e.g. z0.h or z0.b above), so in the context of your code snippet:

    ptrue p0.b, vl8
    ld1b {z0.b}, p0/z, [...] # loads 8 bytes
    uxtb z0.h, p0/m, z0.h # bug: only zero-extends the low 4 half-words
    st1h {z0.h}, p0, [...] # bug: only stores the low 4 half-words

    As with the unpacked stores there are also unpacked loads, which you can find
    documented here (see e.g. "16-bit element"):

    https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/LD1B--scalar-plus-scalar---Contiguous-load-unsigned-bytes-to-vector--scalar-index--

    In particular this means that in a similar way to the store-truncation case we
    can also use the load to perform the sign-extension for free, something like:

    function PFX(blockcopy_ps_8x8)
      ptrue p0.h, vl8 # note .h rather than .b
    .rept 8
      ld1b {z0.h}, p0/z, [x2] # note .h rather than .b
      st1h {z0.h}, p0, [x0]
      add x0, x0, x1, lsl #1
      add x2, x2, x3
    .endr
      ret
    endfunc

    2)

    The general gist of the code looks believable to me. A few comments:

    Thanks,
    George

  • Hi George,

    I wish you health and a happy new year.

    1) Hmm, I wasn't aware of the unpack versions of load instructions. Thanks once again for showing me the way. However, I think there is a problem as the number of bytes to be read from memory increases. For example, consider the 16x16 case. I think it would be best to read as much data as we can from the memory with one instruction, and then manipulate the data. So, I propose the following code:

    function PFX(blockcopy_ps_16x16_sve2)
        ptrue           p0.b, vl16
        ptrue           p1.h, vl8
        mov             x11, #8
    .rept 16
        ld1b            {z0.b}, p0/z, [x2]
        uunpklo         z1.h, z0.b
        uunpkhi         z2.h, z0.b
        st1h            {z1.h}, p1, [x0]
        st1h            {z2.h}, p1, [x0, x11, lsl #1]
        add             x0, x0, x1, lsl #1
        add             x2, x2, x3
    .endr
        ret
    endfunc

    In this case, we do not use the unpack loads in order to load as much data as we can with one command. Then, we use the "uunpklo" and "uunpkhi" instructions for unpacking. Do you agree? Or is there a better way?

    2) After taking your comments into account, I created the following piece of code:

    function PFX(copy_cnt_4_sve2)
        ptrue           p0.h, vl4
        ptrue           p1.h
        dup             z4.h, #1
        dup             z6.h, #0
    .rept 4
        ld1h            {z0.h}, p0/z, [x1]
        st1h            {z0.h}, p0, [x0]
        add             x1, x1, x2, lsl #1
        add             x0, x0, #8
        cmpeq           p2.h, p0/z, z0.h, #0
        add             z6.h, p2/m, z6.h, z4.h
    .endr
        uaddv           d6, p1, z6.h
        fmov            w12, s6
        add             w0, w0, #16
        ret
    endfunc

    Is this what you meant with your comments?

    Thank you in advance,

    Akis

  • Hi Akis,

    1)

    Using the unpack instructions for larger load amounts makes sense to me.
    Exactly which one is faster in this particular case will of course depend on
    the particular micro-architecture of interest, and e.g. the number of vector
    pipelines versus the number of load/store pipelines.

    As a very minor optimisation, and mostly just for illustrative purposes, I
    would point out that the use of p1 here is redundant:

    ptrue p0.b, vl16 [1111111111111111]
    ptrue p0.h, vl8 [1010101010101010]
    st1h {z1.h} ..... ^ ^ ^ ^ ^ ^ ^ ^ // 16-bit instructions consider every-other bit of the predicate

    You will notice that for all the bits considered by the 16-bit store
    instrucions the values of the two ptrue instructions are equivalent, and as
    such you can simply use p0 everywhere here.

    2)

    Your revised code looks much more reasonable to me, however I think there are a
    couple of bugs in this code (I could be wrong, I have not run it):

    The original Neon code sums the output of the cmeq instructions. As Neon does
    not have predicate registers the result of comparisons are either all-ones or
    all-zeros in a register, however when thought about in twos-complement integer
    representation this is equivalent to -1 or 0. Since the Neon code produces a
    negative sum, this is likely the reason for the addition of 16 at the end. In
    the case of the SVE code there is no need for the final addition, although the
    cmpeq should probably become cmpne instead?

    Additionally, the fmov on line 15 is writing w12 rather than w0?

    Aside from correctness, similarly to the case in (1), the use of p1 is
    redundant here: all elements beyond the first four can never have a non-zero
    value. So we can avoid constructing p1 and use p0 instead.

    Finally, for a vector length of 128-bits it is unlikely to be significantly
    faster, but since we are only loading and storing 64-bits at a time here we
    could additionally consider using gather instructions to make use of a full
    vector of data for the store/cmp/add?

    Thanks,
    George

  • Hi George,

    thanks again for your answers.

    1) I used p0 everywhere as you proposed. Everything seems to be working fine.

    2) I applied all your modifications. The final code seems to be:

    function PFX(copy_cnt_4_sve2)
        ptrue           p0.h, vl8
        dup             z4.h, #1
        dup             z6.h, #0
        lsl             x2, x2, #1
        lsl             x3, x2, #1
        dup             z7.d, x3
        index           z1.d, #0, x2
    .rept 2
        ld1d            {z0.d}, p0/z, [x1, z1.d]
        add             z1.d, p0/m, z1.d, z7.d
        st1d            {z0.d}, p0, [x0]
        add             x0, x0, #16
        cmpne           p2.h, p0/z, z0.h, #0
        add             z6.h, p2/m, z6.h, z4.h
    .endr
        uaddv           d6, p0, z6.h
        fmov            w0, s6
        ret
    endfunc

    This is what you meant, right?

    Does it make sense to use the same approach for vector sizes larger then 128 bits instead of using contiguous loads and stores? I think we have already discussed this and the answer is no, right?

    BR,

    Akis

  • Hi Akis,

    That code looks reasonable to me. One minor improvement would be to avoid the
    vector addition on line 11 by instead performing (cheaper) scalar addition on
    the base address, lines 6 and 7 can then be removed:

    lsl x3, x2, #1
    dup z7.d, x3
    ...
    ld1d {z0.d}, p0/z, [x1, z1.d]
    add z1.d, p0/m, z1.d, z7.d
    
    becomes
    
    ld1d {z0.d}, p0/z, [x1, z1.d]
    add x1, x1, x2
    

    Additionally I just realised that instead of doing a predicated add of 1 on
    line 15 you could instead use the incp instruction[1]. This allows us to
    directly increment a scalar register by the number of active predicate
    elements, allowing us to avoid the uaddv/fmov at the end of the loop:

    dup z4.h, #1
    dup z6.h, #0
    ...
    cmpne p2.h, p0/z, z0.h, #0
    add z6.h, p2/m, z6.h, z4.h
    ...
    uaddv d6, p0, z6.h
    fmov w0, s6
    
    becomes
    
    mov x0, #0
    ...
    cmpne p2.h, p0/z, z0.h, #0
    incp x0, p2.h
    ...
    // nothing to do at the end of the loop, result already in x0

    As far as performance on vector lengths above 128 goes, I think the only change
    needed is to somehow include the x2 offset into the z1 offset vector. The rest
    of the code other than the predicate setup itself seems vector-length agnostic?
    The benefit seems marginal though when the inner block of code is only executed
    twice though.

    Thanks,
    George

    [1] developer.arm.com/.../INCP--scalar---Increment-scalar-by-count-of-true-predicate-elements-

  • Hi George,

    thanks again for your help.

    "incp" seems to do the trick. Thanks!

    I have two minor comments. I think I have to multiply by two before I add x2 to x1 and the finar register for store should be x15 (as x0 is used as input to the function). The final code is

    function PFX(copy_cnt_4_sve2)
        ptrue           p0.h, vl8
        mov             x15, #0
        lsl             x2, x2, #1
        index           z1.d, #0, x2
    .rept 2
        ld1d            {z0.d}, p0/z, [x1, z1.d]
        add             x1, x1, x2, lsl #1
        st1d            {z0.d}, p0, [x0]
        add             x0, x0, #16
        cmpne           p2.h, p0/z, z0.h, #0
        incp            x15, p2.h
    .endr
        mov             x0, x15
        ret
    endfunc

    Regarding the usage of index in vector sizes larger than 128 bits, I think I will stick to contiguous loads and stores as you proposed for "blockcopy_sp_8x8" function in the beginning of this thread. Creating a vector of offsets with every second element adjacent in memory to
    the previous one and using scatter/gather loads and stores seems to decrease the performance. So, I prefer loading the full size of a vector (as much as it is) and stopping there before jumping to the next element.

    Thanks for your help!

    Akis

  • Hi George,

    I am sorry to bother you again but I need your help.

    I have the following function using NEON:

    function PFX(pixel_sse_pp_16x\h\()_neon)
        ld1             {v16.16b}, [x0], x1
        ld1             {v17.16b}, [x2], x3
        usubl           v1.8h, v16.8b, v17.8b
        usubl2          v2.8h, v16.16b, v17.16b
        ld1             {v16.16b}, [x0], x1
        ld1             {v17.16b}, [x2], x3
        smull           v0.4s, v1.4h, v1.4h
        smlal2          v0.4s, v1.8h, v1.8h
        smlal           v0.4s, v2.4h, v2.4h
        smlal2          v0.4s, v2.8h, v2.8h
    .rept \h - 2
        usubl           v1.8h, v16.8b, v17.8b
        usubl2          v2.8h, v16.16b, v17.16b
        ld1             {v16.16b}, [x0], x1
        smlal           v0.4s, v1.4h, v1.4h
        smlal2          v0.4s, v1.8h, v1.8h
        ld1             {v17.16b}, [x2], x3
        smlal           v0.4s, v2.4h, v2.4h
        smlal2          v0.4s, v2.8h, v2.8h
    .endr
        usubl           v1.8h, v16.8b, v17.8b
        usubl2          v2.8h, v16.16b, v17.16b
        smlal           v0.4s, v1.4h, v1.4h
        smlal2          v0.4s, v1.8h, v1.8h
        smlal           v0.4s, v2.4h, v2.4h
        smlal2          v0.4s, v2.8h, v2.8h
        trn2            v1.2d, v0.2d, v0.2d
        add             v0.2s, v0.2s, v1.2s
        addp            v0.2s, v0.2s, v0.2s
        fmov            w0, s0
        ret
    endfunc

    Honestly, I do not understand why the trn2 instruction is used. Anyway, I tried to implement it using the SVE2 instruction set as follows:

    function PFX(pixel_sse_pp_16x\h\()_sve2)
        ptrue           p0.b, vl16
        mov             z3.d, #0
        mov             z4.d, #0
    .rept \h
        ld1b            {z16.b}, p0/z, [x0]
        add             x0, x0, x1
        ld1b            {z17.b}, p0/z, [x2]
        add             x2, x2, x3
        usublb          z1.h, z16.b, z17.b
        usublt          z2.h, z16.b, z17.b
        smlalb          z3.s, z1.h, z1.h
        smlalt          z4.s, z1.h, z1.h
        smlalb          z3.s, z2.h, z2.h
        smlalt          z4.s, z2.h, z2.h
    .endr
        uaddv           d3, p0, z3.s
        fmov            w0, s3
        uaddv           d4, p0, z4.s
        fmov            w1, s4
        add             w0, w0, w1
        ret
    endfunc
     

    The SVE2 version of the function seems to work fine but the performance is worse when it is compered to NEON version.

    Do you have any suggestion?

    Thank you in advance,

    Akis