This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Hello,

I have the following piece of code:

template<int bx, int by>
void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
{
    for (int y = 0; y < by; y++)
    {
        for (int x = 0; x < bx; x++)
        {
            a[x] = (pixel)b[x];
        }

        a += stridea;
        b += strideb;
        }
}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)
    lsl x3, x3, #1
.rept 4
    ld1 {v0.8h}, [x2], x3
    ld1 {v1.8h}, [x2], x3
    xtn v0.8b, v0.8h
    xtn v1.8b, v1.8h
    st1 {v0.d}[0], [x0], x1
    st1 {v1.d}[0], [x0], x1
.endr
    ret
endfunc
However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):
function PFX(blockcopy_sp_8x8)
    MOV x8, 8
    MOV x9, #0
    MOV x6, #0
    MOV x7, #0
    MOV z31.d, #64
    MOV z0.d, #0

    WHILELT p1.d, x9, x8
    B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:
    INDEX z1.d, x6, x3
    INDEX z2.d, x7, x1
.rept 2
    LD1D z3.d, p1/Z, [x2, z1.d]
    ADD z1.d, z1.d, z31.d
    UZP1 z3.b, z3.b, z0.b
    ST1W z3.d, p1, [x0, z2.d, UXTW #2]
    ADD z2.d, z2.d, z31.d
.endr
    INCD x9
    MUL x6, x9, x3
    MUL x7, x9, x1
    WHILELT p1.d, x9, x8
    B.FIRST .L_loopStart_blockcopy_sp_8x8
.L_return_blockcopy_sp_8x8:
    RET
endfunc
However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.
For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.
So, it might be the case that the z3 register is not fully loaded with 16bytes of data.
Can you please tell me what I am doing wrong?
Thank you in advance.
Parents
  • Hi Akis,

    The vl8 here is the pattern operand to the ptrue instruction, you can find it
    documented here:

    developer.arm.com/.../PTRUES--Initialise-predicate-from-named-constraint-and-set-the-condition-flags-

    The effect of this operand is to limit the number of elements that are set to
    true to some upper bound, for instance here we are saying that we want only
    exactly eight predicate lanes to be set to true, in order to ensure we are only
    using the low 128-bits of our vector (since 16-bit elements, 16 * 8 = 128). It
    is worth keeping in mind that setting the pattern to something that exceeds the
    vector length will set the whole predicate to false, not all true as you might
    expect, which means that something like:

    ptrue p0.h, vl16

    The above would set p0.h to all-true on a 256-bit vector length machine but
    all-false on a 128-bit vector length machine! That does not matter in our case
    since all SVE machines must have a vector length of at least 128-bits.

    The code you posted looks reasonable to me, although it is worth noting that
    the lsl can be included into the first add instruction at no cost. Also the
    assembler syntax requires a "/z" suffix on the predicated load to indicate that
    lanes with a false predicate are set to zero. So something like:

      ptrue p0.h, vl8
    .rept 8
      ld1h {z0.h}, p0/z, [x2]  // p0/z rather than p0
      add x2, x2, x3, lsl #1   // lsl #1 here rather than done separately
      st1b {z0.h}, p0, [x0]
      add x0, x0, x1
    .endr
      ret

    (Also worth pointing out the obvious that the last rept iteration the result of
    the pair of add instructions is unused, but it probably doesn't matter much).

    Thanks,
    George

Reply
  • Hi Akis,

    The vl8 here is the pattern operand to the ptrue instruction, you can find it
    documented here:

    developer.arm.com/.../PTRUES--Initialise-predicate-from-named-constraint-and-set-the-condition-flags-

    The effect of this operand is to limit the number of elements that are set to
    true to some upper bound, for instance here we are saying that we want only
    exactly eight predicate lanes to be set to true, in order to ensure we are only
    using the low 128-bits of our vector (since 16-bit elements, 16 * 8 = 128). It
    is worth keeping in mind that setting the pattern to something that exceeds the
    vector length will set the whole predicate to false, not all true as you might
    expect, which means that something like:

    ptrue p0.h, vl16

    The above would set p0.h to all-true on a 256-bit vector length machine but
    all-false on a 128-bit vector length machine! That does not matter in our case
    since all SVE machines must have a vector length of at least 128-bits.

    The code you posted looks reasonable to me, although it is worth noting that
    the lsl can be included into the first add instruction at no cost. Also the
    assembler syntax requires a "/z" suffix on the predicated load to indicate that
    lanes with a false predicate are set to zero. So something like:

      ptrue p0.h, vl8
    .rept 8
      ld1h {z0.h}, p0/z, [x2]  // p0/z rather than p0
      add x2, x2, x3, lsl #1   // lsl #1 here rather than done separately
      st1b {z0.h}, p0, [x0]
      add x0, x0, x1
    .endr
      ret

    (Also worth pointing out the obvious that the last rept iteration the result of
    the pair of add instructions is unused, but it probably doesn't matter much).

    Thanks,
    George

Children