This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Hello,

I have the following piece of code:

template<int bx, int by>
void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
{
    for (int y = 0; y < by; y++)
    {
        for (int x = 0; x < bx; x++)
        {
            a[x] = (pixel)b[x];
        }

        a += stridea;
        b += strideb;
        }
}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)
    lsl x3, x3, #1
.rept 4
    ld1 {v0.8h}, [x2], x3
    ld1 {v1.8h}, [x2], x3
    xtn v0.8b, v0.8h
    xtn v1.8b, v1.8h
    st1 {v0.d}[0], [x0], x1
    st1 {v1.d}[0], [x0], x1
.endr
    ret
endfunc
However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):
function PFX(blockcopy_sp_8x8)
    MOV x8, 8
    MOV x9, #0
    MOV x6, #0
    MOV x7, #0
    MOV z31.d, #64
    MOV z0.d, #0

    WHILELT p1.d, x9, x8
    B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:
    INDEX z1.d, x6, x3
    INDEX z2.d, x7, x1
.rept 2
    LD1D z3.d, p1/Z, [x2, z1.d]
    ADD z1.d, z1.d, z31.d
    UZP1 z3.b, z3.b, z0.b
    ST1W z3.d, p1, [x0, z2.d, UXTW #2]
    ADD z2.d, z2.d, z31.d
.endr
    INCD x9
    MUL x6, x9, x3
    MUL x7, x9, x1
    WHILELT p1.d, x9, x8
    B.FIRST .L_loopStart_blockcopy_sp_8x8
.L_return_blockcopy_sp_8x8:
    RET
endfunc
However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.
For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.
So, it might be the case that the z3 register is not fully loaded with 16bytes of data.
Can you please tell me what I am doing wrong?
Thank you in advance.
Parents
  • Hi Akis,

    If you are able to use SVE2 then, rather than the Neoverse V1 software
    optimization guide I posted previously, you may find either the Neoverse N2 or
    Neoverse V2 guides more appropriate:

    * Neoverse N2: developer.arm.com/.../latest
    * Neoverse V2: developer.arm.com/.../latest

    Both of these particular micro-architectures have an SVE vector length of
    128-bits, so it is probably worth discussing that specifically for a moment. In
    a situation where the Neon and SVE vector lengths are identical there is
    unlikely to be a massive improvement from code where there is a 1-1 mapping
    between Neon and SVE. For reference you can observe from the software
    optimization guides that most SVE instructions with Neon equivalents have
    identical latencies and throughputs.

    That does not mean that there is no benefit from SVE in such a situation,
    however we need to be able to make use of instructions that are not present in
    Neon in order to achieve a measurable speedup. One particular class of
    instructions that you may find useful are the zero/sign-extending load and
    truncating store instructions available in SVE.

    For example: developer.arm.com/.../ST1B--scalar-plus-scalar---Contiguous-store-bytes-from-vector--scalar-index--

    Going back bx=by=8 in your original example code, instead of trying to make the
    code efficient on long vector lengths, we instead can optimize for shorter
    vector lengths by making use of contiguous loads. This comes at the cost of
    potentially not filling the whole vector on machines with longer vector
    lengths. For example:

    ptrue p0.h, vl8
    ld1h {z0.h}, p0, [x0]
    st1b {z0.h}, p0, [x1]
    // ^ note h-suffix (16 bit) on the load vs b-suffix (8 bit) on the store

    In the code above we give up on trying to make full use of the vector and
    instead constrain the predicate to the low 8 .h (16-bit) elements (8 * 16 = 128
    bits). This is guaranteed to be fine in this case since the SVE vector length
    must be a power of two and at least 128 bits. We then copy 128 bits from the
    address at x0 into z0, but for the store we make use of the truncating store
    instructions: specifically the b suffix on the st1b indicates that we are only
    storing the low byte per element despite each element being 16 bits (from the h
    suffix on z0.h). You will note that this successfully avoids the need for a
    separate xtn instruction as we needed in the Neon equivalent, while also
    avoiding the more costly gather and scatter instructions we had previously
    discussed.

    Whether the above solution is any faster than the Neon equivalent you
    originally posted, or faster than the alternative sequence using gather/scatter
    instructions that we previously discussed, will depend on the nature of the
    surrounding code and the values of stridea/strideb, as well as the details of
    the micro-architecture you are optimizing for. If possible I would recommending
    benchmarking all three across the micro-architecture(s) you are interested in
    to get an idea of the tradeoffs involved rather than me trying to make an
    uninformed decision on your behalf, but the SVE version I've posted above
    should be a safe starting point!

    Thinking more generally than your specific code snippet, given that the code we
    have discussed is only doing a relatively simple copy operation it would also
    be worth considering whether the copy can be avoided completely or merged into
    the surrounding code such that the truncation occurs there instead. This would
    allow us to eliminate the overhead of an additional load and store as we have
    currently and potentially allow for further optimizations.

    Hope that helps. I'm happy to try and expand on anything if it is unclear or if
    you have more specific questions about particular areas we've mentioned so far!

    Thanks,
    George

Reply
  • Hi Akis,

    If you are able to use SVE2 then, rather than the Neoverse V1 software
    optimization guide I posted previously, you may find either the Neoverse N2 or
    Neoverse V2 guides more appropriate:

    * Neoverse N2: developer.arm.com/.../latest
    * Neoverse V2: developer.arm.com/.../latest

    Both of these particular micro-architectures have an SVE vector length of
    128-bits, so it is probably worth discussing that specifically for a moment. In
    a situation where the Neon and SVE vector lengths are identical there is
    unlikely to be a massive improvement from code where there is a 1-1 mapping
    between Neon and SVE. For reference you can observe from the software
    optimization guides that most SVE instructions with Neon equivalents have
    identical latencies and throughputs.

    That does not mean that there is no benefit from SVE in such a situation,
    however we need to be able to make use of instructions that are not present in
    Neon in order to achieve a measurable speedup. One particular class of
    instructions that you may find useful are the zero/sign-extending load and
    truncating store instructions available in SVE.

    For example: developer.arm.com/.../ST1B--scalar-plus-scalar---Contiguous-store-bytes-from-vector--scalar-index--

    Going back bx=by=8 in your original example code, instead of trying to make the
    code efficient on long vector lengths, we instead can optimize for shorter
    vector lengths by making use of contiguous loads. This comes at the cost of
    potentially not filling the whole vector on machines with longer vector
    lengths. For example:

    ptrue p0.h, vl8
    ld1h {z0.h}, p0, [x0]
    st1b {z0.h}, p0, [x1]
    // ^ note h-suffix (16 bit) on the load vs b-suffix (8 bit) on the store

    In the code above we give up on trying to make full use of the vector and
    instead constrain the predicate to the low 8 .h (16-bit) elements (8 * 16 = 128
    bits). This is guaranteed to be fine in this case since the SVE vector length
    must be a power of two and at least 128 bits. We then copy 128 bits from the
    address at x0 into z0, but for the store we make use of the truncating store
    instructions: specifically the b suffix on the st1b indicates that we are only
    storing the low byte per element despite each element being 16 bits (from the h
    suffix on z0.h). You will note that this successfully avoids the need for a
    separate xtn instruction as we needed in the Neon equivalent, while also
    avoiding the more costly gather and scatter instructions we had previously
    discussed.

    Whether the above solution is any faster than the Neon equivalent you
    originally posted, or faster than the alternative sequence using gather/scatter
    instructions that we previously discussed, will depend on the nature of the
    surrounding code and the values of stridea/strideb, as well as the details of
    the micro-architecture you are optimizing for. If possible I would recommending
    benchmarking all three across the micro-architecture(s) you are interested in
    to get an idea of the tradeoffs involved rather than me trying to make an
    uninformed decision on your behalf, but the SVE version I've posted above
    should be a safe starting point!

    Thinking more generally than your specific code snippet, given that the code we
    have discussed is only doing a relatively simple copy operation it would also
    be worth considering whether the copy can be avoided completely or merged into
    the surrounding code such that the truncation occurs there instead. This would
    allow us to eliminate the overhead of an additional load and store as we have
    currently and potentially allow for further optimizations.

    Hope that helps. I'm happy to try and expand on anything if it is unclear or if
    you have more specific questions about particular areas we've mentioned so far!

    Thanks,
    George

Children