This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Hello,

I have the following piece of code:

template<int bx, int by>
void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
{
    for (int y = 0; y < by; y++)
    {
        for (int x = 0; x < bx; x++)
        {
            a[x] = (pixel)b[x];
        }

        a += stridea;
        b += strideb;
        }
}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)
    lsl x3, x3, #1
.rept 4
    ld1 {v0.8h}, [x2], x3
    ld1 {v1.8h}, [x2], x3
    xtn v0.8b, v0.8h
    xtn v1.8b, v1.8h
    st1 {v0.d}[0], [x0], x1
    st1 {v1.d}[0], [x0], x1
.endr
    ret
endfunc
However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):
function PFX(blockcopy_sp_8x8)
    MOV x8, 8
    MOV x9, #0
    MOV x6, #0
    MOV x7, #0
    MOV z31.d, #64
    MOV z0.d, #0

    WHILELT p1.d, x9, x8
    B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:
    INDEX z1.d, x6, x3
    INDEX z2.d, x7, x1
.rept 2
    LD1D z3.d, p1/Z, [x2, z1.d]
    ADD z1.d, z1.d, z31.d
    UZP1 z3.b, z3.b, z0.b
    ST1W z3.d, p1, [x0, z2.d, UXTW #2]
    ADD z2.d, z2.d, z31.d
.endr
    INCD x9
    MUL x6, x9, x3
    MUL x7, x9, x1
    WHILELT p1.d, x9, x8
    B.FIRST .L_loopStart_blockcopy_sp_8x8
.L_return_blockcopy_sp_8x8:
    RET
endfunc
However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.
For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.
So, it might be the case that the z3 register is not fully loaded with 16bytes of data.
Can you please tell me what I am doing wrong?
Thank you in advance.
Parents
  • Hi,

    First of all just to clarify: post-indexed addressing usually refers to
    instructions where the address is automatically updated at the end of the
    instruction based on a specified offset. Here is an example of a post-indexed
    instruction:

    ldr x0, [sp], #8

    In the above load instruction you can see we are loading 8-bytes from the
    stack, taking the address from the stack pointer sp, and afterwards
    incrementing sp by 8. An equivalent way of writing this would be:

    ldr x0, [sp]
    add sp, sp, #8

    The SVE instruction set does not contain any post-indexed addressing load or
    store instructions, however this does not tend to be a problem in practice
    since there is generally only a very marginal performance difference when using
    post-indexed addressing on a load or store compared to updating the address in
    a separate instruction as in the above example.

    As you allude to in your question, the optimal sequence of code here will
    depend greatly on the value of bx, by, stridea, and strideb. I will assume that
    the size of a pixel is 8-bits, so you are effectively wanting a 16-bit to 8-bit
    truncation?

    For bx=by=8 you will be loading (8 * 16) 128 bits of data per iteration of the
    innermost loop and storing (8 * 8) 64 bits of data. SVE only very recently
    gained support for gather instructions with 128-bit elements, as part of the
    SVE2.1 extension ( developer.arm.com/.../LD1Q--Gather-load-quadwords- ).
    Since there is not an SVE instruction to cover loading 128-bits of data per
    element your choices here are probably either:

    a) Create a vector of offsets with every second element adjacent in memory to
    the previous one. There are a few different ways of writing this, but the
    obvious way of doing it would be something like:

    lsl x3, x3, #4 // make the stride bytes rather than elements, adjust as needed.
    index z1.d, #0, x3 // create an index
    index z2.d, #8, x3 // create a second index offset by 8 from the first
    zip1 z1.d, z1.d, z2.d // interleave them together

    b) Just continue using the Neon code! It is unlikely you will see much
    performance improvement when you are filling the entire Neon vector perfectly
    as you are currently doing, and the benefit from SVE here is likely marginal
    unless you are targeting a machine with a large vector length.

    With regard to your concern about the code being vector-length agnostic: gather
    and scatter instructions are traditionally slower than the contiguous-access
    counterparts and should be avoided if for instance bx is greater than or equal
    to the size of a single vector, however if your concern is merely whether the
    code is vector-length agnostic or not then I think you are fine. (For the
    performance of gather/scatter and other instructions you can check this in for
    example the Neoverse V1 software optimization guide, available here:
    developer.arm.com/.../ ) SVE code
    makes use of predication (e.g. the p1 register in your code) which allows you
    to conditionally enable part of all of the vector depending on the size of the
    data being operated on. You are initialising the predicate with "WHILELT p1.d,
    x9, x8" where x9 is initialised to 0 and x8 is 8, so on the first iteration of
    the loop this is the equivalent of setting the first 8 bits of the predicate
    for 64-bit elements to true. Unless you are working on a machine with a vector
    length greater than (64 * 8 =) 512 bits, then I think you should be making
    full use of the vector length!

    Hope that helps, let me know if you have any further specific issues and I can
    try and help!

    Thanks,
    George

Reply
  • Hi,

    First of all just to clarify: post-indexed addressing usually refers to
    instructions where the address is automatically updated at the end of the
    instruction based on a specified offset. Here is an example of a post-indexed
    instruction:

    ldr x0, [sp], #8

    In the above load instruction you can see we are loading 8-bytes from the
    stack, taking the address from the stack pointer sp, and afterwards
    incrementing sp by 8. An equivalent way of writing this would be:

    ldr x0, [sp]
    add sp, sp, #8

    The SVE instruction set does not contain any post-indexed addressing load or
    store instructions, however this does not tend to be a problem in practice
    since there is generally only a very marginal performance difference when using
    post-indexed addressing on a load or store compared to updating the address in
    a separate instruction as in the above example.

    As you allude to in your question, the optimal sequence of code here will
    depend greatly on the value of bx, by, stridea, and strideb. I will assume that
    the size of a pixel is 8-bits, so you are effectively wanting a 16-bit to 8-bit
    truncation?

    For bx=by=8 you will be loading (8 * 16) 128 bits of data per iteration of the
    innermost loop and storing (8 * 8) 64 bits of data. SVE only very recently
    gained support for gather instructions with 128-bit elements, as part of the
    SVE2.1 extension ( developer.arm.com/.../LD1Q--Gather-load-quadwords- ).
    Since there is not an SVE instruction to cover loading 128-bits of data per
    element your choices here are probably either:

    a) Create a vector of offsets with every second element adjacent in memory to
    the previous one. There are a few different ways of writing this, but the
    obvious way of doing it would be something like:

    lsl x3, x3, #4 // make the stride bytes rather than elements, adjust as needed.
    index z1.d, #0, x3 // create an index
    index z2.d, #8, x3 // create a second index offset by 8 from the first
    zip1 z1.d, z1.d, z2.d // interleave them together

    b) Just continue using the Neon code! It is unlikely you will see much
    performance improvement when you are filling the entire Neon vector perfectly
    as you are currently doing, and the benefit from SVE here is likely marginal
    unless you are targeting a machine with a large vector length.

    With regard to your concern about the code being vector-length agnostic: gather
    and scatter instructions are traditionally slower than the contiguous-access
    counterparts and should be avoided if for instance bx is greater than or equal
    to the size of a single vector, however if your concern is merely whether the
    code is vector-length agnostic or not then I think you are fine. (For the
    performance of gather/scatter and other instructions you can check this in for
    example the Neoverse V1 software optimization guide, available here:
    developer.arm.com/.../ ) SVE code
    makes use of predication (e.g. the p1 register in your code) which allows you
    to conditionally enable part of all of the vector depending on the size of the
    data being operated on. You are initialising the predicate with "WHILELT p1.d,
    x9, x8" where x9 is initialised to 0 and x8 is 8, so on the first iteration of
    the loop this is the equivalent of setting the first 8 bits of the predicate
    for 64-bit elements to true. Unless you are working on a machine with a vector
    length greater than (64 * 8 =) 512 bits, then I think you should be making
    full use of the vector length!

    Hope that helps, let me know if you have any further specific issues and I can
    try and help!

    Thanks,
    George

Children