This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Hello,

I have the following piece of code:

template<int bx, int by>
void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
{
    for (int y = 0; y < by; y++)
    {
        for (int x = 0; x < bx; x++)
        {
            a[x] = (pixel)b[x];
        }

        a += stridea;
        b += strideb;
        }
}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)
    lsl x3, x3, #1
.rept 4
    ld1 {v0.8h}, [x2], x3
    ld1 {v1.8h}, [x2], x3
    xtn v0.8b, v0.8h
    xtn v1.8b, v1.8h
    st1 {v0.d}[0], [x0], x1
    st1 {v1.d}[0], [x0], x1
.endr
    ret
endfunc
However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):
function PFX(blockcopy_sp_8x8)
    MOV x8, 8
    MOV x9, #0
    MOV x6, #0
    MOV x7, #0
    MOV z31.d, #64
    MOV z0.d, #0

    WHILELT p1.d, x9, x8
    B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:
    INDEX z1.d, x6, x3
    INDEX z2.d, x7, x1
.rept 2
    LD1D z3.d, p1/Z, [x2, z1.d]
    ADD z1.d, z1.d, z31.d
    UZP1 z3.b, z3.b, z0.b
    ST1W z3.d, p1, [x0, z2.d, UXTW #2]
    ADD z2.d, z2.d, z31.d
.endr
    INCD x9
    MUL x6, x9, x3
    MUL x7, x9, x1
    WHILELT p1.d, x9, x8
    B.FIRST .L_loopStart_blockcopy_sp_8x8
.L_return_blockcopy_sp_8x8:
    RET
endfunc
However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.
For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.
So, it might be the case that the z3 register is not fully loaded with 16bytes of data.
Can you please tell me what I am doing wrong?
Thank you in advance.
Parents
  • Hi Akis,

    2)

    The execution throughput here is referring to the number of instances of that
    instruction that can be started on the same cycle. For example if LD2B has a
    throughput of 1 meaning that every cycle one of these instructions can begin
    (but each one still takes 8 cycles until the result is available for use by any
    instructions that depend on the result).

    This means that for a series of independent LD1B instructions with a latency of
    6 cycles and a throughput of 3 per cycle:

    ld1b {z0.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z1.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z2.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z3.b}, ... // starts on cycle 1, completed on cycle 7
    ld1b {z4.b}, ... // starts on cycle 1, completed on cycle 7
    ld1b {z5.b}, ... // starts on cycle 1, completed on cycle 7

    And for the same amount of data loaded with LD2B instructions with a latency of
    8 cycles and a throughput of 3 per cycle:

    ld2b {z0.b, z1.b}, ... // starts on cycle 0, completed on cycle 8
    ld2b {z2.b, z3.b}, ... // starts on cycle 1, completed on cycle 9
    ld2b {z4.b, z5.b}, ... // starts on cycle 2, completed on cycle 10

    You can see that even though we required fewer LD2B instructions, the
    combination of the worse throughput and higher latency compared to LD1B means
    that it is preferrable to use LD1B in this instance. More generally, which
    sequence is preferrable will depend on your micro-architecture of interest, and
    the overall picture may change if there are other instructions being executed
    at the same time which might compete for the same execution resources on the
    CPU, but this is fine for an initial estimate at least.

    3)

    The 0x40 here should in theory be handled by the "rounding" part of sqrshrb. I
    think the difference is that the original code initialised every .b element to
    0x40 rather than every .h element as I imagined. I'm not really sure why the
    code initialises .b elements given that the rest of the arithmetic is done with
    .h. Assuming that the code you showed is definitely what you want, I think you
    can correct for this by just adding 0x80 at the end:

    sqrshrnb z0.b, z0.h, #7
    add z0.b, z0.b, #0x80

    I had a go at writing a very quick test case to check this, hopefully this
    mostly makes sense:

    .global foo
    foo:
      mov z30.b, #0x40
      saddlb z4.s, z0.h, z30.h
      saddlt z5.s, z0.h, z30.h
      shrnb z0.h, z4.s, #7
      shrnt z0.h, z5.s, #7
      sqxtunb z0.b, z0.h
      ret
    
    .global bar
    bar:
      sqrshrnb z0.b, z0.h, #7
      add z0.b, z0.b, #0x80
      ret

    #include <arm_neon.h>
    #include <stdio.h>
    
    int8x8_t foo(int16x4_t x);
    int8x8_t bar(int16x4_t x);
    
    int main() {
      for (int i=-32768; i<32768; ++i) {
        int y = foo(vdup_n_s16(i))[0];
        int z = bar(vdup_n_s16(i))[0];
        if (y != z) {
          printf("%d -> %d or %d\n", i, y, z);
        }
      }
    }
    

    4)

    Of course there is nothing stopping you from using Neon for the 128-bit cases,
    or even using Neon for just the sqxtun(2) instructions and nothing else since
    at vl=128 the registers are identical and overlap. Having said that, I don't
    think that having two stores here is much of a problem. Did you try a version
    using the zero-extending loads instead, since I think that would still beat the
    Neon version by not needing the uxtl(2)/uunpk(lo/hi) instructions?

    Thanks,
    George

Reply
  • Hi Akis,

    2)

    The execution throughput here is referring to the number of instances of that
    instruction that can be started on the same cycle. For example if LD2B has a
    throughput of 1 meaning that every cycle one of these instructions can begin
    (but each one still takes 8 cycles until the result is available for use by any
    instructions that depend on the result).

    This means that for a series of independent LD1B instructions with a latency of
    6 cycles and a throughput of 3 per cycle:

    ld1b {z0.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z1.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z2.b}, ... // starts on cycle 0, completed on cycle 6
    ld1b {z3.b}, ... // starts on cycle 1, completed on cycle 7
    ld1b {z4.b}, ... // starts on cycle 1, completed on cycle 7
    ld1b {z5.b}, ... // starts on cycle 1, completed on cycle 7

    And for the same amount of data loaded with LD2B instructions with a latency of
    8 cycles and a throughput of 3 per cycle:

    ld2b {z0.b, z1.b}, ... // starts on cycle 0, completed on cycle 8
    ld2b {z2.b, z3.b}, ... // starts on cycle 1, completed on cycle 9
    ld2b {z4.b, z5.b}, ... // starts on cycle 2, completed on cycle 10

    You can see that even though we required fewer LD2B instructions, the
    combination of the worse throughput and higher latency compared to LD1B means
    that it is preferrable to use LD1B in this instance. More generally, which
    sequence is preferrable will depend on your micro-architecture of interest, and
    the overall picture may change if there are other instructions being executed
    at the same time which might compete for the same execution resources on the
    CPU, but this is fine for an initial estimate at least.

    3)

    The 0x40 here should in theory be handled by the "rounding" part of sqrshrb. I
    think the difference is that the original code initialised every .b element to
    0x40 rather than every .h element as I imagined. I'm not really sure why the
    code initialises .b elements given that the rest of the arithmetic is done with
    .h. Assuming that the code you showed is definitely what you want, I think you
    can correct for this by just adding 0x80 at the end:

    sqrshrnb z0.b, z0.h, #7
    add z0.b, z0.b, #0x80

    I had a go at writing a very quick test case to check this, hopefully this
    mostly makes sense:

    .global foo
    foo:
      mov z30.b, #0x40
      saddlb z4.s, z0.h, z30.h
      saddlt z5.s, z0.h, z30.h
      shrnb z0.h, z4.s, #7
      shrnt z0.h, z5.s, #7
      sqxtunb z0.b, z0.h
      ret
    
    .global bar
    bar:
      sqrshrnb z0.b, z0.h, #7
      add z0.b, z0.b, #0x80
      ret

    #include <arm_neon.h>
    #include <stdio.h>
    
    int8x8_t foo(int16x4_t x);
    int8x8_t bar(int16x4_t x);
    
    int main() {
      for (int i=-32768; i<32768; ++i) {
        int y = foo(vdup_n_s16(i))[0];
        int z = bar(vdup_n_s16(i))[0];
        if (y != z) {
          printf("%d -> %d or %d\n", i, y, z);
        }
      }
    }
    

    4)

    Of course there is nothing stopping you from using Neon for the 128-bit cases,
    or even using Neon for just the sqxtun(2) instructions and nothing else since
    at vl=128 the registers are identical and overlap. Having said that, I don't
    think that having two stores here is much of a problem. Did you try a version
    using the zero-extending loads instead, since I think that would still beat the
    Neon version by not needing the uxtl(2)/uunpk(lo/hi) instructions?

    Thanks,
    George

Children
  • Hi George,

    2) Now I understand. Thanks. Just one question. Does the execution throughput refer to instances of the same instruction? I mean which one of the following is preferable:

    ld1b    {z0.b}, p0/z, [x0]
    ld1b    {z1.b}, p0/z, [x1]
    add     x0, x0, x5
    add     x1, x1, x6

    or

    ld1b    {z0.b}, p0/z, [x0]
    add     x0, x0, x5
    ld1b    {z1.b}, p0/z, [x1]
    add     x1, x1, x6

    If the execution throughput refers to instances of the same instruction, I guess the firth option is the best. Or?

    3) Your code works perfectly. Thanks!

    4) I switched back to using the zero-extending loads instead. Regarding the performance, I think it is better, as you said. Thanks!

    I might come back to you if I need anything else. Thanks for everything!

    BR,

    Akis

  • Hi Akis,

    Throughput in this case is referring to the number of the same instruction that
    can begin execution on each cycle. The exact code layout is not particularly
    important for large out-of-order cores like Neoverse N2 or Neoverse V2, so I
    would expect both arrangements to perform more or less the same. The bottleneck
    in such cores is instead usually any dependency chain between instructions, for
    example in the case of the load instructions here the loads cannot begin
    execution until the addresses x0 and x1 have been calculated.

    Glad to hear the new code worked as expected!

    Thanks,
    George