This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Akis over 1 year ago

Hello,

I have the following piece of code:

template<int bx, int by>

void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)

{

for (int y = 0; y < by; y++)

{

for (int x = 0; x < bx; x++)

{

a[x] = (pixel)b[x];

}

a += stridea;

b += strideb;

}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)

lsl x3, x3, #1

.rept 4

ld1 {v0.8h}, [x2], x3

ld1 {v1.8h}, [x2], x3

xtn v0.8b, v0.8h

xtn v1.8b, v1.8h

st1 {v0.d}[0], [x0], x1

st1 {v1.d}[0], [x0], x1

.endr

ret

endfunc

However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):

function PFX(blockcopy_sp_8x8)

MOV x8, 8

MOV x9, #0

MOV x6, #0

MOV x7, #0

MOV z31.d, #64

MOV z0.d, #0

WHILELT p1.d, x9, x8

B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:

INDEX z1.d, x6, x3

INDEX z2.d, x7, x1

.rept 2

LD1D z3.d, p1/Z, [x2, z1.d]

ADD z1.d, z1.d, z31.d

UZP1 z3.b, z3.b, z0.b

ST1W z3.d, p1, [x0, z2.d, UXTW #2]

ADD z2.d, z2.d, z31.d

.endr

INCD x9

MUL x6, x9, x3

MUL x7, x9, x1

WHILELT p1.d, x9, x8

B.FIRST .L_loopStart_blockcopy_sp_8x8

.L_return_blockcopy_sp_8x8:

RET

endfunc

However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.

For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.

So, it might be the case that the z3 register is not fully loaded with 16bytes of data.

Can you please tell me what I am doing wrong?

Thank you in advance.

Top replies

Parents

+1 George Steed over 1 year ago in reply to Akis

Hi Akis,

Thanks for providing the updated implementation. That looks reasonable to me, I
don't really have anything significant to suggest.

You could unroll and interleave the loop as in your original 16x16 Neon version
(to end up with a .rept 8 instead of .rept 16), but that is probably a marginal
benefit, if at all. You could use uzp1 as I had previously suggested for the
Neon version and then have a single full store (using .b elements) rather than
a pair of unpacked stores (.h), but this isn't actually a reduction in
instruction count since you are just trading a store instruction for a vector
op, so is again probably not significant.

I will be on holiday from tomorrow so apologies in advance for any delay to
future replies. I wish you a good break if you are also taking any holiday over
the new year period!

Thanks,
George
Cancel
Up +1 Down

Cancel

Reply

+1 George Steed over 1 year ago in reply to Akis

Hi Akis,

Thanks for providing the updated implementation. That looks reasonable to me, I
don't really have anything significant to suggest.

You could unroll and interleave the loop as in your original 16x16 Neon version
(to end up with a .rept 8 instead of .rept 16), but that is probably a marginal
benefit, if at all. You could use uzp1 as I had previously suggested for the
Neon version and then have a single full store (using .b elements) rather than
a pair of unpacked stores (.h), but this isn't actually a reduction in
instruction count since you are just trading a store instruction for a vector
op, so is again probably not significant.

I will be on holiday from tomorrow so apologies in advance for any delay to
future replies. I wish you a good break if you are also taking any holiday over
the new year period!

Thanks,
George
Cancel
Up +1 Down

Cancel

Children

0 Akis over 1 year ago in reply to George Steed

Hi George,

thanks for everything (and also for your suggestions regarding NEON code). You have really helped me. I will try to not disturb you during your vacations :). I am also taking some days off.

I wish you merry Christmas and a happy new year!

BR,

Akis
Cancel
Up +1 Down

Cancel

0 Akis over 1 year ago in reply to Akis

Hi George,

When you will be back from your holidays, I will appreciate if you could answer to the following questions:

1) I have the following function:

template<int bx, int by>
void blockcopy_ps_c(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb)
{
    for (int y = 0; y < by; y++)
    {
        for (int x = 0; x < bx; x++)
            a[x] = (int16_t)b[x];

        a += stridea;
        b += strideb;
    }
}

I have created the following piece of code for supporting its functionality using SVE2 (assume bx=by=8):

function PFX(blockcopy_ps_8x8)
    ptrue           p0.b, vl8
.rept 8
    ld1b            {z0.b}, p0/z, [x2]
    uxtb            z0.h, p0/M, z0.h
    st1h            {z0.h}, p0, [x0]
    add             x0, x0, x1, lsl #1
    add             x2, x2, x3
.endr
    ret
endfunc

Do you agree? Am I missing something?

2) I have the following function in NEON:

function PFX(copy_cnt_4_neon)
    lsl             x2, x2, #1
    movi            v4.8b, #0
.rept 2
    ld1             {v0.8b}, [x1], x2
    ld1             {v1.8b}, [x1], x2
    stp             d0, d1, [x0], #16
    cmeq            v0.4h, v0.4h, #0
    cmeq            v1.4h, v1.4h, #0
    add             v4.4h, v4.4h, v0.4h
    add             v4.4h, v4.4h, v1.4h
.endr
    saddlv          s4, v4.4h
    fmov            w12, s4
    add             w0, w12, #16
    ret
endfunc

I have created the following piece of code for supporting its functionality using SVE2:

function PFX(copy_cnt_4)
    ptrue           p0.h, vl4
    mov             x10, #8
    ptrue           p1.h
    mov             z4.h, p1/z, #1
    mov             w0, #0
.rept 4
    ld1h            {z0.h}, p0/z, [x1]
    st1h            {z0.h}, p0, [x0]
    add             x1, x1, x2, lsl #1
    add             x0, x0, x10
    cmpeq           p2.h, p0/z, z0.h, #0
    uaddv           d6, p2, z4.h
    fmov            w12, s6
    add             w0, w0, w12
.endr
    add             w0, w0, #16
endfunc

Do you agree? Am I missing something?

Thank you in advance,

Akis

+1 George Steed over 1 year ago in reply to Akis
Hi Akis,

Happy new year!

1)

I think there is a bug with your suggested implementation here around how the
predicate is initialised and how the data is loaded. The core of the function
is:

ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] uxtb z0.h, p0/m, z0.h st1h {z0.h}, p0, [...]

Note that when mixing precisions (.b and .h above) it is important to
understand how the predicate layout affects the operation being performed.

ptrue p0.b, vl8 [1111111100000000] ^ | ^ most significant bit | ^ least significant bit v ptrue p0.h, vl8 [1010101010101010] | ^ most significant bit ^ least significant bit

This makes a difference because the predicate is used based on the element size
(e.g. z0.h or z0.b above), so in the context of your code snippet:

ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] # loads 8 bytes uxtb z0.h, p0/m, z0.h # bug: only zero-extends the low 4 half-words st1h {z0.h}, p0, [...] # bug: only stores the low 4 half-words

As with the unpacked stores there are also unpacked loads, which you can find
documented here (see e.g. "16-bit element"):

https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/LD1B--scalar-plus-scalar---Contiguous-load-unsigned-bytes-to-vector--scalar-index--

In particular this means that in a similar way to the store-truncation case we
can also use the load to perform the sign-extension for free, something like:

function PFX(blockcopy_ps_8x8) ptrue p0.h, vl8 # note .h rather than .b .rept 8 ld1b {z0.h}, p0/z, [x2] # note .h rather than .b st1h {z0.h}, p0, [x0] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc

2)

The general gist of the code looks believable to me. A few comments:

The Neon code does the reduction at the end of the loop rather than inside it.
This makes sense since instructions that reduce across an entire vector like
uaddv tend to be quite expensive (you can consult the optimization guides
earlier for exact figures). You can use a predicated add or subtract
instruction to achieve the same effect as the vector addition in the original:
https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/ADD--vectors--predicated---Add-vectors--predicated--

I think p1 is used only for the broadcast on line 5? In which case you can just use the unpredicated version:
https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/DUP--immediate---Broadcast-signed-immediate-to-vector-elements--unpredicated--

The add on line 11 can just be an add-immediate, no need to initialize x10 separately.

Thanks,
George
Cancel
Up +1 Down

Cancel

0 Akis over 1 year ago in reply to George Steed

Hi George,

I wish you health and a happy new year.

1) Hmm, I wasn't aware of the unpack versions of load instructions. Thanks once again for showing me the way. However, I think there is a problem as the number of bytes to be read from memory increases. For example, consider the 16x16 case. I think it would be best to read as much data as we can from the memory with one instruction, and then manipulate the data. So, I propose the following code:

function PFX(blockcopy_ps_16x16_sve2)
    ptrue           p0.b, vl16
    ptrue           p1.h, vl8
    mov             x11, #8
.rept 16
    ld1b            {z0.b}, p0/z, [x2]
    uunpklo         z1.h, z0.b
    uunpkhi         z2.h, z0.b
    st1h            {z1.h}, p1, [x0]
    st1h            {z2.h}, p1, [x0, x11, lsl #1]
    add             x0, x0, x1, lsl #1
    add             x2, x2, x3
.endr
    ret
endfunc

In this case, we do not use the unpack loads in order to load as much data as we can with one command. Then, we use the "uunpklo" and "uunpkhi" instructions for unpacking. Do you agree? Or is there a better way?

2) After taking your comments into account, I created the following piece of code:

function PFX(copy_cnt_4_sve2)
    ptrue           p0.h, vl4
    ptrue           p1.h
    dup             z4.h, #1
    dup             z6.h, #0
.rept 4
    ld1h            {z0.h}, p0/z, [x1]
    st1h            {z0.h}, p0, [x0]
    add             x1, x1, x2, lsl #1
    add             x0, x0, #8
    cmpeq           p2.h, p0/z, z0.h, #0
    add             z6.h, p2/m, z6.h, z4.h
.endr
    uaddv           d6, p1, z6.h
    fmov            w12, s6
    add             w0, w0, #16
    ret
endfunc

Is this what you meant with your comments?

Thank you in advance,

Akis

+1 George Steed over 1 year ago in reply to Akis
Hi Akis,

1)

Using the unpack instructions for larger load amounts makes sense to me.
Exactly which one is faster in this particular case will of course depend on
the particular micro-architecture of interest, and e.g. the number of vector
pipelines versus the number of load/store pipelines.

As a very minor optimisation, and mostly just for illustrative purposes, I
would point out that the use of p1 here is redundant:

ptrue p0.b, vl16 [1111111111111111] ptrue p0.h, vl8 [1010101010101010] st1h {z1.h} ..... ^ ^ ^ ^ ^ ^ ^ ^ // 16-bit instructions consider every-other bit of the predicate

You will notice that for all the bits considered by the 16-bit store
instrucions the values of the two ptrue instructions are equivalent, and as
such you can simply use p0 everywhere here.

2)

Your revised code looks much more reasonable to me, however I think there are a
couple of bugs in this code (I could be wrong, I have not run it):

The original Neon code sums the output of the cmeq instructions. As Neon does
not have predicate registers the result of comparisons are either all-ones or
all-zeros in a register, however when thought about in twos-complement integer
representation this is equivalent to -1 or 0. Since the Neon code produces a
negative sum, this is likely the reason for the addition of 16 at the end. In
the case of the SVE code there is no need for the final addition, although the
cmpeq should probably become cmpne instead?

Additionally, the fmov on line 15 is writing w12 rather than w0?

Aside from correctness, similarly to the case in (1), the use of p1 is
redundant here: all elements beyond the first four can never have a non-zero
value. So we can avoid constructing p1 and use p0 instead.

Finally, for a vector length of 128-bits it is unlikely to be significantly
faster, but since we are only loading and storing 64-bits at a time here we
could additionally consider using gather instructions to make use of a full
vector of data for the store/cmp/add?

Thanks,
George
Cancel
Up +1 Down

Cancel

0 Akis over 1 year ago in reply to George Steed

Hi George,

thanks again for your answers.

1) I used p0 everywhere as you proposed. Everything seems to be working fine.

2) I applied all your modifications. The final code seems to be:

function PFX(copy_cnt_4_sve2)
    ptrue           p0.h, vl8
    dup             z4.h, #1
    dup             z6.h, #0
    lsl             x2, x2, #1
    lsl             x3, x2, #1
    dup             z7.d, x3
    index           z1.d, #0, x2
.rept 2
    ld1d            {z0.d}, p0/z, [x1, z1.d]
    add             z1.d, p0/m, z1.d, z7.d
    st1d            {z0.d}, p0, [x0]
    add             x0, x0, #16
    cmpne           p2.h, p0/z, z0.h, #0
    add             z6.h, p2/m, z6.h, z4.h
.endr
    uaddv           d6, p0, z6.h
    fmov            w0, s6
    ret
endfunc

This is what you meant, right?

Does it make sense to use the same approach for vector sizes larger then 128 bits instead of using contiguous loads and stores? I think we have already discussed this and the answer is no, right?

BR,

Akis

+1 George Steed over 1 year ago in reply to Akis
Hi Akis,

That code looks reasonable to me. One minor improvement would be to avoid the
vector addition on line 11 by instead performing (cheaper) scalar addition on
the base address, lines 6 and 7 can then be removed:

lsl x3, x2, #1 dup z7.d, x3 ... ld1d {z0.d}, p0/z, [x1, z1.d] add z1.d, p0/m, z1.d, z7.d becomes ld1d {z0.d}, p0/z, [x1, z1.d] add x1, x1, x2

Additionally I just realised that instead of doing a predicated add of 1 on
line 15 you could instead use the incp instruction[1]. This allows us to
directly increment a scalar register by the number of active predicate
elements, allowing us to avoid the uaddv/fmov at the end of the loop:

dup z4.h, #1 dup z6.h, #0 ... cmpne p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h ... uaddv d6, p0, z6.h fmov w0, s6 becomes mov x0, #0 ... cmpne p2.h, p0/z, z0.h, #0 incp x0, p2.h ... // nothing to do at the end of the loop, result already in x0

As far as performance on vector lengths above 128 goes, I think the only change
needed is to somehow include the x2 offset into the z1 offset vector. The rest
of the code other than the predicate setup itself seems vector-length agnostic?
The benefit seems marginal though when the inner block of code is only executed
twice though.

Thanks,
George

[1] developer.arm.com/.../INCP--scalar---Increment-scalar-by-count-of-true-predicate-elements-
Cancel
Up +1 Down

Cancel
0 Akis over 1 year ago in reply to George Steed
Hi George,

thanks again for your help.

"incp" seems to do the trick. Thanks!

I have two minor comments. I think I have to multiply by two before I add x2 to x1 and the finar register for store should be x15 (as x0 is used as input to the function). The final code is

function PFX(copy_cnt_4_sve2) ptrue p0.h, vl8 mov x15, #0 lsl x2, x2, #1 index z1.d, #0, x2 .rept 2 ld1d {z0.d}, p0/z, [x1, z1.d] add x1, x1, x2, lsl #1 st1d {z0.d}, p0, [x0] add x0, x0, #16 cmpne p2.h, p0/z, z0.h, #0 incp x15, p2.h .endr mov x0, x15 ret endfunc

Regarding the usage of index in vector sizes larger than 128 bits, I think I will stick to contiguous loads and stores as you proposed for "blockcopy_sp_8x8" function in the beginning of this thread. Creating a vector of offsets with every second element adjacent in memory to
the previous one and using scatter/gather loads and stores seems to decrease the performance. So, I prefer loading the full size of a vector (as much as it is) and stopping there before jumping to the next element.

Thanks for your help!

Akis
Cancel
Up 0 Down

Cancel

0 Akis over 1 year ago in reply to Akis

Hi George,

I am sorry to bother you again but I need your help.

I have the following function using NEON:

function PFX(pixel_sse_pp_16x\h\()_neon)
    ld1             {v16.16b}, [x0], x1
    ld1             {v17.16b}, [x2], x3
    usubl           v1.8h, v16.8b, v17.8b
    usubl2          v2.8h, v16.16b, v17.16b
    ld1             {v16.16b}, [x0], x1
    ld1             {v17.16b}, [x2], x3
    smull           v0.4s, v1.4h, v1.4h
    smlal2          v0.4s, v1.8h, v1.8h
    smlal           v0.4s, v2.4h, v2.4h
    smlal2          v0.4s, v2.8h, v2.8h
.rept \h - 2
    usubl           v1.8h, v16.8b, v17.8b
    usubl2          v2.8h, v16.16b, v17.16b
    ld1             {v16.16b}, [x0], x1
    smlal           v0.4s, v1.4h, v1.4h
    smlal2          v0.4s, v1.8h, v1.8h
    ld1             {v17.16b}, [x2], x3
    smlal           v0.4s, v2.4h, v2.4h
    smlal2          v0.4s, v2.8h, v2.8h
.endr
    usubl           v1.8h, v16.8b, v17.8b
    usubl2          v2.8h, v16.16b, v17.16b
    smlal           v0.4s, v1.4h, v1.4h
    smlal2          v0.4s, v1.8h, v1.8h
    smlal           v0.4s, v2.4h, v2.4h
    smlal2          v0.4s, v2.8h, v2.8h
    trn2            v1.2d, v0.2d, v0.2d
    add             v0.2s, v0.2s, v1.2s
    addp            v0.2s, v0.2s, v0.2s
    fmov            w0, s0
    ret
endfunc

Honestly, I do not understand why the trn2 instruction is used. Anyway, I tried to implement it using the SVE2 instruction set as follows:

function PFX(pixel_sse_pp_16x\h\()_sve2)
    ptrue           p0.b, vl16
    mov             z3.d, #0
    mov             z4.d, #0
.rept \h
    ld1b            {z16.b}, p0/z, [x0]
    add             x0, x0, x1
    ld1b            {z17.b}, p0/z, [x2]
    add             x2, x2, x3
    usublb          z1.h, z16.b, z17.b
    usublt          z2.h, z16.b, z17.b
    smlalb          z3.s, z1.h, z1.h
    smlalt          z4.s, z1.h, z1.h
    smlalb          z3.s, z2.h, z2.h
    smlalt          z4.s, z2.h, z2.h
.endr
    uaddv           d3, p0, z3.s
    fmov            w0, s3
    uaddv           d4, p0, z4.s
    fmov            w1, s4
    add             w0, w0, w1
    ret
endfunc

The SVE2 version of the function seems to work fine but the performance is worse when it is compered to NEON version.

Do you have any suggestion?

Thank you in advance,

Akis

+1 George Steed over 1 year ago in reply to Akis

Hi Akis,

Happy Year of the Rabbit!

The trn2 looks a bit odd to me there as well. I think it is trying to get the
upper half of the vector v0 into the lower half of v1 so that a normal add can
be done instead of needing an addp? On most modern micro-architectures these
two instructions would likely be slower than just doing the addp so this is
probably not helping the Neon code.

Your SVE2 code looks reasonable to me. I can't see anything that would be
obviously problematic to performance, I would expect it to perform at more or
less exactly the same performance as the Neon code. Is it possible to measure
this particular kernel in isolation to confirm that this is the only kernel
causing a performance regression, or perhaps there is another performance issue
elsewhere that is causing the observed difference from the Neon build of the
code?

If you are interested in very small values of h (like h=2) then perhaps there
are interesting tricks to play here to bring it closer to the Neon code like
using smull{b,t} instead of the accumulating versions for the first iteration,
but I wouldn't expect that to meaningfully improve performance.

Sorry I cannot be of more help!

Thanks,
George
Cancel
Up +2 Down

Cancel

0 Akis over 1 year ago in reply to George Steed

Hi George,

Happy Year of the Rabbit!

No worries. You have really helped a lot so far. Without your help, I wouldn't have gone so far.

Regarding the smaller h values, can you be more specific? For example, for the 4x4 case, I am using the following code in order to also take advantage of the load widening instructions:

function PFX(pixel_sse_pp_4x4_sve2)
    ptrue           p0.s, vl4
    mov             z0.d, #0
.rept 4
    ld1b            {z16.s}, p0/z, [x0]
    add             x0, x0, x1
    ld1b            {z17.s}, p0/z, [x2]
    add             x2, x2, x3
    sub             z16.s, p0/m, z16.s, z17.s
    mla             z0.s, p0/m, z16.s, z16.s
.endr
    uaddv           d0, p0, z0.s
    fmov            w0, s0
    ret
endfunc

Is there a way to do it better?

Unfortunately, I cannot isolate this kernel and measure it. In fact, I am using the hyperfine tool (https://github.com/sharkdp/hyperfine) for benchmarking which runs the whole executable code.

I will try to list here some more functions that are eligible for the performance drop, if this OK with you. For example, here is the NEON version of a function:

function PFX(pixel_sad_16x16_neon)
    ld1             {v0.16b}, [x0], x1
    ld1             {v1.16b}, [x2], x3
    ld1             {v2.16b}, [x0], x1
    ld1             {v3.16b}, [x2], x3
    uabdl           v16.8h, v0.8b, v1.8b
    uabdl2          v17.8h, v0.16b, v1.16b
    uabal           v16.8h, v2.8b, v3.8b
    uabal2          v17.8h, v2.16b, v3.16b
.rept 7
    ld1             {v0.16b}, [x0], x1
    ld1             {v1.16b}, [x2], x3
    ld1             {v2.16b}, [x0], x1
    ld1             {v3.16b}, [x2], x3
    uabal           v16.8h, v0.8b, v1.8b
    uabal2          v17.8h, v0.16b, v1.16b
    uabal           v16.8h, v2.8b, v3.8b
    uabal2          v17.8h, v2.16b, v3.16b
.endr
.if \w > 4
    add             v16.8h, v16.8h, v17.8h
.endif
    uaddlv          s0, v16.8h
    fmov            w0, s0
    ret
endfunc

And here is the SVE2 implementation that I developed:

function PFX(pixel_sad_16x16_sve2)
    rdvl            x9, #1
    cmp             x9, #16
    bgt             .vl_gt_16_pixel_sad_16x\h
    mov             x11, #8
    mov             z16.d, #0
    ptrue           p0.h, vl8
.rept 16
    ld1b            {z0.h}, p0/z, [x0]
    ld1b            {z1.h}, p0/z, [x0, x11]
    add             x0, x0, x1
    ld1b            {z2.h}, p0/z, [x2]
    ld1b            {z3.h}, p0/z, [x2, x11]
    add             x2, x2, x3
    uaba            z16.h, z0.h, z2.h
    uaba            z16.h, z1.h, z3.h
.endr
    uaddv           d5, p0, z16.h
    fmov            w0, s5
    ret
.vl_gt_16_pixel_sad_16x\h\():
    mov             z16.d, #0
    ptrue           p0.h, vl16
.rept 16
    ld1b            {z0.h}, p0/z, [x0]
    add             x0, x0, x1
    ld1b            {z2.h}, p0/z, [x2]
    add             x2, x2, x3
    uaba            z16.h, z0.h, z2.h
.endr
    uaddv           d5, p0, z16.h
    fmov            w0, s5
    ret
endfunc

Can you identify any operation that can be done more efficiently in the SVE2 code?

Thank you in advance,

Akis

+1 George Steed over 1 year ago in reply to Akis
Hi Akis,

Thanks for sharing the other kernels.

For optimising for the smaller h values, I was referring to the fact that you
can usually avoid the additional latency associated with first creating a zero
accumulator and then accumulating into it. This probably doesn't matter very
much especially for larger h values, but for smaller cases it could make a
small but noticeable difference. Taking your example in your latest post, we
could imagine peeling the first iteration of the rept and doing something like:

function PFX(pixel_sse_pp_4x4_sve2) ptrue p0.s, vl4 // peel first iteration ld1b {z0.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z0.s, p0/m, z0.s, z17.s mul z0.s, p0/m, z0.s, z0.s // only need mul here rather than mla .rept 4 ld1b {z16.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z16.s, p0/m, z16.s, z17.s mla z0.s, p0/m, z16.s, z16.s .endr uaddv d0, p0, z0.s fmov w0, s0 ret endfunc

Again I don't think that this really makes any significant difference in
practice, but in general you can sometimes find similar small optimisations in
the first iterations of such loops.

For the pixel_sad code I think that your SVE2 implementation is currently doing
twice as many load instructions as is strictly needed. Although the ld1b->.h
zero extension is nice here, we are effectively losing of our load bandwidth
(since the instruction is loading 8 bytes rather than 16). In SVE2 you can
actually do better than this by instead taking advantage of the UABAL[BT]
instructions to mirror the widening behaviour of the existing Neon code. This
would look something like:

... ptrue p0.b, vl16 // .b instead of .h, vl16 instead of vl8 .rept 16 ld1b {z0.b}, p0/z, [x0] // z0.b instead of z0.h add x0, x0, x1 ld1b {z2.b}, p0/z, [x2] // z2.b instead of z2.h add x2, x2, x3 uabalb z16.h, z0.b, z2.b uabalt z16.h, z1.b, z3.b // ^^ ^ ^ inputs now .b, instruction widens to .h .endr

Hope that helps!

Thanks,
George
Cancel
Up +1 Down

Cancel

0 Akis over 1 year ago in reply to George Steed

Hi George,

thanks for your comments. I followed your advices for both the small h values case and the pixel_sad function.

I have the following extra questions:

1) How about when I need to load 12 bytes of data? For example, I have the following NEON code:

function PFX(pixel_sad_12x\h\()_neon)
    movrel          x12, sad12_mask
    ld1             {v31.16b}, [x12]
    movi            v16.16b, #0
    movi            v17.16b, #0
    mov             w9, #\h/8
.loop_12x\h:
    sub             w9, w9, #1
.rept 4
    ld1             {v0.16b}, [x0], x1
    and             v0.16b, v0.16b, v31.16b
    ld1             {v1.16b}, [x2], x3
    and             v1.16b, v1.16b, v31.16b
    ld1             {v2.16b}, [x0], x1
    and             v2.16b, v2.16b, v31.16b
    ld1             {v3.16b}, [x2], x3
    and             v3.16b, v3.16b, v31.16b
    uabal           v16.8h, v0.8b, v1.8b
    uabal2          v17.8h, v0.16b, v1.16b
    uabal           v16.8h, v2.8b, v3.8b
    uabal2          v17.8h, v2.16b, v3.16b
.endr
    cbnz            w9, .loop_12x\h
    add             v16.8h, v16.8h, v17.8h
    uaddlv          s0, v16.8h
    fmov            w0, s0
    ret
endfunc

const sad12_mask, align=8
.byte 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0
endconst

I have developed the following SVE2 code:

function PFX(pixel_sad_12x\h\()_sve2)
    mov             z16.d, #0
    ptrue           p0.h, vl8
    ptrue           p1.h, vl4
    mov             x11, #8
.rept \h
    ld1b            {z0.h}, p0/z, [x0]
    ld1b            {z1.h}, p1/z, [x0, x11]
    add             x0, x0, x1
    ld1b            {z8.h}, p0/z, [x2]
    ld1b            {z9.h}, p1/z, [x2, x11]
    add             x2, x2, x3
    uaba            z16.h, z0.h, z8.h
    uaba            z16.h, z1.h, z9.h
.endr
    uaddv           d5, p0, z16.h
    fmov            w0, s5
    ret
endfunc

Here again I am using ld1 instructions for also widen the loaded values, but since I cannot perfectly define ptrues (12 bytes of data are needed), is there a better way?

2) Should I use two ld1 instructions or one ld2 instruction whenever it is possible? For example. I have the following NEON piece of code:

function PFX(pixel_sse_pp_32x32_neon)
    mov             w12, #8
    movi            v0.16b, #0
    movi            v1.16b, #0
.loop_sse_pp_32:
    sub             w12, w12, #1
.rept 4
    ld1             {v16.16b,v17.16b}, [x0], x1
    ld1             {v18.16b,v19.16b}, [x2], x3
    usubl           v2.8h, v16.8b, v18.8b
    usubl2          v3.8h, v16.16b, v18.16b
    usubl           v4.8h, v17.8b, v19.8b
    usubl2          v5.8h, v17.16b, v19.16b
    smlal           v0.4s, v2.4h, v2.4h
    smlal2          v1.4s, v2.8h, v2.8h
    smlal           v0.4s, v3.4h, v3.4h
    smlal2          v1.4s, v3.8h, v3.8h
    smlal           v0.4s, v4.4h, v4.4h
    smlal2          v1.4s, v4.8h, v4.8h
    smlal           v0.4s, v5.4h, v5.4h
    smlal2          v1.4s, v5.8h, v5.8h
.endr
    cbnz            w12, .loop_sse_pp_32
    add             v0.4s, v0.4s, v1.4s
    trn2            v1.2d, v0.2d, v0.2d
    add             v0.2s, v0.2s, v1.2s
    addp            v0.2s, v0.2s, v0.2s
    fmov            w0, s0
    ret
endfunc

I have developed the following SVE2 code:

function PFX(pixel_sse_pp_32x32_sve2)
    rdvl            x9, #1
    cmp             x9, #16
    bgt             .vl_gt_16_pixel_sse_pp_32x32
    ptrue           p0.b, vl16
    mov             z20.d, #0
    mov             z21.d, #0
    mov             x11, #16
.rept 32
    ld1b            {z16.b}, p0/z, [x0]
    ld1b            {z17.b}, p0/z, [x0, x11]
    add             x0, x0, x1
    ld1b            {z18.b}, p0/z, [x2]
    ld1b            {z19.b}, p0/z, [x2, x11]
    add             x2, x2, x3
    usublb          z1.h, z16.b, z18.b
    usublt          z2.h, z16.b, z18.b
    usublb          z3.h, z17.b, z19.b
    usublt          z4.h, z17.b, z19.b
    smlalb          z20.s, z1.h, z1.h
    smlalt          z21.s, z1.h, z1.h
    smlalb          z20.s, z2.h, z2.h
    smlalt          z21.s, z2.h, z2.h
    smlalb          z20.s, z3.h, z3.h
    smlalt          z21.s, z3.h, z3.h
    smlalb          z20.s, z4.h, z4.h
    smlalt          z21.s, z4.h, z4.h
.endr
    uaddv           d3, p0, z20.s
    fmov            w0, s3
    uaddv           d4, p0, z21.s
    fmov            w1, s4
    add             w0, w0, w1
    ret
.vl_gt_16_pixel_sse_pp_32x32:
    ptrue           p0.b, vl32
    mov             z20.d, #0
    mov             z21.d, #0
.rept 32
    ld1b            {z16.b}, p0/z, [x0]
    add             x0, x0, x1
    ld1b            {z18.b}, p0/z, [x2]
    add             x2, x2, x3
    usublb          z1.h, z16.b, z18.b
    usublt          z2.h, z16.b, z18.b
    smlalb          z20.s, z1.h, z1.h
    smlalt          z21.s, z1.h, z1.h
    smlalb          z20.s, z2.h, z2.h
    smlalt          z21.s, z2.h, z2.h
.endr
    uaddv           d3, p0, z20.s
    fmov            w0, s3
    uaddv           d4, p0, z21.s
    fmov            w1, s4
    add             w0, w0, w1
    ret
endfunc

Here for example, I think I can use one ld2 instead of two ld1 instructions. Should I go for it? Also, by the way, can you identify something in the code that can be done more efficiently?

3) I have developed the following piece of code using SVE2:

function PFX(addAvg_32x\h\()_sve2)
    mov             z30.b, #0x40
    mov             w12, #\h
    rdvl            x9, #1
    cmp             x9, #16
    bgt             .vl_gt_16_addAvg_32x\h
    ptrue           p0.b, vl16
    ptrue           p1.h, vl8
.loop_eq_16_sve2_addavg_32x\h\():
    sub             w12, w12, #1
    mov             x14, #16
    ld1b            {z0.b}, p0/z, [x0]
    ld1b            {z1.b}, p0/z, [x0, x14]
    add             x14, x14, #16
    ld1b            {z2.b}, p0/z, [x0, x14]
    add             x14, x14, #16
    ld1b            {z3.b}, p0/z, [x0, x14]
    mov             x14, #16
    ld1b            {z4.b}, p0/z, [x1]
    ld1b            {z5.b}, p0/z, [x1, x14]
    add             x14, x14, #16
    ld1b            {z6.b}, p0/z, [x1, x14]
    add             x14, x14, #16
    ld1b            {z7.b}, p0/z, [x1, x14]
    add             x0, x0, x3, lsl #1
    add             x1, x1, x4, lsl #1
    add             z0.h, p1/m, z0.h, z4.h
    add             z1.h, p1/m, z1.h, z5.h
    add             z2.h, p1/m, z2.h, z6.h
    add             z3.h, p1/m, z3.h, z7.h
    saddlb          z4.s, z0.h, z30.h
    saddlt          z5.s, z0.h, z30.h
    saddlb          z6.s, z1.h, z30.h
    saddlt          z7.s, z1.h, z30.h
    saddlb          z8.s, z2.h, z30.h
    saddlt          z9.s, z2.h, z30.h
    saddlb          z10.s, z3.h, z30.h
    saddlt          z11.s, z3.h, z30.h
    shrnb           z0.h, z4.s, #7
    shrnt           z0.h, z5.s, #7
    shrnb           z1.h, z6.s, #7
    shrnt           z1.h, z7.s, #7
    shrnb           z2.h, z8.s, #7
    shrnt           z2.h, z9.s, #7
    shrnb           z3.h, z10.s, #7
    shrnt           z3.h, z11.s, #7
    sqxtunb         z0.b, z0.h
    sqxtunb         z1.b, z1.h
    sqxtunb         z2.b, z2.h
    sqxtunb         z3.b, z3.h
    mov             x14, #8
    st1b            {z0.h}, p1, [x2]
    st1b            {z1.h}, p1, [x2, x14]
    add             x14, x14, #8
    st1b            {z2.h}, p1, [x2, x14]
    add             x14, x14, #8
    st1b            {z3.h}, p1, [x2, x14]
    add             x2, x2, x5
    cbnz            w12, .loop_eq_16_sve2_addavg_32x\h
    ret
.vl_gt_16_addAvg_32x\h\():
    cmp             x9, #48
    bgt             .vl_gt_48_addAvg_32x\h
    ptrue           p0.b, vl32
    ptrue           p1.h, vl16
    mov             x11, #32
    mov             x10, #16
.loop_gt_eq_32_sve2_addavg_32x\h\():
    sub             w12, w12, #1
    ld1b            {z0.b}, p0/z, [x0]
    ld1b            {z1.b}, p0/z, [x0, x11]
    ld1b            {z2.b}, p0/z, [x1]
    ld1b            {z3.b}, p0/z, [x1, x11]
    add             x0, x0, x3, lsl #1
    add             x1, x1, x4, lsl #1
    add             z0.h, p1/m, z0.h, z2.h
    add             z1.h, p1/m, z1.h, z3.h
    saddlb          z4.s, z0.h, z30.h
    saddlt          z5.s, z0.h, z30.h
    saddlb          z6.s, z1.h, z30.h
    saddlt          z7.s, z1.h, z30.h
    shrnb           z0.h, z4.s, #7
    shrnt           z0.h, z5.s, #7
    shrnb           z1.h, z6.s, #7
    shrnt           z1.h, z7.s, #7
    sqxtunb         z0.b, z0.h
    sqxtunb         z1.b, z1.h
    st1b            {z0.h}, p1, [x2]
    st1b            {z1.h}, p1, [x2, x10]
    add             x2, x2, x5
    cbnz            w12, .loop_gt_eq_32_sve2_addavg_32x\h
    ret
.vl_gt_48_addAvg_32x\h\():
    ptrue           p0.b, vl64
    ptrue           p1.h, vl32
.loop_eq_64_sve2_addavg_32x\h\():
    sub             w12, w12, #1
    ld1b            {z0.b}, p0/z, [x0]
    ld1b            {z1.b}, p0/z, [x1]
    add             x0, x0, x3, lsl #1
    add             x1, x1, x4, lsl #1
    add             z0.h, p1/m, z0.h, z1.h
    saddlb          z1.s, z0.h, z30.h
    saddlt          z2.s, z0.h, z30.h
    shrnb           z0.h, z1.s, #7
    shrnt           z0.h, z2.s, #7
    sqxtunb         z0.b, z0.h
    st1b            {z0.h}, p1, [x2]
    add             x2, x2, x5
    cbnz            w12, .loop_eq_64_sve2_addavg_32x\h
    ret
endfunc

As you can see, I am creating the pointers by adding with #16 in each iteration. Should I move their computation outside the loop and have something like this instead?

mov     x14, #16
mov     x15, #32
mov     x16, #48

and then use x14-x16 as pointers inside the loop without any extra computation? I think the second solution is the best but again it deteriorates the performance. That's why I am asking for your confirmation. Again, if you identify anything else in the function that can be done better, please inform me.

4) I have the following piece of code in NEON:

function PFX(pixel_add_ps_16x\h\()_neon)
    lsl             x5, x5, #1
    mov             w12, #\h / 8
.loop_add_ps_16x\h\():
    sub             w12, w12, #1
.rept 4
    ld1             {v0.16b}, [x2], x4
    ld1             {v1.16b}, [x2], x4
    ld1             {v16.8h-v17.8h}, [x3], x5
    ld1             {v18.8h-v19.8h}, [x3], x5
    uxtl            v4.8h, v0.8b
    uxtl2           v5.8h, v0.16b
    uxtl            v6.8h, v1.8b
    uxtl2           v7.8h, v1.16b
    add             v24.8h, v4.8h, v16.8h
    add             v25.8h, v5.8h, v17.8h
    add             v26.8h, v6.8h, v18.8h
    add             v27.8h, v7.8h, v19.8h
    sqxtun          v4.8b, v24.8h
    sqxtun2         v4.16b, v25.8h
    sqxtun          v5.8b, v26.8h
    sqxtun2         v5.16b, v27.8h
    st1             {v4.16b}, [x0], x1
    st1             {v5.16b}, [x0], x1
.endr
    cbnz            w12, .loop_add_ps_16x\h
    ret
endfunc
.endm

I have developed the following SVE2 version for it:

function PFX(pixel_add_ps_16x\h\()_neon)
    ptrue           p0.b, vl16
    mov             x10, #8
.rept \h
    ld1b            {z0.b}, p0/z, [x2]
    add             x2, x2, x4
    ld1h            {z2.h}, p0/z, [x3]
    ld1h            {z3.h}, p0/z, [x3, x10, lsl #1]
    add             x3, x3, x5, lsl #1
    uunpklo         z6.h, z0.b
    uunpkhi         z7.h, z0.b
    add             z24.h, z6.h, z2.h
    add             z25.h, z7.h, z3.h
    sqxtunb         z6.b, z24.h
    sqxtunb         z7.b, z25.h
    st1b            {z6.h}, p0, [x0]
    st1b            {z7.h}, p0, [x0, x10]
    add             x0, x0, x1
.endr
    ret
endfunc
.endm

Although it works, I am using two st1 instructions. No matter how much I tried, I couldn't find a better solution. Do you have a more efficient solution in mind?

Thank you in advance,

Akis

+1 George Steed over 1 year ago in reply to Akis
Hi Akis,

1)

To get a predicate for the first 12 elements you should be able to use a
WHILELT instruction, something like:

mov w0, #12 whilelt p0.b, wzr, w0

Then I think you can use the widening absolute difference instruction as
in my earlier reply, ending up with something like:

mov w0, #12 whilelt p0.b, wzr, w0 .rept \h ld1b {z0.h}, p0/z, [x0] add x0, x0, x1 ld1b {z8.h}, p0/z, [x2] add x2, x2, x3 uabalb z16.h, z0.b, z8.b uabalt z16.h, z1.b, z9.b .endr

It could also be worth using separate accumulators for the UABALB and UABALT
instructions and summing them together at the end, but whether the overhead of
the addition at the end is worth it probably depends on the value of \h since
the accumulator latency of these instructions is only a single cycle on both
Neoverse N2 and Neoverse V2.

2)

The details of things like ld2 instructions can be found on the software
optimization guides linked previously. It is a long way to scroll back through
this thread now, so here they are again:

Neoverse V2: developer.arm.com/.../latest
Neoveres N2: developer.arm.com/.../latest

In particular here you can see the SVE LD2B instruction is both higher latency
as well as worse than half the throughput (1 vs 3) compared to LD1B, so in this
case we can say that a pair of LD1B should be preferred instead.

The rest of the code looks reasonable to me!

3)

Moving the constants outside the loop makes sense to me. If you are already
guarding that particular code path such that it is specific to a 128-bit vector
length you could also just use the vl-scaled immediate loads, e.g.

mov x14, #16 ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z2.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z3.b}, p0/z, [x0, x14]

can become

ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, #1, mul vl] ld1b {z2.b}, p0/z, [x0, #2, mul vl] ld1b {z3.b}, p0/z, [x0, #3, mul vl]

Also, I'm not 100% sure but I notice that z30 is initialised to #0x40 and added
before the narrowing shifts. I am wondering if you can instead use the existing
rounding-and-narrowing shift instructions to avoid needing this addition at
all. For the code like:

saddlb z4.s, z0.h, z30.h saddlt z5.s, z0.h, z30.h shrnb z0.h, z4.s, #7 shrnt z0.h, z5.s, #7 sqxtunb z0.b, z0.h

Perhaps this can instead simply be:

sqrshrnb z0.b, z0.h, #7

For more information, see the instruction documentation here:
developer.arm.com/.../SQRSHRNB--Signed-saturating-rounding-shift-right-narrow-by-immediate--bottom--

Let me know if this doesn't work and I can investigate in more detail, but it
feels possible.

4)

As you correctly point out I think there is no nice way of using a single store
here, since SQXTUNT writes to the top half of each pair of elements (as opposed
to SQXTUN2 in Neon which writes to the top half of the overall vector).

I think you can use a pair of zero-extending loads (ld1b {z0.h}) rather than
UUNPKLO/HI to save one instruction here, but that is the only obvious
improvement here that I can see.

Thanks,
George
Cancel
Up +1 Down

Cancel
0 Akis over 1 year ago in reply to George Steed
Hi George,

1) Thanks! To be honest, I completely forgot about the tricks that you can do with whilelt instruction.

2) OK. So, I will continue using 2 ld1 instructions instead of 1 ld2. However, I have some questions. Based on the Neoverse N2 optimization guide you provided, it is stated that for load instructions, the latency represents the maximum latency to load all the vector registers written by the instruction. The latencies for l1b and l2b are 6 and 8 respectively. So, two ld1b instructions need 6+6=12, while one l2b needs 8. Doesn't this imply that the execution time of 2 ld1b is greater than 1 l2b? To be honest, I didn't understand what execution throughput is.

3) Thanks for your comments. Unfortunately, the version of the code that uses 'sqrshrb' instruction does not pass the tests. Don't we also need the addition with '0x40' that the code without sqrshrb does?

4) I have figured out how we can use only one store (but ld2 should be used for loading):

function PFX(pixel_add_ps_16x\h\()_neon) ptrue p0.b, vl16 ptrue p2.b, vl8 .rept \h ld2b {z0.b, z1.b}, p2/z, [x2] add x2, x2, x4 ld2h {z2.h, z3.h}, p0/z, [x3] add x3, x3, x5, lsl #1 uunpklo z5.h, z0.b uunpklo z6.h, z1.b add z24.h, z5.h, z2.h add z25.h, z6.h, z3.h sqxtunb z5.b, z24.h sqxtunt z5.b, z25.h st1b {z5.b}, p0, [x0] add x0, x0, x1 .endr ret endfunc

Do you think that this version of the code is faster than the one which uses 2 stores? Also, if your answer is no, how about using the NEON code when the vector size is equal to 128 bits and the SVE2 code when the vector sizes in greater then 128 bits? I can easily do this as I am checking and branching based on that size (if of course the NEON code is faster than the SVE2 with the two stores). What do you think?

Thanks for everything!

Akis
Cancel
Up 0 Down

Cancel