This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Take full advantage of SVE vector length agnostic approach

Akis over 1 year ago

Hello,

I have the following piece of code:

template<int bx, int by>

void blockcopy_sp_c(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)

{

for (int y = 0; y < by; y++)

{

for (int x = 0; x < bx; x++)

{

a[x] = (pixel)b[x];

}

a += stridea;

b += strideb;

}

So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.

One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):

function PFX(blockcopy_sp_8x8_neon)

lsl x3, x3, #1

.rept 4

ld1 {v0.8h}, [x2], x3

ld1 {v1.8h}, [x2], x3

xtn v0.8b, v0.8h

xtn v1.8b, v1.8h

st1 {v0.d}[0], [x0], x1

st1 {v1.d}[0], [x0], x1

.endr

ret

endfunc

However, the only way to use a post-index, register offset in SVE seems to be the gather loads and scatter stores. So, a possible ASM code for SVE2 to support the aforementioned function is the following (assuming that bx=by=8):

function PFX(blockcopy_sp_8x8)

MOV x8, 8

MOV x9, #0

MOV x6, #0

MOV x7, #0

MOV z31.d, #64

MOV z0.d, #0

WHILELT p1.d, x9, x8

B.NONE .L_return_blockcopy_sp_8x8

.L_loopStart_blockcopy_sp_8x8:

INDEX z1.d, x6, x3

INDEX z2.d, x7, x1

.rept 2

LD1D z3.d, p1/Z, [x2, z1.d]

ADD z1.d, z1.d, z31.d

UZP1 z3.b, z3.b, z0.b

ST1W z3.d, p1, [x0, z2.d, UXTW #2]

ADD z2.d, z2.d, z31.d

.endr

INCD x9

MUL x6, x9, x3

MUL x7, x9, x1

WHILELT p1.d, x9, x8

B.FIRST .L_loopStart_blockcopy_sp_8x8

.L_return_blockcopy_sp_8x8:

RET

endfunc

However, I do not believe that this code takes full advantage of SVE vector length agnostic approach.

For example, the LD1D instruction reads only 64 bit before it jumps to the next location in memory.

So, it might be the case that the z3 register is not fully loaded with 16bytes of data.

Can you please tell me what I am doing wrong?

Thank you in advance.

Top replies

Parents

0 Akis over 1 year ago in reply to George Steed
Hi George,

2) Now I understand. Thanks. Just one question. Does the execution throughput refer to instances of the same instruction? I mean which one of the following is preferable:

ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x1] add x0, x0, x5 add x1, x1, x6

or

ld1b {z0.b}, p0/z, [x0] add x0, x0, x5 ld1b {z1.b}, p0/z, [x1] add x1, x1, x6

If the execution throughput refers to instances of the same instruction, I guess the firth option is the best. Or?

3) Your code works perfectly. Thanks!

4) I switched back to using the zero-extending loads instead. Regarding the performance, I think it is better, as you said. Thanks!

I might come back to you if I need anything else. Thanks for everything!

BR,

Akis
Cancel
Up 0 Down

Cancel

Reply

0 Akis over 1 year ago in reply to George Steed
Hi George,

2) Now I understand. Thanks. Just one question. Does the execution throughput refer to instances of the same instruction? I mean which one of the following is preferable:

ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x1] add x0, x0, x5 add x1, x1, x6

or

ld1b {z0.b}, p0/z, [x0] add x0, x0, x5 ld1b {z1.b}, p0/z, [x1] add x1, x1, x6

If the execution throughput refers to instances of the same instruction, I guess the firth option is the best. Or?

3) Your code works perfectly. Thanks!

4) I switched back to using the zero-extending loads instead. Regarding the performance, I think it is better, as you said. Thanks!

I might come back to you if I need anything else. Thanks for everything!

BR,

Akis
Cancel
Up 0 Down

Cancel

Children

+1 George Steed over 1 year ago in reply to Akis

Hi Akis,

Throughput in this case is referring to the number of the same instruction that
can begin execution on each cycle. The exact code layout is not particularly
important for large out-of-order cores like Neoverse N2 or Neoverse V2, so I
would expect both arrangements to perform more or less the same. The bottleneck
in such cores is instead usually any dependency chain between instructions, for
example in the case of the load instructions here the loads cannot begin
execution until the addresses x0 and x1 have been calculated.

Glad to hear the new code worked as expected!

Thanks,
George
Cancel
Up +1 Down

Cancel