Hello,I have the following piece of code:
So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):
Thanks George! I really appreciate your help.
It seems a little bit more complicated with the number of iterations and read/store data are increased. For example, consider the 16x16 case. A possible NEON implementation can be the following:
function PFX(blockcopy_sp_16x16_neon) lsl x3, x3, #1 movrel x11, xtn_xtn2_table ld1 {v31.16b}, [x11] .rept 8 ld1 {v0.8h-v1.8h}, [x2], x3 ld1 {v2.8h-v3.8h}, [x2], x3 tbl v0.16b, {v0.16b,v1.16b}, v31.16b tbl v1.16b, {v2.16b,v3.16b}, v31.16b st1 {v0.16b}, [x0], x1 st1 {v1.16b}, [x0], x1 .endr ret endfunc const xtn_xtn2_table, align=4 .byte 0, 2, 4, 6, 8, 10, 12, 14 .byte 16, 18, 20, 22, 24, 26, 28, 30 endconst
How about the following equivalent code using SVE2?
function PFX(blockcopy_sp_16x16) mov x8, 16 mov w12, 16 .L_init_ss_16x16: mov x9, #0 whilelt p1.h, x9, x8 b.none .L_next_stridea_blockcopy_ss_16x16 .L_blockcopy_ss_16x16: ld1h {z0.h}, p0/z, [x2, x9, lsl #1] st1b {z0.h}, p0, [x0, x9] inch x9 whilelt p1.h, x9, x8 b.first .L_blockcopy_ss_16x16 .L_next_stridea_blockcopy_ss_16x16: add x2, x2, x3, lsl #1 add x0, x0, x1 sub w12, w12, #1 cbnz w12, .L_init_ss_16x16 ret endfunc
Am I missing something?
Sorry for sending you new questions. As I have mentioned, I am new to this stuff and I am trying to understand.
BR,
Akis
Hi Akis,
The code you have posted looks reasonable to me as generic SVE code. It isworth keeping in mind that for such a simple inner loop and low trip count itis still probably worth specializing for specific vector lengths, inparticular:
So with that in mind it is probably worth doing something like comparing theresult of a rdvl instruction at the start of the function and branching to oneof the above two possible implementations based on that. Obviously such anapproach would not scale for increasingly large sizes of bx/by, but that issomething you can probably determine where such a trade-off no longer makessense for your target application and micro-architecture.
If you wanted to keep the vector length agnostic version in your reply then youcould still make some small improvements:
Finally as an aside, it is also worth noting that the use of tbl in the Neonversion of the code you posted is sub-optimal: you can use the uzp1 instructionto avoid needing to load the tbl indices at the start of the function.
Hope that helps!
Thanks,George
Hi George,
I would like to thank you again for your answers.
Regarding the solution for avoiding whilelt and inch instructions, you mean something like this, right?
function PFX(blockcopy_sp_16x16) rdvl x9, #1 cmp x9, #16 bgt .vl_ge_16_blockcopy_ss_16_16 ptrue p0.h, vl8 mov x10, #8 .rept 16 ld1h {z0.h}, p0/z, [x2] ld1h {z1.h}, p0/z, [x2, x10, lsl #1] add x2, x2, x3, lsl #1 st1b {z0.h}, p0, [x0] st1b {z1.h}, p0, [x0, x10] add x0, x0, x1 .endr ret .vl_ge_16_blockcopy_ss_16_16: ptrue p0.h, vl16 .rept 16 ld1h {z0.h}, p0/z, [x2] add x2, x2, x3, lsl #1 st1b {z0.h}, p0, [x0] add x0, x0, x1 .endr ret endfunc
It seems more efficient, but as you said, as bx and by increase, maybe it does not scale well.
I think I will opt this, because I am little bit concerned with your comment regarding suboptimal implementations of whilelt and inch in some platforms. Please recall that I care only for optimizing the performance (and not necessarily for using the vector length agnostic capability of SVE2).
Can you please indicate whether you agree with the aforementioned piece of code?
Thank you once again.
Thanks for providing the updated implementation. That looks reasonable to me, Idon't really have anything significant to suggest.
You could unroll and interleave the loop as in your original 16x16 Neon version(to end up with a .rept 8 instead of .rept 16), but that is probably a marginalbenefit, if at all. You could use uzp1 as I had previously suggested for theNeon version and then have a single full store (using .b elements) rather thana pair of unpacked stores (.h), but this isn't actually a reduction ininstruction count since you are just trading a store instruction for a vectorop, so is again probably not significant.
I will be on holiday from tomorrow so apologies in advance for any delay tofuture replies. I wish you a good break if you are also taking any holiday overthe new year period!
thanks for everything (and also for your suggestions regarding NEON code). You have really helped me. I will try to not disturb you during your vacations :). I am also taking some days off.
I wish you merry Christmas and a happy new year!
When you will be back from your holidays, I will appreciate if you could answer to the following questions:
1) I have the following function:
template<int bx, int by> void blockcopy_ps_c(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb) { for (int y = 0; y < by; y++) { for (int x = 0; x < bx; x++) a[x] = (int16_t)b[x]; a += stridea; b += strideb; } }
I have created the following piece of code for supporting its functionality using SVE2 (assume bx=by=8):
function PFX(blockcopy_ps_8x8) ptrue p0.b, vl8 .rept 8 ld1b {z0.b}, p0/z, [x2] uxtb z0.h, p0/M, z0.h st1h {z0.h}, p0, [x0] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
Do you agree? Am I missing something?
2) I have the following function in NEON:
function PFX(copy_cnt_4_neon) lsl x2, x2, #1 movi v4.8b, #0 .rept 2 ld1 {v0.8b}, [x1], x2 ld1 {v1.8b}, [x1], x2 stp d0, d1, [x0], #16 cmeq v0.4h, v0.4h, #0 cmeq v1.4h, v1.4h, #0 add v4.4h, v4.4h, v0.4h add v4.4h, v4.4h, v1.4h .endr saddlv s4, v4.4h fmov w12, s4 add w0, w12, #16 ret endfunc
I have created the following piece of code for supporting its functionality using SVE2:
function PFX(copy_cnt_4) ptrue p0.h, vl4 mov x10, #8 ptrue p1.h mov z4.h, p1/z, #1 mov w0, #0 .rept 4 ld1h {z0.h}, p0/z, [x1] st1h {z0.h}, p0, [x0] add x1, x1, x2, lsl #1 add x0, x0, x10 cmpeq p2.h, p0/z, z0.h, #0 uaddv d6, p2, z4.h fmov w12, s6 add w0, w0, w12 .endr add w0, w0, #16 endfunc
Thank you in advance,
Happy new year!
1)
I think there is a bug with your suggested implementation here around how thepredicate is initialised and how the data is loaded. The core of the functionis:
ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] uxtb z0.h, p0/m, z0.h st1h {z0.h}, p0, [...]
Note that when mixing precisions (.b and .h above) it is important tounderstand how the predicate layout affects the operation being performed.
ptrue p0.b, vl8 [1111111100000000] ^ | ^ most significant bit | ^ least significant bit v ptrue p0.h, vl8 [1010101010101010] | ^ most significant bit ^ least significant bit
This makes a difference because the predicate is used based on the element size(e.g. z0.h or z0.b above), so in the context of your code snippet:
ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] # loads 8 bytes uxtb z0.h, p0/m, z0.h # bug: only zero-extends the low 4 half-words st1h {z0.h}, p0, [...] # bug: only stores the low 4 half-words
As with the unpacked stores there are also unpacked loads, which you can finddocumented here (see e.g. "16-bit element"):
https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/LD1B--scalar-plus-scalar---Contiguous-load-unsigned-bytes-to-vector--scalar-index--
In particular this means that in a similar way to the store-truncation case wecan also use the load to perform the sign-extension for free, something like:
function PFX(blockcopy_ps_8x8) ptrue p0.h, vl8 # note .h rather than .b .rept 8 ld1b {z0.h}, p0/z, [x2] # note .h rather than .b st1h {z0.h}, p0, [x0] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
2)
The general gist of the code looks believable to me. A few comments:
I wish you health and a happy new year.
1) Hmm, I wasn't aware of the unpack versions of load instructions. Thanks once again for showing me the way. However, I think there is a problem as the number of bytes to be read from memory increases. For example, consider the 16x16 case. I think it would be best to read as much data as we can from the memory with one instruction, and then manipulate the data. So, I propose the following code:
function PFX(blockcopy_ps_16x16_sve2) ptrue p0.b, vl16 ptrue p1.h, vl8 mov x11, #8 .rept 16 ld1b {z0.b}, p0/z, [x2] uunpklo z1.h, z0.b uunpkhi z2.h, z0.b st1h {z1.h}, p1, [x0] st1h {z2.h}, p1, [x0, x11, lsl #1] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
In this case, we do not use the unpack loads in order to load as much data as we can with one command. Then, we use the "uunpklo" and "uunpkhi" instructions for unpacking. Do you agree? Or is there a better way?
2) After taking your comments into account, I created the following piece of code:
function PFX(copy_cnt_4_sve2) ptrue p0.h, vl4 ptrue p1.h dup z4.h, #1 dup z6.h, #0 .rept 4 ld1h {z0.h}, p0/z, [x1] st1h {z0.h}, p0, [x0] add x1, x1, x2, lsl #1 add x0, x0, #8 cmpeq p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h .endr uaddv d6, p1, z6.h fmov w12, s6 add w0, w0, #16 ret endfunc
Is this what you meant with your comments?
Using the unpack instructions for larger load amounts makes sense to me.Exactly which one is faster in this particular case will of course depend onthe particular micro-architecture of interest, and e.g. the number of vectorpipelines versus the number of load/store pipelines.
As a very minor optimisation, and mostly just for illustrative purposes, Iwould point out that the use of p1 here is redundant:
ptrue p0.b, vl16 [1111111111111111] ptrue p0.h, vl8 [1010101010101010] st1h {z1.h} ..... ^ ^ ^ ^ ^ ^ ^ ^ // 16-bit instructions consider every-other bit of the predicate
You will notice that for all the bits considered by the 16-bit storeinstrucions the values of the two ptrue instructions are equivalent, and assuch you can simply use p0 everywhere here.
Your revised code looks much more reasonable to me, however I think there are acouple of bugs in this code (I could be wrong, I have not run it):
The original Neon code sums the output of the cmeq instructions. As Neon doesnot have predicate registers the result of comparisons are either all-ones orall-zeros in a register, however when thought about in twos-complement integerrepresentation this is equivalent to -1 or 0. Since the Neon code produces anegative sum, this is likely the reason for the addition of 16 at the end. Inthe case of the SVE code there is no need for the final addition, although thecmpeq should probably become cmpne instead?
Additionally, the fmov on line 15 is writing w12 rather than w0?
Aside from correctness, similarly to the case in (1), the use of p1 isredundant here: all elements beyond the first four can never have a non-zerovalue. So we can avoid constructing p1 and use p0 instead.
Finally, for a vector length of 128-bits it is unlikely to be significantlyfaster, but since we are only loading and storing 64-bits at a time here wecould additionally consider using gather instructions to make use of a fullvector of data for the store/cmp/add?
thanks again for your answers.
1) I used p0 everywhere as you proposed. Everything seems to be working fine.
2) I applied all your modifications. The final code seems to be:
function PFX(copy_cnt_4_sve2) ptrue p0.h, vl8 dup z4.h, #1 dup z6.h, #0 lsl x2, x2, #1 lsl x3, x2, #1 dup z7.d, x3 index z1.d, #0, x2 .rept 2 ld1d {z0.d}, p0/z, [x1, z1.d] add z1.d, p0/m, z1.d, z7.d st1d {z0.d}, p0, [x0] add x0, x0, #16 cmpne p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h .endr uaddv d6, p0, z6.h fmov w0, s6 ret endfunc
This is what you meant, right?
Does it make sense to use the same approach for vector sizes larger then 128 bits instead of using contiguous loads and stores? I think we have already discussed this and the answer is no, right?
That code looks reasonable to me. One minor improvement would be to avoid thevector addition on line 11 by instead performing (cheaper) scalar addition onthe base address, lines 6 and 7 can then be removed:
lsl x3, x2, #1 dup z7.d, x3 ... ld1d {z0.d}, p0/z, [x1, z1.d] add z1.d, p0/m, z1.d, z7.d becomes ld1d {z0.d}, p0/z, [x1, z1.d] add x1, x1, x2
Additionally I just realised that instead of doing a predicated add of 1 online 15 you could instead use the incp instruction[1]. This allows us todirectly increment a scalar register by the number of active predicateelements, allowing us to avoid the uaddv/fmov at the end of the loop:
dup z4.h, #1 dup z6.h, #0 ... cmpne p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h ... uaddv d6, p0, z6.h fmov w0, s6 becomes mov x0, #0 ... cmpne p2.h, p0/z, z0.h, #0 incp x0, p2.h ... // nothing to do at the end of the loop, result already in x0
As far as performance on vector lengths above 128 goes, I think the only changeneeded is to somehow include the x2 offset into the z1 offset vector. The restof the code other than the predicate setup itself seems vector-length agnostic?The benefit seems marginal though when the inner block of code is only executedtwice though.
[1] developer.arm.com/.../INCP--scalar---Increment-scalar-by-count-of-true-predicate-elements-
thanks again for your help.
"incp" seems to do the trick. Thanks!
I have two minor comments. I think I have to multiply by two before I add x2 to x1 and the finar register for store should be x15 (as x0 is used as input to the function). The final code is
function PFX(copy_cnt_4_sve2) ptrue p0.h, vl8 mov x15, #0 lsl x2, x2, #1 index z1.d, #0, x2 .rept 2 ld1d {z0.d}, p0/z, [x1, z1.d] add x1, x1, x2, lsl #1 st1d {z0.d}, p0, [x0] add x0, x0, #16 cmpne p2.h, p0/z, z0.h, #0 incp x15, p2.h .endr mov x0, x15 ret endfunc
Regarding the usage of index in vector sizes larger than 128 bits, I think I will stick to contiguous loads and stores as you proposed for "blockcopy_sp_8x8" function in the beginning of this thread. Creating a vector of offsets with every second element adjacent in memory tothe previous one and using scatter/gather loads and stores seems to decrease the performance. So, I prefer loading the full size of a vector (as much as it is) and stopping there before jumping to the next element.
Thanks for your help!
I am sorry to bother you again but I need your help.
I have the following function using NEON:
function PFX(pixel_sse_pp_16x\h\()_neon) ld1 {v16.16b}, [x0], x1 ld1 {v17.16b}, [x2], x3 usubl v1.8h, v16.8b, v17.8b usubl2 v2.8h, v16.16b, v17.16b ld1 {v16.16b}, [x0], x1 ld1 {v17.16b}, [x2], x3 smull v0.4s, v1.4h, v1.4h smlal2 v0.4s, v1.8h, v1.8h smlal v0.4s, v2.4h, v2.4h smlal2 v0.4s, v2.8h, v2.8h .rept \h - 2 usubl v1.8h, v16.8b, v17.8b usubl2 v2.8h, v16.16b, v17.16b ld1 {v16.16b}, [x0], x1 smlal v0.4s, v1.4h, v1.4h smlal2 v0.4s, v1.8h, v1.8h ld1 {v17.16b}, [x2], x3 smlal v0.4s, v2.4h, v2.4h smlal2 v0.4s, v2.8h, v2.8h .endr usubl v1.8h, v16.8b, v17.8b usubl2 v2.8h, v16.16b, v17.16b smlal v0.4s, v1.4h, v1.4h smlal2 v0.4s, v1.8h, v1.8h smlal v0.4s, v2.4h, v2.4h smlal2 v0.4s, v2.8h, v2.8h trn2 v1.2d, v0.2d, v0.2d add v0.2s, v0.2s, v1.2s addp v0.2s, v0.2s, v0.2s fmov w0, s0 ret endfunc
Honestly, I do not understand why the trn2 instruction is used. Anyway, I tried to implement it using the SVE2 instruction set as follows:
function PFX(pixel_sse_pp_16x\h\()_sve2) ptrue p0.b, vl16 mov z3.d, #0 mov z4.d, #0 .rept \h ld1b {z16.b}, p0/z, [x0] add x0, x0, x1 ld1b {z17.b}, p0/z, [x2] add x2, x2, x3 usublb z1.h, z16.b, z17.b usublt z2.h, z16.b, z17.b smlalb z3.s, z1.h, z1.h smlalt z4.s, z1.h, z1.h smlalb z3.s, z2.h, z2.h smlalt z4.s, z2.h, z2.h .endr uaddv d3, p0, z3.s fmov w0, s3 uaddv d4, p0, z4.s fmov w1, s4 add w0, w0, w1 ret endfunc
The SVE2 version of the function seems to work fine but the performance is worse when it is compered to NEON version.
Do you have any suggestion?
Happy Year of the Rabbit!
The trn2 looks a bit odd to me there as well. I think it is trying to get theupper half of the vector v0 into the lower half of v1 so that a normal add canbe done instead of needing an addp? On most modern micro-architectures thesetwo instructions would likely be slower than just doing the addp so this isprobably not helping the Neon code.
Your SVE2 code looks reasonable to me. I can't see anything that would beobviously problematic to performance, I would expect it to perform at more orless exactly the same performance as the Neon code. Is it possible to measurethis particular kernel in isolation to confirm that this is the only kernelcausing a performance regression, or perhaps there is another performance issueelsewhere that is causing the observed difference from the Neon build of thecode?
If you are interested in very small values of h (like h=2) then perhaps thereare interesting tricks to play here to bring it closer to the Neon code likeusing smull{b,t} instead of the accumulating versions for the first iteration,but I wouldn't expect that to meaningfully improve performance.
Sorry I cannot be of more help!
No worries. You have really helped a lot so far. Without your help, I wouldn't have gone so far.
Regarding the smaller h values, can you be more specific? For example, for the 4x4 case, I am using the following code in order to also take advantage of the load widening instructions:
function PFX(pixel_sse_pp_4x4_sve2) ptrue p0.s, vl4 mov z0.d, #0 .rept 4 ld1b {z16.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z16.s, p0/m, z16.s, z17.s mla z0.s, p0/m, z16.s, z16.s .endr uaddv d0, p0, z0.s fmov w0, s0 ret endfunc
Is there a way to do it better?
Unfortunately, I cannot isolate this kernel and measure it. In fact, I am using the hyperfine tool (https://github.com/sharkdp/hyperfine) for benchmarking which runs the whole executable code.
I will try to list here some more functions that are eligible for the performance drop, if this OK with you. For example, here is the NEON version of a function:
function PFX(pixel_sad_16x16_neon) ld1 {v0.16b}, [x0], x1 ld1 {v1.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 uabdl v16.8h, v0.8b, v1.8b uabdl2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .rept 7 ld1 {v0.16b}, [x0], x1 ld1 {v1.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 uabal v16.8h, v0.8b, v1.8b uabal2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .endr .if \w > 4 add v16.8h, v16.8h, v17.8h .endif uaddlv s0, v16.8h fmov w0, s0 ret endfunc
And here is the SVE2 implementation that I developed:
function PFX(pixel_sad_16x16_sve2) rdvl x9, #1 cmp x9, #16 bgt .vl_gt_16_pixel_sad_16x\h mov x11, #8 mov z16.d, #0 ptrue p0.h, vl8 .rept 16 ld1b {z0.h}, p0/z, [x0] ld1b {z1.h}, p0/z, [x0, x11] add x0, x0, x1 ld1b {z2.h}, p0/z, [x2] ld1b {z3.h}, p0/z, [x2, x11] add x2, x2, x3 uaba z16.h, z0.h, z2.h uaba z16.h, z1.h, z3.h .endr uaddv d5, p0, z16.h fmov w0, s5 ret .vl_gt_16_pixel_sad_16x\h\(): mov z16.d, #0 ptrue p0.h, vl16 .rept 16 ld1b {z0.h}, p0/z, [x0] add x0, x0, x1 ld1b {z2.h}, p0/z, [x2] add x2, x2, x3 uaba z16.h, z0.h, z2.h .endr uaddv d5, p0, z16.h fmov w0, s5 ret endfunc
Can you identify any operation that can be done more efficiently in the SVE2 code?
Thanks for sharing the other kernels.
For optimising for the smaller h values, I was referring to the fact that youcan usually avoid the additional latency associated with first creating a zeroaccumulator and then accumulating into it. This probably doesn't matter verymuch especially for larger h values, but for smaller cases it could make asmall but noticeable difference. Taking your example in your latest post, wecould imagine peeling the first iteration of the rept and doing something like:
function PFX(pixel_sse_pp_4x4_sve2) ptrue p0.s, vl4 // peel first iteration ld1b {z0.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z0.s, p0/m, z0.s, z17.s mul z0.s, p0/m, z0.s, z0.s // only need mul here rather than mla .rept 4 ld1b {z16.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z16.s, p0/m, z16.s, z17.s mla z0.s, p0/m, z16.s, z16.s .endr uaddv d0, p0, z0.s fmov w0, s0 ret endfunc
Again I don't think that this really makes any significant difference inpractice, but in general you can sometimes find similar small optimisations inthe first iterations of such loops.
For the pixel_sad code I think that your SVE2 implementation is currently doingtwice as many load instructions as is strictly needed. Although the ld1b->.hzero extension is nice here, we are effectively losing of our load bandwidth(since the instruction is loading 8 bytes rather than 16). In SVE2 you canactually do better than this by instead taking advantage of the UABAL[BT]instructions to mirror the widening behaviour of the existing Neon code. Thiswould look something like:
... ptrue p0.b, vl16 // .b instead of .h, vl16 instead of vl8 .rept 16 ld1b {z0.b}, p0/z, [x0] // z0.b instead of z0.h add x0, x0, x1 ld1b {z2.b}, p0/z, [x2] // z2.b instead of z2.h add x2, x2, x3 uabalb z16.h, z0.b, z2.b uabalt z16.h, z1.b, z3.b // ^^ ^ ^ inputs now .b, instruction widens to .h .endr