Hello,I have the following piece of code:
So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):
Hi George,
When you will be back from your holidays, I will appreciate if you could answer to the following questions:
1) I have the following function:
template<int bx, int by> void blockcopy_ps_c(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb) { for (int y = 0; y < by; y++) { for (int x = 0; x < bx; x++) a[x] = (int16_t)b[x]; a += stridea; b += strideb; } }
I have created the following piece of code for supporting its functionality using SVE2 (assume bx=by=8):
function PFX(blockcopy_ps_8x8) ptrue p0.b, vl8 .rept 8 ld1b {z0.b}, p0/z, [x2] uxtb z0.h, p0/M, z0.h st1h {z0.h}, p0, [x0] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
Do you agree? Am I missing something?
2) I have the following function in NEON:
function PFX(copy_cnt_4_neon) lsl x2, x2, #1 movi v4.8b, #0 .rept 2 ld1 {v0.8b}, [x1], x2 ld1 {v1.8b}, [x1], x2 stp d0, d1, [x0], #16 cmeq v0.4h, v0.4h, #0 cmeq v1.4h, v1.4h, #0 add v4.4h, v4.4h, v0.4h add v4.4h, v4.4h, v1.4h .endr saddlv s4, v4.4h fmov w12, s4 add w0, w12, #16 ret endfunc
I have created the following piece of code for supporting its functionality using SVE2:
function PFX(copy_cnt_4) ptrue p0.h, vl4 mov x10, #8 ptrue p1.h mov z4.h, p1/z, #1 mov w0, #0 .rept 4 ld1h {z0.h}, p0/z, [x1] st1h {z0.h}, p0, [x0] add x1, x1, x2, lsl #1 add x0, x0, x10 cmpeq p2.h, p0/z, z0.h, #0 uaddv d6, p2, z4.h fmov w12, s6 add w0, w0, w12 .endr add w0, w0, #16 endfunc
Thank you in advance,
Akis
Hi Akis,
Happy new year!
1)
I think there is a bug with your suggested implementation here around how thepredicate is initialised and how the data is loaded. The core of the functionis:
ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] uxtb z0.h, p0/m, z0.h st1h {z0.h}, p0, [...]
Note that when mixing precisions (.b and .h above) it is important tounderstand how the predicate layout affects the operation being performed.
ptrue p0.b, vl8 [1111111100000000] ^ | ^ most significant bit | ^ least significant bit v ptrue p0.h, vl8 [1010101010101010] | ^ most significant bit ^ least significant bit
This makes a difference because the predicate is used based on the element size(e.g. z0.h or z0.b above), so in the context of your code snippet:
ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] # loads 8 bytes uxtb z0.h, p0/m, z0.h # bug: only zero-extends the low 4 half-words st1h {z0.h}, p0, [...] # bug: only stores the low 4 half-words
As with the unpacked stores there are also unpacked loads, which you can finddocumented here (see e.g. "16-bit element"):
https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/LD1B--scalar-plus-scalar---Contiguous-load-unsigned-bytes-to-vector--scalar-index--
In particular this means that in a similar way to the store-truncation case wecan also use the load to perform the sign-extension for free, something like:
function PFX(blockcopy_ps_8x8) ptrue p0.h, vl8 # note .h rather than .b .rept 8 ld1b {z0.h}, p0/z, [x2] # note .h rather than .b st1h {z0.h}, p0, [x0] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
2)
The general gist of the code looks believable to me. A few comments:
Thanks,George
I wish you health and a happy new year.
1) Hmm, I wasn't aware of the unpack versions of load instructions. Thanks once again for showing me the way. However, I think there is a problem as the number of bytes to be read from memory increases. For example, consider the 16x16 case. I think it would be best to read as much data as we can from the memory with one instruction, and then manipulate the data. So, I propose the following code:
function PFX(blockcopy_ps_16x16_sve2) ptrue p0.b, vl16 ptrue p1.h, vl8 mov x11, #8 .rept 16 ld1b {z0.b}, p0/z, [x2] uunpklo z1.h, z0.b uunpkhi z2.h, z0.b st1h {z1.h}, p1, [x0] st1h {z2.h}, p1, [x0, x11, lsl #1] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
In this case, we do not use the unpack loads in order to load as much data as we can with one command. Then, we use the "uunpklo" and "uunpkhi" instructions for unpacking. Do you agree? Or is there a better way?
2) After taking your comments into account, I created the following piece of code:
function PFX(copy_cnt_4_sve2) ptrue p0.h, vl4 ptrue p1.h dup z4.h, #1 dup z6.h, #0 .rept 4 ld1h {z0.h}, p0/z, [x1] st1h {z0.h}, p0, [x0] add x1, x1, x2, lsl #1 add x0, x0, #8 cmpeq p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h .endr uaddv d6, p1, z6.h fmov w12, s6 add w0, w0, #16 ret endfunc
Is this what you meant with your comments?
Using the unpack instructions for larger load amounts makes sense to me.Exactly which one is faster in this particular case will of course depend onthe particular micro-architecture of interest, and e.g. the number of vectorpipelines versus the number of load/store pipelines.
As a very minor optimisation, and mostly just for illustrative purposes, Iwould point out that the use of p1 here is redundant:
ptrue p0.b, vl16 [1111111111111111] ptrue p0.h, vl8 [1010101010101010] st1h {z1.h} ..... ^ ^ ^ ^ ^ ^ ^ ^ // 16-bit instructions consider every-other bit of the predicate
You will notice that for all the bits considered by the 16-bit storeinstrucions the values of the two ptrue instructions are equivalent, and assuch you can simply use p0 everywhere here.
Your revised code looks much more reasonable to me, however I think there are acouple of bugs in this code (I could be wrong, I have not run it):
The original Neon code sums the output of the cmeq instructions. As Neon doesnot have predicate registers the result of comparisons are either all-ones orall-zeros in a register, however when thought about in twos-complement integerrepresentation this is equivalent to -1 or 0. Since the Neon code produces anegative sum, this is likely the reason for the addition of 16 at the end. Inthe case of the SVE code there is no need for the final addition, although thecmpeq should probably become cmpne instead?
Additionally, the fmov on line 15 is writing w12 rather than w0?
Aside from correctness, similarly to the case in (1), the use of p1 isredundant here: all elements beyond the first four can never have a non-zerovalue. So we can avoid constructing p1 and use p0 instead.
Finally, for a vector length of 128-bits it is unlikely to be significantlyfaster, but since we are only loading and storing 64-bits at a time here wecould additionally consider using gather instructions to make use of a fullvector of data for the store/cmp/add?
thanks again for your answers.
1) I used p0 everywhere as you proposed. Everything seems to be working fine.
2) I applied all your modifications. The final code seems to be:
function PFX(copy_cnt_4_sve2) ptrue p0.h, vl8 dup z4.h, #1 dup z6.h, #0 lsl x2, x2, #1 lsl x3, x2, #1 dup z7.d, x3 index z1.d, #0, x2 .rept 2 ld1d {z0.d}, p0/z, [x1, z1.d] add z1.d, p0/m, z1.d, z7.d st1d {z0.d}, p0, [x0] add x0, x0, #16 cmpne p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h .endr uaddv d6, p0, z6.h fmov w0, s6 ret endfunc
This is what you meant, right?
Does it make sense to use the same approach for vector sizes larger then 128 bits instead of using contiguous loads and stores? I think we have already discussed this and the answer is no, right?
BR,
That code looks reasonable to me. One minor improvement would be to avoid thevector addition on line 11 by instead performing (cheaper) scalar addition onthe base address, lines 6 and 7 can then be removed:
lsl x3, x2, #1 dup z7.d, x3 ... ld1d {z0.d}, p0/z, [x1, z1.d] add z1.d, p0/m, z1.d, z7.d becomes ld1d {z0.d}, p0/z, [x1, z1.d] add x1, x1, x2
Additionally I just realised that instead of doing a predicated add of 1 online 15 you could instead use the incp instruction[1]. This allows us todirectly increment a scalar register by the number of active predicateelements, allowing us to avoid the uaddv/fmov at the end of the loop:
dup z4.h, #1 dup z6.h, #0 ... cmpne p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h ... uaddv d6, p0, z6.h fmov w0, s6 becomes mov x0, #0 ... cmpne p2.h, p0/z, z0.h, #0 incp x0, p2.h ... // nothing to do at the end of the loop, result already in x0
As far as performance on vector lengths above 128 goes, I think the only changeneeded is to somehow include the x2 offset into the z1 offset vector. The restof the code other than the predicate setup itself seems vector-length agnostic?The benefit seems marginal though when the inner block of code is only executedtwice though.
[1] developer.arm.com/.../INCP--scalar---Increment-scalar-by-count-of-true-predicate-elements-
thanks again for your help.
"incp" seems to do the trick. Thanks!
I have two minor comments. I think I have to multiply by two before I add x2 to x1 and the finar register for store should be x15 (as x0 is used as input to the function). The final code is
function PFX(copy_cnt_4_sve2) ptrue p0.h, vl8 mov x15, #0 lsl x2, x2, #1 index z1.d, #0, x2 .rept 2 ld1d {z0.d}, p0/z, [x1, z1.d] add x1, x1, x2, lsl #1 st1d {z0.d}, p0, [x0] add x0, x0, #16 cmpne p2.h, p0/z, z0.h, #0 incp x15, p2.h .endr mov x0, x15 ret endfunc
Regarding the usage of index in vector sizes larger than 128 bits, I think I will stick to contiguous loads and stores as you proposed for "blockcopy_sp_8x8" function in the beginning of this thread. Creating a vector of offsets with every second element adjacent in memory tothe previous one and using scatter/gather loads and stores seems to decrease the performance. So, I prefer loading the full size of a vector (as much as it is) and stopping there before jumping to the next element.
Thanks for your help!
I am sorry to bother you again but I need your help.
I have the following function using NEON:
function PFX(pixel_sse_pp_16x\h\()_neon) ld1 {v16.16b}, [x0], x1 ld1 {v17.16b}, [x2], x3 usubl v1.8h, v16.8b, v17.8b usubl2 v2.8h, v16.16b, v17.16b ld1 {v16.16b}, [x0], x1 ld1 {v17.16b}, [x2], x3 smull v0.4s, v1.4h, v1.4h smlal2 v0.4s, v1.8h, v1.8h smlal v0.4s, v2.4h, v2.4h smlal2 v0.4s, v2.8h, v2.8h .rept \h - 2 usubl v1.8h, v16.8b, v17.8b usubl2 v2.8h, v16.16b, v17.16b ld1 {v16.16b}, [x0], x1 smlal v0.4s, v1.4h, v1.4h smlal2 v0.4s, v1.8h, v1.8h ld1 {v17.16b}, [x2], x3 smlal v0.4s, v2.4h, v2.4h smlal2 v0.4s, v2.8h, v2.8h .endr usubl v1.8h, v16.8b, v17.8b usubl2 v2.8h, v16.16b, v17.16b smlal v0.4s, v1.4h, v1.4h smlal2 v0.4s, v1.8h, v1.8h smlal v0.4s, v2.4h, v2.4h smlal2 v0.4s, v2.8h, v2.8h trn2 v1.2d, v0.2d, v0.2d add v0.2s, v0.2s, v1.2s addp v0.2s, v0.2s, v0.2s fmov w0, s0 ret endfunc
Honestly, I do not understand why the trn2 instruction is used. Anyway, I tried to implement it using the SVE2 instruction set as follows:
function PFX(pixel_sse_pp_16x\h\()_sve2) ptrue p0.b, vl16 mov z3.d, #0 mov z4.d, #0 .rept \h ld1b {z16.b}, p0/z, [x0] add x0, x0, x1 ld1b {z17.b}, p0/z, [x2] add x2, x2, x3 usublb z1.h, z16.b, z17.b usublt z2.h, z16.b, z17.b smlalb z3.s, z1.h, z1.h smlalt z4.s, z1.h, z1.h smlalb z3.s, z2.h, z2.h smlalt z4.s, z2.h, z2.h .endr uaddv d3, p0, z3.s fmov w0, s3 uaddv d4, p0, z4.s fmov w1, s4 add w0, w0, w1 ret endfunc
The SVE2 version of the function seems to work fine but the performance is worse when it is compered to NEON version.
Do you have any suggestion?
Happy Year of the Rabbit!
The trn2 looks a bit odd to me there as well. I think it is trying to get theupper half of the vector v0 into the lower half of v1 so that a normal add canbe done instead of needing an addp? On most modern micro-architectures thesetwo instructions would likely be slower than just doing the addp so this isprobably not helping the Neon code.
Your SVE2 code looks reasonable to me. I can't see anything that would beobviously problematic to performance, I would expect it to perform at more orless exactly the same performance as the Neon code. Is it possible to measurethis particular kernel in isolation to confirm that this is the only kernelcausing a performance regression, or perhaps there is another performance issueelsewhere that is causing the observed difference from the Neon build of thecode?
If you are interested in very small values of h (like h=2) then perhaps thereare interesting tricks to play here to bring it closer to the Neon code likeusing smull{b,t} instead of the accumulating versions for the first iteration,but I wouldn't expect that to meaningfully improve performance.
Sorry I cannot be of more help!
No worries. You have really helped a lot so far. Without your help, I wouldn't have gone so far.
Regarding the smaller h values, can you be more specific? For example, for the 4x4 case, I am using the following code in order to also take advantage of the load widening instructions:
function PFX(pixel_sse_pp_4x4_sve2) ptrue p0.s, vl4 mov z0.d, #0 .rept 4 ld1b {z16.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z16.s, p0/m, z16.s, z17.s mla z0.s, p0/m, z16.s, z16.s .endr uaddv d0, p0, z0.s fmov w0, s0 ret endfunc
Is there a way to do it better?
Unfortunately, I cannot isolate this kernel and measure it. In fact, I am using the hyperfine tool (https://github.com/sharkdp/hyperfine) for benchmarking which runs the whole executable code.
I will try to list here some more functions that are eligible for the performance drop, if this OK with you. For example, here is the NEON version of a function:
function PFX(pixel_sad_16x16_neon) ld1 {v0.16b}, [x0], x1 ld1 {v1.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 uabdl v16.8h, v0.8b, v1.8b uabdl2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .rept 7 ld1 {v0.16b}, [x0], x1 ld1 {v1.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 uabal v16.8h, v0.8b, v1.8b uabal2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .endr .if \w > 4 add v16.8h, v16.8h, v17.8h .endif uaddlv s0, v16.8h fmov w0, s0 ret endfunc
And here is the SVE2 implementation that I developed:
function PFX(pixel_sad_16x16_sve2) rdvl x9, #1 cmp x9, #16 bgt .vl_gt_16_pixel_sad_16x\h mov x11, #8 mov z16.d, #0 ptrue p0.h, vl8 .rept 16 ld1b {z0.h}, p0/z, [x0] ld1b {z1.h}, p0/z, [x0, x11] add x0, x0, x1 ld1b {z2.h}, p0/z, [x2] ld1b {z3.h}, p0/z, [x2, x11] add x2, x2, x3 uaba z16.h, z0.h, z2.h uaba z16.h, z1.h, z3.h .endr uaddv d5, p0, z16.h fmov w0, s5 ret .vl_gt_16_pixel_sad_16x\h\(): mov z16.d, #0 ptrue p0.h, vl16 .rept 16 ld1b {z0.h}, p0/z, [x0] add x0, x0, x1 ld1b {z2.h}, p0/z, [x2] add x2, x2, x3 uaba z16.h, z0.h, z2.h .endr uaddv d5, p0, z16.h fmov w0, s5 ret endfunc
Can you identify any operation that can be done more efficiently in the SVE2 code?
Thanks for sharing the other kernels.
For optimising for the smaller h values, I was referring to the fact that youcan usually avoid the additional latency associated with first creating a zeroaccumulator and then accumulating into it. This probably doesn't matter verymuch especially for larger h values, but for smaller cases it could make asmall but noticeable difference. Taking your example in your latest post, wecould imagine peeling the first iteration of the rept and doing something like:
function PFX(pixel_sse_pp_4x4_sve2) ptrue p0.s, vl4 // peel first iteration ld1b {z0.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z0.s, p0/m, z0.s, z17.s mul z0.s, p0/m, z0.s, z0.s // only need mul here rather than mla .rept 4 ld1b {z16.s}, p0/z, [x0] add x0, x0, x1 ld1b {z17.s}, p0/z, [x2] add x2, x2, x3 sub z16.s, p0/m, z16.s, z17.s mla z0.s, p0/m, z16.s, z16.s .endr uaddv d0, p0, z0.s fmov w0, s0 ret endfunc
Again I don't think that this really makes any significant difference inpractice, but in general you can sometimes find similar small optimisations inthe first iterations of such loops.
For the pixel_sad code I think that your SVE2 implementation is currently doingtwice as many load instructions as is strictly needed. Although the ld1b->.hzero extension is nice here, we are effectively losing of our load bandwidth(since the instruction is loading 8 bytes rather than 16). In SVE2 you canactually do better than this by instead taking advantage of the UABAL[BT]instructions to mirror the widening behaviour of the existing Neon code. Thiswould look something like:
... ptrue p0.b, vl16 // .b instead of .h, vl16 instead of vl8 .rept 16 ld1b {z0.b}, p0/z, [x0] // z0.b instead of z0.h add x0, x0, x1 ld1b {z2.b}, p0/z, [x2] // z2.b instead of z2.h add x2, x2, x3 uabalb z16.h, z0.b, z2.b uabalt z16.h, z1.b, z3.b // ^^ ^ ^ inputs now .b, instruction widens to .h .endr
Hope that helps!
thanks for your comments. I followed your advices for both the small h values case and the pixel_sad function.
I have the following extra questions:
1) How about when I need to load 12 bytes of data? For example, I have the following NEON code:
function PFX(pixel_sad_12x\h\()_neon) movrel x12, sad12_mask ld1 {v31.16b}, [x12] movi v16.16b, #0 movi v17.16b, #0 mov w9, #\h/8 .loop_12x\h: sub w9, w9, #1 .rept 4 ld1 {v0.16b}, [x0], x1 and v0.16b, v0.16b, v31.16b ld1 {v1.16b}, [x2], x3 and v1.16b, v1.16b, v31.16b ld1 {v2.16b}, [x0], x1 and v2.16b, v2.16b, v31.16b ld1 {v3.16b}, [x2], x3 and v3.16b, v3.16b, v31.16b uabal v16.8h, v0.8b, v1.8b uabal2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .endr cbnz w9, .loop_12x\h add v16.8h, v16.8h, v17.8h uaddlv s0, v16.8h fmov w0, s0 ret endfunc const sad12_mask, align=8 .byte 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0 endconst
I have developed the following SVE2 code:
function PFX(pixel_sad_12x\h\()_sve2) mov z16.d, #0 ptrue p0.h, vl8 ptrue p1.h, vl4 mov x11, #8 .rept \h ld1b {z0.h}, p0/z, [x0] ld1b {z1.h}, p1/z, [x0, x11] add x0, x0, x1 ld1b {z8.h}, p0/z, [x2] ld1b {z9.h}, p1/z, [x2, x11] add x2, x2, x3 uaba z16.h, z0.h, z8.h uaba z16.h, z1.h, z9.h .endr uaddv d5, p0, z16.h fmov w0, s5 ret endfunc
Here again I am using ld1 instructions for also widen the loaded values, but since I cannot perfectly define ptrues (12 bytes of data are needed), is there a better way?
2) Should I use two ld1 instructions or one ld2 instruction whenever it is possible? For example. I have the following NEON piece of code:
function PFX(pixel_sse_pp_32x32_neon) mov w12, #8 movi v0.16b, #0 movi v1.16b, #0 .loop_sse_pp_32: sub w12, w12, #1 .rept 4 ld1 {v16.16b,v17.16b}, [x0], x1 ld1 {v18.16b,v19.16b}, [x2], x3 usubl v2.8h, v16.8b, v18.8b usubl2 v3.8h, v16.16b, v18.16b usubl v4.8h, v17.8b, v19.8b usubl2 v5.8h, v17.16b, v19.16b smlal v0.4s, v2.4h, v2.4h smlal2 v1.4s, v2.8h, v2.8h smlal v0.4s, v3.4h, v3.4h smlal2 v1.4s, v3.8h, v3.8h smlal v0.4s, v4.4h, v4.4h smlal2 v1.4s, v4.8h, v4.8h smlal v0.4s, v5.4h, v5.4h smlal2 v1.4s, v5.8h, v5.8h .endr cbnz w12, .loop_sse_pp_32 add v0.4s, v0.4s, v1.4s trn2 v1.2d, v0.2d, v0.2d add v0.2s, v0.2s, v1.2s addp v0.2s, v0.2s, v0.2s fmov w0, s0 ret endfunc
function PFX(pixel_sse_pp_32x32_sve2) rdvl x9, #1 cmp x9, #16 bgt .vl_gt_16_pixel_sse_pp_32x32 ptrue p0.b, vl16 mov z20.d, #0 mov z21.d, #0 mov x11, #16 .rept 32 ld1b {z16.b}, p0/z, [x0] ld1b {z17.b}, p0/z, [x0, x11] add x0, x0, x1 ld1b {z18.b}, p0/z, [x2] ld1b {z19.b}, p0/z, [x2, x11] add x2, x2, x3 usublb z1.h, z16.b, z18.b usublt z2.h, z16.b, z18.b usublb z3.h, z17.b, z19.b usublt z4.h, z17.b, z19.b smlalb z20.s, z1.h, z1.h smlalt z21.s, z1.h, z1.h smlalb z20.s, z2.h, z2.h smlalt z21.s, z2.h, z2.h smlalb z20.s, z3.h, z3.h smlalt z21.s, z3.h, z3.h smlalb z20.s, z4.h, z4.h smlalt z21.s, z4.h, z4.h .endr uaddv d3, p0, z20.s fmov w0, s3 uaddv d4, p0, z21.s fmov w1, s4 add w0, w0, w1 ret .vl_gt_16_pixel_sse_pp_32x32: ptrue p0.b, vl32 mov z20.d, #0 mov z21.d, #0 .rept 32 ld1b {z16.b}, p0/z, [x0] add x0, x0, x1 ld1b {z18.b}, p0/z, [x2] add x2, x2, x3 usublb z1.h, z16.b, z18.b usublt z2.h, z16.b, z18.b smlalb z20.s, z1.h, z1.h smlalt z21.s, z1.h, z1.h smlalb z20.s, z2.h, z2.h smlalt z21.s, z2.h, z2.h .endr uaddv d3, p0, z20.s fmov w0, s3 uaddv d4, p0, z21.s fmov w1, s4 add w0, w0, w1 ret endfunc
Here for example, I think I can use one ld2 instead of two ld1 instructions. Should I go for it? Also, by the way, can you identify something in the code that can be done more efficiently?
3) I have developed the following piece of code using SVE2:
function PFX(addAvg_32x\h\()_sve2) mov z30.b, #0x40 mov w12, #\h rdvl x9, #1 cmp x9, #16 bgt .vl_gt_16_addAvg_32x\h ptrue p0.b, vl16 ptrue p1.h, vl8 .loop_eq_16_sve2_addavg_32x\h\(): sub w12, w12, #1 mov x14, #16 ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z2.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z3.b}, p0/z, [x0, x14] mov x14, #16 ld1b {z4.b}, p0/z, [x1] ld1b {z5.b}, p0/z, [x1, x14] add x14, x14, #16 ld1b {z6.b}, p0/z, [x1, x14] add x14, x14, #16 ld1b {z7.b}, p0/z, [x1, x14] add x0, x0, x3, lsl #1 add x1, x1, x4, lsl #1 add z0.h, p1/m, z0.h, z4.h add z1.h, p1/m, z1.h, z5.h add z2.h, p1/m, z2.h, z6.h add z3.h, p1/m, z3.h, z7.h saddlb z4.s, z0.h, z30.h saddlt z5.s, z0.h, z30.h saddlb z6.s, z1.h, z30.h saddlt z7.s, z1.h, z30.h saddlb z8.s, z2.h, z30.h saddlt z9.s, z2.h, z30.h saddlb z10.s, z3.h, z30.h saddlt z11.s, z3.h, z30.h shrnb z0.h, z4.s, #7 shrnt z0.h, z5.s, #7 shrnb z1.h, z6.s, #7 shrnt z1.h, z7.s, #7 shrnb z2.h, z8.s, #7 shrnt z2.h, z9.s, #7 shrnb z3.h, z10.s, #7 shrnt z3.h, z11.s, #7 sqxtunb z0.b, z0.h sqxtunb z1.b, z1.h sqxtunb z2.b, z2.h sqxtunb z3.b, z3.h mov x14, #8 st1b {z0.h}, p1, [x2] st1b {z1.h}, p1, [x2, x14] add x14, x14, #8 st1b {z2.h}, p1, [x2, x14] add x14, x14, #8 st1b {z3.h}, p1, [x2, x14] add x2, x2, x5 cbnz w12, .loop_eq_16_sve2_addavg_32x\h ret .vl_gt_16_addAvg_32x\h\(): cmp x9, #48 bgt .vl_gt_48_addAvg_32x\h ptrue p0.b, vl32 ptrue p1.h, vl16 mov x11, #32 mov x10, #16 .loop_gt_eq_32_sve2_addavg_32x\h\(): sub w12, w12, #1 ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, x11] ld1b {z2.b}, p0/z, [x1] ld1b {z3.b}, p0/z, [x1, x11] add x0, x0, x3, lsl #1 add x1, x1, x4, lsl #1 add z0.h, p1/m, z0.h, z2.h add z1.h, p1/m, z1.h, z3.h saddlb z4.s, z0.h, z30.h saddlt z5.s, z0.h, z30.h saddlb z6.s, z1.h, z30.h saddlt z7.s, z1.h, z30.h shrnb z0.h, z4.s, #7 shrnt z0.h, z5.s, #7 shrnb z1.h, z6.s, #7 shrnt z1.h, z7.s, #7 sqxtunb z0.b, z0.h sqxtunb z1.b, z1.h st1b {z0.h}, p1, [x2] st1b {z1.h}, p1, [x2, x10] add x2, x2, x5 cbnz w12, .loop_gt_eq_32_sve2_addavg_32x\h ret .vl_gt_48_addAvg_32x\h\(): ptrue p0.b, vl64 ptrue p1.h, vl32 .loop_eq_64_sve2_addavg_32x\h\(): sub w12, w12, #1 ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x1] add x0, x0, x3, lsl #1 add x1, x1, x4, lsl #1 add z0.h, p1/m, z0.h, z1.h saddlb z1.s, z0.h, z30.h saddlt z2.s, z0.h, z30.h shrnb z0.h, z1.s, #7 shrnt z0.h, z2.s, #7 sqxtunb z0.b, z0.h st1b {z0.h}, p1, [x2] add x2, x2, x5 cbnz w12, .loop_eq_64_sve2_addavg_32x\h ret endfunc
As you can see, I am creating the pointers by adding with #16 in each iteration. Should I move their computation outside the loop and have something like this instead?
mov x14, #16 mov x15, #32 mov x16, #48
and then use x14-x16 as pointers inside the loop without any extra computation? I think the second solution is the best but again it deteriorates the performance. That's why I am asking for your confirmation. Again, if you identify anything else in the function that can be done better, please inform me.
4) I have the following piece of code in NEON:
function PFX(pixel_add_ps_16x\h\()_neon) lsl x5, x5, #1 mov w12, #\h / 8 .loop_add_ps_16x\h\(): sub w12, w12, #1 .rept 4 ld1 {v0.16b}, [x2], x4 ld1 {v1.16b}, [x2], x4 ld1 {v16.8h-v17.8h}, [x3], x5 ld1 {v18.8h-v19.8h}, [x3], x5 uxtl v4.8h, v0.8b uxtl2 v5.8h, v0.16b uxtl v6.8h, v1.8b uxtl2 v7.8h, v1.16b add v24.8h, v4.8h, v16.8h add v25.8h, v5.8h, v17.8h add v26.8h, v6.8h, v18.8h add v27.8h, v7.8h, v19.8h sqxtun v4.8b, v24.8h sqxtun2 v4.16b, v25.8h sqxtun v5.8b, v26.8h sqxtun2 v5.16b, v27.8h st1 {v4.16b}, [x0], x1 st1 {v5.16b}, [x0], x1 .endr cbnz w12, .loop_add_ps_16x\h ret endfunc .endm
I have developed the following SVE2 version for it:
function PFX(pixel_add_ps_16x\h\()_neon) ptrue p0.b, vl16 mov x10, #8 .rept \h ld1b {z0.b}, p0/z, [x2] add x2, x2, x4 ld1h {z2.h}, p0/z, [x3] ld1h {z3.h}, p0/z, [x3, x10, lsl #1] add x3, x3, x5, lsl #1 uunpklo z6.h, z0.b uunpkhi z7.h, z0.b add z24.h, z6.h, z2.h add z25.h, z7.h, z3.h sqxtunb z6.b, z24.h sqxtunb z7.b, z25.h st1b {z6.h}, p0, [x0] st1b {z7.h}, p0, [x0, x10] add x0, x0, x1 .endr ret endfunc .endm
Although it works, I am using two st1 instructions. No matter how much I tried, I couldn't find a better solution. Do you have a more efficient solution in mind?
To get a predicate for the first 12 elements you should be able to use aWHILELT instruction, something like:
mov w0, #12 whilelt p0.b, wzr, w0
Then I think you can use the widening absolute difference instruction asin my earlier reply, ending up with something like:
mov w0, #12 whilelt p0.b, wzr, w0 .rept \h ld1b {z0.h}, p0/z, [x0] add x0, x0, x1 ld1b {z8.h}, p0/z, [x2] add x2, x2, x3 uabalb z16.h, z0.b, z8.b uabalt z16.h, z1.b, z9.b .endr
It could also be worth using separate accumulators for the UABALB and UABALTinstructions and summing them together at the end, but whether the overhead ofthe addition at the end is worth it probably depends on the value of \h sincethe accumulator latency of these instructions is only a single cycle on bothNeoverse N2 and Neoverse V2.
The details of things like ld2 instructions can be found on the softwareoptimization guides linked previously. It is a long way to scroll back throughthis thread now, so here they are again:
Neoverse V2: developer.arm.com/.../latestNeoveres N2: developer.arm.com/.../latest
In particular here you can see the SVE LD2B instruction is both higher latencyas well as worse than half the throughput (1 vs 3) compared to LD1B, so in thiscase we can say that a pair of LD1B should be preferred instead.
The rest of the code looks reasonable to me!
3)
Moving the constants outside the loop makes sense to me. If you are alreadyguarding that particular code path such that it is specific to a 128-bit vectorlength you could also just use the vl-scaled immediate loads, e.g.
mov x14, #16 ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z2.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z3.b}, p0/z, [x0, x14]
can become
ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, #1, mul vl] ld1b {z2.b}, p0/z, [x0, #2, mul vl] ld1b {z3.b}, p0/z, [x0, #3, mul vl]
Also, I'm not 100% sure but I notice that z30 is initialised to #0x40 and addedbefore the narrowing shifts. I am wondering if you can instead use the existingrounding-and-narrowing shift instructions to avoid needing this addition atall. For the code like:
saddlb z4.s, z0.h, z30.h saddlt z5.s, z0.h, z30.h shrnb z0.h, z4.s, #7 shrnt z0.h, z5.s, #7 sqxtunb z0.b, z0.h
Perhaps this can instead simply be:
sqrshrnb z0.b, z0.h, #7
For more information, see the instruction documentation here:developer.arm.com/.../SQRSHRNB--Signed-saturating-rounding-shift-right-narrow-by-immediate--bottom--
Let me know if this doesn't work and I can investigate in more detail, but itfeels possible.
4)
As you correctly point out I think there is no nice way of using a single storehere, since SQXTUNT writes to the top half of each pair of elements (as opposedto SQXTUN2 in Neon which writes to the top half of the overall vector).
I think you can use a pair of zero-extending loads (ld1b {z0.h}) rather thanUUNPKLO/HI to save one instruction here, but that is the only obviousimprovement here that I can see.
1) Thanks! To be honest, I completely forgot about the tricks that you can do with whilelt instruction.
2) OK. So, I will continue using 2 ld1 instructions instead of 1 ld2. However, I have some questions. Based on the Neoverse N2 optimization guide you provided, it is stated that for load instructions, the latency represents the maximum latency to load all the vector registers written by the instruction. The latencies for l1b and l2b are 6 and 8 respectively. So, two ld1b instructions need 6+6=12, while one l2b needs 8. Doesn't this imply that the execution time of 2 ld1b is greater than 1 l2b? To be honest, I didn't understand what execution throughput is.
3) Thanks for your comments. Unfortunately, the version of the code that uses 'sqrshrb' instruction does not pass the tests. Don't we also need the addition with '0x40' that the code without sqrshrb does?
4) I have figured out how we can use only one store (but ld2 should be used for loading):
function PFX(pixel_add_ps_16x\h\()_neon) ptrue p0.b, vl16 ptrue p2.b, vl8 .rept \h ld2b {z0.b, z1.b}, p2/z, [x2] add x2, x2, x4 ld2h {z2.h, z3.h}, p0/z, [x3] add x3, x3, x5, lsl #1 uunpklo z5.h, z0.b uunpklo z6.h, z1.b add z24.h, z5.h, z2.h add z25.h, z6.h, z3.h sqxtunb z5.b, z24.h sqxtunt z5.b, z25.h st1b {z5.b}, p0, [x0] add x0, x0, x1 .endr ret endfunc
Do you think that this version of the code is faster than the one which uses 2 stores? Also, if your answer is no, how about using the NEON code when the vector size is equal to 128 bits and the SVE2 code when the vector sizes in greater then 128 bits? I can easily do this as I am checking and branching based on that size (if of course the NEON code is faster than the SVE2 with the two stores). What do you think?
Thanks for everything!
The execution throughput here is referring to the number of instances of thatinstruction that can be started on the same cycle. For example if LD2B has athroughput of 1 meaning that every cycle one of these instructions can begin(but each one still takes 8 cycles until the result is available for use by anyinstructions that depend on the result).
This means that for a series of independent LD1B instructions with a latency of6 cycles and a throughput of 3 per cycle:
ld1b {z0.b}, ... // starts on cycle 0, completed on cycle 6 ld1b {z1.b}, ... // starts on cycle 0, completed on cycle 6 ld1b {z2.b}, ... // starts on cycle 0, completed on cycle 6 ld1b {z3.b}, ... // starts on cycle 1, completed on cycle 7 ld1b {z4.b}, ... // starts on cycle 1, completed on cycle 7 ld1b {z5.b}, ... // starts on cycle 1, completed on cycle 7
And for the same amount of data loaded with LD2B instructions with a latency of8 cycles and a throughput of 3 per cycle:
ld2b {z0.b, z1.b}, ... // starts on cycle 0, completed on cycle 8 ld2b {z2.b, z3.b}, ... // starts on cycle 1, completed on cycle 9 ld2b {z4.b, z5.b}, ... // starts on cycle 2, completed on cycle 10
You can see that even though we required fewer LD2B instructions, thecombination of the worse throughput and higher latency compared to LD1B meansthat it is preferrable to use LD1B in this instance. More generally, whichsequence is preferrable will depend on your micro-architecture of interest, andthe overall picture may change if there are other instructions being executedat the same time which might compete for the same execution resources on theCPU, but this is fine for an initial estimate at least.
The 0x40 here should in theory be handled by the "rounding" part of sqrshrb. Ithink the difference is that the original code initialised every .b element to0x40 rather than every .h element as I imagined. I'm not really sure why thecode initialises .b elements given that the rest of the arithmetic is done with.h. Assuming that the code you showed is definitely what you want, I think youcan correct for this by just adding 0x80 at the end:
sqrshrnb z0.b, z0.h, #7 add z0.b, z0.b, #0x80
I had a go at writing a very quick test case to check this, hopefully thismostly makes sense:
.global foo foo: mov z30.b, #0x40 saddlb z4.s, z0.h, z30.h saddlt z5.s, z0.h, z30.h shrnb z0.h, z4.s, #7 shrnt z0.h, z5.s, #7 sqxtunb z0.b, z0.h ret .global bar bar: sqrshrnb z0.b, z0.h, #7 add z0.b, z0.b, #0x80 ret
#include <arm_neon.h> #include <stdio.h> int8x8_t foo(int16x4_t x); int8x8_t bar(int16x4_t x); int main() { for (int i=-32768; i<32768; ++i) { int y = foo(vdup_n_s16(i))[0]; int z = bar(vdup_n_s16(i))[0]; if (y != z) { printf("%d -> %d or %d\n", i, y, z); } } }
Of course there is nothing stopping you from using Neon for the 128-bit cases,or even using Neon for just the sqxtun(2) instructions and nothing else sinceat vl=128 the registers are identical and overlap. Having said that, I don'tthink that having two stores here is much of a problem. Did you try a versionusing the zero-extending loads instead, since I think that would still beat theNeon version by not needing the uxtl(2)/uunpk(lo/hi) instructions?
2) Now I understand. Thanks. Just one question. Does the execution throughput refer to instances of the same instruction? I mean which one of the following is preferable:
ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x1] add x0, x0, x5 add x1, x1, x6
or
ld1b {z0.b}, p0/z, [x0] add x0, x0, x5 ld1b {z1.b}, p0/z, [x1] add x1, x1, x6
If the execution throughput refers to instances of the same instruction, I guess the firth option is the best. Or?
3) Your code works perfectly. Thanks!
4) I switched back to using the zero-extending loads instead. Regarding the performance, I think it is better, as you said. Thanks!
I might come back to you if I need anything else. Thanks for everything!