Hello,I have the following piece of code:
So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):
Hi,
First of all just to clarify: post-indexed addressing usually refers toinstructions where the address is automatically updated at the end of theinstruction based on a specified offset. Here is an example of a post-indexedinstruction:
ldr x0, [sp], #8
In the above load instruction you can see we are loading 8-bytes from thestack, taking the address from the stack pointer sp, and afterwardsincrementing sp by 8. An equivalent way of writing this would be:
ldr x0, [sp] add sp, sp, #8
The SVE instruction set does not contain any post-indexed addressing load orstore instructions, however this does not tend to be a problem in practicesince there is generally only a very marginal performance difference when usingpost-indexed addressing on a load or store compared to updating the address ina separate instruction as in the above example.
As you allude to in your question, the optimal sequence of code here willdepend greatly on the value of bx, by, stridea, and strideb. I will assume thatthe size of a pixel is 8-bits, so you are effectively wanting a 16-bit to 8-bittruncation?
For bx=by=8 you will be loading (8 * 16) 128 bits of data per iteration of theinnermost loop and storing (8 * 8) 64 bits of data. SVE only very recentlygained support for gather instructions with 128-bit elements, as part of theSVE2.1 extension ( developer.arm.com/.../LD1Q--Gather-load-quadwords- ).Since there is not an SVE instruction to cover loading 128-bits of data perelement your choices here are probably either:
a) Create a vector of offsets with every second element adjacent in memory tothe previous one. There are a few different ways of writing this, but theobvious way of doing it would be something like:
lsl x3, x3, #4 // make the stride bytes rather than elements, adjust as needed. index z1.d, #0, x3 // create an index index z2.d, #8, x3 // create a second index offset by 8 from the first zip1 z1.d, z1.d, z2.d // interleave them together
b) Just continue using the Neon code! It is unlikely you will see muchperformance improvement when you are filling the entire Neon vector perfectlyas you are currently doing, and the benefit from SVE here is likely marginalunless you are targeting a machine with a large vector length.
With regard to your concern about the code being vector-length agnostic: gatherand scatter instructions are traditionally slower than the contiguous-accesscounterparts and should be avoided if for instance bx is greater than or equalto the size of a single vector, however if your concern is merely whether thecode is vector-length agnostic or not then I think you are fine. (For theperformance of gather/scatter and other instructions you can check this in forexample the Neoverse V1 software optimization guide, available here:developer.arm.com/.../ ) SVE codemakes use of predication (e.g. the p1 register in your code) which allows youto conditionally enable part of all of the vector depending on the size of thedata being operated on. You are initialising the predicate with "WHILELT p1.d,x9, x8" where x9 is initialised to 0 and x8 is 8, so on the first iteration ofthe loop this is the equivalent of setting the first 8 bits of the predicatefor 64-bit elements to true. Unless you are working on a machine with a vector length greater than (64 * 8 =) 512 bits, then I think you should be making full use of the vector length!
Hope that helps, let me know if you have any further specific issues and I cantry and help!
Thanks,George
Hi George,
I would like to thank you very much for your answer. It helped me a lot. As I am new to this stuff, I might post new questions in the future :).
I didn't know that there is small difference in performance when comparing post-indexed addressing on a load or store to updating the address ina separate instruction. Thanks for providing this info.
You are correct regarding your assumption. The size of a pixel is 8-bits, so I need a 16-bit to 8-bit truncation.
It's a pity that gather instructions with 128-bit elements are supported only in SVE2.1 extension. This means that I cannot use them :(, as the platforms that I am using support up to SVE2.
I think that I will go for the first proposal that you mentioned, that is the creation of the offset with interleaved elements. My goal is to improve the time execution of the code or at the worst case be equal to the NEON one. So, if the execution platform supports 128 bit Z registers, the execution time will be approximately the same as with the NEON one. On the contrary, in cases where the size of Z registers are greater than 128 bits, the code will be faster (as you have already mentioned). So, I opt your first proposal, unless you think I will reach a dead end.
However, taking into account your comment that the gather and scatter instructions are traditionally slower than the contiguous-accesscounterparts and should be avoided if for instance bx is greater than or equal to the size of a single vector (I guess here you are considering the SVE2.1 extension with quad word load and store instructions, right?), I do not think that I should eventually use your first proposal, as again I will use gather and scatter instructions even with your interleaved index approach, right? Or I guess this will be compensated in cases where the Z register length is greater than 128 bits?
Thanks for letting me know that my code takes full advantage of vector length. But, as I mentioned, my requirement is to improve the time execution (and not necessarily use the whole vector size). So, I would appreciate your final advice, should I go for your first proposal (interleaved index)?
BR,
Akis
Hi Akis,
If you are able to use SVE2 then, rather than the Neoverse V1 softwareoptimization guide I posted previously, you may find either the Neoverse N2 orNeoverse V2 guides more appropriate:
* Neoverse N2: developer.arm.com/.../latest* Neoverse V2: developer.arm.com/.../latest
Both of these particular micro-architectures have an SVE vector length of128-bits, so it is probably worth discussing that specifically for a moment. Ina situation where the Neon and SVE vector lengths are identical there isunlikely to be a massive improvement from code where there is a 1-1 mappingbetween Neon and SVE. For reference you can observe from the softwareoptimization guides that most SVE instructions with Neon equivalents haveidentical latencies and throughputs.
That does not mean that there is no benefit from SVE in such a situation,however we need to be able to make use of instructions that are not present inNeon in order to achieve a measurable speedup. One particular class ofinstructions that you may find useful are the zero/sign-extending load andtruncating store instructions available in SVE.
For example: developer.arm.com/.../ST1B--scalar-plus-scalar---Contiguous-store-bytes-from-vector--scalar-index--
Going back bx=by=8 in your original example code, instead of trying to make thecode efficient on long vector lengths, we instead can optimize for shortervector lengths by making use of contiguous loads. This comes at the cost ofpotentially not filling the whole vector on machines with longer vectorlengths. For example:
ptrue p0.h, vl8 ld1h {z0.h}, p0, [x0] st1b {z0.h}, p0, [x1] // ^ note h-suffix (16 bit) on the load vs b-suffix (8 bit) on the store
In the code above we give up on trying to make full use of the vector andinstead constrain the predicate to the low 8 .h (16-bit) elements (8 * 16 = 128bits). This is guaranteed to be fine in this case since the SVE vector lengthmust be a power of two and at least 128 bits. We then copy 128 bits from theaddress at x0 into z0, but for the store we make use of the truncating storeinstructions: specifically the b suffix on the st1b indicates that we are onlystoring the low byte per element despite each element being 16 bits (from the hsuffix on z0.h). You will note that this successfully avoids the need for aseparate xtn instruction as we needed in the Neon equivalent, while alsoavoiding the more costly gather and scatter instructions we had previouslydiscussed.
Whether the above solution is any faster than the Neon equivalent youoriginally posted, or faster than the alternative sequence using gather/scatterinstructions that we previously discussed, will depend on the nature of thesurrounding code and the values of stridea/strideb, as well as the details ofthe micro-architecture you are optimizing for. If possible I would recommendingbenchmarking all three across the micro-architecture(s) you are interested into get an idea of the tradeoffs involved rather than me trying to make anuninformed decision on your behalf, but the SVE version I've posted aboveshould be a safe starting point!
Thinking more generally than your specific code snippet, given that the code wehave discussed is only doing a relatively simple copy operation it would alsobe worth considering whether the copy can be avoided completely or merged intothe surrounding code such that the truncation occurs there instead. This wouldallow us to eliminate the overhead of an additional load and store as we havecurrently and potentially allow for further optimizations.
Hope that helps. I'm happy to try and expand on anything if it is unclear or ifyou have more specific questions about particular areas we've mentioned so far!
I would like to thank you again for your new answer.
I will definitely take a look at the Neoverse N2 and V2 documents that you provided.
The last piece of code that you provided(the one that gets rids of the xtn instruction and reads 16 bit while stores 8 bits)seems very interesting. You actually get rid of one instruction (as you mentioned) and this can make the execution faster.
For sure I will benchmark all the three solutions mentioned in this post. I also agree with you that I can optimize the surrounding code so that I can avoid the load/store instructions at this point. However, the current requirements I have (and also based on the time plan) does not allow for this. Maybe I can propose this enhancement as a later step. But you are definitely right.
Thanks a lot! I may come back to you :).
one more question.
What does the "ptrue p0.h, vl8" do? I couldn't find what the "vl8" means in the ARM tutorials. Does it mean that 8 lanes in p0 become true associated with half words?
So, the final code using contiguous loads and stores is the following, right?
function PFX(blockcopy_sp_8x8) lsl x3, x3, #1 ptrue p0.h, vl8 .rept 8 ld1h {z0.h}, p0, [x2] add x2, x2, x3 st1b {z0.h}, p0, [x0] add x0, x0, x1 .endr ret endfunc
Thank you in advance,
The vl8 here is the pattern operand to the ptrue instruction, you can find itdocumented here:
developer.arm.com/.../PTRUES--Initialise-predicate-from-named-constraint-and-set-the-condition-flags-
The effect of this operand is to limit the number of elements that are set totrue to some upper bound, for instance here we are saying that we want onlyexactly eight predicate lanes to be set to true, in order to ensure we are onlyusing the low 128-bits of our vector (since 16-bit elements, 16 * 8 = 128). Itis worth keeping in mind that setting the pattern to something that exceeds thevector length will set the whole predicate to false, not all true as you mightexpect, which means that something like:
ptrue p0.h, vl16
The above would set p0.h to all-true on a 256-bit vector length machine butall-false on a 128-bit vector length machine! That does not matter in our casesince all SVE machines must have a vector length of at least 128-bits.
The code you posted looks reasonable to me, although it is worth noting thatthe lsl can be included into the first add instruction at no cost. Also theassembler syntax requires a "/z" suffix on the predicated load to indicate thatlanes with a false predicate are set to zero. So something like:
ptrue p0.h, vl8 .rept 8 ld1h {z0.h}, p0/z, [x2] // p0/z rather than p0 add x2, x2, x3, lsl #1 // lsl #1 here rather than done separately st1b {z0.h}, p0, [x0] add x0, x0, x1 .endr ret
(Also worth pointing out the obvious that the last rept iteration the result ofthe pair of add instructions is unused, but it probably doesn't matter much).
Thanks George! I really appreciate your help.
It seems a little bit more complicated with the number of iterations and read/store data are increased. For example, consider the 16x16 case. A possible NEON implementation can be the following:
function PFX(blockcopy_sp_16x16_neon) lsl x3, x3, #1 movrel x11, xtn_xtn2_table ld1 {v31.16b}, [x11] .rept 8 ld1 {v0.8h-v1.8h}, [x2], x3 ld1 {v2.8h-v3.8h}, [x2], x3 tbl v0.16b, {v0.16b,v1.16b}, v31.16b tbl v1.16b, {v2.16b,v3.16b}, v31.16b st1 {v0.16b}, [x0], x1 st1 {v1.16b}, [x0], x1 .endr ret endfunc const xtn_xtn2_table, align=4 .byte 0, 2, 4, 6, 8, 10, 12, 14 .byte 16, 18, 20, 22, 24, 26, 28, 30 endconst
How about the following equivalent code using SVE2?
function PFX(blockcopy_sp_16x16) mov x8, 16 mov w12, 16 .L_init_ss_16x16: mov x9, #0 whilelt p1.h, x9, x8 b.none .L_next_stridea_blockcopy_ss_16x16 .L_blockcopy_ss_16x16: ld1h {z0.h}, p0/z, [x2, x9, lsl #1] st1b {z0.h}, p0, [x0, x9] inch x9 whilelt p1.h, x9, x8 b.first .L_blockcopy_ss_16x16 .L_next_stridea_blockcopy_ss_16x16: add x2, x2, x3, lsl #1 add x0, x0, x1 sub w12, w12, #1 cbnz w12, .L_init_ss_16x16 ret endfunc
Am I missing something?
Sorry for sending you new questions. As I have mentioned, I am new to this stuff and I am trying to understand.
The code you have posted looks reasonable to me as generic SVE code. It isworth keeping in mind that for such a simple inner loop and low trip count itis still probably worth specializing for specific vector lengths, inparticular:
So with that in mind it is probably worth doing something like comparing theresult of a rdvl instruction at the start of the function and branching to oneof the above two possible implementations based on that. Obviously such anapproach would not scale for increasingly large sizes of bx/by, but that issomething you can probably determine where such a trade-off no longer makessense for your target application and micro-architecture.
If you wanted to keep the vector length agnostic version in your reply then youcould still make some small improvements:
Finally as an aside, it is also worth noting that the use of tbl in the Neonversion of the code you posted is sub-optimal: you can use the uzp1 instructionto avoid needing to load the tbl indices at the start of the function.
Hope that helps!
I would like to thank you again for your answers.
Regarding the solution for avoiding whilelt and inch instructions, you mean something like this, right?
function PFX(blockcopy_sp_16x16) rdvl x9, #1 cmp x9, #16 bgt .vl_ge_16_blockcopy_ss_16_16 ptrue p0.h, vl8 mov x10, #8 .rept 16 ld1h {z0.h}, p0/z, [x2] ld1h {z1.h}, p0/z, [x2, x10, lsl #1] add x2, x2, x3, lsl #1 st1b {z0.h}, p0, [x0] st1b {z1.h}, p0, [x0, x10] add x0, x0, x1 .endr ret .vl_ge_16_blockcopy_ss_16_16: ptrue p0.h, vl16 .rept 16 ld1h {z0.h}, p0/z, [x2] add x2, x2, x3, lsl #1 st1b {z0.h}, p0, [x0] add x0, x0, x1 .endr ret endfunc
It seems more efficient, but as you said, as bx and by increase, maybe it does not scale well.
I think I will opt this, because I am little bit concerned with your comment regarding suboptimal implementations of whilelt and inch in some platforms. Please recall that I care only for optimizing the performance (and not necessarily for using the vector length agnostic capability of SVE2).
Can you please indicate whether you agree with the aforementioned piece of code?
Thank you once again.
Thanks for providing the updated implementation. That looks reasonable to me, Idon't really have anything significant to suggest.
You could unroll and interleave the loop as in your original 16x16 Neon version(to end up with a .rept 8 instead of .rept 16), but that is probably a marginalbenefit, if at all. You could use uzp1 as I had previously suggested for theNeon version and then have a single full store (using .b elements) rather thana pair of unpacked stores (.h), but this isn't actually a reduction ininstruction count since you are just trading a store instruction for a vectorop, so is again probably not significant.
I will be on holiday from tomorrow so apologies in advance for any delay tofuture replies. I wish you a good break if you are also taking any holiday overthe new year period!
thanks for everything (and also for your suggestions regarding NEON code). You have really helped me. I will try to not disturb you during your vacations :). I am also taking some days off.
I wish you merry Christmas and a happy new year!
When you will be back from your holidays, I will appreciate if you could answer to the following questions:
1) I have the following function:
template<int bx, int by> void blockcopy_ps_c(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb) { for (int y = 0; y < by; y++) { for (int x = 0; x < bx; x++) a[x] = (int16_t)b[x]; a += stridea; b += strideb; } }
I have created the following piece of code for supporting its functionality using SVE2 (assume bx=by=8):
function PFX(blockcopy_ps_8x8) ptrue p0.b, vl8 .rept 8 ld1b {z0.b}, p0/z, [x2] uxtb z0.h, p0/M, z0.h st1h {z0.h}, p0, [x0] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
Do you agree? Am I missing something?
2) I have the following function in NEON:
function PFX(copy_cnt_4_neon) lsl x2, x2, #1 movi v4.8b, #0 .rept 2 ld1 {v0.8b}, [x1], x2 ld1 {v1.8b}, [x1], x2 stp d0, d1, [x0], #16 cmeq v0.4h, v0.4h, #0 cmeq v1.4h, v1.4h, #0 add v4.4h, v4.4h, v0.4h add v4.4h, v4.4h, v1.4h .endr saddlv s4, v4.4h fmov w12, s4 add w0, w12, #16 ret endfunc
I have created the following piece of code for supporting its functionality using SVE2:
function PFX(copy_cnt_4) ptrue p0.h, vl4 mov x10, #8 ptrue p1.h mov z4.h, p1/z, #1 mov w0, #0 .rept 4 ld1h {z0.h}, p0/z, [x1] st1h {z0.h}, p0, [x0] add x1, x1, x2, lsl #1 add x0, x0, x10 cmpeq p2.h, p0/z, z0.h, #0 uaddv d6, p2, z4.h fmov w12, s6 add w0, w0, w12 .endr add w0, w0, #16 endfunc
Happy new year!
1)
I think there is a bug with your suggested implementation here around how thepredicate is initialised and how the data is loaded. The core of the functionis:
ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] uxtb z0.h, p0/m, z0.h st1h {z0.h}, p0, [...]
Note that when mixing precisions (.b and .h above) it is important tounderstand how the predicate layout affects the operation being performed.
ptrue p0.b, vl8 [1111111100000000] ^ | ^ most significant bit | ^ least significant bit v ptrue p0.h, vl8 [1010101010101010] | ^ most significant bit ^ least significant bit
This makes a difference because the predicate is used based on the element size(e.g. z0.h or z0.b above), so in the context of your code snippet:
ptrue p0.b, vl8 ld1b {z0.b}, p0/z, [...] # loads 8 bytes uxtb z0.h, p0/m, z0.h # bug: only zero-extends the low 4 half-words st1h {z0.h}, p0, [...] # bug: only stores the low 4 half-words
As with the unpacked stores there are also unpacked loads, which you can finddocumented here (see e.g. "16-bit element"):
https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/LD1B--scalar-plus-scalar---Contiguous-load-unsigned-bytes-to-vector--scalar-index--
In particular this means that in a similar way to the store-truncation case wecan also use the load to perform the sign-extension for free, something like:
function PFX(blockcopy_ps_8x8) ptrue p0.h, vl8 # note .h rather than .b .rept 8 ld1b {z0.h}, p0/z, [x2] # note .h rather than .b st1h {z0.h}, p0, [x0] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
2)
The general gist of the code looks believable to me. A few comments:
I wish you health and a happy new year.
1) Hmm, I wasn't aware of the unpack versions of load instructions. Thanks once again for showing me the way. However, I think there is a problem as the number of bytes to be read from memory increases. For example, consider the 16x16 case. I think it would be best to read as much data as we can from the memory with one instruction, and then manipulate the data. So, I propose the following code:
function PFX(blockcopy_ps_16x16_sve2) ptrue p0.b, vl16 ptrue p1.h, vl8 mov x11, #8 .rept 16 ld1b {z0.b}, p0/z, [x2] uunpklo z1.h, z0.b uunpkhi z2.h, z0.b st1h {z1.h}, p1, [x0] st1h {z2.h}, p1, [x0, x11, lsl #1] add x0, x0, x1, lsl #1 add x2, x2, x3 .endr ret endfunc
In this case, we do not use the unpack loads in order to load as much data as we can with one command. Then, we use the "uunpklo" and "uunpkhi" instructions for unpacking. Do you agree? Or is there a better way?
2) After taking your comments into account, I created the following piece of code:
function PFX(copy_cnt_4_sve2) ptrue p0.h, vl4 ptrue p1.h dup z4.h, #1 dup z6.h, #0 .rept 4 ld1h {z0.h}, p0/z, [x1] st1h {z0.h}, p0, [x0] add x1, x1, x2, lsl #1 add x0, x0, #8 cmpeq p2.h, p0/z, z0.h, #0 add z6.h, p2/m, z6.h, z4.h .endr uaddv d6, p1, z6.h fmov w12, s6 add w0, w0, #16 ret endfunc
Is this what you meant with your comments?
Using the unpack instructions for larger load amounts makes sense to me.Exactly which one is faster in this particular case will of course depend onthe particular micro-architecture of interest, and e.g. the number of vectorpipelines versus the number of load/store pipelines.
As a very minor optimisation, and mostly just for illustrative purposes, Iwould point out that the use of p1 here is redundant:
ptrue p0.b, vl16 [1111111111111111] ptrue p0.h, vl8 [1010101010101010] st1h {z1.h} ..... ^ ^ ^ ^ ^ ^ ^ ^ // 16-bit instructions consider every-other bit of the predicate
You will notice that for all the bits considered by the 16-bit storeinstrucions the values of the two ptrue instructions are equivalent, and assuch you can simply use p0 everywhere here.
Your revised code looks much more reasonable to me, however I think there are acouple of bugs in this code (I could be wrong, I have not run it):
The original Neon code sums the output of the cmeq instructions. As Neon doesnot have predicate registers the result of comparisons are either all-ones orall-zeros in a register, however when thought about in twos-complement integerrepresentation this is equivalent to -1 or 0. Since the Neon code produces anegative sum, this is likely the reason for the addition of 16 at the end. Inthe case of the SVE code there is no need for the final addition, although thecmpeq should probably become cmpne instead?
Additionally, the fmov on line 15 is writing w12 rather than w0?
Aside from correctness, similarly to the case in (1), the use of p1 isredundant here: all elements beyond the first four can never have a non-zerovalue. So we can avoid constructing p1 and use p0 instead.
Finally, for a vector length of 128-bits it is unlikely to be significantlyfaster, but since we are only loading and storing 64-bits at a time here wecould additionally consider using gather instructions to make use of a fullvector of data for the store/cmp/add?