Hello,I have the following piece of code:
So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):
Hi Akis,
1)
To get a predicate for the first 12 elements you should be able to use aWHILELT instruction, something like:
mov w0, #12 whilelt p0.b, wzr, w0
Then I think you can use the widening absolute difference instruction asin my earlier reply, ending up with something like:
mov w0, #12 whilelt p0.b, wzr, w0 .rept \h ld1b {z0.h}, p0/z, [x0] add x0, x0, x1 ld1b {z8.h}, p0/z, [x2] add x2, x2, x3 uabalb z16.h, z0.b, z8.b uabalt z16.h, z1.b, z9.b .endr
It could also be worth using separate accumulators for the UABALB and UABALTinstructions and summing them together at the end, but whether the overhead ofthe addition at the end is worth it probably depends on the value of \h sincethe accumulator latency of these instructions is only a single cycle on bothNeoverse N2 and Neoverse V2.
2)
The details of things like ld2 instructions can be found on the softwareoptimization guides linked previously. It is a long way to scroll back throughthis thread now, so here they are again:
Neoverse V2: developer.arm.com/.../latestNeoveres N2: developer.arm.com/.../latest
In particular here you can see the SVE LD2B instruction is both higher latencyas well as worse than half the throughput (1 vs 3) compared to LD1B, so in thiscase we can say that a pair of LD1B should be preferred instead.
The rest of the code looks reasonable to me!
3)
Moving the constants outside the loop makes sense to me. If you are alreadyguarding that particular code path such that it is specific to a 128-bit vectorlength you could also just use the vl-scaled immediate loads, e.g.
mov x14, #16 ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z2.b}, p0/z, [x0, x14] add x14, x14, #16 ld1b {z3.b}, p0/z, [x0, x14]
can become
ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x0, #1, mul vl] ld1b {z2.b}, p0/z, [x0, #2, mul vl] ld1b {z3.b}, p0/z, [x0, #3, mul vl]
Also, I'm not 100% sure but I notice that z30 is initialised to #0x40 and addedbefore the narrowing shifts. I am wondering if you can instead use the existingrounding-and-narrowing shift instructions to avoid needing this addition atall. For the code like:
saddlb z4.s, z0.h, z30.h saddlt z5.s, z0.h, z30.h shrnb z0.h, z4.s, #7 shrnt z0.h, z5.s, #7 sqxtunb z0.b, z0.h
Perhaps this can instead simply be:
sqrshrnb z0.b, z0.h, #7
For more information, see the instruction documentation here:developer.arm.com/.../SQRSHRNB--Signed-saturating-rounding-shift-right-narrow-by-immediate--bottom--
Let me know if this doesn't work and I can investigate in more detail, but itfeels possible.
4)
As you correctly point out I think there is no nice way of using a single storehere, since SQXTUNT writes to the top half of each pair of elements (as opposedto SQXTUN2 in Neon which writes to the top half of the overall vector).
I think you can use a pair of zero-extending loads (ld1b {z0.h}) rather thanUUNPKLO/HI to save one instruction here, but that is the only obviousimprovement here that I can see.
Thanks,George
Hi George,
1) Thanks! To be honest, I completely forgot about the tricks that you can do with whilelt instruction.
2) OK. So, I will continue using 2 ld1 instructions instead of 1 ld2. However, I have some questions. Based on the Neoverse N2 optimization guide you provided, it is stated that for load instructions, the latency represents the maximum latency to load all the vector registers written by the instruction. The latencies for l1b and l2b are 6 and 8 respectively. So, two ld1b instructions need 6+6=12, while one l2b needs 8. Doesn't this imply that the execution time of 2 ld1b is greater than 1 l2b? To be honest, I didn't understand what execution throughput is.
3) Thanks for your comments. Unfortunately, the version of the code that uses 'sqrshrb' instruction does not pass the tests. Don't we also need the addition with '0x40' that the code without sqrshrb does?
4) I have figured out how we can use only one store (but ld2 should be used for loading):
function PFX(pixel_add_ps_16x\h\()_neon) ptrue p0.b, vl16 ptrue p2.b, vl8 .rept \h ld2b {z0.b, z1.b}, p2/z, [x2] add x2, x2, x4 ld2h {z2.h, z3.h}, p0/z, [x3] add x3, x3, x5, lsl #1 uunpklo z5.h, z0.b uunpklo z6.h, z1.b add z24.h, z5.h, z2.h add z25.h, z6.h, z3.h sqxtunb z5.b, z24.h sqxtunt z5.b, z25.h st1b {z5.b}, p0, [x0] add x0, x0, x1 .endr ret endfunc
Do you think that this version of the code is faster than the one which uses 2 stores? Also, if your answer is no, how about using the NEON code when the vector size is equal to 128 bits and the SVE2 code when the vector sizes in greater then 128 bits? I can easily do this as I am checking and branching based on that size (if of course the NEON code is faster than the SVE2 with the two stores). What do you think?
Thanks for everything!
Akis
The execution throughput here is referring to the number of instances of thatinstruction that can be started on the same cycle. For example if LD2B has athroughput of 1 meaning that every cycle one of these instructions can begin(but each one still takes 8 cycles until the result is available for use by anyinstructions that depend on the result).
This means that for a series of independent LD1B instructions with a latency of6 cycles and a throughput of 3 per cycle:
ld1b {z0.b}, ... // starts on cycle 0, completed on cycle 6 ld1b {z1.b}, ... // starts on cycle 0, completed on cycle 6 ld1b {z2.b}, ... // starts on cycle 0, completed on cycle 6 ld1b {z3.b}, ... // starts on cycle 1, completed on cycle 7 ld1b {z4.b}, ... // starts on cycle 1, completed on cycle 7 ld1b {z5.b}, ... // starts on cycle 1, completed on cycle 7
And for the same amount of data loaded with LD2B instructions with a latency of8 cycles and a throughput of 3 per cycle:
ld2b {z0.b, z1.b}, ... // starts on cycle 0, completed on cycle 8 ld2b {z2.b, z3.b}, ... // starts on cycle 1, completed on cycle 9 ld2b {z4.b, z5.b}, ... // starts on cycle 2, completed on cycle 10
You can see that even though we required fewer LD2B instructions, thecombination of the worse throughput and higher latency compared to LD1B meansthat it is preferrable to use LD1B in this instance. More generally, whichsequence is preferrable will depend on your micro-architecture of interest, andthe overall picture may change if there are other instructions being executedat the same time which might compete for the same execution resources on theCPU, but this is fine for an initial estimate at least.
The 0x40 here should in theory be handled by the "rounding" part of sqrshrb. Ithink the difference is that the original code initialised every .b element to0x40 rather than every .h element as I imagined. I'm not really sure why thecode initialises .b elements given that the rest of the arithmetic is done with.h. Assuming that the code you showed is definitely what you want, I think youcan correct for this by just adding 0x80 at the end:
sqrshrnb z0.b, z0.h, #7 add z0.b, z0.b, #0x80
I had a go at writing a very quick test case to check this, hopefully thismostly makes sense:
.global foo foo: mov z30.b, #0x40 saddlb z4.s, z0.h, z30.h saddlt z5.s, z0.h, z30.h shrnb z0.h, z4.s, #7 shrnt z0.h, z5.s, #7 sqxtunb z0.b, z0.h ret .global bar bar: sqrshrnb z0.b, z0.h, #7 add z0.b, z0.b, #0x80 ret
#include <arm_neon.h> #include <stdio.h> int8x8_t foo(int16x4_t x); int8x8_t bar(int16x4_t x); int main() { for (int i=-32768; i<32768; ++i) { int y = foo(vdup_n_s16(i))[0]; int z = bar(vdup_n_s16(i))[0]; if (y != z) { printf("%d -> %d or %d\n", i, y, z); } } }
Of course there is nothing stopping you from using Neon for the 128-bit cases,or even using Neon for just the sqxtun(2) instructions and nothing else sinceat vl=128 the registers are identical and overlap. Having said that, I don'tthink that having two stores here is much of a problem. Did you try a versionusing the zero-extending loads instead, since I think that would still beat theNeon version by not needing the uxtl(2)/uunpk(lo/hi) instructions?
2) Now I understand. Thanks. Just one question. Does the execution throughput refer to instances of the same instruction? I mean which one of the following is preferable:
ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x1] add x0, x0, x5 add x1, x1, x6
or
ld1b {z0.b}, p0/z, [x0] add x0, x0, x5 ld1b {z1.b}, p0/z, [x1] add x1, x1, x6
If the execution throughput refers to instances of the same instruction, I guess the firth option is the best. Or?
3) Your code works perfectly. Thanks!
4) I switched back to using the zero-extending loads instead. Regarding the performance, I think it is better, as you said. Thanks!
I might come back to you if I need anything else. Thanks for everything!
BR,
Throughput in this case is referring to the number of the same instruction thatcan begin execution on each cycle. The exact code layout is not particularlyimportant for large out-of-order cores like Neoverse N2 or Neoverse V2, so Iwould expect both arrangements to perform more or less the same. The bottleneckin such cores is instead usually any dependency chain between instructions, forexample in the case of the load instructions here the loads cannot beginexecution until the addresses x0 and x1 have been calculated.
Glad to hear the new code worked as expected!