Hello,
I have the following 3 functions that utilize NEON instruction set:
function pixel_avg2_w8_neon, export=1 1: subs w5, w5, #2 ld1 {v0.8b}, [x2], x3 ld1 {v2.8b}, [x4], x3 urhadd v0.8b, v0.8b, v2.8b ld1 {v1.8b}, [x2], x3 ld1 {v3.8b}, [x4], x3 urhadd v1.8b, v1.8b, v3.8b st1 {v0.8b}, [x0], x1 st1 {v1.8b}, [x0], x1 b.gt 1b ret endfunc
function pixel_avg2_w16_neon, export=1 1: subs w5, w5, #2 ld1 {v0.16b}, [x2], x3 ld1 {v2.16b}, [x4], x3 urhadd v0.16b, v0.16b, v2.16b ld1 {v1.16b}, [x2], x3 ld1 {v3.16b}, [x4], x3 urhadd v1.16b, v1.16b, v3.16b st1 {v0.16b}, [x0], x1 st1 {v1.16b}, [x0], x1 b.gt 1b ret endfunc
function pixel_sad_\h\()_neon, export=1 ld1 {v1.16b}, [x2], x3 ld1 {v0.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 uabdl v16.8h, v0.8b, v1.8b uabdl2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .rept \h / 2 - 1 ld1 {v1.16b}, [x2], x3 ld1 {v0.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 uabal v16.8h, v0.8b, v1.8b uabal2 v17.8h, v0.16b, v1.16b uabal v16.8h, v2.8b, v3.8b uabal2 v17.8h, v2.16b, v3.16b .endr add v16.8h, v16.8h, v17.8h uaddlv s0, v16.8h fmov w0, s0 ret endfunc
I want to use SVE/SVE2 instructions set to improve the performance of these functions. My testbed is Alibaba Yitian 710 (vector size=128 bits).
For the first 2, I couldn't find a way to improve the performance. For the latter, I wrote the following function:
function pixel_sad_\h\()_sve, export=1 ptrue p0.h, vl8 ld1b {z1.h}, p0/z, [x2] ld1b {z4.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z3.h}, p0/z, [x2] ld1b {z6.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z0.h}, p0/z, [x0] ld1b {z5.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 ld1b {z2.h}, p0/z, [x0] ld1b {z7.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 uabd v16.8h, v0.8h, v1.8h uabd v17.8h, v4.8h, v5.8h uaba v16.8h, v2.8h, v3.8h uaba v17.8h, v7.8h, v6.8h .rept \h / 2 - 1 ld1b {z1.h}, p0/z, [x2] ld1b {z4.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z3.h}, p0/z, [x2] ld1b {z6.h}, p0/z, [x2, #1, mul vl] add x2, x2, x3 ld1b {z0.h}, p0/z, [x0] ld1b {z5.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 ld1b {z2.h}, p0/z, [x0] ld1b {z7.h}, p0/z, [x0, #1, mul vl] add x0, x0, x1 uaba v16.8h, v0.8h, v1.8h uaba v17.8h, v4.8h, v5.8h uaba v16.8h, v2.8h, v3.8h uaba v17.8h, v7.8h, v6.8h .endr add v16.8h, v16.8h, v17.8h uaddlv s0, v16.8h fmov w0, s0 ret endfunc
However, this degrades the performance instead of improving it.
Can someone help me?
Thank you in advance,
Akis
Hi Akis,
Thanks for the question!
For a SVE vector length of 128-bits you are probably correct that there is not much performance available here for these particular functions. In general SVE and SVE2 provide a performance uplift either when there are longer vector lengths available or when we can take advantage of some features of SVE that are not already present in Neon. Some examples of where SVE can provide a benefit would be:
Starting with the pixel_avg2_w16_neon function: we are processing exactly 16-bytes at a time here so gather/scatter and predication will provide no benefit at VL=128. There is no widening or narrowing so probably no benefit from new SVE2 instruction sequences either. I think you are correct that there is not much we can do here.
For the pixel_avg2_w8_neon: again we have no widening or narrowing so there is probably not much to gain from using new SVE2 instructions. Since we are only processing 64-bits at a time though there is the potential for using gather/scatter instructions to fill out an entire vector and operate on that instead. In particular something like:
function pixel_avg2_w8_sve, export=1 ptrue p0.b index z4.d, #0, x3 // create a vector of {0,x3,x3*2,x3*3,...} index z5.d, #0, x1 // create a vector of {0,x1,x1*2,x1*3,...} cntd x6 mul x3, x6, x3 // input stride *= vl mul x1, x6, x1 // output stride *= vl whilelt p1.d, wzr, w5 // create a predicate to deal with odd lengths w5, // no need for a branch here since assume w5 > 0. 1: sub w5, w5, w6 ld1d {z0.d}, p1/z, [x2, z4.d] // gather blocks of 64-bits. ld1d {z2.d}, p1/z, [x4, z4.d] // gather blocks of 64-bits. urhadd z0.b, p0/m, z0.b, z2.b // operate on a full vector of data. st1d {z0.d}, p1, [x0, z4.d] // store blocks of 64-bits. add x2, x2, x3 add x4, x4, x3 add x0, x0, x1 whilelt p1.d, wzr, w5 b.any 1b ret
While the above sequence is vector-length agnostic and makes full use of SVE features, it is also probably slower than your original Neon implementation since gather and scatter instructions tend to have a relatively high overhead associated with them. This overhead is fine if it enables a significant amount of vector work elsewhere but in this case the data processing is only a single instruction so the tradeoff is unlikely to be worthwhile.
It's also worth noting that in the above implementation we are using WHILELT instructions for loop control however this might not be necessary if you can guarantee thatthe number of elements being processed is exactly divisible by the vector length (or divisible by two if you only care about VL=128). Removing the WHILELT and going back to using a normal SUBS or CMP for loop control may be a small improvement but the cost here is ultimately dominated by the gather/scatter unfortunately.
Finally let us consider the pixel_sad_\h\()_neon function. In this case while we could use widening loads as in your suggested SVE code, in reality we generally should prefer to use the widening data-processing instructions like the Neon code does if there is no overhead to doing so, since it means we can load more data with each load instruction and therefore need fewer instructions overall.
In this case we can probably take advantage of a different instruction sequence, albeit one that also exists in Neon. The potential advantage here is that widening instructions are only able to operate on half of the data at a time, so if we have a better instruction sequence then there is usually some room for improvement even if the replacement sequence is also two instructions.
The dot-product instructions are optional in Armv8.2-A and were made mandatory in Armv8.4-A, and are available both in Neon and in SVE. You will know if the Neon dot product instructions are available by the presence of the "asimddp" (dp=dot product) feature in /proc/cpuinfo. Since your micro-architecture includes SVE this shouldn't be a problem.
The dot-product instructions here can be used as a faster way of performing a widening accumulation on many micro-architectures since it tends to have good latency and throughput. This means that we can use a non-widening absolute-difference instruction to calculate a full vector of results at once and then accumulate them separately. Something like:
function pixel_sad_\h\()_neon_dotprod, export=1 ld1 {v1.16b}, [x2], x3 ld1 {v0.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 movi v19.4s, #0 // accumulator vector movi v18.16b, #1 // constant vector of 1s uabd v16.16b, v0.16b, v1.16b uabd v17.16b, v2.16b, v3.16b udot v19.4s, v16.16b, v18.16b udot v19.4s, v17.16b, v18.16b .rept \h / 2 - 1 ld1 {v1.16b}, [x2], x3 ld1 {v0.16b}, [x0], x1 ld1 {v3.16b}, [x2], x3 ld1 {v2.16b}, [x0], x1 uabd v16.16b, v0.16b, v1.16b uabd v17.16b, v2.16b, v3.16b udot v19.4s, v16.16b, v18.16b udot v19.4s, v17.16b, v18.16b .endr addv s0, v19.4s fmov w0, s0 ret
You can see above that we are using a dot product of our vector of absolute differences (v16 and v17) by a vector of all 1s (v18) to accumulate our result into v19. You may also want to consider having multiple accumulators rather than just v19 as we have above.
While it is then possible to write an SVE version of the above code, for the reasons we mentioned before there is probably not much benefit to using SVE fora vector length of 128-bits here.
Hope that helps!
Thanks,George
Hi George,
I am very happy to talk to you again in this forum. You have helped me a lot in a previous post (about one year ago) and now you are helping me me again. I am very grateful for that.
Regarding the pixel_avg2_w16_neon, I totally agree with you. I do not think that we can improve this function when vector size is 128 bits. So, I will continue using its NEON version.
Regarding the pixel_avg2_w8_neon, I have tried to run the code you provided (I think you have a typo and the store function should use z5 instead of z4), but as you said, the performance is degraded. Generally, I realized that the usage of SVE/SVE2 instructions that are agnostic to vector size (such as whilelt, cntd, and so on) degrades the performance. Maybe the current CPU architectures does not provide optimized implementations of these instructions, as you have indicated in the old post. I also realized that the usage of gather/scatter loads and stores degrades the performance as well. So, if someone can not use the additional data that the can manipulate (as it is the case withpixel_avg2_w8_neon), I do not think that they should be used. So, I will continue using its NEON version.
Regarding the function pixel_sad_\h\()_neon_dotprod, I have run it and yes the performance is improved! I would like to thank you for that. To be honest, I couldn't identify the implementation with the dot instructions. Once again, thank you!
I have a couple more NEON functions that I would like to improve. If you could find some time, can you please helm me to also improve these functions? I apologize if I cause any inconvenience but this forum provides the only help I can get.
More specifically, I have the following NEON function:
function pixel_sad_x_\h\()_neon, export=1 mov x6, x5 mov x5, x4 mov x7, #16 ld1 {v0.8b}, [x0], x7 ld1 {v1.8b}, [x1], x5 ld1 {v2.8b}, [x2], x5 uabdl v16.8h, v1.8b, v0.8b ld1 {v3.8b}, [x3], x5 uabdl v17.8h, v2.8b, v0.8b ld1 {v5.8b}, [x0], x7 ld1 {v1.8b}, [x1], x5 uabdl v18.8h, v3.8b, v0.8b ld1 {v2.8b}, [x2], x5 uabal v16.8h, v1.8b, v5.8b ld1 {v3.8b}, [x3], x5 uabal v17.8h, v2.8b, v5.8b uabal v18.8h, v3.8b, v5.8b .rept \h / 2 - 1 ld1 {v0.8b}, [x0], x7 ld1 {v1.8b}, [x1], x5 ld1 {v2.8b}, [x2], x5 uabal v16.8h, v1.8b, v0.8b ld1 {v3.8b}, [x3], x5 uabal v17.8h, v2.8b, v0.8b ld1 {v5.8b}, [x0], x7 ld1 {v1.8b}, [x1], x5 uabal v18.8h, v3.8b, v0.8b ld1 {v2.8b}, [x2], x5 uabal v16.8h, v1.8b, v5.8b ld1 {v3.8b}, [x3], x5 uabal v17.8h, v2.8b, v5.8b uabal v18.8h, v3.8b, v5.8b .endr uaddlv s0, v16.8h uaddlv s1, v17.8h uaddlv s2, v18.8h stp s0, s1, [x6], #8 str s2, [x6] ret .endfunc
Can I use the same approach with the dot instructions? Or do you think that this can be improved using another way?
Also, what about this function:
function pixel_satd_4x4_neon, export=1 ld1 {v1.s}[0], [x2], x3 ld1 {v0.s}[0], [x0], x1 ld1 {v3.s}[0], [x2], x3 ld1 {v2.s}[0], [x0], x1 ld1 {v1.s}[1], [x2], x3 ld1 {v0.s}[1], [x0], x1 ld1 {v3.s}[1], [x2], x3 ld1 {v2.s}[1], [x0], x1 usubl v0.8h, v0.8b, v1.8b usubl v1.8h, v2.8b, v3.8b add v2.8h, v0.8h, v1.8h sub v3.8h, v0.8h, v1.8h zip1 v0.2d, v2.2d, v3.2d zip2 v1.2d, v2.2d, v3.2d add v2.8h, v0.8h, v1.8h sub v3.8h, v0.8h, v1.8h trn1 v0.8h, v2.8h, v3.8h trn2 v1.8h, v2.8h, v3.8h add v2.8h, v0.8h, v1.8h sub v3.8h, v0.8h, v1.8h trn1 v0.4s, v2.4s, v3.4s trn2 v1.4s, v2.4s, v3.4s abs v0.8h, v0.8h abs v1.8h, v1.8h umax v0.8h, v0.8h, v1.8h uaddlv s0, v0.8h mov w0, v0.s[0] ret endfunc
I wrote the following SVE/SVE2 equivalent function:
function pixel_satd_4x4_sve, export=1 ptrue p0.h, vl4 ld1b {z1.h}, p0/z, [x2] add x2, x2, x3 ld1b {z0.h}, p0/z, [x0] add x0, x0, x1 ld1b {z3.h}, p0/z, [x2] add x2, x2, x3 ld1b {z2.h}, p0/z, [x0] add x0, x0, x1 ld1b {z31.h}, p0/z, [x2] add x2, x2, x3 ld1b {z30.h}, p0/z, [x0] add x0, x0, x1 ld1b {z29.h}, p0/z, [x2] ld1b {z28.h}, p0/z, [x0] sub v0.4h, v0.4h, v1.4h sub v30.4h, v30.4h, v31.4h sub v1.4h, v2.4h, v3.4h sub v31.4h, v28.4h, v29.4h add v2.4h, v0.4h, v1.4h sub v3.4h, v0.4h, v1.4h add v22.4h, v30.4h, v31.4h sub v23.4h, v30.4h, v31.4h add v28.4h, v22.4h, v2.4h sub v29.4h, v22.4h, v2.4h add v30.4h, v23.4h, v3.4h sub v31.4h, v23.4h, v3.4h mov v28.d[1], v30.d[0] mov v29.d[1], v31.d[0] trn1 v0.8h, v28.8h, v29.8h trn2 v1.8h, v28.8h, v29.8h add v2.4h, v0.4h, v1.4h sub v3.4h, v0.4h, v1.4h trn1 v0.4s, v2.4s, v3.4s trn2 v1.4s, v2.4s, v3.4s abs v0.8h, v0.8h abs v1.8h, v1.8h umax v0.8h, v0.8h, v1.8h uaddlv s0, v0.8h mov w0, v0.s[0] ret endfunc
The performance is improved, but I would like to ask if you can identify a better way.
No problem, happy to help!
Apolgies for the typo in pixel_avg2_w8_neon. Your benchmark results make sense to me and seem to be in line with our previous conversation, so your conclusion of continuing to use the Neon implementation makes sense to me. I'm glad to hear that the dot product implementation improved performance in the pixel_sad_\h\()_neon_dotprod version.
For the new pixel_sad_x_\h\()_neon function I don't think there is likely to be any benefit from using dot product instructions here. The main advantage of those instructions is that it allows us to process a whole vectors worth of data at a time compared to normal widening instructions which only operate on half of the input data per instruction. Since we only have eight bytes of input data per vector rather than a full 16 bytes I think there is no benefit.
It is interesting in this case that the stride between consecutive 64-bit blocks in x0 is 16 bytes rather than eight, so the data is almost contiguous inmemory. If this function is called multiple times such that both sets of the alternating 64-bit groups are used then it could be that there are improvements to be had here, but not with the current interface I think since we do not know how the other eight bytes in each 16-byte block are used.
For the pixel_satd_4x4_neon function I think there is at least one small improvement possible to the Neon code:
The lane-index loads in Neon involve an implicit merge into the rest of the vector which can hurt performance. We can avoid this by instead simply loading the low four bytes of the vector with a normal LDR instruction and then using indexed LD1 instructions for the remainder.
The normal LDR instructions do not have a post-increment so we need to do these separately, but that will still probably perform better than the current code. The post-increment on the last two loads is also unnecessary.
Something like:
ldr s1, [x2] ldr s0, [x0] ldr s3, [x2, x3] // load from x2 + x3 ldr s2, [x0, x1] // load from x0 + x1 add x2, x2, x3, lsl #1 // x2 += x3 * 2 add x0, x0, x1, lsl #1 // x0 += x1 * 2 ld1 {v1.s}[1], [x2], x3 ld1 {v0.s}[1], [x0], x1 ld1 {v3.s}[1], [x2] // don't need post-increment ld1 {v2.s}[1], [x0] // don't need post-increment ...
You can probably also save one instruction by merging the ABS instruction into the previous SUB. So instead of:
add v2.8h, v0.8h, v1.8h sub v3.8h, v0.8h, v1.8h trn1 v0.4s, v2.4s, v3.4s trn2 v1.4s, v2.4s, v3.4s abs v0.8h, v0.8h abs v1.8h, v1.8h
You should instead be able to do something like:
add v2.8h, v0.8h, v1.8h abs v2.8h, v2.8h uabd v3.8h, v0.8h, v1.8h // sub + abs = uabd trn1 v0.4s, v2.4s, v3.4s trn2 v1.4s, v2.4s, v3.4s
This should be fine since the TRN{1,2} instructions do not affect the actual data so it is fine to move them after the ABS.
Thanks for your help!
For the pixel_sad_x_\h\()_neon function, I agree. Unfortunately, currently I cannot modify the code that calls the interface. I am just implementing the interface. But I will take into account your comment in case something changes.
For the pixel_satd_4x4_neon, your proposal to use ldr instructions improved the performance. Thanks! Regarding the usage of uabd instruction, initially I run the whole binary executable without unit testing and the performance was greatly improved. In fact, it reached my end goal! Later, I realized that the unit tests failed. In order to fix the issue, I used "sabd" instead of "uabd". It seems that we need signed half words. This again improves the performance when compared to initial NEON function but not at the levels when using "uabd". To be honest, I do not understand, as according to
https://documentation-service.arm.com/static/60ad18a5982fc7708ac1cde8?token=
https://developer.arm.com/documentation/pjdoc466751330-593177/0002
it seems that "uabd" and "sabd" have the same latency and throughput. Am I doing something wrong? Can we somehow use the uabd? Or maybe the core code follows different flow when "uabd" is used (which shouldn't be followed as the output of the pixel_satd_4x4_neon is wrong) and actually this causes the big performance improvement?
I have also the following NEON functions that I have to improve:
function pixel_sad_\h\()_neon_10, export=1 lsl x1, x1, #1 lsl x3, x3, #1 ld1 {v1.8h}, [x2], x3 ld1 {v0.8h}, [x0], x1 ld1 {v3.8h}, [x2], x3 ld1 {v2.8h}, [x0], x1 uabdl v16.4s, v0.4h, v1.4h uabdl2 v17.4s, v0.8h, v1.8h uabdl v18.4s, v2.4h, v3.4h uabdl2 v19.4s, v2.8h, v3.8h .rept \h / 2 - 1 ld1 {v1.8h}, [x2], x3 ld1 {v0.8h}, [x0], x1 ld1 {v3.8h}, [x2], x3 ld1 {v2.8h}, [x0], x1 uabal v16.4s, v0.4h, v1.4h uabal2 v17.4s, v0.8h, v1.8h uabal v18.4s, v2.4h, v3.4h uabal2 v19.4s, v2.8h, v3.8h .endr add v16.4s, v16.4s, v18.4s uaddlv s0, v16.8h fmov w0, s0 ret endfunc
and
function pixel_ssd_\h\()_neon, export=1 ld1 {v16.16b}, [x0], x1 ld1 {v17.16b}, [x2], x3 usubl v2.8h, v16.8b, v17.8b usubl2 v3.8h, v16.16b, v17.16b ld1 {v16.16b}, [x0], x1 smull v0.4s, v2.4h, v2.4h smull2 v1.4s, v2.8h, v2.8h ld1 {v17.16b}, [x2], x3 smlal v0.4s, v3.4h, v3.4h smlal2 v1.4s, v3.8h, v3.8h .rept \h-2 usubl v2.8h, v16.8b, v17.8b usubl2 v3.8h, v16.16b, v17.16b ld1 {v16.16b}, [x0], x1 smlal v0.4s, v2.4h, v2.4h smlal2 v1.4s, v2.8h, v2.8h ld1 {v17.16b}, [x2], x3 smlal v0.4s, v3.4h, v3.4h smlal2 v1.4s, v3.8h, v3.8h .endr usubl v2.8h, v16.8b, v17.8b usubl2 v3.8h, v16.16b, v17.16b smlal v0.4s, v2.4h, v2.4h smlal2 v1.4s, v2.8h, v2.8h smlal v0.4s, v3.4h, v3.4h smlal2 v1.4s, v3.8h, v3.8h add v0.4s, v0.4s, v1.4s addv s0, v0.4s mov w0, v0.s[0] ret endfunc
Any thoughts on these? I think we cannot use the udot approach here, or can we?
Happy new year!
I don't have a good explanation about why UABD and SABD may differ in performance. As you point out the Software Optimization Guides identify them as performing identically. Perhaps there are other sources of noise in the benchmarks or the binary layout has changed slightly as a result of re-linking the program?
For pixel_sad_\h\()_neon_10 we are dealing with .h elements rather than .b so we cannot use the Neon dot product instructions, however we can use the SVE dot-product instructions instead since a 16-bit dot product is available:
https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UDOT--4-way--vectors---Unsigned-integer-dot-product-
I am assuming that the "10" in the function name refers to 10-bit input elements rather than full 16-bit wide input. In this instance we can delayaccumulating into a wider datatype for longer since we have some bits to spare. In the snippet below I have only delayed the accumulation by one instruction but you could consider doing more to further reduce the number of dot-product instructions needed.
ptrue p0.h dup z4.h, #1 // A constant vector of 1s for summing z0.h*1. dup z5.d, #0 // A accumulator with 64-bit elements. ... uabd z0.h, p0/m, z0.h, z1.h // 10-bits uaba z0.h, z2.h, z3.h // 11-bits udot z5.d, z0.h, z4.h ... uaddlv d0, p0, z5.d fmov w0, s0
For the pixel_ssd_\h\()_neon we can also make use of dot-product instructions, albeit here I think we only need 8-bit rather than 16-bit dot-products.
We can note that for the subtraction we only actually care about the absolute difference since we always square the result, or we may know that the second operand is always less than the first one. Either way I think each pair of USUBL(2) instructions can be replaced by a single non-widening UABD. We can then make use of the dot product to do the accumulation and widening in a single step.
For example instead of:
ld1 {v16.16b}, [x0], x1 ld1 {v17.16b}, [x2], x3 usubl v2.8h, v16.8b, v17.8b usubl2 v3.8h, v16.16b, v17.16b smull v0.4s, v2.4h, v2.4h smull2 v1.4s, v2.8h, v2.8h
We can probably do something like:
movi v0.4s, #0 // Need to initialise an accumulator. ... ld1 {v16.16b}, [x0], x1 ld1 {v17.16b}, [x2], x3 uabd v2.16h, v16.16b, v17.16b udot v0.4s, v2.16b, v2.16b
Happy new year! I wish all the best.
Regarding the UABD vs SABD, I do not have a good explanation either. I have built the whole project from scratch but still the same is happening. I suspect that the caller function follows different code paths (maybe using an if statement which check the return value of the callee function) and when uabd is used this value is wrong and so different paths are followed which limit the amount of CPU cycles. I do not have any other explanation.
Yes you are right. "10" stands for 10-bit. The code you provided for pixel_sad_\h\()_neon_10 improves the performance. Thanks! (just a typo, it should be "uaddv" and not "uaddlv").
Your code for pixel_ssd_\h\()_neon improved the performance as well! Thanks!
I have a couple of functions more to improve. Here they are:
function quant_4x4x4_neon, export=1 ld1 {v16.8h,v17.8h}, [x0] abs v18.8h, v16.8h abs v19.8h, v17.8h ld1 {v0.8h,v1.8h}, [x2] ld1 {v2.8h,v3.8h}, [x1] QUANT_TWO v0.8h, v1.8h, v2, v3, v4.16b ld1 {v16.8h,v17.8h}, [x0] abs v18.8h, v16.8h abs v19.8h, v17.8h QUANT_TWO v0.8h, v1.8h, v2, v3, v5.16b ld1 {v16.8h,v17.8h}, [x0] abs v18.8h, v16.8h abs v19.8h, v17.8h QUANT_TWO v0.8h, v1.8h, v2, v3, v6.16b ld1 {v16.8h,v17.8h}, [x0] abs v18.8h, v16.8h abs v19.8h, v17.8h QUANT_TWO v0.8h, v1.8h, v2, v3, v7.16b uqxtn v4.8b, v4.8h uqxtn v7.8b, v7.8h uqxtn v6.8b, v6.8h uqxtn v5.8b, v5.8h fmov x7, d7 fmov x6, d6 fmov x5, d5 fmov x4, d4 mov w0, #0 tst x7, x7 cinc w0, w0, ne lsl w0, w0, #1 tst x6, x6 cinc w0, w0, ne lsl w0, w0, #1 tst x5, x5 cinc w0, w0, ne lsl w0, w0, #1 tst x4, x4 cinc w0, w0, ne ret endfunc .macro QUANT_TWO bias0 bias1 mf0_1 mf2_3 mask add v18.8h, v18.8h, \bias0 add v19.8h, v19.8h, \bias1 umull v20.4s, v18.4h, \mf0_1\().4h umull2 v21.4s, v18.8h, \mf0_1\().8h umull v22.4s, v19.4h, \mf2_3\().4h umull2 v23.4s, v19.8h, \mf2_3\().8h sshr v16.8h, v16.8h, #15 sshr v17.8h, v17.8h, #15 shrn v18.4h, v20.4s, #16 shrn2 v18.8h, v21.4s, #16 shrn v19.4h, v22.4s, #16 shrn2 v19.8h, v23.4s, #16 eor v18.16b, v18.16b, v16.16b eor v19.16b, v19.16b, v17.16b sub v18.8h, v18.8h, v16.8h sub v19.8h, v19.8h, v17.8h orr \mask, v18.16b, v19.16b st1 {v18.8h,v19.8h}, [x0], #32 .endm
function hpel_filter_neon, export=1 ubfm x9, x3, #0, #3 add w15, w5, w9 sub x13, x3, x9 // align src sub x10, x0, x9 sub x11, x1, x9 sub x12, x2, x9 movi v30.16b, #5 movi v31.16b, #20 1: // line start mov x3, x13 mov x2, x12 mov x1, x11 mov x0, x10 add x7, x3, #16 // src pointer next 16b for horiz filter mov x5, x15 // restore width sub x3, x3, x4, lsl #1 // src - 2*stride ld1 {v28.16b}, [x7], #16 // src[16:31] add x9, x3, x5 // holds src - 2*stride + width ld1 {v16.16b}, [x3], x4 // src-2*stride[0:15] ld1 {v17.16b}, [x3], x4 // src-1*stride[0:15] ld1 {v18.16b}, [x3], x4 // src+0*stride[0:15] ld1 {v19.16b}, [x3], x4 // src+1*stride[0:15] ld1 {v20.16b}, [x3], x4 // src+2*stride[0:15] ld1 {v21.16b}, [x3], x4 // src+3*stride[0:15] ext v22.16b, v7.16b, v18.16b, #14 uaddl v1.8h, v16.8b, v21.8b ext v26.16b, v18.16b, v28.16b, #3 umlsl v1.8h, v17.8b, v30.8b ext v23.16b, v7.16b, v18.16b, #15 umlal v1.8h, v18.8b, v31.8b ext v24.16b, v18.16b, v28.16b, #1 umlal v1.8h, v19.8b, v31.8b ext v25.16b, v18.16b, v28.16b, #2 umlsl v1.8h, v20.8b, v30.8b 2: // next 16 pixel of line subs x5, x5, #16 sub x3, x9, x5 // src - 2*stride += 16 uaddl v4.8h, v22.8b, v26.8b uaddl2 v5.8h, v22.16b, v26.16b sqrshrun v6.8b, v1.8h, #5 umlsl v4.8h, v23.8b, v30.8b umlsl2 v5.8h, v23.16b, v30.16b umlal v4.8h, v18.8b, v31.8b umlal2 v5.8h, v18.16b, v31.16b umlal v4.8h, v24.8b, v31.8b umlal2 v5.8h, v24.16b, v31.16b umlsl v4.8h, v25.8b, v30.8b umlsl2 v5.8h, v25.16b, v30.16b uaddl2 v2.8h, v16.16b, v21.16b sqrshrun v4.8b, v4.8h, #5 mov v7.16b, v18.16b sqrshrun2 v4.16b, v5.8h, #5 umlsl2 v2.8h, v17.16b, v30.16b ld1 {v16.16b}, [x3], x4 // src-2*stride[0:15] umlal2 v2.8h, v18.16b, v31.16b ld1 {v17.16b}, [x3], x4 // src-1*stride[0:15] umlal2 v2.8h, v19.16b, v31.16b ld1 {v18.16b}, [x3], x4 // src+0*stride[0:15] umlsl2 v2.8h, v20.16b, v30.16b ld1 {v19.16b}, [x3], x4 // src+1*stride[0:15] st1 {v4.16b}, [x0], #16 sqrshrun2 v6.16b, v2.8h, #5 ld1 {v20.16b}, [x3], x4 // src+2*stride[0:15] ld1 {v21.16b}, [x3], x4 // src+3*stride[0:15] ext v22.16b, v0.16b, v1.16b, #12 ext v26.16b, v1.16b, v2.16b, #6 ext v23.16b, v0.16b, v1.16b, #14 st1 {v6.16b}, [x1], #16 uaddl v3.8h, v16.8b, v21.8b ext v25.16b, v1.16b, v2.16b, #4 umlsl v3.8h, v17.8b, v30.8b ext v24.16b, v1.16b, v2.16b, #2 umlal v3.8h, v18.8b, v31.8b add v4.8h, v22.8h, v26.8h umlal v3.8h, v19.8b, v31.8b add v5.8h, v23.8h, v25.8h umlsl v3.8h, v20.8b, v30.8b add v6.8h, v24.8h, v1.8h ext v22.16b, v1.16b, v2.16b, #12 ext v26.16b, v2.16b, v3.16b, #6 ext v23.16b, v1.16b, v2.16b, #14 ext v25.16b, v2.16b, v3.16b, #4 ext v24.16b, v2.16b, v3.16b, #2 add v22.8h, v22.8h, v26.8h add v23.8h, v23.8h, v25.8h add v24.8h, v24.8h, v2.8h sub v4.8h, v4.8h, v5.8h // a-b sub v5.8h, v5.8h, v6.8h // b-c sub v22.8h, v22.8h, v23.8h // a-b sub v23.8h, v23.8h, v24.8h // b-c sshr v4.8h, v4.8h, #2 // (a-b)/4 sshr v22.8h, v22.8h, #2 // (a-b)/4 sub v4.8h, v4.8h, v5.8h // (a-b)/4-b+c sub v22.8h, v22.8h, v23.8h // (a-b)/4-b+c sshr v4.8h, v4.8h, #2 // ((a-b)/4-b+c)/4 sshr v22.8h, v22.8h, #2 // ((a-b)/4-b+c)/4 add v4.8h, v4.8h, v6.8h // ((a-b)/4-b+c)/4+c = (a-5*b+20*c)/16 add v22.8h, v22.8h, v24.8h // ((a-b)/4-b+c)/4+c = (a-5*b+20*c)/16 sqrshrun v4.8b, v4.8h, #6 ld1 {v28.16b}, [x7], #16 // src[16:31] mov v0.16b, v2.16b ext v23.16b, v7.16b, v18.16b, #15 sqrshrun2 v4.16b, v22.8h, #6 mov v1.16b, v3.16b ext v22.16b, v7.16b, v18.16b, #14 ext v24.16b, v18.16b, v28.16b, #1 ext v25.16b, v18.16b, v28.16b, #2 ext v26.16b, v18.16b, v28.16b, #3 st1 {v4.16b}, [x2], #16 b.gt 2b subs w6, w6, #1 add x10, x10, x4 add x11, x11, x4 add x12, x12, x4 add x13, x13, x4 b.gt 1b ret endfunc
function sub8x8_dct8_neon, export=1 mov x3, #16 mov x4, #16 ld1 {v16.8b}, [x1], x3 ld1 {v17.8b}, [x2], x4 ld1 {v18.8b}, [x1], x3 ld1 {v19.8b}, [x2], x4 usubl v0.8h, v16.8b, v17.8b ld1 {v20.8b}, [x1], x3 ld1 {v21.8b}, [x2], x4 usubl v1.8h, v18.8b, v19.8b ld1 {v22.8b}, [x1], x3 ld1 {v23.8b}, [x2], x4 usubl v2.8h, v20.8b, v21.8b ld1 {v24.8b}, [x1], x3 ld1 {v25.8b}, [x2], x4 usubl v3.8h, v22.8b, v23.8b ld1 {v26.8b}, [x1], x3 ld1 {v27.8b}, [x2], x4 usubl v4.8h, v24.8b, v25.8b ld1 {v28.8b}, [x1], x3 ld1 {v29.8b}, [x2], x4 usubl v5.8h, v26.8b, v27.8b ld1 {v30.8b}, [x1], x3 ld1 {v31.8b}, [x2], x4 usubl v6.8h, v28.8b, v29.8b usubl v7.8h, v30.8b, v31.8b DCT8_1D row transpose8x8.h v0, v1, v2, v3, v4, v5, v6, v7, v30, v31 DCT8_1D col st1 {v0.8h,v1.8h,v2.8h,v3.8h}, [x0], #64 st1 {v4.8h,v5.8h,v6.8h,v7.8h}, [x0], #64 ret endfunc .macro DCT8_1D type SUMSUB_AB v18.8h, v17.8h, v3.8h, v4.8h // s34/d34 SUMSUB_AB v19.8h, v16.8h, v2.8h, v5.8h // s25/d25 SUMSUB_AB v22.8h, v21.8h, v1.8h, v6.8h // s16/d16 SUMSUB_AB v23.8h, v20.8h, v0.8h, v7.8h // s07/d07 SUMSUB_AB v24.8h, v26.8h, v23.8h, v18.8h // a0/a2 SUMSUB_AB v25.8h, v27.8h, v22.8h, v19.8h // a1/a3 SUMSUB_AB v30.8h, v29.8h, v20.8h, v17.8h // a6/a5 sshr v23.8h, v21.8h, #1 sshr v18.8h, v16.8h, #1 add v23.8h, v23.8h, v21.8h add v18.8h, v18.8h, v16.8h sub v30.8h, v30.8h, v23.8h sub v29.8h, v29.8h, v18.8h SUMSUB_AB v28.8h, v31.8h, v21.8h, v16.8h // a4/a7 sshr v22.8h, v20.8h, #1 sshr v19.8h, v17.8h, #1 add v22.8h, v22.8h, v20.8h add v19.8h, v19.8h, v17.8h add v22.8h, v28.8h, v22.8h add v31.8h, v31.8h, v19.8h SUMSUB_AB v0.8h, v4.8h, v24.8h, v25.8h SUMSUB_SHR 2, v1.8h, v7.8h, v22.8h, v31.8h, v16.8h, v17.8h SUMSUB_SHR 1, v2.8h, v6.8h, v26.8h, v27.8h, v18.8h, v19.8h SUMSUB_SHR2 2, v3.8h, v5.8h, v30.8h, v29.8h, v20.8h, v21.8h .endm .macro SUMSUB_AB sum, sub, a, b add \sum, \a, \b sub \sub, \a, \b .endm .macro SUMSUB_SHR shift sum sub a b t0 t1 sshr \t0, \b, #\shift sshr \t1, \a, #\shift add \sum, \a, \t0 sub \sub, \t1, \b .endm .macro SUMSUB_SHR2 shift sum sub a b t0 t1 sshr \t0, \a, #\shift sshr \t1, \b, #\shift add \sum, \t0, \b sub \sub, \a, \t1 .endm .macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9 trn1 \r8\().8h, \r0\().8h, \r1\().8h trn2 \r9\().8h, \r0\().8h, \r1\().8h trn1 \r1\().8h, \r2\().8h, \r3\().8h trn2 \r3\().8h, \r2\().8h, \r3\().8h trn1 \r0\().8h, \r4\().8h, \r5\().8h trn2 \r5\().8h, \r4\().8h, \r5\().8h trn1 \r2\().8h, \r6\().8h, \r7\().8h trn2 \r7\().8h, \r6\().8h, \r7\().8h trn1 \r4\().4s, \r0\().4s, \r2\().4s trn2 \r2\().4s, \r0\().4s, \r2\().4s trn1 \r6\().4s, \r5\().4s, \r7\().4s trn2 \r7\().4s, \r5\().4s, \r7\().4s trn1 \r5\().4s, \r9\().4s, \r3\().4s trn2 \r9\().4s, \r9\().4s, \r3\().4s trn1 \r3\().4s, \r8\().4s, \r1\().4s trn2 \r8\().4s, \r8\().4s, \r1\().4s trn1 \r0\().2d, \r3\().2d, \r4\().2d trn2 \r4\().2d, \r3\().2d, \r4\().2d trn1 \r1\().2d, \r5\().2d, \r6\().2d trn2 \r5\().2d, \r5\().2d, \r6\().2d trn2 \r6\().2d, \r8\().2d, \r2\().2d trn1 \r2\().2d, \r8\().2d, \r2\().2d trn1 \r3\().2d, \r9\().2d, \r7\().2d trn2 \r7\().2d, \r9\().2d, \r7\().2d .endm
Unfortunately, I was not able to find any way of improving these functions. Any thoughts?
BR,
For the quant_4x4x4_neon function:
Instead of an ABS followed by an ADD we could consider trying to make use of the SABA instruction to perform an absolute difference with zero and accumulate to do both instructions at once. The obvious problem here is that the bias parameter is reused so would need an additional MOV instruction to duplicate it. This is less of an issue in SVE where we can make use of MOVPRFX, since in this case the additional instruction can be considered "free" if it is destructively used by the following instruction. See:
https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/MOVPRFX--unpredicated---Move-prefix--unpredicated--https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/SABA--Signed-absolute-difference-and-accumulate-
So instead of:
abs v18.8h, v16.8h add v18.8h, v18.8h, \bias0
We could instead consider something like:
dup z30.h, #0 ... movprfx z18.h, \bias0 saba v18.h, v16.h, z30.h
For the UMULL + SSHR #16 pairs, SVE has a "multiply and return high half instruction" in UMULH which I think does what you want here in a single instruction:
https://developer.arm.com/documentation/ddi0602/2023-12/SVE-Instructions/UMULH--unpredicated---Unsigned-multiply-returning-high-half--unpredicated--
The SSHR #15 + EOR + SUB combination looks as if it performing something like a conditional negation based on the sign of v16/v17: (i.e. v16.8h < 0 ? -v18.8h : v18.8h). Perhaps we can make use of SVE predicated instructions here to perform a conditional negation instead?
sshr v16.8h, v16.8h, #15 eor v18.16b, v18.16b, v16.16b sub v18.8h, v18.8h, v16.8h
ptrue p1.b ... cmplt p0.h, p1/z, z16.h, #0 neg z18.h, p0/m, z18.h
For the hpel_filter_neon function it's hard to know without understanding the underlying algorithm but it doesn't appear obvious that there is much we can do here.
The arithmetic is using widening instructions so we could consider trying to make use of the dot-product instructions here, making use of TBL instructions rather than the EXT instructions that are there currently to reorder the data into a layout that makes the use of dot-product instructions more viable. I don't think I understand the background behind the permute that is being performed by the EXT instructions enough here to comment further on this one I'm afraid.
For the sub8x8_dct8_neon function we have quite a few SSHR following the SUB instruction in the SUMSUB_AB macro. It seems like we could make use of the halving-subtract instruction here to do the same thing in a single instruction:
https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/SHSUB--Signed-Halving-Subtract-?lang=en
I think the same is also true for the SSHR #1 in the second use of the SUMSUB_SHR macro?
Beyond that I don't think there is much we can do here since the code is mainly just adds and subtracts and a transpose which can't really be improved at all.
thank you very much for your answer.
For the quant_4x4x4_neon function, everything worked like a charm. Thanks!
For the hpel_filter_neon function. unfortunately I do not have any more information now. I will check again the code in if I have more info I will contact you again.
For the sub8x8_dct8_neon function, what you proposed is very interesting. However, it seems that we need both the normal subtraction output and the subtraction output right shifted by one. Below, I am listing again the DCT8_1D macro after substituting the internally used macros with the real instructions:
.macro DCT8_1D_SVE type add v18.8h, v3.8h, v4.8h sub v17.8h, v3.8h, v4.8h add v19.8h, v2.8h, v5.8h sub v16.8h, v2.8h, v5.8h add v22.8h, v1.8h, v6.8h sub v21.8h, v1.8h, v6.8h add v23.8h, v0.8h, v7.8h sub v20.8h, v0.8h, v7.8h add v24.8h, v23.8h, v18.8h sub v26.8h, v23.8h, v18.8h add v25.8h, v22.8h, v19.8h sub v27.8h, v22.8h, v19.8h add v30.8h, v20.8h, v17.8h sub v29.8h, v20.8h, v17.8h sshr v23.8h, v21.8h, #1 sshr v18.8h, v16.8h, #1 add v23.8h, v23.8h, v21.8h add v18.8h, v18.8h, v16.8h sub v30.8h, v30.8h, v23.8h sub v29.8h, v29.8h, v18.8h add v28.8h, v21.8h, v16.8h sub v31.8h, v21.8h, v16.8h sshr v22.8h, v20.8h, #1 sshr v19.8h, v17.8h, #1 add v22.8h, v22.8h, v20.8h add v19.8h, v19.8h, v17.8h add v22.8h, v28.8h, v22.8h add v31.8h, v31.8h, v19.8h add v0.8h, v24.8h, v25.8h sub v4.8h, v24.8h, v25.8h sshr v16.8h, v31.8h, #2 sshr v17.8h, v22.8h, #2 add v1.8h, v22.8h, v16.8h sub v7.8h, v17.8h, v31.8h sshr v18.8h, v27.8h, #1 sshr v19.8h, v26.8h, #1 add v2.8h, v26.8h, v18.8h sub v6.8h, v19.8h, v27.8h sshr v20.8h, v30.8h, #2 sshr v21.8h, v29.8h, #2 add v3.8h, v20.8h, v29.8h sub v5.8h, v30.8h, v21.8h .endm
For example, we subtract v6.8h from v1.8h and place the result in v21.8h. Then, we right shift v21.8h by 1 and place the result in v23.8h, as later on we need both v21.8h and v238h. So, I do not think we can use shsub as we will lose v21.8h. Am I missing something?
Also, I have a couple more functions to improve. More specifically:
function mc_copy_w16_neon, export=1 lsl x1, x1, #1 lsl x3, x3, #1 1: subs w4, w4, #4 ld1 {v0.8h, v1.8h}, [x2], x3 ld1 {v2.8h, v3.8h}, [x2], x3 ld1 {v4.8h, v5.8h}, [x2], x3 ld1 {v6.8h, v7.8h}, [x2], x3 st1 {v0.8h, v1.8h}, [x0], x1 st1 {v2.8h, v3.8h}, [x0], x1 st1 {v4.8h, v5.8h}, [x0], x1 st1 {v6.8h, v7.8h}, [x0], x1 b.gt 1b ret endfunc
function memcpy_aligned_neon, export=1 tst x2, #16 b.eq 32f sub x2, x2, #16 ldr q0, [x1], #16 str q0, [x0], #16 32: tst x2, #32 b.eq 640f sub x2, x2, #32 ldp q0, q1, [x1], #32 stp q0, q1, [x0], #32 640: cbz x2, 1f 64: subs x2, x2, #64 ldp q0, q1, [x1, #32] ldp q2, q3, [x1], #64 stp q0, q1, [x0, #32] stp q2, q3, [x0], #64 b.gt 64b 1: ret endfunc
const pw_0to15, align=5 .short 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 endconst function mbtree_propagate_list_internal_neon, export=1 movrel x11, pw_0to15 dup v31.8h, w4 // bipred_weight movi v30.8h, #0xc0, lsl #8 ld1 {v29.8h}, [x11] //h->mb.i_mb_x,h->mb.i_mb_y movi v28.4s, #4 movi v27.8h, #31 movi v26.8h, #32 dup v24.8h, w5 // mb_y zip1 v29.8h, v29.8h, v24.8h 8: subs w6, w6, #8 ld1 {v1.8h}, [x1], #16 // propagate_amount ld1 {v2.8h}, [x2], #16 // lowres_cost and v2.16b, v2.16b, v30.16b cmeq v25.8h, v2.8h, v30.8h umull v16.4s, v1.4h, v31.4h umull2 v17.4s, v1.8h, v31.8h rshrn v16.4h, v16.4s, #6 rshrn2 v16.8h, v17.4s, #6 bsl v25.16b, v16.16b, v1.16b // if( lists_used == 3 ) // propagate_amount = (propagate_amount * bipred_weight + 32) >> 6 ld1 {v4.8h,v5.8h}, [x0], #32 sshr v6.8h, v4.8h, #5 sshr v7.8h, v5.8h, #5 add v6.8h, v6.8h, v29.8h add v29.8h, v29.8h, v28.8h add v7.8h, v7.8h, v29.8h add v29.8h, v29.8h, v28.8h st1 {v6.8h,v7.8h}, [x3], #32 and v4.16b, v4.16b, v27.16b and v5.16b, v5.16b, v27.16b uzp1 v6.8h, v4.8h, v5.8h // x & 31 uzp2 v7.8h, v4.8h, v5.8h // y & 31 sub v4.8h, v26.8h, v6.8h // 32 - (x & 31) sub v5.8h, v26.8h, v7.8h // 32 - (y & 31) mul v19.8h, v6.8h, v7.8h // idx3weight = y*x; mul v18.8h, v4.8h, v7.8h // idx2weight = y*(32-x); mul v17.8h, v6.8h, v5.8h // idx1weight = (32-y)*x; mul v16.8h, v4.8h, v5.8h // idx0weight = (32-y)*(32-x) ; umull v6.4s, v19.4h, v25.4h umull2 v7.4s, v19.8h, v25.8h umull v4.4s, v18.4h, v25.4h umull2 v5.4s, v18.8h, v25.8h umull v2.4s, v17.4h, v25.4h umull2 v3.4s, v17.8h, v25.8h umull v0.4s, v16.4h, v25.4h umull2 v1.4s, v16.8h, v25.8h rshrn v19.4h, v6.4s, #10 rshrn2 v19.8h, v7.4s, #10 rshrn v18.4h, v4.4s, #10 rshrn2 v18.8h, v5.4s, #10 rshrn v17.4h, v2.4s, #10 rshrn2 v17.8h, v3.4s, #10 rshrn v16.4h, v0.4s, #10 rshrn2 v16.8h, v1.4s, #10 zip1 v0.8h, v16.8h, v17.8h zip2 v1.8h, v16.8h, v17.8h zip1 v2.8h, v18.8h, v19.8h zip2 v3.8h, v18.8h, v19.8h st1 {v0.8h,v1.8h}, [x3], #32 st1 {v2.8h,v3.8h}, [x3], #32 b.ge 8b ret endfunc
function pixel_var2_8x\h\()_neon, export=1 mov x3, #16 ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 mov x5, \h - 2 usubl v0.8h, v16.8b, v18.8b usubl v1.8h, v17.8b, v19.8b ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 smull v2.4s, v0.4h, v0.4h smull2 v3.4s, v0.8h, v0.8h smull v4.4s, v1.4h, v1.4h smull2 v5.4s, v1.8h, v1.8h usubl v6.8h, v16.8b, v18.8b 1: subs x5, x5, #1 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 smlal v2.4s, v6.4h, v6.4h smlal2 v3.4s, v6.8h, v6.8h usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 smlal v4.4s, v7.4h, v7.4h smlal2 v5.4s, v7.8h, v7.8h usubl v6.8h, v16.8b, v18.8b add v1.8h, v1.8h, v7.8h b.gt 1b ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 smlal v2.4s, v6.4h, v6.4h smlal2 v3.4s, v6.8h, v6.8h usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h smlal v4.4s, v7.4h, v7.4h add v1.8h, v1.8h, v7.8h smlal2 v5.4s, v7.8h, v7.8h saddlv s0, v0.8h saddlv s1, v1.8h add v2.4s, v2.4s, v3.4s add v4.4s, v4.4s, v5.4s mov w0, v0.s[0] mov w1, v1.s[0] addv s2, v2.4s addv s4, v4.4s mul w0, w0, w0 mul w1, w1, w1 mov w3, v2.s[0] mov w4, v4.s[0] sub w0, w3, w0, lsr # 6 + (\h >> 4) sub w1, w4, w1, lsr # 6 + (\h >> 4) str w3, [x2] add w0, w0, w1 str w4, [x2, #4] ret endfunc
function pixel_sad_x_h\()_neon_10, export=1 mov x7, #16 lsl x5, x5, #1 lsl x7, x7, #1 ld1 {v0.8h, v1.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 ld1 {v4.8h, v5.8h}, [x2], x5 uabd v16.8h, v2.8h, v0.8h uabd v20.8h, v3.8h, v1.8h ld1 {v24.8h, v25.8h}, [x3], x5 uabd v17.8h, v4.8h, v0.8h uabd v21.8h, v5.8h, v1.8h ld1 {v6.8h, v7.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 uabd v18.8h, v24.8h, v0.8h uabd v22.8h, v25.8h, v1.8h ld1 {v4.8h, v5.8h}, [x2], x5 uaba v16.8h, v2.8h, v6.8h uaba v20.8h, v3.8h, v7.8h ld1 {v24.8h, v25.8h}, [x3], x5 uaba v17.8h, v4.8h, v6.8h uaba v21.8h, v5.8h, v7.8h ld1 {v26.8h, v27.8h}, [x4], x5 ld1 {v28.8h, v29.8h}, [x4], x5 uaba v18.8h, v24.8h, v6.8h uaba v22.8h, v25.8h, v7.8h uabd v19.8h, v26.8h, v0.8h uabd v23.8h, v27.8h, v1.8h uaba v19.8h, v28.8h, v6.8h uaba v23.8h, v29.8h, v7.8h .rept \h / 2 - 1 ld1 {v0.8h, v1.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 ld1 {v4.8h, v5.8h}, [x2], x5 uaba v16.8h, v2.8h, v0.8h uaba v20.8h, v3.8h, v1.8h ld1 {v24.8h, v25.8h}, [x3], x5 uaba v17.8h, v4.8h, v0.8h uaba v21.8h, v5.8h, v1.8h ld1 {v6.8h, v7.8h}, [x0], x7 ld1 {v2.8h, v3.8h}, [x1], x5 uaba v18.8h, v24.8h, v0.8h uaba v22.8h, v25.8h, v1.8h ld1 {v4.8h, v5.8h}, [x2], x5 uaba v16.8h, v2.8h, v6.8h uaba v20.8h, v3.8h, v7.8h ld1 {v24.8h, v25.8h}, [x3], x5 uaba v17.8h, v4.8h, v6.8h uaba v21.8h, v5.8h, v7.8h ld1 {v26.8h, v27.8h}, [x4], x5 ld1 {v28.8h, v29.8h}, [x4], x5 uaba v18.8h, v24.8h, v6.8h uaba v22.8h, v25.8h, v7.8h uaba v19.8h, v26.8h, v0.8h uaba v23.8h, v27.8h, v1.8h uaba v19.8h, v28.8h, v6.8h uaba v23.8h, v29.8h, v7.8h .endr add v16.8h, v16.8h, v20.8h add v17.8h, v17.8h, v21.8h add v18.8h, v18.8h, v22.8h add v19.8h, v19.8h, v23.8h // add up the sads uaddlv s0, v16.8h uaddlv s1, v17.8h uaddlv s2, v18.8h stp s0, s1, [x6], #8 uaddlv s3, v19.8h stp s2, s3, [x6] ret endfunc
For the latter, I tried to use the udot approach, but the performance is degraded. Any thoughts?
You do not have to provide me full functions, just some hints. Sorry for the large amount of help that I am asking. It seems that this forum is the only help I can get.
Good to hear that my suggestions for quant_4x4x4_neon worked as we expected!
For sub8x8_dct8_neon I wonder if we can still remove some of the shift instructions by combining with the successor addition and using the SSRA instruction to perform a shift and addition in a single instruction:
https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/SSRA--Signed-Shift-Right-and-Accumulate--immediate--
sshr v23.8h, v21.8h, #1 add v23.8h, v23.8h, v21.8h
ssra v21.8h, v21.8h, #1 // v21.8h += v21.8h >> 1
This has the disadvantage that we must reuse the same register as the non-shifted addend so this does not work if we need v21 elsewhere later: in your snippet I think that v21 is used in an ADD and SUB after the SSHR+ADD pair, however I suspect that some of this can be solved by re-ordering the code so that the ADD/SUB are done first and therefore the register can be reused for the SSRA?
For the copy functions like mc_copy_w16_neon and memcpy_aligned_neon there is probably no benefit from SVE at the same vector length as Neon. One small optimisation you could consider is maintaining multiple independent source and destination addresses (e.g. x0, x0+x1, x0+2*x1, x0+3*x1) and incrementing them independently (e.g. by x1*4), since currently in mc_copy_w16_neon for instance the x0 and x2 addresses must be updated four times per loop iteration which could be slow. I don't expect that would have a big impact in performance though.
For mbtree_propagate_list_internal_neon I wonder if we can also use the SSRA instruction here as well? We currently have e.g.
sshr v6.8h, v4.8h, #5 add v6.8h, v6.8h, v29.8h
Which could instead be:
ssra v29.8h, v4.8h, #5
I guess that doesn't work so well in this case because v28 and v29 are needed for the next iteration of the loop, but even a MOV to duplicate them into another variable may still be better since the constants will not be on the critical path of the calculation.
The UZP1/UZP2 and later ZIP1/ZIP2 instructions in the loop feels strange since the ZIP1/ZIP2 will undo the effect of the earlier UZP1/UZP2 instructions? Perhaps the other operands (v25) can be adjusted so that both pairs of permutes can either be removed or at leastreplaced with a single permute to swap pairs of lanes so that they can interact with each other (REV32.8H?).
Finally, I suspect it doesn't work in this case but just mentioning it in case it could be useful: for the UMULL+RSHRN pairs we could consider trying to replace those with something like the UMULH SVE instruction:
The problems I suspect that we'll encounter trying to use UMULH here are (a) that the shift is a rounding shift which means we cannot usually just take the top half of the multiplication result, and (b) the shift value is only 10 rather than 16. The shift value might not be a problem if you could instead adjust the operands and multiply by (v25 << 6) instead, but that might not be possible depending on the range of possible values for that multiplicand.
For pixel_var2_8x\h\()_neon I would assume that we could replace the USUBL+SMULL/SMLAL pairs with UABD+UDOT as we have done previously. Since we have one continguous array it may also be worth loading into full vectors of data here rather than only using half a vector at a time, e.g.
ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3
Could be something like:
ld1 {v16.16b}, [x0], #16 // Merged from v16 and v17. ld1 {v18.8b}, [x1], x3 ld1 {v18.d}[1], [x1], x3 // Load into high half of v18, not v19.
For pixel_sad_x_h\()_neon_10 I agree with your conclusion. I don't think that there will be much benefit from dot product here since there is never a widening operation, so the UABA instruction is able to operate on full vectors rather than on only half of a vector like in some of our previous examples where we have used UMLAL or UABAL etc.
For sub8x8_dct8_neon, I applied your suggestion and everything worked fine. Thanks!
For the copy functions, as you said, there is no much left to do for improving the performance.
For mbtree_propagate_list_internal_neon, I applied your suggestion. Thanks!
For pixel_var2_8x\h\()_neon, I used the udot instruction, but it doesn't work. It seems that some vectors (for example v0.8h, v1.8h, v6.8h and v7.8h) are still needed after widening instructions. I developed the following function:
function pixel_var2_8x\h\()_sve, export=1 movi v30.4s, #0 movi v31.4s, #0 mov x3, #16 ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 mov x5, \h - 2 uabd v28.8b, v16.8b, v18.8b usubl v0.8h, v16.8b, v18.8b uabd v29.8b, v17.8b, v19.8b usubl v1.8h, v17.8b, v19.8b ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 udot v30.2s, v28.8b, v28.8b udot v31.2s, v29.8b, v29.8b uabd v28.8b, v16.8b, v18.8b usubl v6.8h, v16.8b, v18.8b 1: subs x5, x5, #1 ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 udot v30.2s, v28.8b, v28.8b uabd v29.8b, v17.8b, v19.8b usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h ld1 {v16.8b}, [x0], #8 ld1 {v18.8b}, [x1], x3 udot v31.2s, v29.8b, v29.8b uabd v28.8b, v16.8b, v18.8b usubl v6.8h, v16.8b, v18.8b add v1.8h, v1.8h, v7.8h b.gt 1b ld1 {v17.8b}, [x0], #8 ld1 {v19.8b}, [x1], x3 udot v30.2s, v6.8b, v6.8b uabd v29.8b, v17.8b, v19.8b usubl v7.8h, v17.8b, v19.8b add v0.8h, v0.8h, v6.8h udot v31.2s, v29.8b, v29.8b add v1.8h, v1.8h, v7.8h saddlv s0, v0.8h saddlv s1, v1.8h mov w0, v0.s[0] mov w1, v1.s[0] addv s2, v30.4s addv s4, v31.4s mul w0, w0, w0 mul w1, w1, w1 mov w3, v30.s[0] mov w4, v31.s[0] sub w0, w3, w0, lsr # 6 + (\h >> 4) sub w1, w4, w1, lsr # 6 + (\h >> 4) str w3, [x2] add w0, w0, w1 str w4, [x2, #4] ret endfunc
Unit tests fail. Can you please tell me what I am doing wrong? Also, the usage of the three load merging commands instead of the initial four, degrades the performance. I do not know why.
For pixel_sad_x_h\()_neon_10, I also agree that we cannot improve it.
Unit test fail. Can you please tell me what I am doing wrong? Also, the usage of the three load merging commands instead of the initial four, degrades the performance. I do not know why.
It's a bit hard for me to try and debug the whole code snippet. One thing I did notice though is that at the end of the function you reduce v30 and v31 as such:
addv s2, v30.4s addv s4, v31.4s ... mov w3, v30.s[0] // Should this be v2.s[0] ? mov w4, v31.s[0] // Should this be v4.s[0] ?
This seems suspicious since s2 and s4 are otherwise never used after those instructions.
With regards to still needing the USUBL, do you know if either the absolute difference (UABD) or a non-widening subtract (SUB) would work here instead? If so then we can potentially use only one of those instead since the UABD and USUBL are doing very similar things at the moment? Assuming that a non-widening approach works here you could then sum the results with UADDW or another UDOT instruction with all-1s as the other operand.
For example, instead of:
uabd v28.8b, v16.8b, v18.8b usubl v6.8h, v16.8b, v18.8b udot v30.2s, v28.8b, v28.8b add v0.8h, v0.8h, v6.8h
We could see if something like this would work instead:
uabd v28.8b, v16.8b, v18.8b // or SUB? udot v30.2s, v28.8b, v28.8b uaddw v0.8h, v0.8h, v28.8b
Using the dot product would also work here if we need to widen beyond a 16-bit accumulator for v0 since it allows us to accumulate in 32-bits by multiplying by a vector of all-1s:
mov v6.16b, #1 ... uabd v28.8b, v16.8b, v18.8b // or SUB? udot v30.2s, v28.8b, v28.8b udot v0.2s, v28.8b, v6.8b // v28.8b * 1
If an approach like that works then at that point it may be beneficial to re-try the three-load appoach since the entire computation can be moved from .8b to .16b which could be more significant than your previous attempt?
after using the mov instructions you proposed, everything worked fine! Thanks!
Unfortunately, after some testing, I can neither use sub nor uabd. Unit tests fail again. So I can not use the 3 load instructions as well. But your proposed solution is very interesting and it may help me optimize other functions. Thanks!
I think we can close this thread. You gave me a lot of help. I couldn't reach up to this point without your help. Once again, thanks!
If I need further help, I will create new thread (I hope that this is OK).