Hello,
in SVE, I can use `svcmpeq_u64` to compare two svuint64_t vectors. The result is a `svbool_t` predicate. Now, I can use `svlastb_` functions to extract the element itself. However, as I am implementing a linear search using SVE, I am not interested in the element itself (I know what I am searching for), but the _index_ of the first true match in the svbool_t returned by the comparison function.
I was not able to find any function that gives me the index of the first true element in a svbool_t. In x86 SSE, for example, I can create a movemask and count trailing zeros to index my comparison results. In NEON, I can do the same by using an emulated movemask function, but I don't know how to do it natively in NEON. How would I do this on SVE?
Thank you so much for your help.
Best,
Maxbit
Hi Maxbit,
(a) For comparing different choices of instruction sequences you can refer to the software optimisation guides available on developer.arm.com, for example for Neoverse V1: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ . On Neoverse V1 at least it appears that BRKA and PFIRST instructions have identical latency and throughput, so which one you use probably doesn't matter, but I would suggest checking the guide for the exact core you are interested in.
(b) For the Neon equivalent as you point out there is no nice sequence here. The fastest sequence will probably depend both on how frequently the INVALID_IDX path is taken compared to an actual index being found, as well as the particular core you are interested in, so I'd recomment benchmarking a couple of different alternatives yourself to see which one works better for your use case.
For the code itself: comparisons in Neon like `vceqq` will set matching elements to all-ones, so you can take advantage of that in a couple of ways:
Since Neon vectors are only a pair of uint64_t elements wide, you might actually find it quicker to extract out the pair of elements and just call e.g. __builtin_ctz twice. In general the scalar instructions have lower overheads for such short sequences.
I appreciate that's a bit of a vague answer. Let me know if you run into any other problems and I can try and throw a more complete example together.
Dear George,
Thank you so much again for your help and please excuse my late answer. I am currently implementing several variants that you suggested in order to benchmark them. I have one question on the UMINV approach, maybe I am overseeing something.
The comparison result vector will be a vector of either all 1-bit or all 0-bit entries. If I element-wise AND this vector with our {1,2,3,..} index vector, the smallest non-zero number that remains is the index we are looking for. However, the AND will turn indices where there was no match into 0, and as 0 will be smaller than any of our other indices, UMINV will always return 0. Or am I getting something wrong here? We would need an instruction for the smallest non-zero value, wouldn't we?
Thank you for clearing that up.
You are correct, the UMINV approach doesn't work as I imagined. You could try instead ANDing it with the bitwise negation of the index (which is a large unsigned number) and use UMAXV instead. Something like:
#include <arm_neon.h> #include <stdio.h> int main() { uint32x4_t x = {0, 0, ~0U, ~0U}; uint32x4_t y = {~0U, ~1U, ~2U, ~3U}; uint32_t idx = ~vmaxvq_u32(vandq_u32(x, y)); printf("idx = %u\n", idx); }
Thanks,George