Hello,
in SVE, I can use `svcmpeq_u64` to compare two svuint64_t vectors. The result is a `svbool_t` predicate. Now, I can use `svlastb_` functions to extract the element itself. However, as I am implementing a linear search using SVE, I am not interested in the element itself (I know what I am searching for), but the _index_ of the first true match in the svbool_t returned by the comparison function.
I was not able to find any function that gives me the index of the first true element in a svbool_t. In x86 SSE, for example, I can create a movemask and count trailing zeros to index my comparison results. In NEON, I can do the same by using an emulated movemask function, but I don't know how to do it natively in NEON. How would I do this on SVE?
Thank you so much for your help.
Best,
Maxbit
Dear George,
thank you so much. I was not aware of the index instructions, that is indeed a very clever way to handle this problem. It works great for my use case. Thank you so, so much!
I have two follow-up questions on this.
a) Do you believe there is any performance difference between svbrka_b_z and svpfirst_b? Probably it is very minor anyway, but I was just wondering whether I should benchmark this thoroughly.
b) Of course, I need to provide a NEON fallback implementation for finding the first match. Currently, my NEON code looks like this:/* The following two functions implement the index of first match calculation */ // adapted from https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h HEDLEY_ALWAYS_INLINE static uint32_t movemask_sse2neon(uint8x16_t input) { uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7)); uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(high_bits, high_bits, 7)); uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8); } HEDLEY_ALWAYS_INLINE static uint32_t non_sve_vector_first_index(vector_type vector) { uint32_t mask = movemask_sse2neon(to_uint8_vec(vector)); // __builtin_ctz does not work with all zero mask if (mask == 0) { return INVALID_IDX; } return __builtin_ctz(mask) / static_cast<uint32_t>(sizeof(KeyT)); } /* The next function implements the any_found check */ HEDLEY_ALWAYS_INLINE static bool non_sve_vector_any_nonzero(vector_type input) { // Idea: 1. Reinterpret vector_type as uint64x2_t (if necessary) // 2. Our 128-bit register now contains two 64-bit uints. Only if both of them are zero, there is no zero bit uint64x2_t reinterpreted; if constexpr (std::is_same_v<_vector_type, uint64x2_t>) { reinterpreted = input; } else { if constexpr (std::is_same_v<_vector_type, uint8x16_t>) { reinterpreted = vreinterpretq_u64_u8(input); } else if constexpr (std::is_same_v<_vector_type, uint16x8_t>) { reinterpreted = vreinterpretq_u64_u16(input); } else if constexpr (std::is_same_v<_vector_type, uint32x4_t>) { reinterpreted = vreinterpretq_u64_u32(input); } else { FAIL("Unsupported vector type in conversion"); } } return vgetq_lane_u64(reinterpreted, 0) | vgetq_lane_u64(reinterpreted, 1); }
/* The following two functions implement the index of first match calculation */ // adapted from https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h HEDLEY_ALWAYS_INLINE static uint32_t movemask_sse2neon(uint8x16_t input) { uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7)); uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(high_bits, high_bits, 7)); uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8); } HEDLEY_ALWAYS_INLINE static uint32_t non_sve_vector_first_index(vector_type vector) { uint32_t mask = movemask_sse2neon(to_uint8_vec(vector)); // __builtin_ctz does not work with all zero mask if (mask == 0) { return INVALID_IDX; } return __builtin_ctz(mask) / static_cast<uint32_t>(sizeof(KeyT)); } /* The next function implements the any_found check */ HEDLEY_ALWAYS_INLINE static bool non_sve_vector_any_nonzero(vector_type input) { // Idea: 1. Reinterpret vector_type as uint64x2_t (if necessary) // 2. Our 128-bit register now contains two 64-bit uints. Only if both of them are zero, there is no zero bit uint64x2_t reinterpreted; if constexpr (std::is_same_v<_vector_type, uint64x2_t>) { reinterpreted = input; } else { if constexpr (std::is_same_v<_vector_type, uint8x16_t>) { reinterpreted = vreinterpretq_u64_u8(input); } else if constexpr (std::is_same_v<_vector_type, uint16x8_t>) { reinterpreted = vreinterpretq_u64_u16(input); } else if constexpr (std::is_same_v<_vector_type, uint32x4_t>) { reinterpreted = vreinterpretq_u64_u32(input); } else { FAIL("Unsupported vector type in conversion"); } } return vgetq_lane_u64(reinterpreted, 0) | vgetq_lane_u64(reinterpreted, 1); }
As you can see, it is quite complicated and has overhead, especially because I did not know how to handle the index calculation efficiently in NEON. On x86 SEE, again, I can use the TZCNT and movemask intrinsics, but I could not find equivalents for NEON. Do you also have a clever idea on how to make this NEON code more efficient? It should basically achieve the same result as the SVE code we discussed above, i.e., we have a vector that is result of a comparison, e.g., by `vceqq_u64`, and then we implement a test function to check whether there was any match and a function that gets the first index of the match.
Thank you so, so much for your help. I am quite new to ARM SIMD programming and just getting the hang of the differences compared to SSE/AVX.
Hi Maxbit,
(a) For comparing different choices of instruction sequences you can refer to the software optimisation guides available on developer.arm.com, for example for Neoverse V1: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ . On Neoverse V1 at least it appears that BRKA and PFIRST instructions have identical latency and throughput, so which one you use probably doesn't matter, but I would suggest checking the guide for the exact core you are interested in.
(b) For the Neon equivalent as you point out there is no nice sequence here. The fastest sequence will probably depend both on how frequently the INVALID_IDX path is taken compared to an actual index being found, as well as the particular core you are interested in, so I'd recomment benchmarking a couple of different alternatives yourself to see which one works better for your use case.
For the code itself: comparisons in Neon like `vceqq` will set matching elements to all-ones, so you can take advantage of that in a couple of ways:
Since Neon vectors are only a pair of uint64_t elements wide, you might actually find it quicker to extract out the pair of elements and just call e.g. __builtin_ctz twice. In general the scalar instructions have lower overheads for such short sequences.
I appreciate that's a bit of a vague answer. Let me know if you run into any other problems and I can try and throw a more complete example together.
Thank you so much again for your help and please excuse my late answer. I am currently implementing several variants that you suggested in order to benchmark them. I have one question on the UMINV approach, maybe I am overseeing something.
The comparison result vector will be a vector of either all 1-bit or all 0-bit entries. If I element-wise AND this vector with our {1,2,3,..} index vector, the smallest non-zero number that remains is the index we are looking for. However, the AND will turn indices where there was no match into 0, and as 0 will be smaller than any of our other indices, UMINV will always return 0. Or am I getting something wrong here? We would need an instruction for the smallest non-zero value, wouldn't we?
Thank you for clearing that up.
You are correct, the UMINV approach doesn't work as I imagined. You could try instead ANDing it with the bitwise negation of the index (which is a large unsigned number) and use UMAXV instead. Something like:
#include <arm_neon.h> #include <stdio.h> int main() { uint32x4_t x = {0, 0, ~0U, ~0U}; uint32x4_t y = {~0U, ~1U, ~2U, ~3U}; uint32_t idx = ~vmaxvq_u32(vandq_u32(x, y)); printf("idx = %u\n", idx); }
Thanks,George