This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SVE Instruction to get index of first true element

Hello,

in SVE, I can use `svcmpeq_u64` to compare two svuint64_t vectors. The result is a `svbool_t` predicate. Now, I can use `svlastb_` functions to extract the element itself. However, as I am implementing a linear search using SVE, I am not interested in the element itself (I know what I am searching for), but the _index_ of the first true match in the svbool_t returned by the comparison function.

I was not able to find any function that gives me the index of the first true element in a svbool_t. In x86 SSE, for example, I can create a movemask and count trailing zeros to index my comparison results. In NEON, I can do the same by using an emulated movemask function, but I don't know how to do it natively in NEON. How would I do this on SVE?

Thank you so much for your help.

Best,

Maxbit

Parents
  • Hi Maxbit,

    (a) For comparing different choices of instruction sequences you can refer to the software optimisation guides available on developer.arm.com, for example for Neoverse V1: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ . On Neoverse V1 at least it appears that BRKA and PFIRST instructions have identical latency and throughput, so which one you use probably doesn't matter, but I would suggest checking the guide for the exact core you are interested in.

    (b) For the Neon equivalent as you point out there is no nice sequence here. The fastest sequence will probably depend both on how frequently the INVALID_IDX path is taken compared to an actual index being found, as well as the particular core you are interested in, so I'd recomment benchmarking a couple of different alternatives yourself to see which one works better for your use case.

    For the code itself: comparisons in Neon like `vceqq` will set matching elements to all-ones, so you can take advantage of that in a couple of ways:

    • You can use the UMAXV instruction and then compare this against zero to see if any elements match.
    • You can use the comparison result and AND that against a hard-coded vector of {1,2,...}, then use UMINV, take one away from that to find the minimum matching index (you need to avoid zero here since the non-matching cases will already be zero). Unlike SVE you don't need an INDEX instruction here, you can just hard-code it since the vector length is known at compile time.
    • You can alternatively use BSL instead of AND to select the passing elements into a {0,1,...} vector and the others to -1, to avoid needing to subtract one at the end.

    Since Neon vectors are only a pair of uint64_t elements wide, you might actually find it quicker to extract out the pair of elements and just call e.g. __builtin_ctz twice. In general the scalar instructions have lower overheads for such short sequences.

    I appreciate that's a bit of a vague answer. Let me know if you run into any other problems and I can try and throw a more complete example together.

Reply
  • Hi Maxbit,

    (a) For comparing different choices of instruction sequences you can refer to the software optimisation guides available on developer.arm.com, for example for Neoverse V1: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ . On Neoverse V1 at least it appears that BRKA and PFIRST instructions have identical latency and throughput, so which one you use probably doesn't matter, but I would suggest checking the guide for the exact core you are interested in.

    (b) For the Neon equivalent as you point out there is no nice sequence here. The fastest sequence will probably depend both on how frequently the INVALID_IDX path is taken compared to an actual index being found, as well as the particular core you are interested in, so I'd recomment benchmarking a couple of different alternatives yourself to see which one works better for your use case.

    For the code itself: comparisons in Neon like `vceqq` will set matching elements to all-ones, so you can take advantage of that in a couple of ways:

    • You can use the UMAXV instruction and then compare this against zero to see if any elements match.
    • You can use the comparison result and AND that against a hard-coded vector of {1,2,...}, then use UMINV, take one away from that to find the minimum matching index (you need to avoid zero here since the non-matching cases will already be zero). Unlike SVE you don't need an INDEX instruction here, you can just hard-code it since the vector length is known at compile time.
    • You can alternatively use BSL instead of AND to select the passing elements into a {0,1,...} vector and the others to -1, to avoid needing to subtract one at the end.

    Since Neon vectors are only a pair of uint64_t elements wide, you might actually find it quicker to extract out the pair of elements and just call e.g. __builtin_ctz twice. In general the scalar instructions have lower overheads for such short sequences.

    I appreciate that's a bit of a vague answer. Let me know if you run into any other problems and I can try and throw a more complete example together.

Children