Hi,
First of all please confirm that it is indeed gpu task and not simd. I guess it is for gpu:
I have a vector of length 32 elements, of size 16 bits each.
I need to compare each of those elements to EACH of 32 elements of another 200k vectors like this one (200,000) vectors.
I understood SIMD does it perfectly fine and I guess 200k times every time might be even not that bad to compute.
But it is still awful to execute 200k comparisons in SIMD serially and I want to do it in GPU in parallel.
Q1: Can each gpu core/entity do such comparison like simd does (to compare each of the elements of vector A to a single element of vector B in parallel)?
Q2: Which library (one of the Neon’s I guess) suits me the best? I found the compute library, but I cannot find what to use (I see https://arm-software.github.io/ComputeLibrary/latest/index.xhtml but there is no more straight forward tutorial than this one that I’d found. How to implement such a comparison? Say SIMD has vceq_s32 for example which is straight forward. But how do I work against that link? If this the relevant library for me of course.
Many thanks,
vitali.pom