Best fitting library to compare vector to another 200k vectors


First of all please confirm that it is indeed gpu task and not simd. I guess it is for gpu:

I have a vector of length 32 elements, of size 16 bits each. 

I need to compare each of those elements to EACH of 32 elements of another 200k vectors like this one (200,000) vectors. 

I understood SIMD does it perfectly fine and I guess 200k times every time might be even not that bad to compute. 

But it is still awful to execute 200k comparisons in SIMD serially and I want to do it in GPU in parallel. 

Q1: Can each gpu core/entity do such comparison like simd does (to compare each of the elements of vector A to a single element of vector B in parallel)?

Q2: Which library (one of the Neon’s I guess) suits me the best? I found the compute library, but I cannot find what to use (I see but there is no more straight forward tutorial than this one that I’d found. How to implement such a comparison? Say SIMD has vceq_s32 for example which is straight forward. But how do I work against that link? If this the relevant library for me of course. 

Many thanks,


More questions in this forum