In a project which is focused on accelerating the performance on ARM, I am using the mm_shuffle_epi8 implementation from the below page https://github.com/f4exb/cm256cc/blob/master/sse2neon.h#L981
But above implementation is sub optimal and leading to performance costs.
Is there a right equivalent for _mm_shuffle_epi8 for ARM ?
Do you have a particular bit shuffle you need, or do you need the full generality?