We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
In a project which is focused on accelerating the performance on ARM, I am using the mm_shuffle_epi8 implementation from the below page https://github.com/f4exb/cm256cc/blob/master/sse2neon.h#L981
But above implementation is sub optimal and leading to performance costs.
Is there a right equivalent for _mm_shuffle_epi8 for ARM ?
Do you have a particular bit shuffle you need, or do you need the full generality?