We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
In a project which is focussed on accelerating the performance on ARM, I am using the mm_shuffle_epi8 implementation from the below page https://github.com/f4exb/cm256cc/blob/master/sse2neon.h#L981.
But above implementation is sub optimal and leading to performance costs.
Is there a right equivalent for _mm_shuffle_epi8 for ARM ?
There isn't an exact equivalent, but vtbl is likely a useful command for doing _mm_shuffle_epi8 in Neon.
As there isn't a direct equivalent, a completely generic version won't be as efficient, but if you have particular shuffles you'll get to something better.
I always plug the searchable list of Neon commands, and this guide will also be useful.