SIMD-NEON Optimization on CortexA7or Cortex A57

Hi, we are experiencing poor performance on Small functions translated to SIMD NEON because of  likely latency.I found a guide on  but i did not find any on Cortex A7, Hence applying the latency from A57 matched the results observed in A7.

Is there any guidance or recommendations, guide, book, tools  to be sure to make the difference with SIMD Cortex ? we may missed a step. Thanks for the tips.

PS:we used vshl.u32 , vbic.32,vbic.32,  vshl.u32, vsri.u32   in Q,D registers.