Hi, we are experiencing poor performance on Small functions translated to SIMD NEON because of likely latency.I found a guide on
http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf but i did not find any on Cortex A7, Hence applying the latency from A57 matched the results observed in A7.
Is there any guidance or recommendations, guide, book, tools to be sure to make the difference with SIMD Cortex ? we may missed a step. Thanks for the tips.
PS:we used vshl.u32 , vbic.32,vbic.32, vshl.u32, vsri.u32 in Q,D registers.
I would like to add Cortex A8 in the perimeter. Any information and TIPS on SIMD on this one comparing to A7 is valuable. Thanks.