With a support entitlement you can also get direct access to our team of highly-qualified Arm experts 24-hours a day
Open a support case
If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time. In these case i have no performance win at all. How to speed up such functions? What am i doing wrong?