I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
    Thank you for your answer. 
    Yes, i have read your article with interpolation before i post here and was very suprised with such great result (12 times), my maximum performance win with neon intrinsics VS pure C was only 5 times. Is ASM so better than C intrinsics? 
    On what processor core do you do it?
    It is amazing idea to process only y axes and transpose image. Fortunately, i am doing interpolation while converting from cartesian to polar coordinat system and do not think that this idea can used in this case. 
    So if random memory access than using neon has no sense because of memory relocation? 

