NEON (SIMD) does not give performance increase

Note: This was originally posted on 15th August 2011 at


I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
No Data
  • Note: This was originally posted on 16th August 2011 at

    [size="2"]Thank you for your answer.[/size] 
    [size="2"]Yes, i have read your article with interpolation before i post here and was very suprised with such great result (12 times), my maximum performance win with neon intrinsics VS pure C was only 5 times. Is ASM so better than C intrinsics? [/size] 
    On what processor core do you do it?
    [size="2"]It is amazing idea to process only y axes and transpose image. Fortunately, i am doing interpolation while converting from cartesian to polar coordinat system and do not think that this idea can used in this case.[/size] 
    [size="2"]So if random memory access than using neon has no sense because of memory relocation?[/size] 

No Data