NEON (SIMD) does not give performance increase

Note: This was originally posted on 15th August 2011 at


I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
No Data
  • Note: This was originally posted on 16th August 2011 at

    Hi Etienne,

    Nice blog post on resampling - one random idea on the implementation

    You currently run two passes over the image one for Y up-sample and one for X up-sample. Given that data loading is a significant cost in the algorithm (especially in real cases where image larger than cache size) is there any way you can split the image into small "tiles" that fit into registers or (more realistically) cache and do both the X and Y up-sample on a single tile and then move onto the next tile?

    This fits with the "only touch main memory once" design policy for most media processing codecs.

No Data