NEON (SIMD) does not give performance increase

Note: This was originally posted on 15th August 2011 at


I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
No Data
  • Note: This was originally posted on 14th September 2011 at

    why i obtain such results?
    why author of article has 7.5 times win and i only 3 ? (neon hand made asm vs C)

    That could be due to many things.
    May be your C compiler is better,
    May be your image are bigger (and then are not in the cache)

    The gap is not usefull here.

    It would be more usefull if you give us:
    - the size of your image
    - the frequency of your CPU
    - the muber of time you loop on your function to bench
    - the real time obtained

    With this, we will be able to know how many cycles are needed for one loop of hilbert test!

No Data