NEON (SIMD) does not give performance increase

Note: This was originally posted on 15th August 2011 at


I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
No Data
  • Note: This was originally posted on 15th September 2011 at

    That could be due to many things.
    May be your C compiler is better,
    May be your image are bigger (and then are not in the cache)

    The gap is not usefull here.

    It would be more usefull if you give us:
    - the size of your image
    - the frequency of your CPU
    - the muber of time you loop on your function to bench
    - the real time obtained

    With this, we will be able to know how many cycles are needed for one loop of hilbert test!

    i do not think that it is compiler because i have the same win with neon c intrinsics vs pure c as author of article.

    - the frequency of your CPU  - 800 mhz
    - the muber of time you loop on your function to bench - 10
    - the real time obtained -

    55551 ( pure c)
    39884 (intrinsics)
    18029 (asm)
    for each loop

    and yes, you was right, when i reduce size of image ( i was using 1920 * 1080) to 100 * 100 i have asm win to pure c ~5.3 times, when c intrinsics give the same ~1.4 times

    thank you :)
No Data