NEON (SIMD) does not give performance increase

Note: This was originally posted on 15th August 2011 at


I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
No Data
  • Note: This was originally posted on 17th August 2011 at

    Hmm, but here asm was much better than c intrinsics too.

    C-version:     15.1 cycles per pixel.
    NEON-version:   9.9 cycles per pixel.
    Assembler:   2.0 cycles per pixel.

    Hum. You're right, Intrinsics is not very good in fact ;)

    I think a lot but i can not solve it. Processing only Y axes in your article you was working with consecutive data. But how can i do this?
    To Cartesian coordinates from polar coordinates
    [font=arial, verdana, tahoma, sans-serif][size=3]


    You spoke about conversion FROM cartesian TO polar and not the opposite!

    I[font="arial, verdana, tahoma, sans-serif"][size="3"]n my tran[/size][/font]sfromation there are cos and sin so there is no consecutive data on polar cood. side.

    You're right, that will not be easy to do it in one single step.
    But you can certainly do it with NEON anyway !

    How Your datas are stored in memory ?

No Data