NEON (SIMD) does not give performance increase

Note: This was originally posted on 15th August 2011 at http://forums.arm.com

Hello!

I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
Parents
No Data
Reply
  • Note: This was originally posted on 14th September 2011 at http://forums.arm.com

    hello again!

    i try to reproduce theese results - http://hilbert-space.de/?p=22

    a try on cortex A8 and A9

    A8:
    C intrinsics win compared with c: 1.4
    hand made ASM NEON compared with c:: 3

    A9:

    C intrinsics win compared with c: 1.9
    hand made ASM NEON compared with c:: 2.5

    but author of paper has



    C intrinsics win compared with c: 1.5
    hand made ASM NEON compared with c:: 7.5 !!!!

    why i obtain such results?
    why author of article has 7.5 times win and i only 3 ? (neon hand made asm vs C)
Children
No Data