NEON (SIMD) does not give performance increase

Note: This was originally posted on 15th August 2011 at http://forums.arm.com

Hello!

I am trying manually increase performance of some image processing functions.

I am using neon intrinsics and c language.

If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

How to speed up such functions? What am i doing wrong?

Thank you.
  • Note: This was originally posted on 16th August 2011 at http://forums.arm.com

    [size="2"]Thank you for your answer.[/size] 
    [size="2"]Yes, i have read your article with interpolation before i post here and was very suprised with such great result (12 times), my maximum performance win with neon intrinsics VS pure C was only 5 times. Is ASM so better than C intrinsics? [/size] 
    On what processor core do you do it?
    [size="2"]It is amazing idea to process only y axes and transpose image. Fortunately, i am doing interpolation while converting from cartesian to polar coordinat system and do not think that this idea can used in this case.[/size] 
    [size="2"]So if random memory access than using neon has no sense because of memory relocation?[/size] 
    Aleksey.

  • Note: This was originally posted on 14th September 2011 at http://forums.arm.com

    hello again!

    i try to reproduce theese results - http://hilbert-space.de/?p=22

    a try on cortex A8 and A9

    A8:
    C intrinsics win compared with c: 1.4
    hand made ASM NEON compared with c:: 3

    A9:

    C intrinsics win compared with c: 1.9
    hand made ASM NEON compared with c:: 2.5

    but author of paper has



    C intrinsics win compared with c: 1.5
    hand made ASM NEON compared with c:: 7.5 !!!!

    why i obtain such results?
    why author of article has 7.5 times win and i only 3 ? (neon hand made asm vs C)
  • Note: This was originally posted on 15th September 2011 at http://forums.arm.com


    That could be due to many things.
    May be your C compiler is better,
    May be your image are bigger (and then are not in the cache)

    The gap is not usefull here.

    It would be more usefull if you give us:
    - the size of your image
    - the frequency of your CPU
    - the muber of time you loop on your function to bench
    - the real time obtained

    With this, we will be able to know how many cycles are needed for one loop of hilbert test!




    i do not think that it is compiler because i have the same win with neon c intrinsics vs pure c as author of article.

    - the frequency of your CPU  - 800 mhz
    - the muber of time you loop on your function to bench - 10
    - the real time obtained -

    55551 ( pure c)
    39884 (intrinsics)
    18029 (asm)
    for each loop

    and yes, you was right, when i reduce size of image ( i was using 1920 * 1080) to 100 * 100 i have asm win to pure c ~5.3 times, when c intrinsics give the same ~1.4 times

    thank you :)
  • Note: This was originally posted on 17th August 2011 at http://forums.arm.com

    Hmm, but here http://hilbert-space.de/?p=22 asm was much better than c intrinsics too.

    C-version:     15.1 cycles per pixel.
    NEON-version:   9.9 cycles per pixel.
    Assembler:   [font=Arial][size=2]2.0 cycles per pixel.[/size][/font]



    >> You just need to make the conversion during the second pass (and not doing it during the first one).


    I think a lot but i can not solve it. Processing only Y axes in your article you was working with consecutive data. But how can i do this?


    To Cartesian coordinates from polar coordinates







    I[font=arial, verdana, tahoma, sans-serif][size=3]n my tran[/size][/font][font=Arial][size=2]sfromation there are cos and sin so there is no consecutive data on polar cood. side.[/size][/font]

    Thank you!
  • Note: This was originally posted on 17th August 2011 at http://forums.arm.com

    [color=#222222]
    webshaker,
    [/color][color=#222222]yeah from cartesian to polar, some processing and than back polar to cartesian.[/color]



    when i doing cartesian to polar i load points consecutive, but save not. when i doing polar to cartesian i load point not consecutive, but save consecutive.
    data is stored channel by channel.

    [color=#222222]
    isogen74,
    [/color][color=#222222]i am using gcc 4.4.1[/color]
    [color=#222222]
    [/color]
    [color=#222222]>>[/color][color=#222222]it seems to struggle with data movement using a lot of stack ops rather than vmov instructions. [/color]
    [color=#333333][font=arial, sans-serif]Could you explain in more detail?[/font][/color]
    [color=#333333][font=arial, sans-serif]
    [/font][/color]
    thank you for help
  • Note: This was originally posted on 22nd August 2011 at http://forums.arm.com

    thanks, i go to learn asm :)
  • Note: This was originally posted on 15th August 2011 at http://forums.arm.com


    If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.

    But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time.  In these case i have no performance win at all.

    How to speed up such functions? What am i doing wrong?


    You're right.
    NEON is not very good to process not sequential datas.

    For the bilinear interpolation, the best is to do it into 2 distinct pass:
    - one on the X axis that will be slow
    - one on the Y axis that will be very pfast.

    I've made the enlarge bilinear interpolation.
    my fistr version was 12 times faster than C version
    http://pulsar.webshaker.net/2011/05/25/bilinear-enlarge-with-neon/

    After taht I succeed to optimize until 16 times
    http://pulsar.webshaker.net/2011/07/15/agrandissement-bilineaire-la-vengeance/
    PS: sorry this times I do not translate it in english. Id you want to make the translation, send me it.

    If you are doin reduction bilinear interpolation, sorry I do not done it yet.
    But I'll do it soon.

    Etienne
  • Note: This was originally posted on 16th August 2011 at http://forums.arm.com


    You currently run two passes over the image one for Y up-sample and one for X up-sample. Given that data loading is a significant cost in the algorithm (especially in real cases where image larger than cache size) is there any way you can split the image into small "tiles" that fit into registers or (more realistically) cache and do both the X and Y up-sample on a single tile and then move onto the next tile?

    This fits with the "only touch main memory once" design policy for most media processing codecs.


    I've tried the one pass algorithm.
    I do not succeed to increase performance (but maybe anybody else could ;))

    Working on tile (say 8*8 pixels) needs to write datas in different memory location (different lines) and have to result to produce a lot of cache miss (it is the same problem I had with the transposition in fact).
    The other problem is that you will not have enough ARM register to compute the 2 axis in one pass.

    But there is another reason why I do not do that !

    Enlarge and Reduce algorithme are quite different. Using 2 pass allow you to manage case when image is reduce on X axis and enlarge on Y axis.
    You have only 4 functions to write. If you try to make the 2 pass in one time, the code is more complexe. you have 8 functions (if you include the case where only 1 axis is changed).

    Etienne.
  • Note: This was originally posted on 16th August 2011 at http://forums.arm.com


    [size="2"]Yes, i have read your article with interpolation before i post here and was very suprised with such great result (12 times), my maximum performance win with neon intrinsics VS pure C was only 5 times. Is ASM so better than C intrinsics? [/size] 
    [color=#000000][size=3]

    [/size][/color]
    [color=#000000][size=3]
    [/size][/color]
    [color=#000000][size=3]I do not really know Intrinsics programation, but I assume this is more macro than function, so...[/size][/color]
    [color=#000000][size=3]It must not have a big difference between C and ASM.[/size][/color]
    [color=#000000][size=3]
    [/size][/color]
    [color=#000000][size=3]The real problem you have here is loading datas. Loading not consecutive datas into NEON register is a nightmare.[/size][/color]
    [color=#000000][size=3]The best solution I found, is the one I explain into the second post. You load datas with the ARM and you send them using VMOV to NEON registers.[/size][/color]
    [color=#000000][size=3]
    [/size][/color]
    [color=#000000][size=3]The remaining difference is probably due to the pipeline optimisation I've made.[/size][/color]
    [color=#000000][size=3]
    [/size][/color]
    [color=#000000][size=3]

    [/size][/color]
    On what processor core do you do it?
    [size="2"][color=#000000][size=3]


    [/size][/size][/color]
    [size="2"]I make my test on a beagleboard.[/size]
    [size="2"]But the increase performance should be the same on all Cortex A8. (I can't say about Cortex A9).[/size]


    [color=#000000][size=3]

    [/size][/color]
    [size="2"]It is amazing idea to process only y axes and transpose image. Fortunately, i am doing interpolation while converting from cartesian to polar coordinat system and do not think that this idea can used in this case.[/size] 
    [size="2"]So if random memory access than using neon has no sense because of memory relocation?[/size] 
    Aleksey.


    Sure you can !
    You just need to make the conversion during the second pass (and not doing it during the first one).

    Etienne.
  • Note: This was originally posted on 14th September 2011 at http://forums.arm.com


    why i obtain such results?
    why author of article has 7.5 times win and i only 3 ? (neon hand made asm vs C)



    That could be due to many things.
    May be your C compiler is better,
    May be your image are bigger (and then are not in the cache)

    The gap is not usefull here.

    It would be more usefull if you give us:
    - the size of your image
    - the frequency of your CPU
    - the muber of time you loop on your function to bench
    - the real time obtained

    With this, we will be able to know how many cycles are needed for one loop of hilbert test!

  • Note: This was originally posted on 17th August 2011 at http://forums.arm.com


    Hmm, but here http://hilbert-space.de/?p=22 asm was much better than c intrinsics too.

    C-version:     15.1 cycles per pixel.
    NEON-version:   9.9 cycles per pixel.
    Assembler:   2.0 cycles per pixel.


    Hum. You're right, Intrinsics is not very good in fact ;)



    I think a lot but i can not solve it. Processing only Y axes in your article you was working with consecutive data. But how can i do this?
    To Cartesian coordinates from polar coordinates
    [font=arial, verdana, tahoma, sans-serif][size=3]

    [/size][/font]


    You spoke about conversion FROM cartesian TO polar and not the opposite!









    I[font="arial, verdana, tahoma, sans-serif"][size="3"]n my tran[/size][/font]sfromation there are cos and sin so there is no consecutive data on polar cood. side.


    You're right, that will not be easy to do it in one single step.
    But you can certainly do it with NEON anyway !

    How Your datas are stored in memory ?





  • Note: This was originally posted on 16th August 2011 at http://forums.arm.com

    Hi Etienne,

    Nice blog post on resampling - one random idea on the implementation

    You currently run two passes over the image one for Y up-sample and one for X up-sample. Given that data loading is a significant cost in the algorithm (especially in real cases where image larger than cache size) is there any way you can split the image into small "tiles" that fit into registers or (more realistically) cache and do both the X and Y up-sample on a single tile and then move onto the next tile?

    This fits with the "only touch main memory once" design policy for most media processing codecs.

    Cheers,
    Iso
  • Note: This was originally posted on 17th August 2011 at http://forums.arm.com

    My experience with GCC intrinsics for NEON hasn't been great - it seems to struggle with data movement using a lot of stack ops rather than vmov instructions.

    Out of interest, which version of GCC are you using? There has been some effort to improve the NEON code generation for the intrinsics in newer versions (the latest version from CodeSourcery is 4.5.2 I think) of the toolchain, but I think there is still more to do.

    Iso
  • Note: This was originally posted on 18th August 2011 at http://forums.arm.com

    [color=#333333][font=arial, sans-serif][size=2]
    Could you explain in more detail?
    [/size][/font][/color]
    [color=#333333][font=arial, sans-serif][size=2]
    [/size][/font][/color]
    A lot of NEON instructions require consecutive or alternate registers, so you end up moving lanes of data between the high level intrinsic data structures. If you were to write this in assembler you would just write a "vmov dM, dN" type instruction and do the move in registers. My experience with intrinsics is that it does a write of dN to the stack and then does a read into dM back off the stack, which is obviously significantly more expensive than the vmov. [This was with an older version of GCC, it may have improved].
    [color=#333333][font=arial, sans-serif][size=2]
    [/size][/font][/color]
    [color=#333333][font=arial, sans-serif][size=2]I'd highly recommend using objdump on the binary and looking at the disassembly code the compiler produces for the intrinsics - it is a useful exercise anyway when writing algorithm code, for NEON or otherwise.[/size][/font][/color]
    [color=#333333][font=arial, sans-serif][size=2]
    [/size][/font][/color]