That could be due to many things.May be your C compiler is better,May be your image are bigger (and then are not in the cache)The gap is not usefull here.It would be more usefull if you give us:- the size of your image- the frequency of your CPU- the muber of time you loop on your function to bench- the real time obtainedWith this, we will be able to know how many cycles are needed for one loop of hilbert test!
If processed elements located one by one or in one place i can load them with pld all at once and in such cases i have performance win about 3-4 times.But if processed elements located at random places in memory (for example when i do bilinearinterpolation) there is no neon intrinsics for fast way of doing it. I need to place elements in array manually one by one(with c code) and then pld this array with neon. Or using vgetq_lane_, vsetq_lane.I think this actions take most of time. In these case i have no performance win at all. How to speed up such functions? What am i doing wrong?
You currently run two passes over the image one for Y up-sample and one for X up-sample. Given that data loading is a significant cost in the algorithm (especially in real cases where image larger than cache size) is there any way you can split the image into small "tiles" that fit into registers or (more realistically) cache and do both the X and Y up-sample on a single tile and then move onto the next tile? This fits with the "only touch main memory once" design policy for most media processing codecs.
[size="2"]Yes, i have read your article with interpolation before i post here and was very suprised with such great result (12 times), my maximum performance win with neon intrinsics VS pure C was only 5 times. Is ASM so better than C intrinsics? [/size] [color=#000000][size=3]
[/size][/color]On what processor core do you do it? [size="2"][color=#000000][size=3]
[/size][/color][size="2"]It is amazing idea to process only y axes and transpose image. Fortunately, i am doing interpolation while converting from cartesian to polar coordinat system and do not think that this idea can used in this case.[/size] [size="2"]So if random memory access than using neon has no sense because of memory relocation?[/size] Aleksey.
why i obtain such results?why author of article has 7.5 times win and i only 3 ? (neon hand made asm vs C)
Hmm, but here http://hilbert-space.de/?p=22 asm was much better than c intrinsics too. C-version: 15.1 cycles per pixel. NEON-version: 9.9 cycles per pixel. Assembler: 2.0 cycles per pixel.
I think a lot but i can not solve it. Processing only Y axes in your article you was working with consecutive data. But how can i do this? To Cartesian coordinates from polar coordinates[font=arial, verdana, tahoma, sans-serif][size=3]
I[font="arial, verdana, tahoma, sans-serif"][size="3"]n my tran[/size][/font]sfromation there are cos and sin so there is no consecutive data on polar cood. side.
Could you explain in more detail?