We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, I am accelerating a image processing algorithm with OpenCL on the cellphone, however I met a case which had a very poor performance.
The case is to calculate the prefix sum of each row on the image (a 2D buffer). For example, a 3x3 image:
50 32 10
48 45 100
30 80 10
the result should be:
50 82 92
48 93 193
30 110 120
That is for each row store it's prefix sum. My opencl kernel is somewhat as follows:
__kernel void scan_h( __global int *dst, __global int *src, int width, int height, int stride) { int x = get_global_id(0); int data; data = 0; int i; for(i=0; i<width; i++) { data += src[i + x*stride]; dst[i+ x*stride] = data; } }
it needed about 100ms to finish a 3000*3000 image.
Another experiment show interesting result. If I calculate the each column prefix sum with the kernel as follows, the run-time is only 4ms.
__kernel void scan_v( __global int *dst, __global int *src, int width, int height, int stride) { int x = get_global_id(0); int data = 0; int i; for(i=0; i<height; i++) { data += src[x+i*stride]; dst[x+i*stride] = data; } }
I do know why is the difference. So a work around for the row prefix sum is to tranpose the image first and then perform column prefix sum followed with another transpose.
But I strongly wonder why the native row prefix sum is so slow?
Hi, all:
Here is the newest effort.
After taking the advice from Anthony, the performance is 4x faster against the origin kernel. Further more, I made a try to transpose the image, than employed a vertical prefix sum, finally another transpose. The state-of-art result is less than10ms.
To summary, for a 3968x2976 image (int datatype for each pixel):
1. the vertical prefix sum costs 2.5ms~4.5ms (differ caused by DVFS).
2. the original horizontal prefix sum (without vectorization) costs 70ms~100ms
after vectorization, the time consumption drops to 20ms
further, replace the horizontal prefix sum with two transpose and a vertical prefix sum, it costs 6ms~10ms.