This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

prefix sum with Opencl

Hi, I am accelerating a image processing algorithm with OpenCL on the cellphone, however I met a case which had a very poor performance.

The case is to calculate the prefix sum of each row on the image (a 2D buffer). For example, a 3x3 image:

50 32 10

48 45 100

30 80 10

the result should be:

50 82 92

48 93 193

30 110 120

That is for each row store it's prefix sum. My opencl kernel is somewhat as follows:

__kernel void scan_h( __global int *dst,
    __global int *src,
        int width, int height, int stride)
{
    int x = get_global_id(0);

    int data;
    data = 0;

    int i;
    for(i=0; i<width; i++)
    {
        data += src[i + x*stride];
        dst[i+ x*stride] = data;
    }
}

it needed about 100ms to finish a 3000*3000 image.

Another experiment show interesting result. If I calculate the each column prefix sum with the kernel as follows, the run-time is only 4ms.

__kernel void scan_v(
    __global int *dst,
    __global int *src,
        int width, int height, int stride)
{
    int x = get_global_id(0);


    int data = 0;


    int i;
    for(i=0; i<height; i++)
    {
        data += src[x+i*stride];
        dst[x+i*stride] = data;
    }
}

I do know why is the difference. So a work around for the row prefix sum is to tranpose the image first and then perform column prefix sum followed with another transpose.

But I strongly wonder why the native row prefix sum is so slow?

Parents

0 Irving.Qin over 8 years ago

Hi, all:
Here is the newest effort.
After taking the advice from Anthony, the performance is 4x faster against the origin kernel. Further more, I made a try to transpose the image, than employed a vertical prefix sum, finally another transpose. The state-of-art result is less than10ms.
To summary, for a 3968x2976 image (int datatype for each pixel):
1. the vertical prefix sum costs 2.5ms~4.5ms (differ caused by DVFS).
2. the original horizontal prefix sum (without vectorization) costs 70ms~100ms
after vectorization, the time consumption drops to 20ms
further, replace the horizontal prefix sum with two transpose and a vertical prefix sum, it costs 6ms~10ms.
Cancel
Up 0 Down

Cancel

Reply

0 Irving.Qin over 8 years ago

Hi, all:
Here is the newest effort.
After taking the advice from Anthony, the performance is 4x faster against the origin kernel. Further more, I made a try to transpose the image, than employed a vertical prefix sum, finally another transpose. The state-of-art result is less than10ms.
To summary, for a 3968x2976 image (int datatype for each pixel):
1. the vertical prefix sum costs 2.5ms~4.5ms (differ caused by DVFS).
2. the original horizontal prefix sum (without vectorization) costs 70ms~100ms
after vectorization, the time consumption drops to 20ms
further, replace the horizontal prefix sum with two transpose and a vertical prefix sum, it costs 6ms~10ms.
Cancel
Up 0 Down

Cancel

Children

No data