Support forums

Graphics, Gaming, and VR forum prefix sum with Opencl

State Accepted Answer
+1 person also asked this people also asked this
Locked Locked
Replies 6 replies
Subscribers 136 subscribers
Views 8969 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

prefix sum with Opencl

Irving.Qin over 8 years ago

Hi, I am accelerating a image processing algorithm with OpenCL on the cellphone, however I met a case which had a very poor performance.

The case is to calculate the prefix sum of each row on the image (a 2D buffer). For example, a 3x3 image:

50 32 10

48 45 100

30 80 10

the result should be:

50 82 92

48 93 193

30 110 120

That is for each row store it's prefix sum. My opencl kernel is somewhat as follows:

__kernel void scan_h( __global int *dst,
    __global int *src,
        int width, int height, int stride)
{
    int x = get_global_id(0);

    int data;
    data = 0;

    int i;
    for(i=0; i<width; i++)
    {
        data += src[i + x*stride];
        dst[i+ x*stride] = data;
    }
}

it needed about 100ms to finish a 3000*3000 image.

Another experiment show interesting result. If I calculate the each column prefix sum with the kernel as follows, the run-time is only 4ms.

__kernel void scan_v(
    __global int *dst,
    __global int *src,
        int width, int height, int stride)
{
    int x = get_global_id(0);


    int data = 0;


    int i;
    for(i=0; i<height; i++)
    {
        data += src[x+i*stride];
        dst[x+i*stride] = data;
    }
}

I do know why is the difference. So a work around for the row prefix sum is to tranpose the image first and then perform column prefix sum followed with another transpose.

But I strongly wonder why the native row prefix sum is so slow?

Parents

0 Myy over 8 years ago

What is the value of stride in the two examples ?
My guess would be that access to contiguous memory locations lead to better use of the GPU memory cache lines, improving the overall performance.
Cancel
Up 0 Down

Cancel

Reply

0 Myy over 8 years ago

What is the value of stride in the two examples ?
My guess would be that access to contiguous memory locations lead to better use of the GPU memory cache lines, improving the overall performance.
Cancel
Up 0 Down

Cancel

Children

+1 Peter Harris over 8 years ago in reply to Myy

Yes, that's pretty much guaranteed to the be the problem. Running tangentially to the memory layout means that you're guaranteed to thrash the cache / MMU TLB, so transposing one of the inputs is always a good idea.
Cancel
Up 0 Down

Cancel
0 Irving.Qin over 8 years ago in reply to Myy

in most cases, the stride are the same with the width. (more practically, it is the value of 128 bytes alignment of the width)
Cancel
Up 0 Down

Cancel