openCL CL_OUT_OF_RESOURCES issue

Top replies

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Hi,

I'm Trying to convert a code written in Cuda to openCL and run into some trouble. My final goal is to implement the code on an Odroid XU3 board with a Mali T628 GPU.

In order to simplify the transition and save time trying to debug openCL kernels I've taken the following steps:

Implement the code in Cuda and test it on a Nvidia GeForce 760
Implement the code in openCL and test it on a Nvidia GeForce 760
test the openCL code on an Odroid XU3 board with a Mali T628 GPU.

I know that different architectures may have different optimizations but that isn't my main concern for now. I manged to run the openCL code on my Nvidia GPU with no apparent issues but keep getting strange errors when trying to run the code on the Odroid board. I know that different architectures have different handling of exceptions etc. but I'm not sure how to solve those issues.

Since the openCL code works great on my Nvidia I assume that I managed to do the correct transition between thread/blocks -> workItems/workGroups etc. I already fixed several issues that relate to the cl_device_max_work_group_size issue so that can't be the cause.When running the code i'm getting a "CL_OUT_OF_RESOURCES" error.

I've narrowed the cause of the error to 2 lines in the code but not sure to fix those issues.

the error is caused by the following lines in the kernel code attached :

lowestDist[pixelNum] = partialDiffSumTemp; both variables are private variables of the kernel and therefor I don't see any potential issue.
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 0] = bestDisparity[0]; Here I guess the cause is "OUT_OF_BOUND" but not sure how to debug it since the original code doesn't have any issue.

Is there any tool that can help debugging those issues on the Odroid ? I saw that using "printf" inside the kernel isn't possible. Is there another available command ?

Thanks

Yuval

stereoKernel.cl.zip

if (partialDiffSumTemp < lowestDist[pixelNum]) {

lowestDist[pixelNum] = partialDiffSumTemp;

bestDisparity[pixelNum] = dispLevel - 1;

Anthony Barbier over 9 years ago in reply to Robert David +1 verified

Hi lrdxgm , What you say is mostly true, however if your kernel is ALU bound, then you will benefit from forcing the local workgroup size to 128 because the extra memory accesses caused by the register...

Parents

0 Yuval over 9 years ago

After solving the CL_OUT_RESOURCES i'm trying to improve the performance of the kernel. I removed all CUDA related optimizations and started calculating the computational cost of each part of the kernel. for some reason I see that 2 lines of code are taking about 50% of the execution time which seems to be a bit unreasonable.
the lines of code I'm referring to are :
if (partialDiffSumTemp < lowestDist[pixelNum]) {
lowestDist[pixelNum] = partialDiffSumTemp;
bestDisparity[pixelNum] = dispLevel - 1;
}
At first I thought that is caused due to the "if statement" but removing the 2 inner statements results in a 50% improvement. I know that this statement is nested within several "for loops" but other parts of the code that are also nested don't give the same affect. any idea how to improve this issue ?
Thanks
Yuval
Cancel
Up 0 Down

Cancel

Reply

0 Yuval over 9 years ago

After solving the CL_OUT_RESOURCES i'm trying to improve the performance of the kernel. I removed all CUDA related optimizations and started calculating the computational cost of each part of the kernel. for some reason I see that 2 lines of code are taking about 50% of the execution time which seems to be a bit unreasonable.
the lines of code I'm referring to are :
if (partialDiffSumTemp < lowestDist[pixelNum]) {
lowestDist[pixelNum] = partialDiffSumTemp;
bestDisparity[pixelNum] = dispLevel - 1;
}
At first I thought that is caused due to the "if statement" but removing the 2 inner statements results in a 50% improvement. I know that this statement is nested within several "for loops" but other parts of the code that are also nested don't give the same affect. any idea how to improve this issue ?
Thanks
Yuval
Cancel
Up 0 Down

Cancel

Children

0 Anthony Barbier over 9 years ago in reply to Yuval

If you don't write back the result of a calculation then the compiler will optimise out all the calculations related to this particular result, which is why if you comment out a "write" you will have the impression it costs you 50% of your execution time when in fact it's because the compiler removed a whole bunch of other calculations which were not needed anymore.
Do you think that's what could be happening here ?
Cancel
Up 0 Down

Cancel