Hi,
I'm Trying to convert a code written in Cuda to openCL and run into some trouble. My final goal is to implement the code on an Odroid XU3 board with a Mali T628 GPU.
In order to simplify the transition and save time trying to debug openCL kernels I've taken the following steps:
I know that different architectures may have different optimizations but that isn't my main concern for now. I manged to run the openCL code on my Nvidia GPU with no apparent issues but keep getting strange errors when trying to run the code on the Odroid board. I know that different architectures have different handling of exceptions etc. but I'm not sure how to solve those issues.
Since the openCL code works great on my Nvidia I assume that I managed to do the correct transition between thread/blocks -> workItems/workGroups etc. I already fixed several issues that relate to the cl_device_max_work_group_size issue so that can't be the cause.When running the code i'm getting a "CL_OUT_OF_RESOURCES" error.
I've narrowed the cause of the error to 2 lines in the code but not sure to fix those issues.
the error is caused by the following lines in the kernel code attached :
Is there any tool that can help debugging those issues on the Odroid ? I saw that using "printf" inside the kernel isn't possible. Is there another available command ?
Thanks
Yuval
Hi goldfracht,
You can use printf inside your kernel but you need to enable this extension : https://www.khronos.org/registry/cl/extensions/arm/cl_arm_printf.txt
/* Define a printf callback function. */ static void printf_callback( const char *buffer, size_t len, size_t complete, void *user_data ) { printf( "%.*s", (int)len, buffer ); } void cl_init() { [...] cl_context_properties props[] = { /* Enable a printf callback function for this context. * */ CL_PRINTF_CALLBACK_ARM, (cl_context_properties) printf_callback, /* Request a minimum printf buffer size of 4MiB for devices in the context that support this extension. */ CL_PRINTF_BUFFERSIZE_ARM, (cl_context_properties) 0x100000, CL_CONTEXT_PLATFORM, (cl_context_properties) platform, 0 }; [...] }
And in your kernel you need to add:
#pragma OPENCL EXTENSION cl_arm_printf : enable
For out of bounds checks and race conditions you can try to use GPUVerify - Download
Note: The online version doesn't currently support out of bounds checking.
Here is how to use it:
test.cl:
__kernel void foo (__global const float* src_a, int d, __global const int* src_b, __global float *c) { uint idx = get_global_id(0); float4 a, b; a = (float4)(src_a[d]); b = convert_float4(vload4(idx, src_b)); vstore4(a + b, idx, c); }
./gpuverify --global_size=[256] --local_size=[256] --kernel-arrays=foo,8,8,8 test.cl --check-array-bounds test.opt.bc: warning: Assuming the arguments 'src_a', 'src_b', 'c' of 'foo' on line 1 of test.cl to be non-aliased; please consider adding a restrict qualifier to these arguments test.cl:9:2: error: possible array out-of-bounds access on array src_a[-1610612736] by work item 2 in work group 0: a = (float4)(src_a[d]); Bitwise values of parameters of 'foo': d = 2684354560 test.cl:9:2: error: possible array out-of-bounds access on array src_a[536870912] by work item 2 in work group 0: a = (float4)(src_a[d]); Bitwise values of parameters of 'foo': d = 536870912 test.cl:10:21: error: possible array out-of-bounds access on array src_b[512] by work item 128 in work group 0: b = convert_float4(vload4(idx, src_b)); Bitwise values of parameters of 'foo': d = 0 test.cl:10:21: error: possible array out-of-bounds access on array src_b[2] by work item 0 in work group 0: b = convert_float4(vload4(idx, src_b)); Bitwise values of parameters of 'foo': d = 0 GPUVerify kernel analyser finished with 0 verified, 4 errors
./gpuverify --global_size=[256] --local_size=[256] --kernel-arrays=foo,8,8,8 test.cl --check-array-bounds
test.opt.bc: warning: Assuming the arguments 'src_a', 'src_b', 'c' of 'foo' on line 1 of test.cl to be non-aliased; please consider adding a restrict qualifier to these arguments
test.cl:9:2: error: possible array out-of-bounds access on array src_a[-1610612736] by work item 2 in work group 0:
a = (float4)(src_a[d]);
Bitwise values of parameters of 'foo':
d = 2684354560
test.cl:9:2: error: possible array out-of-bounds access on array src_a[536870912] by work item 2 in work group 0:
d = 536870912
test.cl:10:21: error: possible array out-of-bounds access on array src_b[512] by work item 128 in work group 0:
b = convert_float4(vload4(idx, src_b));
d = 0
test.cl:10:21: error: possible array out-of-bounds access on array src_b[2] by work item 0 in work group 0:
GPUVerify kernel analyser finished with 0 verified, 4 errors
What calls returns CL_OUT_OF_RESOURCES ? If it's clEnqueueNDRangeKernel: what is the local workgroup size parameter ?
Hope this helps,
Anthony
Thanks for the detailed answer! I'll try and implement this ASAP.
About GPUVerify I saw the linux-64bit version download but the Odroid board uses a Ubuntu OS. Do you think there might be a problem ?
The error is from clEnqueueNDRangeKernel function and I'm using a {144,1088} global work size and {8,32} local work size (MAX_LOCAL_WORK_SIZE = 256). I already addressed a MAX_LOCAL_WORK_SIZE issue and I think I fixed it.
GpuVerify is a static analysis tool you don't need to run it on the device,you can run it on your pc.
Regarding the workgroup size : even if the hardware supports up to 256 local workgroup size, we strongly recommend to limit it to 128.
If your kernel is fairly complex you will have to set an attribute in front of your kernel to give the compiler a hint : see http://community.arm.com/message/28323#28323
Thanks for all the help. I added the __attribute__ and reviewed the kernel with GPUVerify and the issue is solved.
Instead of using a fixed 256 (or 128 as Anthony recommended) work items per work group, I recommend to use clGetKernelWorkGroupInfo and query CL_KERNEL_WORK_GROUP_SIZE. This way you will get the largest work group that the device allows with that kernel (so it can change for different kernels also): On AMD, you will probably get 256, on nVidia 1024, on Intel GPUs 512 (CPU can vary vastly), on Mali this can vary among 64/128/256 based on how complex your kernels are (more register usage per workitem = smaller max work group). On Qualcomm Adreno, it can also get any multiple of 16 (I constantly get annoyed by a work group or 80 or 192 threads, so I have to fix the code not to assume a power-of-two work group). You're out of luck if the algorithm requires a per-determined work group size, but that should be rare.
Hi lrdxgm,
What you say is mostly true, however if your kernel is ALU bound, then you will benefit from forcing the local workgroup size to 128 because the extra memory accesses caused by the register spilling will be hidden by the ALU operations and the GPU utilisation will be much better resulting in better performance.
Hope this makes sense.