hey, I am new to OPENCL. bumped into an issue when we try to leverage the GPU for some math operations (basically matrix multiplication).
following is my sample code,
char *allocptr = malloc(WIDTH*HEIGHT*2); cl_mem buffer = clImportMemoryARM(context,CL_MEM_READ_WRITE, NULL,allocp$ if (error ==CL_SUCCESS) { printf("sucess\n"); } else { printf("error %d\n.",error); }
the code can be compiled without any issue, however it through out error -6 at run time. Can anyone shed some lights here?
the CLINFO is pasted below as well.
firefly@firefly:~$ sudo clinfo -a Platform #0 Name: ARM Platform Vendor: ARM Version: OpenCL 1.2 v1.r14p0-01rel0-git(966ed26).f44c85cb3d2ceb87e8be88e7592755c3 Profile: FULL_PROFILE Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory Device #0 Name: Mali-T860 Type: GPU Vendor: ARM Vendor ID: 140517376 Profile: FULL_PROFILE Available: Yes Version: OpenCL 1.2 v1.r14p0-01rel0-git(966ed26).f44c85cb3d2ceb87e8be88e7592755c3 Driver version: 1.2 Compiler available: Yes Address space size: 64 Little endian: Yes Error correction support: No Address alignment (bits): 1024 Smallest alignment (bytes): 128 Resolution of timer (ns): 1000 Max clock frequency (MHz): 200 Max compute units: 4 Max constant args: 8 Max constant buffer size: 64 kB Max mem alloc size: 489 MB 942 kB Max parameter size: 1024 Command-queue supported props: Out of order execution Profiling Execution capabilities: OpenCL kernels Global memory size: 1 GB 935 MB 696 kB Global memory cache size: 256 kB Global memory line cache size: 64 Local memory size: 32 kB Local memory type: Global Global memory cache type: Read write Max work group size: 256 Max work item dimensions: 3 Max work item sizes: (256, 256, 256) Image support: Yes Max 2D image height: 65536 Max 2D image width: 65536 Max 3D image depth: 65536 Max 3D image height: 65536 Max 3D image width: 65536 Max read image args: 128 Max write image args: 8 Max samplers: 16 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 2 Half precision float capability: Denorms Inf and NaNs Round to nearest even rounding mode Round to zero rounding mode Round to +ve and -ve infinity rounding modes IEEE754-2008 fused multiply-add Single precision float capability: Denorms Inf and NaNs Round to nearest even rounding mode Round to zero rounding mode Round to +ve and -ve infinity rounding modes IEEE754-2008 fused multiply-add Double precision float capability: Denorms Inf and NaNs Round to nearest even rounding mode Round to zero rounding mode Round to +ve and -ve infinity rounding modes IEEE754-2008 fused multiply-add Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory
Hi,
Your code got mangled on the way to the forum (the call to clImportMemoryARM looks incomplete). What size are you passing to clImportMemoryARM?
One thing that is suspicious is that your device doesn't seem to report cl_arm_import_memory_host which should be present when host imports are supported. That being said, some older versions of the driver had this bug and the feature was introduced for host imports so I'm fairly confident it's there on your device.
Taking a step back, do you imperatively need zero-copy imports for your application or would standard buffers created with clCreateBuffer suffice?
Regards,
Kevin
Kevin - thanks for your prompt response.
unfortunately, we are working a near-real-time video application runs on i.MX8 and the current CPU usage ratio is almost hit the cap. Hence we are trying to offload some computation (e.g., feature points tracking/matrix computation/HOG feature calculation) to the GPU. Initially we tried to use clCreateBuffer, but it looks like the gain of offloading the compute to the GPU is offset by the memory copy (data will need to be passed into GPU and result will need to be passed back to the CPU on frame basis) and consequently we cannot meet the real time requirement (33ms per frame). This is why we are trying to check whether zero-copy can help.
I reattached the code below, can you take another look? btw - what is error code -6?
#define WIDTH 100 #define HEIGHT 100 cl_platform_id platform; status = clGetPlatformIDs(1, &platform, NULL); cl_device_id device; clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL); commandQueue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, NULL); char *allocptr = malloc(WIDTH*HEIGHT*2); cl_mem buffer = clImportMemoryARM(context, CL_MEM_READ_WRITE, NULL, allocptr, WIDTH * HEIGHT * 2, &error ); if (error ==CL_SUCCESS) { printf("sucess\n"); } else { printf("error %d\n.",error); }
Your code looks correct.
-6 means CL_OUT_OF_HOST_MEMORY (see https://github.com/KhronosGroup/OpenCL-Headers/blob/master/CL/cl.h#L183) but it's also used as a generic error by some parts of the driver when no existing API code applies. I suspect this is what's happening here.
I suggest you try to pass the CL_IMPORT_TYPE_HOST_ARM import type explicitly (see https://www.khronos.org/registry/OpenCL/extensions/arm/cl_arm_import_memory.txt).
>Initially we tried to use clCreateBuffer, but it looks like the gain of offloading the compute to the GPU is offset by the memory copy
How did you come to that conclusion? What were you measuring?
Kévin
Kevin -
1) I tried toexplicitly pass CL_IMPORT_TYPE_HOST_ARM to the API using the following code. the result error code changed to -64 (i.e., #define CL_INVALID_PROPERTY -64). Any hint?
char *allocptr = malloc(WIDTH*HEIGHT*2); const cl_import_properties_arm properties[3] = { CL_IMPORT_TYPE_ARM, CL_IMPORT_TYPE_HOST_ARM, 0 }; cl_mem buffer = clImportMemoryARM(context, CL_MEM_READ_WRITE, properties, allocptr, WIDTH*HEIGHT*2, &error); //cl_mem buffer = clImportMemoryARM(context,CL_MEM_READ_WRITE, NULL, allocptr,WIDTH*HEIGHT*2,&error); if (error ==CL_SUCCESS) { printf("sucess\n"); } else { printf("error %d\n.",error); }
2) w.r.t. your question why we came to the conclusion that using clCreateBuffer is slower than using CPU. following are the details, hope you can shed some lights here.
a. we are using the following code to simulate multiple matrix multiplication(e.g., x times per frame at 30 fps)
for (int i = 0; i < 100; i++) { float ii = 0.12*((rand() % (10 - 1)) + 1); float jj = 0.4*((rand() % (10 - 1)) + 1); for (int j = 0; j < A.rows; j++) { float* data = A.ptr<float>(j); for (int i = 0; i < A.cols; i++) { data[i] = ii; } } for (int j = 0; j < B.rows; j++) { float* data = B.ptr<float>(j); for (int i = 0; i < B.cols; i++) { data[i] = jj; } } GPUResult = GpuFast::instance()->calcMatrix(A, B); }
b. the code for the GpuFast class are:
//memObjects are created as below when the class instance is created memObjects[0] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * szA, NULL, NULL); memObjects[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * szB, NULL, NULL); memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * szC, NULL, NULL); cv::Mat GpuFast::calcMatrix(cv::Mat a, cv::Mat b) { cl_event prof_event; int szA = Ndim * Pdim; int szB = Pdim * Mdim; int szC = Ndim * Mdim; struct timeval start, end, start1, end1; gettimeofday(&start, NULL); status = clEnqueueWriteBuffer(commandQueue, memObjects[0], CL_TRUE, 0, sizeof(float) * szA, (float*)a.ptr<float>(0), 0, NULL, NULL); if (status) { printf("clEnqueueWriteBuffer A error status=%d\n", (int)status); } status = clEnqueueWriteBuffer(commandQueue, memObjects[1], CL_TRUE, 0, sizeof(float) * szB, (float*)b.ptr<float>(0), 0, NULL, NULL); if (status) { printf("clEnqueueWriteBuffer B error status=%d\n", (int)status); } //cl_int clnum = NWITEMS; status = clSetKernelArg(kernel, 0, sizeof(int), &Ndim); status = clSetKernelArg(kernel, 1, sizeof(int), &Mdim); status = clSetKernelArg(kernel, 2, sizeof(int), &Pdim); status = clSetKernelArg(kernel, 3, sizeof(cl_mem), &memObjects[0]); status = clSetKernelArg(kernel, 4, sizeof(cl_mem), &memObjects[1]); status = clSetKernelArg(kernel, 5, sizeof(cl_mem), &memObjects[2]); if (status) { cout << "clSetKernelArg error" << endl; } global[0] = (size_t)Ndim; global[1] = (size_t)Mdim; status = clEnqueueNDRangeKernel( commandQueue, kernel, 2, NULL, global, NULL, 0, NULL, &prof_event); if (status) { cout << "error" << endl; } gettimeofday(&end, NULL); printf("cpu data to gpu use time %dms\n", 1000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec) / 1000); //clFinish(commandQueue); clWaitForEvents(1, &prof_event); cl_ulong ev_start_time = (cl_ulong)0; cl_ulong ev_end_time = (cl_ulong)0; double rum_time; cl_int ret = clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL); ret = clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL); rum_time = ev_end_time - ev_start_time; printf("\nExecution time in milliseconds = %0.3f ms\n", (rum_time / 1000000.0)); struct timeval start2, end2; gettimeofday(&start2, NULL); status = clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0, sizeof(float)* szC, result, 0, NULL, NULL); if (status) { printf("clEnqueueReadBuffer error status=%d\n",(int)status); } gettimeofday(&end2, NULL); printf("GPU result to CPU use time %dms\n", 1000 * (end2.tv_sec - start2.tv_sec) + (end2.tv_usec - start2.tv_usec) / 1000); return cv::Mat(Ndim, Mdim, CV_32FC1, result); }
c. what we realize is that clEnqueueWriteBuffer cost more or less 2ms each time. this is quite significant for delay budget, hence we are trying to look for alternative solution now.
1) I'm afraid your driver doesn't support the feature. In R14, the feature was still under an experimental build flag that your platform provider chose not to enable. There's nothing we can do to help. You can try to contact the platform provider to get an updated driver or one with experimental features turned on. Sorry.
2) Thanks for posting the code. Before I comment on the code, I should point out that writing efficient matrix multiplication kernels is no simple task. I encourage you to have a look at our Compute Library (https://github.com/ARM-software/ComputeLibrary) and see if that provides what you need directly.
Having said that, here are a couple of comments on your code:
- clEnqueueWriteBuffer will always perform a copy. You may want to consider using clEnqueueMapBuffer and write data directly using the mapped pointer if possible. The same comment applies to the final read, you may be able to use a map operation and avoid one copy.
- Most of your operations are blocking. This means that you'll pay the cost of dispatch and synchronisation multiple times. Try making all commands non-blocking apart from the final clEnqueue{Read,Map}Buffer. You can use queue profiling on copy or map commands as well.Generally speaking, you want to give as big a batch of commands as possible to the OpenCL driver.
Let me know how you get on :).
Kevin - thanks for your valuable input. will try to optimize the code as you recommend. let me get back to you once I have some update.