I am using the odroid XU3 board. It has the Samsung Exynos5422 SoC. The SoC has BigLittle CPUs and the Arm Mali T628 MP6 GPU. I would like to run the CPU and the GPU in parallel on different sections of the array that I am processing. Currently to enforce coherency I have to use clEnqueueMapBuffer and clEnqueueUnmapMemObject. Usage of these OpenCL functions gives a performance degradation and due to this running the CPU and GPU in parallel becomes pointless. I have the following questions.
1) Are the caches of the ARM CPU and the GPU on this SoC coherent ?
2) The GPU shows up as two devices in OpenCL. Are these GPUs cache coherent ?3) Is there anyway to enforce coherency other than using MapBuffer and UnmapMemObject CL functions ?
Hi Kiran,
1) Like Pete mentioned GPU cache coherency is transparent to the application, the driver will take care of it for you.
2) According to the CL specs it is not legal to share a buffer between two devices, in practice if the two devices don't share cache lines then it will work, you just need to be careful with your memory access patterns in your kernels.
3) The OpenCL driver will always use cached memory for buffers and images, if you want to use uncached memory or handle the cache maintenance yourself then you have to use OpenCL / EGL extensions to use externally allocated memory instead of letting the driver decide for you.
Have a look at https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_image_base.txt
and https://www.khronos.org/registry/cl/api/1.2/cl_egl.h
What exactly are you hoping to achieve by handling cache maintenance yourself ?
Cache operations tend to take a lot of time, but cached memory tends to still be much faster than uncached.
Also please make sure the CPU on your platform is using a "performance" governor and not an "ondemand" as usually cache maintenance operations fail to be detected as CPU utilisation by the governor and will therefore be performed at low frequency
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Hope this helps,
Anthony
I dont want to use uncached memory since there is lot of spatial and temporal locality in the data and I want to take advantage of that.
I am using and would like to use cached memory. My requirement is that I want to do processing simultaneously on both the CPU and the GPU. The workloads that i am using are dataparallel, so the CPU and GPU can work on different sections of the arrays in parallel. For this, I allocate opencl buffers(with CL_MEM_ALLOC_HOST_PTR) and map them so that they can be used by the CPU also. If there are no dependencies then fine, but if there are dependencies between the data that is computed on the CPU and the GPU then some kind of synchronization of memory has to be performed so that both the CPU and the GPU reads the updated values. I have been using map/unmap to do this. But sometimes this kills the performance, possibly due to the overhead of map and unmap or due to cache invalidation on CPU side.
In one of your previous replies, you say that opencl buffers cant be mapped or unmapped. I find that they can be mapped and unmapped.
Sorry for the confusion: I only meant externally allocated buffers/images cannot be mapped / unmapped using clEnqueueMap* / Unmap* (i.e buffers using EGL or GLES interops, or relying on a dma_buf allocation).
Despite the misleading name, all that clEnqueueMap* / Unmap* operations do is performing cache maintenance operations.
So what performance hit you get is due to the time it takes to perform the CPU cache maintenance (Which is why I suggested you to double check your CPU governor was actually set to run in performance mode).
The only other thing I can suggest is to use the "offset" and "cb" parameters of clEnqueueMap to only map the zone you're about to work on.
Hello Anthony,
Thanks again for your comments. I have a few more questions.
I checked and found that performance is the CPU governor in use.
Are you sure that Map/Unmap does only cache maintenance ? It has to map the memory location allocated for the OpenCL buffer to CPU's virtual memory, isnt it ? I was thinking that clEnqueueMap is basically doing an mmap(which involves setting the page table and invalidates/cleans the cache) and munmap(which tears down the pagetable and invalidates/cleans the cache).
If Map/Unmap is only associated with the CPU then how am i managing to do the synchronization correctly ?
For eg. I have an array A[1..10]
Case 1 : CPU modifies locations A[1..10] and after that I call map/unmap which forces the entries in CPUs cache to be flushed/invalidated. But the GPU's cache might still be containing the old values for location A[1..10]. How does the GPU get the updated value after map/unmap?
Case 2 : GPU modifies locations A[1..10] and after that I call map/unmap which forces the entries in CPUs cache to be flushed/invalidated. But the GPU's cache might not have been flushed. Then how does the CPU get the updated value after map/unmap?
--Kiran
The way Map / Unmap are implemented is down to the implementer, in our case the memory get mapped when it's allocated and will remain mapped for the entire life of the buffer / image, which means Map / Unmap will only perform some cache maintenance and not some memory mapping.
As mentioned before the driver will take care of automatically synchronising the GPU caches before and after every GPU job, the CPU then gets the updated value by calling map / unmap. (Because Map / Unmap are associated to a command queue then you have a dependency between the GPU jobs and Map / Unmap).
Case 1:
map -> unmap -> enqueueNdRangeKernel : There will be a dependency of the kernel enqueued on the unmap operation, therefore the GPU cache maintenance will happen after the unmap.
Case 2:
enqueueNdRangeKernel -> map : There will be a dependency of the map operation on the kernel execution, therefore the GPU cache maintenance will happen before the map.
Thanks Anthony. You are of great help.