I am using the odroid XU3 board. It has the Samsung Exynos5422 SoC. The SoC has BigLittle CPUs and the Arm Mali T628 MP6 GPU. I would like to run the CPU and the GPU in parallel on different sections of the array that I am processing. Currently to enforce coherency I have to use clEnqueueMapBuffer and clEnqueueUnmapMemObject. Usage of these OpenCL functions gives a performance degradation and due to this running the CPU and GPU in parallel becomes pointless. I have the following questions.
1) Are the caches of the ARM CPU and the GPU on this SoC coherent ?
2) The GPU shows up as two devices in OpenCL. Are these GPUs cache coherent ?3) Is there anyway to enforce coherency other than using MapBuffer and UnmapMemObject CL functions ?
Sorry for the confusion: I only meant externally allocated buffers/images cannot be mapped / unmapped using clEnqueueMap* / Unmap* (i.e buffers using EGL or GLES interops, or relying on a dma_buf allocation).
Despite the misleading name, all that clEnqueueMap* / Unmap* operations do is performing cache maintenance operations.
So what performance hit you get is due to the time it takes to perform the CPU cache maintenance (Which is why I suggested you to double check your CPU governor was actually set to run in performance mode).
The only other thing I can suggest is to use the "offset" and "cb" parameters of clEnqueueMap to only map the zone you're about to work on.
Hope this helps,
Thanks again for your comments. I have a few more questions.
I checked and found that performance is the CPU governor in use.
Are you sure that Map/Unmap does only cache maintenance ? It has to map the memory location allocated for the OpenCL buffer to CPU's virtual memory, isnt it ? I was thinking that clEnqueueMap is basically doing an mmap(which involves setting the page table and invalidates/cleans the cache) and munmap(which tears down the pagetable and invalidates/cleans the cache).
If Map/Unmap is only associated with the CPU then how am i managing to do the synchronization correctly ?
For eg. I have an array A[1..10]
Case 1 : CPU modifies locations A[1..10] and after that I call map/unmap which forces the entries in CPUs cache to be flushed/invalidated. But the GPU's cache might still be containing the old values for location A[1..10]. How does the GPU get the updated value after map/unmap?
Case 2 : GPU modifies locations A[1..10] and after that I call map/unmap which forces the entries in CPUs cache to be flushed/invalidated. But the GPU's cache might not have been flushed. Then how does the CPU get the updated value after map/unmap?
The way Map / Unmap are implemented is down to the implementer, in our case the memory get mapped when it's allocated and will remain mapped for the entire life of the buffer / image, which means Map / Unmap will only perform some cache maintenance and not some memory mapping.
As mentioned before the driver will take care of automatically synchronising the GPU caches before and after every GPU job, the CPU then gets the updated value by calling map / unmap. (Because Map / Unmap are associated to a command queue then you have a dependency between the GPU jobs and Map / Unmap).
map -> unmap -> enqueueNdRangeKernel : There will be a dependency of the kernel enqueued on the unmap operation, therefore the GPU cache maintenance will happen after the unmap.
enqueueNdRangeKernel -> map : There will be a dependency of the map operation on the kernel execution, therefore the GPU cache maintenance will happen before the map.
Thanks Anthony. You are of great help.