I am using the odroid XU3 board. It has the Samsung Exynos5422 SoC. The SoC has BigLittle CPUs and the Arm Mali T628 MP6 GPU. I would like to run the CPU and the GPU in parallel on different sections of the array that I am processing. Currently to enforce coherency I have to use clEnqueueMapBuffer and clEnqueueUnmapMemObject. Usage of these OpenCL functions gives a performance degradation and due to this running the CPU and GPU in parallel becomes pointless. I have the following questions.
1) Are the caches of the ARM CPU and the GPU on this SoC coherent ?
2) The GPU shows up as two devices in OpenCL. Are these GPUs cache coherent ?3) Is there anyway to enforce coherency other than using MapBuffer and UnmapMemObject CL functions ?
Hi kiranchandramohan,
1) The caches on the GPU side will automatically be cleaned / invalidated by the GPU driver as needed, however the CPU caches need to be updated manually: this is why you need to call map / unmap.
2) That's because the 6 cores of the Mali T628 MP6 are not cache coherent, therefore they appear as a cluster of 4 cores and a second one of 2 cores, each cluster translates into a separate OpenCL device. This is specific to the Mali T628: all the cores in the earlier and later models are cache coherent and therefore will appear as a single OpenCL device.
3) When using an OpenCL buffer or image backed up by an externally allocated memory allocation then it's the application's responsibility to update the CPU caches (You basically can't call Map / Unmap on such buffers).
On Android for example you can create an EGLImageKHR from a gralloc buffer then create a cl_image from it using clCreateFromEGLImageKHR.
Hope this helps,
Thanks,
Anthony
Hello abarbier,
Thanks for the informative quick reply.
1) You say that the GPU driver will automatically clean the caches. Does this happen when clReleaseMemObject function is called ? Or does it happen by default bcos the GPU caches are writethrough.
2) So what is the right way to clean/invalidate the gpu caches so that the two gpu devices read the updated data ?
3) I am using OpenCL buffers created with the CL_MEM_ALLOC_HOST_PTR flag. And then map this so that it can be used by the CPU. And then use map and unmap to clean/invalidate the caches. Also how can the cache location corresponding to an OpenCL buffer be updated without using map/unmap openCL ?
--Kiran
So what is the right way to clean/invalidate the gpu caches so that the two gpu devices read the updated data?
I don't think you need to do anything, other than express the dependencies between commands in the CL queues correctly; the driver handles the rest. i.e. GPU cache coherency should be totally transparent to the application.
HTH,
Pete
Hi Kiran,
1) Like Pete mentioned GPU cache coherency is transparent to the application, the driver will take care of it for you.
2) According to the CL specs it is not legal to share a buffer between two devices, in practice if the two devices don't share cache lines then it will work, you just need to be careful with your memory access patterns in your kernels.
3) The OpenCL driver will always use cached memory for buffers and images, if you want to use uncached memory or handle the cache maintenance yourself then you have to use OpenCL / EGL extensions to use externally allocated memory instead of letting the driver decide for you.
Have a look at https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_image_base.txt
and https://www.khronos.org/registry/cl/api/1.2/cl_egl.h
What exactly are you hoping to achieve by handling cache maintenance yourself ?
Cache operations tend to take a lot of time, but cached memory tends to still be much faster than uncached.
Also please make sure the CPU on your platform is using a "performance" governor and not an "ondemand" as usually cache maintenance operations fail to be detected as CPU utilisation by the governor and will therefore be performed at low frequency
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
I dont want to use uncached memory since there is lot of spatial and temporal locality in the data and I want to take advantage of that.
I am using and would like to use cached memory. My requirement is that I want to do processing simultaneously on both the CPU and the GPU. The workloads that i am using are dataparallel, so the CPU and GPU can work on different sections of the arrays in parallel. For this, I allocate opencl buffers(with CL_MEM_ALLOC_HOST_PTR) and map them so that they can be used by the CPU also. If there are no dependencies then fine, but if there are dependencies between the data that is computed on the CPU and the GPU then some kind of synchronization of memory has to be performed so that both the CPU and the GPU reads the updated values. I have been using map/unmap to do this. But sometimes this kills the performance, possibly due to the overhead of map and unmap or due to cache invalidation on CPU side.
In one of your previous replies, you say that opencl buffers cant be mapped or unmapped. I find that they can be mapped and unmapped.
Sorry for the confusion: I only meant externally allocated buffers/images cannot be mapped / unmapped using clEnqueueMap* / Unmap* (i.e buffers using EGL or GLES interops, or relying on a dma_buf allocation).
Despite the misleading name, all that clEnqueueMap* / Unmap* operations do is performing cache maintenance operations.
So what performance hit you get is due to the time it takes to perform the CPU cache maintenance (Which is why I suggested you to double check your CPU governor was actually set to run in performance mode).
The only other thing I can suggest is to use the "offset" and "cb" parameters of clEnqueueMap to only map the zone you're about to work on.
Hello Anthony,
Thanks again for your comments. I have a few more questions.
I checked and found that performance is the CPU governor in use.
Are you sure that Map/Unmap does only cache maintenance ? It has to map the memory location allocated for the OpenCL buffer to CPU's virtual memory, isnt it ? I was thinking that clEnqueueMap is basically doing an mmap(which involves setting the page table and invalidates/cleans the cache) and munmap(which tears down the pagetable and invalidates/cleans the cache).
If Map/Unmap is only associated with the CPU then how am i managing to do the synchronization correctly ?
For eg. I have an array A[1..10]
Case 1 : CPU modifies locations A[1..10] and after that I call map/unmap which forces the entries in CPUs cache to be flushed/invalidated. But the GPU's cache might still be containing the old values for location A[1..10]. How does the GPU get the updated value after map/unmap?
Case 2 : GPU modifies locations A[1..10] and after that I call map/unmap which forces the entries in CPUs cache to be flushed/invalidated. But the GPU's cache might not have been flushed. Then how does the CPU get the updated value after map/unmap?
The way Map / Unmap are implemented is down to the implementer, in our case the memory get mapped when it's allocated and will remain mapped for the entire life of the buffer / image, which means Map / Unmap will only perform some cache maintenance and not some memory mapping.
As mentioned before the driver will take care of automatically synchronising the GPU caches before and after every GPU job, the CPU then gets the updated value by calling map / unmap. (Because Map / Unmap are associated to a command queue then you have a dependency between the GPU jobs and Map / Unmap).
Case 1:
map -> unmap -> enqueueNdRangeKernel : There will be a dependency of the kernel enqueued on the unmap operation, therefore the GPU cache maintenance will happen after the unmap.
Case 2:
enqueueNdRangeKernel -> map : There will be a dependency of the map operation on the kernel execution, therefore the GPU cache maintenance will happen before the map.
Thanks Anthony. You are of great help.
This means that mali T628 mp6 is a worse GPU, if can not work all 6 cores? For what processors are divided into two groups?
Hi kubussz,
Please define which GPU you are comparing the T628-MP6 to in order to label it as 'worse'?
Perhaps I can help by explaining a few things:
Before the Mali T628 (T604 and T624 for example), you could only achieve up to a maximum of 4 cores, and between those 4 cores, it was coherent.
With the need for more cores, we created the Mali T628 which allowed 2 clusters of 4 cores as a maximum, i.e. 8 cores maximum. However due to them working in clusters of 4, that means they were not coherent across clusters (but still are coherent within each cluster). The T628-MP6 mentioned above is configured as one cluster of 4 and a second cluster of 2.
This was a temporary solution for the demand of more cores. After the T628 GPU, we redesigned it to not have this limitation and get rid of the cluster design. From the T7xx series onwards, you can have more than 4 cores, and all are coherent.
Please also note that this cluster design of the T628 does not affect performance for Graphical tasks. The GPU is capable of utilising all 6 cores (in the T628-MP6) for Graphical applications without issue. The issue with coherency only comes in when targeting the Open CL API.
If you have a choice of GPU's to target, you should first assess your algorithm to determine what level of performance you are aiming for. A lot of CL applications we see can easily run on just 2 cores for example.
If you do need >4 cores of performance, then the T628 or above line of GPU's is what you should be targeting
If you are not willing to separate your algorithm across the two CL devices (the two clusters of the T628), then you should be targetting the T7xx and beyond line of GPUs.
Please note that we have seen others easily utilise both clusters of the T628 this way, so it may not be as difficult as you seem to believe it is.
I do not fully understand your second question of which processors are divided... I hope there is enough information above to have answered this however.
If you have any further questions, feel free to ask.
Kind Regards,
Michael McGeagh
mcgeagh napisał(-a):Hi kubussz,Please define which GPU you are comparing the T628-MP6 to in order to label it as 'worse'?Perhaps I can help by explaining a few things:Before the Mali T628 (T604 and T624 for example), you could only achieve up to a maximum of 4 cores, and between those 4 cores, it was coherent.With the need for more cores, we created the Mali T628 which allowed 2 clusters of 4 cores as a maximum, i.e. 8 cores maximum. However due to them working in clusters of 4, that means they were not coherent across clusters (but still are coherent within each cluster). The T628-MP6 mentioned above is configured as one cluster of 4 and a second cluster of 2.This was a temporary solution for the demand of more cores. After the T628 GPU, we redesigned it to not have this limitation and get rid of the cluster design. From the T7xx series onwards, you can have more than 4 cores, and all are coherent.Please also note that this cluster design of the T628 does not affect performance for Graphical tasks. The GPU is capable of utilising all 6 cores (in the T628-MP6) for Graphical applications without issue. The issue with coherency only comes in when targeting the Open CL API.If you have a choice of GPU's to target, you should first assess your algorithm to determine what level of performance you are aiming for. A lot of CL applications we see can easily run on just 2 cores for example.If you do need >4 cores of performance, then the T628 or above line of GPU's is what you should be targetingIf you are not willing to separate your algorithm across the two CL devices (the two clusters of the T628), then you should be targetting the T7xx and beyond line of GPUs.Please note that we have seen others easily utilise both clusters of the T628 this way, so it may not be as difficult as you seem to believe it is.I do not fully understand your second question of which processors are divided... I hope there is enough information above to have answered this however.If you have any further questions, feel free to ask.Kind Regards,Michael McGeagh
mcgeagh napisał(-a):
Thank you very much for your answer, much explained it to me.
I understand that the creation of two clusters of cores: "This was a temporary solution for the demand of more cores" and was not intended to improve performance?
I would like to ask you about naming. Why you use the name MP6 (cores) ? who they differ from cores e.g. adreno 330?
"This was a temporary solution for the demand of more cores" and was not intended to improve performance?
I disagree with that statement. Having 2 clusters does increase performance.
Graphics, the primary use-case of a GPU, can utilise both clusters without issue and thus having 2 vs 1 gives you increased performance.
Computer however will only utilise 1 cluster by default, unless you tell it to utilise both 'cl devices' (clusters), in which case you will have increased performance.
By increased performance, I am comparing with a single cluster version such as a T624. So in both cases, you will get the same performance or better.
The Mali-T6xx family of GPU's had a naming convention of the last number denotes the maximum number of cores the Silicon Partner can configure the GPU to have.
The T604 can have between 1 to 4 cores. The T622 can have between 1 to 2 cores. The T624 can have between 1 to 4 cores. The T628 can have between 1 to 8 cores.
The MPx suffix is the actual number of cores in that piece of silicon.
A silicon partner may license the T628, and create several versions from that single license. They may create an MP2 version for their low end SoC's, and an MP8 for their high end SoC's.
The naming scheme changed with the T7xx and later family of GPUs as we can now scale to greater than 9 cores, and we understood the confusion faced with the older scheme. So now we do not have different maximum core configurations, but just license the GPU as is. That is why the T7xx only has 2 options. The T720 and the T760. Like before, it is the MPx suffix that denotes the actual number of cores in that SoC.
Regarding your question on comparison with Adreno. That is a more complex matter that has already been answered before. It is about the terminology used. Basically one of our "cores" is not equivalent to one of Adreno's "core".
For more, feel free to read this: The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core
And this: Multicore or Multi-pipe GPUs: Easy steps to becoming multi-frag-gasmic
I hope this helps. Let me know if you have any further questions.
that is, the problem of the use of 6 cores applies only to OpenCL?
To clarify:
Graphical applications, such as those using OpenGL ES, do not need to worry about the two clusters. Their applications will run on all 6 cores (T628-MP6) without modification.
Compute applications, such as those using OpenCL, do need to worry about the two clusters. If they do not modify their application to utilise both CL devices in parallel, then they will only utilise 4 out of the 6 cores. With modification however, they can run on all 6 cores.