Cache Coherence for BigLittle CPUs and Arm Mali T628 MP6 GPU on the odroid XU3 board

I am using the odroid XU3 board. It has the Samsung Exynos5422 SoC. The SoC has BigLittle CPUs and the Arm Mali T628 MP6 GPU.  I would like to run the CPU and the GPU in parallel on different sections of the array that I am processing. Currently to enforce coherency I have to use clEnqueueMapBuffer and clEnqueueUnmapMemObject. Usage of these OpenCL functions gives a performance degradation and due to this running the CPU and GPU in parallel becomes pointless. I have the following questions.

1) Are the caches of the ARM CPU and the GPU on this SoC coherent ?

2) The GPU shows up as two devices in OpenCL. Are these GPUs cache coherent ?
3) Is there anyway to enforce coherency other than using MapBuffer and UnmapMemObject CL functions ?

Parents
  • Hi kubussz,

    Please define which GPU you are comparing the T628-MP6 to in order to label it as 'worse'?

    Perhaps I can help by explaining a few things:

    Before the Mali T628 (T604 and T624 for example), you could only achieve up to a maximum of 4 cores, and between those 4 cores, it was coherent.

    With the need for more cores, we created the Mali T628 which allowed 2 clusters of 4 cores as a maximum, i.e. 8 cores maximum. However due to them working in clusters of 4, that means they were not coherent across clusters (but still are coherent within each cluster). The T628-MP6 mentioned above is configured as one cluster of 4 and a second cluster of 2.

    This was a temporary solution for the demand of more cores. After the T628 GPU, we redesigned it to not have this limitation and get rid of the cluster design. From the T7xx series onwards, you can have more than 4 cores, and all are coherent.

    Please also note that this cluster design of the T628 does not affect performance for Graphical tasks. The GPU is capable of utilising all 6 cores (in the T628-MP6) for Graphical applications without issue. The issue with coherency only comes in when targeting the Open CL API.

    If you have a choice of GPU's to target, you should first assess your algorithm to determine what level of performance you are aiming for. A lot of CL applications we see can easily run on just 2 cores for example.

    If you do need >4 cores of performance, then the T628 or above line of GPU's is what you should be targeting

    If you are not willing to separate your algorithm across the two CL devices (the two clusters of the T628), then you should be targetting the T7xx and beyond line of GPUs.

    Please note that we have seen others easily utilise both clusters of the T628 this way, so it may not be as difficult as you seem to believe it is.

    I do not fully understand your second question of which processors are divided... I hope there is enough information above to have answered this however.

    If you have any further questions, feel free to ask.

    Kind Regards,

    Michael McGeagh

Reply
  • Hi kubussz,

    Please define which GPU you are comparing the T628-MP6 to in order to label it as 'worse'?

    Perhaps I can help by explaining a few things:

    Before the Mali T628 (T604 and T624 for example), you could only achieve up to a maximum of 4 cores, and between those 4 cores, it was coherent.

    With the need for more cores, we created the Mali T628 which allowed 2 clusters of 4 cores as a maximum, i.e. 8 cores maximum. However due to them working in clusters of 4, that means they were not coherent across clusters (but still are coherent within each cluster). The T628-MP6 mentioned above is configured as one cluster of 4 and a second cluster of 2.

    This was a temporary solution for the demand of more cores. After the T628 GPU, we redesigned it to not have this limitation and get rid of the cluster design. From the T7xx series onwards, you can have more than 4 cores, and all are coherent.

    Please also note that this cluster design of the T628 does not affect performance for Graphical tasks. The GPU is capable of utilising all 6 cores (in the T628-MP6) for Graphical applications without issue. The issue with coherency only comes in when targeting the Open CL API.

    If you have a choice of GPU's to target, you should first assess your algorithm to determine what level of performance you are aiming for. A lot of CL applications we see can easily run on just 2 cores for example.

    If you do need >4 cores of performance, then the T628 or above line of GPU's is what you should be targeting

    If you are not willing to separate your algorithm across the two CL devices (the two clusters of the T628), then you should be targetting the T7xx and beyond line of GPUs.

    Please note that we have seen others easily utilise both clusters of the T628 this way, so it may not be as difficult as you seem to believe it is.

    I do not fully understand your second question of which processors are divided... I hope there is enough information above to have answered this however.

    If you have any further questions, feel free to ask.

    Kind Regards,

    Michael McGeagh

Children