This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cache Coherence for BigLittle CPUs and Arm Mali T628 MP6 GPU on the odroid XU3 board

I am using the odroid XU3 board. It has the Samsung Exynos5422 SoC. The SoC has BigLittle CPUs and the Arm Mali T628 MP6 GPU. I would like to run the CPU and the GPU in parallel on different sections of the array that I am processing. Currently to enforce coherency I have to use clEnqueueMapBuffer and clEnqueueUnmapMemObject. Usage of these OpenCL functions gives a performance degradation and due to this running the CPU and GPU in parallel becomes pointless. I have the following questions.

1) Are the caches of the ARM CPU and the GPU on this SoC coherent ?

2) The GPU shows up as two devices in OpenCL. Are these GPUs cache coherent ?
3) Is there anyway to enforce coherency other than using MapBuffer and UnmapMemObject CL functions ?

Parents

0 Michael McGeagh over 9 years ago in reply to kubussz

Hi kubussz,
Please define which GPU you are comparing the T628-MP6 to in order to label it as 'worse'?
Perhaps I can help by explaining a few things:
Before the Mali T628 (T604 and T624 for example), you could only achieve up to a maximum of 4 cores, and between those 4 cores, it was coherent.
With the need for more cores, we created the Mali T628 which allowed 2 clusters of 4 cores as a maximum, i.e. 8 cores maximum. However due to them working in clusters of 4, that means they were not coherent across clusters (but still are coherent within each cluster). The T628-MP6 mentioned above is configured as one cluster of 4 and a second cluster of 2.
This was a temporary solution for the demand of more cores. After the T628 GPU, we redesigned it to not have this limitation and get rid of the cluster design. From the T7xx series onwards, you can have more than 4 cores, and all are coherent.
Please also note that this cluster design of the T628 does not affect performance for Graphical tasks. The GPU is capable of utilising all 6 cores (in the T628-MP6) for Graphical applications without issue. The issue with coherency only comes in when targeting the Open CL API.
If you have a choice of GPU's to target, you should first assess your algorithm to determine what level of performance you are aiming for. A lot of CL applications we see can easily run on just 2 cores for example.
If you do need >4 cores of performance, then the T628 or above line of GPU's is what you should be targeting
If you are not willing to separate your algorithm across the two CL devices (the two clusters of the T628), then you should be targetting the T7xx and beyond line of GPUs.
Please note that we have seen others easily utilise both clusters of the T628 this way, so it may not be as difficult as you seem to believe it is.
I do not fully understand your second question of which processors are divided... I hope there is enough information above to have answered this however.
If you have any further questions, feel free to ask.
Kind Regards,
Michael McGeagh
Cancel
Up 0 Down

Cancel

Reply

0 Michael McGeagh over 9 years ago in reply to kubussz

Hi kubussz,
Please define which GPU you are comparing the T628-MP6 to in order to label it as 'worse'?
Perhaps I can help by explaining a few things:
Before the Mali T628 (T604 and T624 for example), you could only achieve up to a maximum of 4 cores, and between those 4 cores, it was coherent.
With the need for more cores, we created the Mali T628 which allowed 2 clusters of 4 cores as a maximum, i.e. 8 cores maximum. However due to them working in clusters of 4, that means they were not coherent across clusters (but still are coherent within each cluster). The T628-MP6 mentioned above is configured as one cluster of 4 and a second cluster of 2.
This was a temporary solution for the demand of more cores. After the T628 GPU, we redesigned it to not have this limitation and get rid of the cluster design. From the T7xx series onwards, you can have more than 4 cores, and all are coherent.
Please also note that this cluster design of the T628 does not affect performance for Graphical tasks. The GPU is capable of utilising all 6 cores (in the T628-MP6) for Graphical applications without issue. The issue with coherency only comes in when targeting the Open CL API.
If you have a choice of GPU's to target, you should first assess your algorithm to determine what level of performance you are aiming for. A lot of CL applications we see can easily run on just 2 cores for example.
If you do need >4 cores of performance, then the T628 or above line of GPU's is what you should be targeting
If you are not willing to separate your algorithm across the two CL devices (the two clusters of the T628), then you should be targetting the T7xx and beyond line of GPUs.
Please note that we have seen others easily utilise both clusters of the T628 this way, so it may not be as difficult as you seem to believe it is.
I do not fully understand your second question of which processors are divided... I hope there is enough information above to have answered this however.
If you have any further questions, feel free to ask.
Kind Regards,
Michael McGeagh
Cancel
Up 0 Down

Cancel

Children

0 kubussz over 9 years ago in reply to Michael McGeagh

mcgeagh napisał(-a):
Hi kubussz,
Please define which GPU you are comparing the T628-MP6 to in order to label it as 'worse'?
Perhaps I can help by explaining a few things:
Before the Mali T628 (T604 and T624 for example), you could only achieve up to a maximum of 4 cores, and between those 4 cores, it was coherent.
With the need for more cores, we created the Mali T628 which allowed 2 clusters of 4 cores as a maximum, i.e. 8 cores maximum. However due to them working in clusters of 4, that means they were not coherent across clusters (but still are coherent within each cluster). The T628-MP6 mentioned above is configured as one cluster of 4 and a second cluster of 2.
This was a temporary solution for the demand of more cores. After the T628 GPU, we redesigned it to not have this limitation and get rid of the cluster design. From the T7xx series onwards, you can have more than 4 cores, and all are coherent.
Please also note that this cluster design of the T628 does not affect performance for Graphical tasks. The GPU is capable of utilising all 6 cores (in the T628-MP6) for Graphical applications without issue. The issue with coherency only comes in when targeting the Open CL API.
If you have a choice of GPU's to target, you should first assess your algorithm to determine what level of performance you are aiming for. A lot of CL applications we see can easily run on just 2 cores for example.
If you do need >4 cores of performance, then the T628 or above line of GPU's is what you should be targeting
If you are not willing to separate your algorithm across the two CL devices (the two clusters of the T628), then you should be targetting the T7xx and beyond line of GPUs.
Please note that we have seen others easily utilise both clusters of the T628 this way, so it may not be as difficult as you seem to believe it is.
I do not fully understand your second question of which processors are divided... I hope there is enough information above to have answered this however.
If you have any further questions, feel free to ask.
Kind Regards,
Michael McGeagh
Thank you very much for your answer, much explained it to me.
I understand that the creation of two clusters of cores: "This was a temporary solution for the demand of more cores" and was not intended to improve performance?

I would like to ask you about naming. Why you use the name MP6 (cores) ? who they differ from cores e.g. adreno 330?
Cancel
Up 0 Down

Cancel
0 Michael McGeagh over 9 years ago in reply to kubussz

Hi kubussz,
"This was a temporary solution for the demand of more cores" and was not intended to improve performance?
I disagree with that statement. Having 2 clusters does increase performance.
Graphics, the primary use-case of a GPU, can utilise both clusters without issue and thus having 2 vs 1 gives you increased performance.
Computer however will only utilise 1 cluster by default, unless you tell it to utilise both 'cl devices' (clusters), in which case you will have increased performance.
By increased performance, I am comparing with a single cluster version such as a T624. So in both cases, you will get the same performance or better.
I would like to ask you about naming. Why you use the name MP6 (cores) ? who they differ from cores e.g. adreno 330?
The Mali-T6xx family of GPU's had a naming convention of the last number denotes the maximum number of cores the Silicon Partner can configure the GPU to have.
The T604 can have between 1 to 4 cores. The T622 can have between 1 to 2 cores. The T624 can have between 1 to 4 cores. The T628 can have between 1 to 8 cores.
The MPx suffix is the actual number of cores in that piece of silicon.
A silicon partner may license the T628, and create several versions from that single license. They may create an MP2 version for their low end SoC's, and an MP8 for their high end SoC's.
The naming scheme changed with the T7xx and later family of GPUs as we can now scale to greater than 9 cores, and we understood the confusion faced with the older scheme. So now we do not have different maximum core configurations, but just license the GPU as is. That is why the T7xx only has 2 options. The T720 and the T760. Like before, it is the MPx suffix that denotes the actual number of cores in that SoC.
Regarding your question on comparison with Adreno. That is a more complex matter that has already been answered before. It is about the terminology used. Basically one of our "cores" is not equivalent to one of Adreno's "core".
For more, feel free to read this: The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core
And this: Multicore or Multi-pipe GPUs: Easy steps to becoming multi-frag-gasmic
I hope this helps. Let me know if you have any further questions.
Kind Regards,
Michael McGeagh
Cancel
Up 0 Down

Cancel
0 kubussz over 9 years ago in reply to Michael McGeagh

that is, the problem of the use of 6 cores applies only to OpenCL?
Cancel
Up 0 Down

Cancel
0 Michael McGeagh over 9 years ago in reply to kubussz

To clarify:
Graphical applications, such as those using OpenGL ES, do not need to worry about the two clusters. Their applications will run on all 6 cores (T628-MP6) without modification.
Compute applications, such as those using OpenCL, do need to worry about the two clusters. If they do not modify their application to utilise both CL devices in parallel, then they will only utilise 4 out of the 6 cores. With modification however, they can run on all 6 cores.
Kind Regards,
Michael McGeagh
Cancel
Up 0 Down

Cancel
0 kubussz over 9 years ago in reply to Michael McGeagh

very thank you for your reply, now everything is clear. But do you can not modified drivers to operate all 6 cores for OpenCL?
mali t628 mp6 with which frequency can work?
Cancel
Up 0 Down

Cancel
0 Peter Harris over 9 years ago in reply to kubussz

But do you can not modified drivers to operate all 6 cores for OpenCL?
No; you'd need hardware memory coherency between the two core clusters to make it work transparently, which Mali-T620 doesn't support. Only the application has enough knowledge to know how to split the work safely across the two devices, so that's why we expose it as two separate OpenCL devices to the application - you can use all 6 cores, it just takes a little more effort .
As mcgeagh has pointed out, we added the hardware support to the newer Mali GPUs (Anything in Mali-T700 series onwards), so in newer chipsets this is no longer an issue.
mali t628 mp6 with which frequency can work?
ARM just license the GPU IP to our silicon partners; the achievable top frequency depends on many aspects of physical implementation. This question is best aimed at the supplier of a specific chip.
HTH,
Pete
Cancel
Up 0 Down

Cancel
0 kubussz over 9 years ago in reply to Peter Harris

As mcgeagh has pointed out, we added the hardware support to the newer Mali GPUs (Anything in Mali-T700 series onwards), so in newer chipsets this is no longer an issue.
hmm but mali-t720 i think it also has this problem?
link: http://www.arm.com/products/multimedia/mali-gpu/high-area-efficiency/mali-t720.php
Cancel
Up 0 Down

Cancel
0 Peter Harris over 9 years ago in reply to kubussz

Mali-T720 is designed for low and mid-end devices which want to save silicon area; I'm not aware of any implementation with more than 4 cores.
Cancel
Up 0 Down

Cancel
0 kubussz over 9 years ago in reply to Peter Harris

thank you very much for your answer.
mali t628 mp6 with which frequency can work?
ARM just license the GPU IP to our silicon partners; the achievable top frequency depends on many aspects of physical implementation. This question is best aimed at the supplier of a specific chip.
but whether mali-t628 mp6 can work at 695 mhz?
Cancel
Up 0 Down

Cancel
0 Michael McGeagh over 9 years ago in reply to kubussz

Hi kubussz
I will repeat what peterharris has already said:
ARM just license the GPU IP to our silicon partners; the achievable top frequency depends on many aspects of physical implementation. This question is best aimed at the supplier of a specific chip.
We cannot answer that question as it is not controlled by us. Please ask your silicon provider of your targeted SoC to see if they can run the GPU at that clock frequency.
Kind Regards,
Michael McGeagh
Cancel
Up 0 Down

Cancel
0 kubussz over 9 years ago in reply to Michael McGeagh

thank you for your answer.
Cancel
Up 0 Down

Cancel