Hello Guys peterharris I am using a Mali T628 GPU on the Odroid XU3 board with Exynos 5422 chip. I have a couple of questions regarding OpenCL on the Mali GPU:
1. Can we get information of active threads or work groups per shader core for Mali-T6xx similar to active warps or blocks per SM in Nvidia using the occupancy tool?
2. Can we get the assembly code (or intermediate representation) for an OpenCL code running on Mali-T6xx similar to PTX of Nvidia?
I understand you cannot tell me about how each of the instructions in OpenCL threads actually get mapped and executed on the functional units inside Tripipe, but if could have some assembly code, that might be useful to predict the performance of OpenCL threads on this GPU. Thanks!
Hi alprakas,
1. DS-5 Streamline does not show you per shader core information. As what happens on what cores are completely transparent to the application running, we do not see any value for the developers to see per shader core statistics
2. The Instruction Set Architecture (ISA) for the GPU's are strictly proprietary and confidential and we currently have no plans to release this ISA information publicly
Regarding predicting performance of OpenCL, we are continually improving our tools to help assist in this in a way that makes sense to the developers. We have added OpenCL support to MGD with the GPUverify tool, as well as introducing a new CL timeline view into DS-5 Streamline. We have also produced material on our website that explains some optimisation techniques that help considerably with OpenCL on embedded mobile devices.
If you have any specific questions regarding optimisation advice, please do not hesitate to ask.
Kind Regards,
Michael McGeagh
Thank you mcgeagh peterharris for your comments. I think I should have given a bit more background of the problem we are trying to solve. We are not Android/game developers. It is purely an academic project, trying to understand the implications of using embedded GPUs for GPGPU applications. Basically, we would like to find a way to predict if a particular application kernel is better for the embedded GPU or not, with respect to maybe multi-core big.LITTLE CPUs or FPGA or DSP, etc. This is where we need a way to roughly predict the applications performance on each of these elements processing elements.
In any case, the one thing I have not tried yet that both of you have mentioned, is to look at the OpenCL timeline in DS5. It just so happens that we have been using the last version of DS-5 without that feature. Using the latest version, as you guys know, requires us to recompile the gator module. We will try this now and report if it gives us any insights into our problem.
Thanks!
Alok
Can we get information of active threads or work groups per shader core for Mali-T6xx similar to active warps or blocks per SM in Nvidia using the occupancy tool?
In addition to McGeagh's comment, it's also worth remembering that mobile GPUs have far fewer cores than desktop GPUs so load balancing is less of an issue. If you design your compute kernel pipeline well (lots of threads per kernel, avoid serialization), then the workload will be statistically flat (i.e. approximately the same number of workgroups are run on each core). DS-5 Streamline does give you counters for the shader cores, but they are averaged over all of the cores implemented in your SoC.
HTH, Pete