Hello Guys peterharris I am using a Mali T628 GPU on the Odroid XU3 board with Exynos 5422 chip. I have a couple of questions regarding OpenCL on the Mali GPU:
1. Can we get information of active threads or work groups per shader core for Mali-T6xx similar to active warps or blocks per SM in Nvidia using the occupancy tool?
2. Can we get the assembly code (or intermediate representation) for an OpenCL code running on Mali-T6xx similar to PTX of Nvidia?
I understand you cannot tell me about how each of the instructions in OpenCL threads actually get mapped and executed on the functional units inside Tripipe, but if could have some assembly code, that might be useful to predict the performance of OpenCL threads on this GPU. Thanks!
Can we get information of active threads or work groups per shader core for Mali-T6xx similar to active warps or blocks per SM in Nvidia using the occupancy tool?
In addition to McGeagh's comment, it's also worth remembering that mobile GPUs have far fewer cores than desktop GPUs so load balancing is less of an issue. If you design your compute kernel pipeline well (lots of threads per kernel, avoid serialization), then the workload will be statistically flat (i.e. approximately the same number of workgroups are run on each core). DS-5 Streamline does give you counters for the shader cores, but they are averaged over all of the cores implemented in your SoC.
HTH, Pete
Thank you mcgeagh peterharris for your comments. I think I should have given a bit more background of the problem we are trying to solve. We are not Android/game developers. It is purely an academic project, trying to understand the implications of using embedded GPUs for GPGPU applications. Basically, we would like to find a way to predict if a particular application kernel is better for the embedded GPU or not, with respect to maybe multi-core big.LITTLE CPUs or FPGA or DSP, etc. This is where we need a way to roughly predict the applications performance on each of these elements processing elements.
In any case, the one thing I have not tried yet that both of you have mentioned, is to look at the OpenCL timeline in DS5. It just so happens that we have been using the last version of DS-5 without that feature. Using the latest version, as you guys know, requires us to recompile the gator module. We will try this now and report if it gives us any insights into our problem.
Thanks!
Alok