Hi guys:
I'm an developing an opencl application on MTK P60(Mali G72 mp3). But i have met some problems.
The application has been run successfully on snapdragon 660(GPU Adreno 512), the performance was about 10ms. But when I run it on Mali G72 mp3, it should cost 60ms! When I check the gpu_utilization, it's almost 100 percent.
Firstly, I couldn't find any specification about the flops performance with the Mali G72.(Adreno 512 GPU flops performance: 255 Gflops)
Secondly, according to benchmarks, performance of G72 mp3 should close to the Adreno 512. I can't find out why it should perform so bad on G72 mp3.
Welcome to talk about this. :)
Not all flops are equal - there is no point counting adders if you need multipliers, etc - so generally flops numbers are not that useful.
Assuming a GPU clocked at 650 MHz you get 12 fp32 FMAs per core per clock.If you count an FMA as 2 FLOPS, then you get 12 * 2 * 3 * 650M = 46G FLOPS of FP32 FMAs. If you write well vectorized FP16 then you get double that.
If you can share your CL kernel we can probably provide more targeted advice.
Cheers, Pete
My kernel code may not be posted to the public. But I can share it to you by private messages.(I don't know how to do that.)
Generally, when testing the flops, the GPU should be running at full capacity with float precision or half precision, which is what I wonder. I suspect that the G72 is not fully loaded when it runs 60ms. Is there any means I can confirm this?
DS-5 Streamline should let you capture performance counters for the GPU.
I profile my project and the timeline graph is on the above. Each cycle is about 120ms. Something seems wired.
(1). I enqueue kernels continually to the command queue. But there seem to be a tiny idle time, and the GPU restart at each new pass.
(2). When it comes into a new pass, the 'Mali Core Cycle' falls(I don't know what dose it means) instead of keeping a high value.
I don't know if it is enough to get some useful information.
The low spot does indeed look strange - the shader core definitely isn't fully loaded, even though the GPU cycles counter is high (so something is queued on the GPU).
Normally this occurs because there is a high volume of very small kernels which are not able to parallelize and fully load the GPU because they are so small with a low thread count. Without knowing exactly what you are trying to do it's going to be hard to provide more specific advice.
That's right. I enqueue more than 1 hundred kernels to the queue as one pass and cycle it. But 80% of them are very small kernels .(like relu and sum operation in CNN) And several convolution kernels costs 80% of the time.
Peter Harris said:very small kernels which are not able to parallelize and fully load the GPU because they are so small with a low thread count
I am not quiet understand those words mean. When kernels are small and GPU cycles counter is high, will it affect the GPU load? I have tuned their work group size, and each small kernel can dispatch hundreds of threads. How could the GPU core is not fully loaded?
Well, I check my work group size again. They are indeed small. Most of them are like [2,1,1] [1,4,1].
These work group size were chosen by auto tune. (I profile the command queue, and search all the possible work group combination, then count the time duration of each combination. Then find out the min cost.) But I discover that when enable the queue profiling, the time cost will double, meaning that time statistics are not exact in this condition. So maybe this method to search fine work group size is bad.
I will fix it manually then.
Small kernels with serial data dependencies are a bad fit for GPUs. A Mali-G72 MP3 can run 1152 threads concurrently, so you need kernels with tens of thousands of threads to fully load the hardware. If you only have "hundreds" then you spend most of your time ramping up new workloads and then ramping down again ...
I have tuned my work groups, and that dose not work. I think I need some other new ways.
Above are the poor performance kernel groups' size on Aderno 660 and Mali G72. Even on 660, work group size is not big.
Peter Harris said:If you only have "hundreds" then you spend most of your time ramping up new workloads and then ramping down again ...
What's exactly in GPU about ramping up workloads. Is it flush the cache or enqueue the kernel from CPU to GPU?
Is there any suggestion I can do to make full use of GPU to the small kernel?
Small work groups are not the problem - the problem is the small overall task size. Each kernel is only able to spawn 32 threads, and Mali GPUs have a finite number of concurrent compute dispatches which can be running simultaneously (exactly how many varies, but 8 is a good rule of thumb). 32 * 8 = 256 threads, which is only about 20% of the total thread capacity of the GPU you are using.
The only advice I can give for Mali is to design your algorithm to have fewer small kernels, or interleave them with much larger ones. You're aiming to have > 1000 threads running.
Hi Harris:
I have do seme tests, and found where the bottleneck is.
My original kernel code was writen based on Adreno 660. I use many vector local variable in the code. Adreno 660 has 2 compute units and support maximum 1024 work group size per CU. But the G72 only support 384 work group size per CU. I suspect that Mali GPU has much less hareware resources than Adreo per CU. This result in little work items working concurrently.
So I tuned the kernel code, mainly reduce the vector numbers and put them into loops. The performance increased! Time consumption decrease from 66ms to 36ms.
But I got another problem. I run the program by a commandline window. When I only run the commandline program, the time was 66ms. But when I run the commandline program meanwhile opening the system camera, the time becomes 36ms. Why the system camera could enhance the performance? It seems that the camera heat up the GPU device.
It is possibly related to DVFS (dynamic voltage and frequency scaling). When idle the device will run at a low power state; it may take some time for the CPU, GPU, and memory system to select and stabilize on a frequency when new workloads start running and increase demand. With more things running - such as the camera - it may more rapidly selected a higher frequency for heavily loaded components.
Can I change DVFS mode directly? I remember that I can set DVFS to "power save" or "performance" to control Adreno GPU. Is there similar way to control Mali GPU?
The DVFS implementation isn't provided by Arm - it's implemented by the chipset manufacturer, so you'd have to check with them, sorry.
Kind regards, Pete