I've been profiling a 3D scene on the Samsung Galaxy S7 and I've noticed that glDrawElements and glDrawArrays CPU time is a lot larger compared to Adreno and PowerVR GPUs.
For some context, in an effort to improve performance on Mali devices, I moved all the OpenGL calls to a separate render thread. After that change, the render thread now is bottle-necking the entire application at a ~50-60ms frame time in a scene with 335 draw calls (after letting the device sit for 5 minutes to thermal throttle).
While I would normally excuse this as being GPU-bound, I ran a DS-5 capture on the device and noticed that the GPU's vertex and fragment time was taking a lot less than this (around ~30ms when the device throttles).
Is there any explanation for why the GL calls are taking so long while the GPU isn't 100%? It looks like every GL call is more expensive on Mali, for some reason.
Here's an attached picture of our DS-5 capture, with the render thread isolated on the CPU Activity
In addition, the Unreal Engine (in the mobile optimization guidelines) recommends scenes to be <= 700 draw calls. While I'm not using the Unreal Engine, is this nevertheless a realistic target for this GPU?
Hi cedega
Can you provide a systrace when the device is throttled down? I had a look at the traces you sent and I noticed a bit of serialization between main thread and render thread. Specifically, I see the render thread waits for the main thread to start, are game and render thread sharing something that needs synchronization? When the device throttles down, I would expect the main thread will also take longer to complete and the sync point make the overall performance to suffer.
If it's possible, It would be good to have a test apk to understand if there is a specific reason why you are seeing this high CPU usage.
-DDD
The main thread serializes a CPU-side command buffer and sends it to the render thread via semaphore. This process is double-buffered, so the render thread will be rendering 1 frame behind in the case of being GPU-bound (and hence able to execute in parallel).
I don't think the sleeping from the semaphore is shown in the systrace, which may make it look like there is resource contention.
The render thread can also be occasionally interrupted via a blocking request from the main thread, but these events shouldn't be frequent.
I've attached the throttled systrace for the S7. I can send an APK as well, but it would have to be done privately.
S7_throttled.zip
Hi cedega,
The systrace you sent looks more as I would have expected. Since the Render thread is the bottleneck it doesn't wait on the Main thread to start executing again as it was happening in your previous systrace. I see the render thread takes 10ms more when throttled as you mentioned and the whole execution is around 40ms.
If you can send the apk to me I will try to have a look at it.