This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Draw call performance on the Mali-T880 MP12

I've been profiling a 3D scene on the Samsung Galaxy S7 and I've noticed that glDrawElements and glDrawArrays CPU time is a lot larger compared to Adreno and PowerVR GPUs.

For some context, in an effort to improve performance on Mali devices, I moved all the OpenGL calls to a separate render thread. After that change, the render thread now is bottle-necking the entire application at a ~50-60ms frame time in a scene with 335 draw calls (after letting the device sit for 5 minutes to thermal throttle).

While I would normally excuse this as being GPU-bound, I ran a DS-5 capture on the device and noticed that the GPU's vertex and fragment time was taking a lot less than this (around ~30ms when the device throttles).

Is there any explanation for why the GL calls are taking so long while the GPU isn't 100%? It looks like every GL call is more expensive on Mali, for some reason.

Here's an attached picture of our DS-5 capture, with the render thread isolated on the CPU Activity

In addition, the Unreal Engine (in the mobile optimization guidelines) recommends scenes to be <= 700 draw calls. While I'm not using the Unreal Engine, is this nevertheless a realistic target for this GPU?

Top replies

Parents

0 Daniele Di Donato over 7 years ago

Hi cedega,

I agree with you that the application is CPU bound at the moment, the GPU execution (Vertex-Fragment) is clearly serialized which is caused by the fact the CPU doesn't provide enough work on time.

We usually suggest <=500 draw-calls depending on how many vertices you are currently drawing. This depends also on the device CPU configuration. What is the frequency for the core where the thread run on? It's not visible from the screenshot you have attached.

Each drawcall has some fixed cost that is independent on the number of vertices drawn. This means that using drawcalls with a lot of vertices will allow you to spread this fixed cost better.

If you can we suggest to:
-batch as many drawcalls as possible (drawing objects with the same GL state with a single drawcall).
-if building using OpenGL ES 3.0 use instancing to draw groups of the same object.
-If it's a VR app. Use Multiview extension to almost halve the CPU drawcall cost.

I understand that all the drawcalls are called by one thread but is there anything else running on the same thread? (culling algorithms, game logic, etc).

Regards,

DDD
Cancel
Up +1 Down

Cancel

Reply

0 Daniele Di Donato over 7 years ago

Hi cedega,

I agree with you that the application is CPU bound at the moment, the GPU execution (Vertex-Fragment) is clearly serialized which is caused by the fact the CPU doesn't provide enough work on time.

We usually suggest <=500 draw-calls depending on how many vertices you are currently drawing. This depends also on the device CPU configuration. What is the frequency for the core where the thread run on? It's not visible from the screenshot you have attached.

Each drawcall has some fixed cost that is independent on the number of vertices drawn. This means that using drawcalls with a lot of vertices will allow you to spread this fixed cost better.

If you can we suggest to:
-batch as many drawcalls as possible (drawing objects with the same GL state with a single drawcall).
-if building using OpenGL ES 3.0 use instancing to draw groups of the same object.
-If it's a VR app. Use Multiview extension to almost halve the CPU drawcall cost.

I understand that all the drawcalls are called by one thread but is there anything else running on the same thread? (culling algorithms, game logic, etc).

Regards,

DDD
Cancel
Up +1 Down

Cancel

Children

0 Peter Harris over 7 years ago in reply to Daniele Di Donato

Hitting 60ms a frame with only 335 draw calls sounds much lower than we expect; we see plenty of applications with 500+ draws hitting 60FPS, so you're about 1/5th of what we would consider "normal".

Can you check what your CPU frequency is, and which CPU type you are running on? If I had to hazard a guess it sounds like your application is locked to a "LITTLE" CPU running at a relatively low frequency, rather than migrating across to the big CPU which has higher software performance.
Cancel
Up +1 Down

Cancel
0 cedega over 7 years ago in reply to Daniele Di Donato

Hey Daniele and Peter,

The render thread is scheduled to a big core, which starts off at 2.26GHz and throttles to 1.25GHz within 5 minutes. The 4 little cores on the device are a consistent 1.59GHz.

The render thread only executes OpenGL commands -- all scene / game related work is done on the main thread (which is usually waiting for the render thread). The render thread performs the OpenGL commands for the previous frame so it executes in parallel with the main thread.

The device's performance is good until it gets thermal throttled. However, it looks like bulk of the CPU work is done by the render thread, which I suspect is causing aggressive thermal throttling on device.

I've attached android systraces of my application running on the S7 and a Redmi Note 4 (which is a Mali-T880 MP4) while not thermal throttled. HH_Render and HH_Main are the render thread and main thread respectively.

systrace.zip
Cancel
Up 0 Down

Cancel
0 Daniele Di Donato over 7 years ago in reply to cedega

Hi cedega

Can you provide a systrace when the device is throttled down? I had a look at the traces you sent and I noticed a bit of serialization between main thread and render thread. Specifically, I see the render thread waits for the main thread to start, are game and render thread sharing something that needs synchronization? When the device throttles down, I would expect the main thread will also take longer to complete and the sync point make the overall performance to suffer.

If it's possible, It would be good to have a test apk to understand if there is a specific reason why you are seeing this high CPU usage.

-DDD
Cancel
Up 0 Down

Cancel
0 cedega over 7 years ago in reply to Daniele Di Donato

The main thread serializes a CPU-side command buffer and sends it to the render thread via semaphore. This process is double-buffered, so the render thread will be rendering 1 frame behind in the case of being GPU-bound (and hence able to execute in parallel).

I don't think the sleeping from the semaphore is shown in the systrace, which may make it look like there is resource contention.

The render thread can also be occasionally interrupted via a blocking request from the main thread, but these events shouldn't be frequent.

I've attached the throttled systrace for the S7. I can send an APK as well, but it would have to be done privately.

S7_throttled.zip
Cancel
Up 0 Down

Cancel
+1 Daniele Di Donato over 7 years ago in reply to cedega

Hi cedega,

The systrace you sent looks more as I would have expected. Since the Render thread is the bottleneck it doesn't wait on the Main thread to start executing again as it was happening in your previous systrace. I see the render thread takes 10ms more when throttled as you mentioned and the whole execution is around 40ms.

If you can send the apk to me I will try to have a look at it.
Cancel
Up +1 Down

Cancel