Hi,
I am trying to find the practical limit of triangle / frames that the Mali T400 can render while keeping up at 60 FPS on a 1024x600 display with a Wayland integration on a ZynqMP+.
With the program and hardware setup described below, I could reach around 32 000 triangles per frame before performance dips below 60 FPS. This number is lower than I expected considering the "0.11 Mtriangles/sec/MHz" reported in the ZynqpMP+ datasheet (page 2). What steps could I take to render more triangles per frame?
To render as many triangle as possible, I reused the sample program "weston-simple-egl" from the Weston (wayland compositor) project. I changed the rendering to draw a fullscreen window (1024x600) with a GL_TRIANGLE_STRIP spanning around 95% of the screen. I tested the program with 32 bits per pix (bpp) and 16 bpp, but couldn't make any significant gain. The Mali GPU ont the system is clocked at 600MHz. The vertex and fragment shader are respectivly passing the vertices and the fragment as is.
The bottleneck seems to be the `eglSwapBuffers` call. It takes more and more time as the number of triangle rises. With 32 000 triangles, it can take up to 18 ms (!), which explains the FPS drop. Unfortunatly, eglSwapBuffers is implemented by the closed source library libmali, so I couldn't dig deeper. I assume the `eglSwapBuffers` call returns when an IRQ comes back from the GPU indicating that the queued jobs are done.
So, in summary, am I effectivly hitting an hardware limit at 32 000 triangles per frame under wayland or is there something I could do to improve performance?
> I think the client doesn't have to wait for the compositor to give it a new buffer?Assuming you are not CPU-bound, then it will have to block and wait somewhere (you can't have the application running at 300FPS, but the hardware only at 30FPS; some wait is needed to bring them in sync). It's usually either in eglSwapBuffers or in the first draw call to the window surface.
> Is there something I could do to understand a bit in more detail how is libmali communicating with the Wayland Server?
It's not my area of expertise - I'll see if I can find someone who can help - but this type of system integration is normally handled by the device manufacturer not us.
Hi Peter,
I did a bit more digging into this. I realized that for the same amount of triangles, the percentage of the screen they cover has a huge impact on performance. If the triangles don't span more than about half the screen, the FPS is steady at 60 for 32K triangles, but if I render them over the whole screen (bigger triangles), FPS drops.
By recompiling the mali ARM kernel drivers with -DDEBUG, I could get information as to what is going in with the job queuing.
In the first scenario, where the triangles span less than half of the screen, the GP (Geometry processor?) job takes 1.2ms from queuing to completion. 58 us after that, the 2 PP (pixel processors?) jobs are queued (1 per PP core, I guess) and take 2.8ms to completion. Roughly, it seems to take the GPU 4ms to render what I'm asking for it. This is with logs in the kernel, so in reality it's going a bit faster.
In the second scenario, where the triangles covers almost the whole screen, the GP job takes 4.8ms from queuing to completion, 60us after the PP jobs are queued an take 9ms to completion. We could estimate the time it takes the GPU to render the scene to about 13.8ms. To get a smooth 60FPS, we expect to finish rendering the scene before 16ms, so that leaves barely any time for the compositor to do its work, hence the missed frames.
I am quite surprised that, for the same amount of vertices, the time to render a basic scene can vary so much. Is there anything that you know of that could help speed up the rendering speed here?
Thanks!
Fragment shaders will run for every pixel that each triangle intersects, unless fragments are killed by early-ZS testing, even if they are opaque. You should get a linear increase in cost due to extra fragments and blended layers as triangle size increase.
For opaque geometry it's important to render from front-to-back with ZS testing enabled (on later Mali architectures this would get killed by hidden surface removal, but Mali-400 doesn't have that).
HTH, Pete