I am trying to find the practical limit of triangle / frames that the Mali T400 can render while keeping up at 60 FPS on a 1024x600 display with a Wayland integration on a ZynqMP+.
With the program and hardware setup described below, I could reach around 32 000 triangles per frame before performance dips below 60 FPS. This number is lower than I expected considering the "0.11 Mtriangles/sec/MHz" reported in the ZynqpMP+ datasheet (page 2). What steps could I take to render more triangles per frame?
To render as many triangle as possible, I reused the sample program "weston-simple-egl" from the Weston (wayland compositor) project. I changed the rendering to draw a fullscreen window (1024x600) with a GL_TRIANGLE_STRIP spanning around 95% of the screen. I tested the program with 32 bits per pix (bpp) and 16 bpp, but couldn't make any significant gain. The Mali GPU ont the system is clocked at 600MHz. The vertex and fragment shader are respectivly passing the vertices and the fragment as is.
The bottleneck seems to be the `eglSwapBuffers` call. It takes more and more time as the number of triangle rises. With 32 000 triangles, it can take up to 18 ms (!), which explains the FPS drop. Unfortunatly, eglSwapBuffers is implemented by the closed source library libmali, so I couldn't dig deeper. I assume the `eglSwapBuffers` call returns when an IRQ comes back from the GPU indicating that the queued jobs are done.
So, in summary, am I effectivly hitting an hardware limit at 32 000 triangles per frame under wayland or is there something I could do to improve performance?
Calling eglSwapBuffers() will block until the next window buffer is available; which "on average" is related to rendering performance although there are some queuing effects here. Specifically the window system won't release the an old buffer until it has a new one to replace it, and we need the new buffer to start queuing commands for it.
The performance does sound low for a simple triangle grid test app; we'd expect ~10 cycles a vertex not 300. Is the other performance you are seeing (e.g. fragment shading a simple quad with a blit texture) consistent with a 600Mhz GPU performance?
Thank you for the quick answer, Peter.
The performance were consistent with what you're suggesting here before the use of a windowing system (with Qt eglfs). I checked the registers driving the GPU clock and they indicates that the GPU is effectively driven at 600MHz.
There might be a synchronization issue between Weston and the client app? From my understanding of the Wayland protocol and the EGL integration, the rendering should be double buffered so we should be drawing in the back buffer which isn't currently bound to the surface. I think the client doesn't have to wait for the compositor to give it a new buffer? I haven't dug to deep in the code, but Mesa seems to it that way.
Is there something I could do to understand a bit in more detail how is libmali communicating with the Wayland Server?
> I think the client doesn't have to wait for the compositor to give it a new buffer?Assuming you are not CPU-bound, then it will have to block and wait somewhere (you can't have the application running at 300FPS, but the hardware only at 30FPS; some wait is needed to bring them in sync). It's usually either in eglSwapBuffers or in the first draw call to the window surface.
> Is there something I could do to understand a bit in more detail how is libmali communicating with the Wayland Server?
It's not my area of expertise - I'll see if I can find someone who can help - but this type of system integration is normally handled by the device manufacturer not us.
I did a bit more digging into this. I realized that for the same amount of triangles, the percentage of the screen they cover has a huge impact on performance. If the triangles don't span more than about half the screen, the FPS is steady at 60 for 32K triangles, but if I render them over the whole screen (bigger triangles), FPS drops.
By recompiling the mali ARM kernel drivers with -DDEBUG, I could get information as to what is going in with the job queuing.
In the first scenario, where the triangles span less than half of the screen, the GP (Geometry processor?) job takes 1.2ms from queuing to completion. 58 us after that, the 2 PP (pixel processors?) jobs are queued (1 per PP core, I guess) and take 2.8ms to completion. Roughly, it seems to take the GPU 4ms to render what I'm asking for it. This is with logs in the kernel, so in reality it's going a bit faster.
In the second scenario, where the triangles covers almost the whole screen, the GP job takes 4.8ms from queuing to completion, 60us after the PP jobs are queued an take 9ms to completion. We could estimate the time it takes the GPU to render the scene to about 13.8ms. To get a smooth 60FPS, we expect to finish rendering the scene before 16ms, so that leaves barely any time for the compositor to do its work, hence the missed frames.
I am quite surprised that, for the same amount of vertices, the time to render a basic scene can vary so much. Is there anything that you know of that could help speed up the rendering speed here?
Fragment shaders will run for every pixel that each triangle intersects, unless fragments are killed by early-ZS testing, even if they are opaque. You should get a linear increase in cost due to extra fragments and blended layers as triangle size increase.
For opaque geometry it's important to render from front-to-back with ZS testing enabled (on later Mali architectures this would get killed by hidden surface removal, but Mali-400 doesn't have that).
View all questions in Graphics and Gaming forum