Hi,
I am trying to find the practical limit of triangle / frames that the Mali T400 can render while keeping up at 60 FPS on a 1024x600 display with a Wayland integration on a ZynqMP+.
With the program and hardware setup described below, I could reach around 32 000 triangles per frame before performance dips below 60 FPS. This number is lower than I expected considering the "0.11 Mtriangles/sec/MHz" reported in the ZynqpMP+ datasheet (page 2). What steps could I take to render more triangles per frame?
To render as many triangle as possible, I reused the sample program "weston-simple-egl" from the Weston (wayland compositor) project. I changed the rendering to draw a fullscreen window (1024x600) with a GL_TRIANGLE_STRIP spanning around 95% of the screen. I tested the program with 32 bits per pix (bpp) and 16 bpp, but couldn't make any significant gain. The Mali GPU ont the system is clocked at 600MHz. The vertex and fragment shader are respectivly passing the vertices and the fragment as is.
The bottleneck seems to be the `eglSwapBuffers` call. It takes more and more time as the number of triangle rises. With 32 000 triangles, it can take up to 18 ms (!), which explains the FPS drop. Unfortunatly, eglSwapBuffers is implemented by the closed source library libmali, so I couldn't dig deeper. I assume the `eglSwapBuffers` call returns when an IRQ comes back from the GPU indicating that the queued jobs are done.
So, in summary, am I effectivly hitting an hardware limit at 32 000 triangles per frame under wayland or is there something I could do to improve performance?
Hi gchamp,
Calling eglSwapBuffers() will block until the next window buffer is available; which "on average" is related to rendering performance although there are some queuing effects here. Specifically the window system won't release the an old buffer until it has a new one to replace it, and we need the new buffer to start queuing commands for it.
eglSwapBuffers()
The performance does sound low for a simple triangle grid test app; we'd expect ~10 cycles a vertex not 300. Is the other performance you are seeing (e.g. fragment shading a simple quad with a blit texture) consistent with a 600Mhz GPU performance?
Cheers, Pete
Thank you for the quick answer, Peter.
The performance were consistent with what you're suggesting here before the use of a windowing system (with Qt eglfs). I checked the registers driving the GPU clock and they indicates that the GPU is effectively driven at 600MHz.
There might be a synchronization issue between Weston and the client app? From my understanding of the Wayland protocol and the EGL integration, the rendering should be double buffered so we should be drawing in the back buffer which isn't currently bound to the surface. I think the client doesn't have to wait for the compositor to give it a new buffer? I haven't dug to deep in the code, but Mesa seems to it that way.
Is there something I could do to understand a bit in more detail how is libmali communicating with the Wayland Server?
> I think the client doesn't have to wait for the compositor to give it a new buffer?Assuming you are not CPU-bound, then it will have to block and wait somewhere (you can't have the application running at 300FPS, but the hardware only at 30FPS; some wait is needed to bring them in sync). It's usually either in eglSwapBuffers or in the first draw call to the window surface.
> Is there something I could do to understand a bit in more detail how is libmali communicating with the Wayland Server?
It's not my area of expertise - I'll see if I can find someone who can help - but this type of system integration is normally handled by the device manufacturer not us.
Hi Peter,
I did a bit more digging into this. I realized that for the same amount of triangles, the percentage of the screen they cover has a huge impact on performance. If the triangles don't span more than about half the screen, the FPS is steady at 60 for 32K triangles, but if I render them over the whole screen (bigger triangles), FPS drops.
By recompiling the mali ARM kernel drivers with -DDEBUG, I could get information as to what is going in with the job queuing.
In the first scenario, where the triangles span less than half of the screen, the GP (Geometry processor?) job takes 1.2ms from queuing to completion. 58 us after that, the 2 PP (pixel processors?) jobs are queued (1 per PP core, I guess) and take 2.8ms to completion. Roughly, it seems to take the GPU 4ms to render what I'm asking for it. This is with logs in the kernel, so in reality it's going a bit faster.
In the second scenario, where the triangles covers almost the whole screen, the GP job takes 4.8ms from queuing to completion, 60us after the PP jobs are queued an take 9ms to completion. We could estimate the time it takes the GPU to render the scene to about 13.8ms. To get a smooth 60FPS, we expect to finish rendering the scene before 16ms, so that leaves barely any time for the compositor to do its work, hence the missed frames.
I am quite surprised that, for the same amount of vertices, the time to render a basic scene can vary so much. Is there anything that you know of that could help speed up the rendering speed here?
Thanks!
Fragment shaders will run for every pixel that each triangle intersects, unless fragments are killed by early-ZS testing, even if they are opaque. You should get a linear increase in cost due to extra fragments and blended layers as triangle size increase.
For opaque geometry it's important to render from front-to-back with ZS testing enabled (on later Mali architectures this would get killed by hidden surface removal, but Mali-400 doesn't have that).
HTH, Pete
To sort of give closure on this topic, I upgraded to linux 5.7 and switched to the lima open source drivers. Performance seems slightly better, particularly since dynamic heap memory management was implemented. With Xilinx' binary blob, I was seeing "PLBU out of memory interrupts" coming back from the GPU for most frames.
For the heavy performance drop when switching to Qtwayland, it seems like I was hit very hard by this Qt bug https://bugreports.qt.io/browse/QTBUG-76813 which caused frequent 100ms freezes.
Thank you for the answers and help provided in the thread.
i am trying switching from Xilinx Mali kernel drivers to lima kernel drivers and i am some kind of stuck.
Weston with wayland is running with gpu support with mali.ko and libMali.so provided by Xilinx under Ubuntu 20.
But doing the same thing with lima and self compiled mesa library is another topic.
So far i was able to load the lima.ko module and could build the mesa drivers but with running weston i got only software rendering, no gpu acceleration.
Could you please give me a hint what you have done? Like:
- kernel settings in petalinux
- device tree binding of gpu
- what to do with the mesa libraries. Maybe i am just missing some kind of links.
Regards,
p00chie
> kernel settings in petalinux
I'm not using petalinux, so I have little insight as to what to change there. The defconfig used to compile the kernel must have `CONFIG_DRM_LIMA` and `CONFIG_DRM_XLNX`.
> device tree binding of gpu
I changed the interrupt-names in zynqmp.dtsi so they match what's lima_device.c is looking for:
diff --git a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi index b0b306ed796d..97e776231428 100644 --- a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi +++ b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi @@ -462,7 +462,7 @@ reg = <0x0 0xfd4b0000 0x0 0x10000>; interrupt-parent = <&gic>; interrupts = <0 132 4>, <0 132 4>, <0 132 4>, <0 132 4>, <0 132 4>, <0 132 4>; - interrupt-names = "IRQGP", "IRQGPMMU", "IRQPP0", "IRQPPMMU0", "IRQPP1", "IRQPPMMU1"; + interrupt-names = "gp", "gpmmu", "pp0", "ppmmu0", "pp1", "ppmmu1"; clock-names = "gpu", "gpu_pp0", "gpu_pp1"; power-domains = <&zynqmp_firmware PD_GPU>; };
lima_device.c also looks for clock-names `bus` and `core` so I changed the driver code to use the clocks `"gpu`, `gpu_pp0`, `gpu_pp1`. Couldn't really find any docs on those clocks, so I can simply attest that empirically, it works.
Maybe you already did that since otherwise there's errors in `dmesg` when lima.ko is loaded.
> what to do with the mesa libraries. Maybe i am just missing some kind of links.
Yes, mesa requires a small patch so it knows it can use Xilinx' drm driver.
I'm really not an expert in the linux graphics ecosystem, but from what I could gather lima is a `render only` driver and Xilinx' drm driver is `display only` (I think Xilinx drm driver was never merged upstreamed, so make sure to use the latest one from their fork), and there's a bit of glue code involved to link them as you said.
This patch is valid for mesa 19.1.6:
--- src/gallium/drivers/kmsro/Android.mk | 1 + src/gallium/targets/dri/meson.build | 1 + src/gallium/targets/dri/target.c | 1 + 3 files changed, 3 insertions(+) diff --git a/src/gallium/drivers/kmsro/Android.mk b/src/gallium/drivers/kmsro/Android.mk index 7c39f97..dbcb389 100644 --- a/src/gallium/drivers/kmsro/Android.mk +++ b/src/gallium/drivers/kmsro/Android.mk @@ -50,5 +50,6 @@ GALLIUM_TARGET_DRIVERS += repaper GALLIUM_TARGET_DRIVERS += st7586 GALLIUM_TARGET_DRIVERS += st7735r GALLIUM_TARGET_DRIVERS += sun4i-drm +GALLIUM_TARGET_DRIVERS += xlnx $(eval GALLIUM_LIBS += $(LOCAL_MODULE) libmesa_winsys_kmsro) endif diff --git a/src/gallium/targets/dri/meson.build b/src/gallium/targets/dri/meson.build index 8da21b3..ab57908 100644 --- a/src/gallium/targets/dri/meson.build +++ b/src/gallium/targets/dri/meson.build @@ -85,6 +85,7 @@ foreach d : [[with_gallium_kmsro, [ 'st7735r_dri.so', 'stm_dri.so', 'sun4i-drm_dri.so', + 'xlnx_dri.so', ]], [with_gallium_radeonsi, 'radeonsi_dri.so'], [with_gallium_nouveau, 'nouveau_dri.so'], diff --git a/src/gallium/targets/dri/target.c b/src/gallium/targets/dri/target.c index f71f690..e8f4340 100644 --- a/src/gallium/targets/dri/target.c +++ b/src/gallium/targets/dri/target.c @@ -110,6 +110,7 @@ DEFINE_LOADER_DRM_ENTRYPOINT(st7586) DEFINE_LOADER_DRM_ENTRYPOINT(st7735r) DEFINE_LOADER_DRM_ENTRYPOINT(stm) DEFINE_LOADER_DRM_ENTRYPOINT(sun4i_drm) +DEFINE_LOADER_DRM_ENTRYPOINT(xlnx) #endif #if defined(GALLIUM_LIMA)
Finally, you can use `kmscube` to test the setup before debugging in weston directly. You should get an ouput like this when everything works correctly (plus a 3D cube on your display):
# kmscube eglGetPlatformDisplayEXT Using display 0x55bab3fbd0 with EGL version 1.4 =================================== EGL information: version: "1.4" vendor: "Mesa Project" client extensions: "EGL_EXT_client_extensions EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses EGL_KHR_debug EGL_EXT_platform_device EGL_EXT_platform_wayland EGL_KHR_platform_wayland EGL_MESA_platform_gbm EGL_KHR_platform_gbm EGL_MESA_platform_surfaceless" =================================== OpenGL ES 2.x information: version: "OpenGL ES 2.0 Mesa 20.1.0" shading language version: "OpenGL ES GLSL ES 1.0.16" vendor: "lima" renderer: "Mali400" ===================================
Thanks for your support!
After doing the changes and compiling Mesa 19.1.6 failed building :(
I switched to 20.1.0 as you got in your info, made the changed and installed it again.
Here is my build config:
meson build/ --buildtype release --prefix=/usr/local --libdir=lib/aarch64-linux-gnu -Dgallium-drivers=lima,kmsro,swrast -Dplatforms=x11,drm,surfaceless,wayland -Dvulkan-drivers= -Ddri-drivers= -Dllvm=false
Unfortunately i couldn't build kmscube but it's in the Ubuntu 20.10 repo.
After running kmscube i got the output:
Using display 0x558dcae390 with EGL version 1.4 =================================== EGL information: version: "1.4" vendor: "Mesa Project" client extensions: "EGL_EXT_client_extensions EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses EGL_KHR_debug EGL_EXT_platform_device EGL_EXT_platform_wayland EGL_KHR_platform_wayland EGL_EXT_platform_x11 EGL_KHR_platform_x11 EGL_MESA_platform_gbm EGL_KHR_platform_gbm EGL_MESA_platform_surfaceless" display extensions: "EGL_ANDROID_blob_cache EGL_ANDROID_native_fence_sync EGL_EXT_buffer_age EGL_EXT_image_dma_buf_import EGL_EXT_image_dma_buf_import_modifiers EGL_KHR_cl_event2 EGL_KHR_config_attribs EGL_KHR_create_context EGL_KHR_create_context_no_error EGL_KHR_fence_sync EGL_KHR_get_all_proc_addresses EGL_KHR_gl_colorspace EGL_KHR_gl_renderbuffer_image EGL_KHR_gl_texture_2D_image EGL_KHR_gl_texture_3D_image EGL_KHR_gl_texture_cubemap_image EGL_KHR_image EGL_KHR_image_base EGL_KHR_image_pixmap EGL_KHR_no_config_context EGL_KHR_partial_update EGL_KHR_reusable_sync EGL_KHR_surfaceless_context EGL_EXT_pixel_format_float EGL_KHR_wait_sync EGL_MESA_configless_context EGL_MESA_drm_image EGL_MESA_image_dma_buf_export EGL_MESA_query_driver EGL_WL_bind_wayland_display " =================================== OpenGL ES 2.x information: version: "OpenGL ES 2.0 Mesa 20.1.0 (git-7de17e2520)" shading language version: "OpenGL ES GLSL ES 1.0.16" vendor: "lima" renderer: "Mali400" extensions: "GL_EXT_blend_minmax GL_EXT_multi_draw_arrays GL_EXT_texture_format_BGRA8888 GL_OES_compressed_ETC1_RGB8_texture GL_OES_depth24 GL_OES_element_index_uint GL_OES_fbo_render_mipmap GL_OES_mapbuffer GL_OES_rgb8_rgba8 GL_OES_standard_derivatives GL_OES_stencil8 GL_OES_texture_3D GL_OES_texture_npot GL_OES_vertex_half_float GL_OES_EGL_image GL_OES_depth_texture GL_OES_packed_depth_stencil GL_OES_get_program_binary GL_APPLE_texture_max_level GL_EXT_discard_framebuffer GL_EXT_read_format_bgra GL_EXT_frag_depth GL_NV_fbo_color_attachments GL_OES_EGL_image_external GL_OES_EGL_sync GL_OES_vertex_array_object GL_EXT_occlusion_query_boolean GL_EXT_unpack_subimage GL_NV_draw_buffers GL_NV_read_buffer GL_NV_read_depth GL_NV_read_depth_stencil GL_NV_read_stencil GL_EXT_draw_buffers GL_EXT_map_buffer_range GL_KHR_debug GL_KHR_texture_compression_astc_ldr GL_NV_pixel_buffer_object GL_OES_required_internalformat GL_OES_surfaceless_context GL_EXT_separate_shader_objects GL_EXT_compressed_ETC1_RGB8_sub_texture GL_EXT_draw_elements_base_vertex GL_EXT_texture_border_clamp GL_KHR_context_flush_control GL_OES_draw_elements_base_vertex GL_OES_texture_border_clamp GL_KHR_no_error GL_KHR_texture_compression_astc_sliced_3d GL_KHR_parallel_shader_compile " =================================== failed to set mode: Invalid argument
So i think the driver should be ok. Maybe there is something missing the the drm or gpu?
First the dmesg for mali:
[ 10.085688] lima fd4b0000.gpu: IRQ pmu not found [ 10.090471] lima fd4b0000.gpu: IRQ ppmmu2 not found [ 10.095394] lima fd4b0000.gpu: IRQ ppmmu3 not found [ 10.100322] lima fd4b0000.gpu: gp - mali400 version major 1 minor 1 [ 10.100353] lima fd4b0000.gpu: pp0 - mali400 version major 1 minor 1 [ 10.100373] lima fd4b0000.gpu: pp1 - mali400 version major 1 minor 1 [ 10.100385] lima fd4b0000.gpu: IRQ pp2 not found [ 10.105041] lima fd4b0000.gpu: IRQ pp3 not found [ 10.109699] lima fd4b0000.gpu: l2 cache 64K, 4-way, 64byte cache line, 128bit external bus [ 10.166862] lima fd4b0000.gpu: bus rate = 599999994 [ 10.166871] lima fd4b0000.gpu: mod rate = 599999994 [ 10.172464] [drm] Initialized lima 1.0.0 20190217 for fd4b0000.gpu on minor 1
2nd the dmesg for drm:
[ 3.537453] OF: graph: no port node found in /amba/zynqmp-display@fd4a0000 [ 3.544426] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 3.551034] [drm] No driver support for vblank timestamp query. [ 3.557007] xlnx-drm xlnx-drm.0: bound fd4a0000.zynqmp-display (ops 0xffffffc010cf8740) [ 3.734818] Console: switching to colour frame buffer device 240x75 [ 3.757765] zynqmp-display fd4a0000.zynqmp-display: fb0: xlnxdrmfb frame buffer device [ 3.765918] [drm] Initialized xlnx 1.0.0 20130509 for fd4a0000.zynqmp-display on minor 0 [ 3.774046] zynqmp-display fd4a0000.zynqmp-display: ZynqMP DisplayPort Subsystem driver probed
Does kmscube did work for you?
After messing around with custom weston launches without success by just using weston-launch the mali/lima/Utgard back to valhalla is working :)
cat /proc/interrupts
root@bcp-linux:/etc/ld.so.conf.d# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 3: 954639 823223 807559 890756 GICv2 30 Level arch_timer 6: 0 0 0 0 GICv2 67 Level zynqmp_ipi 7: 0 0 0 0 GICv2 175 Level arm-pmu 8: 0 0 0 0 GICv2 176 Level arm-pmu 9: 0 0 0 0 GICv2 177 Level arm-pmu 10: 0 0 0 0 GICv2 178 Level arm-pmu 12: 0 0 0 0 GICv2 156 Level zynqmp-dma 13: 0 0 0 0 GICv2 157 Level zynqmp-dma 14: 0 0 0 0 GICv2 158 Level zynqmp-dma 15: 0 0 0 0 GICv2 159 Level zynqmp-dma 16: 0 0 0 0 GICv2 160 Level zynqmp-dma 17: 0 0 0 0 GICv2 161 Level zynqmp-dma 18: 0 0 0 0 GICv2 162 Level zynqmp-dma 19: 0 0 0 0 GICv2 163 Level zynqmp-dma 20: 155 0 0 0 GICv2 164 Level gpmmu, ppmmu0, ppmmu1, gp, pp0, pp1 21: 0 0 0 0 GICv2 109 Level zynqmp-dma 22: 0 0 0 0 GICv2 110 Level zynqmp-dma 23: 0 0 0 0 GICv2 111 Level zynqmp-dma 24: 0 0 0 0 GICv2 112 Level zynqmp-dma 25: 0 0 0 0 GICv2 113 Level zynqmp-dma 26: 0 0 0 0 GICv2 114 Level zynqmp-dma 27: 0 0 0 0 GICv2 115 Level zynqmp-dma 28: 0 0 0 0 GICv2 116 Level zynqmp-dma 29: 1 0 0 0 GICv2 144 Level fd070000.memory-controller 30: 525015 0 0 0 GICv2 89 Level eth0, eth0 32: 0 0 0 0 GICv2 49 Level cdns-i2c 33: 0 0 0 0 GICv2 42 Level ff960000.memory-controller 34: 0 0 0 0 GICv2 57 Level axi-pmon, axi-pmon 35: 0 0 0 0 GICv2 155 Level axi-pmon, axi-pmon 36: 0 0 0 0 GICv2 47 Level ff0f0000.spi 37: 0 0 0 0 GICv2 58 Level ffa60000.rtc 38: 0 0 0 0 GICv2 59 Level ffa60000.rtc 39: 82759 0 0 0 GICv2 81 Level mmc0 40: 814 0 0 0 GICv2 53 Level xuartps 41: 0 0 0 0 GICv2 88 Level ams-irq 42: 742775 0 0 0 GICv2 154 Level fd4c0000.dma 43: 7692 0 0 0 GICv2 151 Level fd4a0000.zynqmp-display 45: 0 0 0 0 GICv2 122 Edge M_AXI_S2O 46: 0 0 0 0 GICv2 126 Edge M_AXI_O2S 47: 0 0 0 0 GICv2 123 Edge M_AXI_S2O_INTR0 48: 0 0 0 0 GICv2 124 Edge M_AXI_S2O_INTR1 49: 0 0 0 0 GICv2 125 Edge M_AXI_S2O_INTR2 50: 0 0 0 0 GICv2 127 Edge M_AXI_O2S_INTR0 51: 0 0 0 0 GICv2 128 Edge M_AXI_O2S_INTR1 84: 7647 0 0 0 GICv2 97 Level xhci-hcd:usb1 85: 3 0 0 0 GICv2 101 Level dwc3-otg IPI0: 101043 227560 313069 254156 Rescheduling interrupts IPI1: 1709 6727 6785 6559 Function call interrupts IPI2: 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 0 0 0 0 Timer broadcast interrupts IPI5: 0 0 0 0 IRQ work interrupts IPI6: 0 0 0 0 CPU wake-up interrupts
But withing weston glmark2-es2-wayland failed with
error: import buffer not properly aligned
Can you start it?
> So i think the driver should be ok. Maybe there is something missing the the drm or gpu?
I think you're right, everything seems initialized correctly in the logs.
> Does kmscube did work for you?
Yes it works. It seems you have the error "failed to set mode: Invalid argument". I think this is an issue with the buffer format kmscube uses by default. I have this patch locally for kmscube:
diff --git a/common.c b/common.c index b6f3e9b..d772a79 100644 --- a/common.c +++ b/common.c @@ -43,7 +43,7 @@ gbm_surface_create_with_modifiers(struct gbm_device *gbm, const struct gbm * init_gbm(int drm_fd, int w, int h, uint64_t modifier) { gbm.dev = gbm_create_device(drm_fd); - gbm.format = GBM_FORMAT_XRGB8888; + gbm.format = GBM_FORMAT_RGB565; gbm.surface = NULL; if (gbm_surface_create_with_modifiers) {
> But withing weston glmark2-es2-wayland failed with
> error: import buffer not properly aligned
> Can you start it?
It starts, but it's doesn't render correctly. The image doesn't render on screen. I have this patch in mesa also, which seems to be the cause of your crash:
Subject: [PATCH] lima: lima_resource: relax stride check See https://gitlab.freedesktop.org/mesa/mesa/-/issues/3070 Suggested-by: Vasily Khoruzhick <anarsoul@gmail.com> --- src/gallium/drivers/lima/lima_resource.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/src/gallium/drivers/lima/lima_resource.c b/src/gallium/drivers/lima/lima_resource.c index 4644ea4..fd7614d 100644 --- a/src/gallium/drivers/lima/lima_resource.c +++ b/src/gallium/drivers/lima/lima_resource.c @@ -351,8 +351,21 @@ lima_resource_from_handle(struct pipe_screen *pscreen, stride = util_format_get_stride(pres->format, width); size = util_format_get_2d_size(pres->format, stride, height); - if (res->levels[0].stride != stride || res->bo->size < size) { - debug_error("import buffer not properly aligned\n"); + if (res->tiled && res->levels[0].stride != stride) { + fprintf(stderr, "tiled imported buffer has mismatching stride: %d (BO) != %d (expected)", + res->levels[0].stride, stride); + goto err_out; + } + + if (!res->tiled && res->levels[0].stride < stride) { + fprintf(stderr, "linear imported buffer has mismatching stride: %d (BO) < %d (expected)", + res->levels[0].stride, stride); + goto err_out; + } + + if (res->bo->size < size) { + fprintf(stderr, "imported bo size is smaller than expected: %d (BO) < %d (expected)\n", + res->bo->size, size); goto err_out; } --
Weston has a few sample clients to test as well, such as "weston-simple-egl" that do work for me.
Thanks for the advice with GBM_FORMAT_RGB565
I had to change it on another spot but i forgot about it.
Are you still under 5.4.0 kernel from Xilinx?
I think there was a lot of work for the lima driver in the linux kernel but it i couldn't merge the current lima sources with the 5.4 xilinx kernel sources because there were way too many changes. Maybe with a newer version from Xilinx i could give it another try.
The official way with libMali with Weston 9 under Ubuntu 20.10 seems a bit more advanced since for excample glmark2 is working without any errors under wayland
Yes, I'm using Xilinx' 5.4.0 kernel, but I cherry picked a few patches from the upstream lima driver. Mainly dynamic heap memory. It's possible libMali is more battled tested than the open source stack. For my use case, lima was doing a better job and I have insight into the whole stack for debugging, so I went with that. But your mileage may vary :-P
This fix finally fixed that a lot of applications didn't load at all.
But the display of the application is not working. After changing size of the application it becomes somehow ok.
Could you get around the imaging artifacts?