This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali deadlock with X server grab

Hi,

We are working with Mali-400 driver r3p2-01rel0 on Exynos4412, running gnome-shell under Linux/X11.

base: BUILD=RELEASE ARCH=arch_011_udd PLATFORM=default_7a TRACE=0 THREAD= GEOM= CORES=MALI400 USING_MALI400=1 TARGET_CORE_REVISION=0x0101 TOPLEVEL_REPO_URL=Linux-r3p2-01rel0 REVISION=Linux-r3p2-01rel0 CHANGED_REVISION=Linux-r3p2-01rel0 REPO_URL=Linux-r3p2-01rel0 BUILD_DATE=Fri Jan 11 14:58:31 UTC 2013 CHANGE_DATE=Linux-r3p2-01rel0 TARGET_TOOLCHAIN=gcc HOST_TOOLCHAIN=gcc TARGET_TOOLCHAIN_VERSION=gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)  HOST_TOOLCHAIN_VERSION=gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)  TARGET_SYSTEM=gcc-arm-linux-gnueabihf HOST_SYSTEM=gcc-arm-linux-gnueabihf CPPFLAGS= CUSTOMER=internal VARIANT=mali400-r3p2-gles11-gles20-linux-ump-x11 HOSTLIB=direct INSTRUMENTED=FALSE USING_MRI=FALSE MALI_TEST_API= UDD_OS=linux

We are facing a problem with gnome-shell that is easy to reproduce: the UI often hangs while minimizing windows or opening new windows. I have traced this down to a deadlock.

At the point of hang, one thread is waiting for a reply from X:

#0 0xb656ed30 in poll () at ../sysdeps/unix/syscall-template.S:81

#1 0xb587dfa2 in poll (__timeout=-1, __nfds=1, __fds=0xb40fe988)

  at /usr/include/arm-linux-gnueabihf/bits/poll2.h:46

#2 _xcb_conn_wait (c=c@entry=0x17ae08, cond=cond@entry=0xb40fe9d8,

  vector=0x0, count=0x0) at ../../src/xcb_conn.c:400

#3 0xb587edb0 in wait_for_reply (c=c@entry=0x17ae08,

  request=, e=e@entry=0xb40fea7c) at ../../src/xcb_in.c:395

#4 0xb587ef3a in xcb_wait_for_reply (c=0x17ae08, request=36, e=0xb40fea7c)

  at ../../src/xcb_in.c:425

#5 0xb5e22644 in _XReply () from /usr/lib/arm-linux-gnueabihf/libX11.so.6

#6 0xb5627b9a in DRI2SwapBuffers ()

  from /usr/lib/arm-linux-gnueabihf/libEGL.so.1

The main gnome-shell thread is hung trying to acquire a mali lock:

#0  __libc_do_syscall ()

    at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:43

#1  0xb6630c44 in __lll_lock_wait (futex=futex@entry=0x10180c, private=0)

    at ../ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c:46

#2  0xb662d4b0 in __GI___pthread_mutex_lock (mutex=0x10180c)

    at pthread_mutex_lock.c:64

#3  0xb5b9219e in _mali_osu_lock_wait ()

   from /usr/lib/arm-linux-gnueabihf/libEGL.s

#4  0xb5bd2e80 in glDeleteTextures ()

   from /usr/lib/arm-linux-gnueabihf/libEGL.so

#5  0xb5fff5ca in _cogl_delete_gl_texture (gl_texture=42)

    at ./driver/gl/cogl-pipeline-opengl.c:212

#6  0xb6024ae4 in _cogl_texture_2d_free (tex_2d=0x19f6858)

    at ./cogl-texture-2d.c:72

#7  _cogl_object_texture_2d_indirect_free (obj=0x19f6858)

    at ./cogl-texture-2d.c:56

#8  0xb600b950 in _cogl_object_default_unref (object=0x19f6858)

    at ./cogl-object.c:96

#9  0xb600b8c4 in cogl_object_unref (obj=<optimized out>)

    at ./cogl-object.c:104

#10  0xb601d646 in _cogl_pipeline_layer_free (layer=0x19e7f60)

    at ./cogl-pipeline-layer.c:630

#11  _cogl_object_pipeline_layer_indirect_free (obj=0x19e7f60)

    at ./cogl-pipeline-layer.c:52

#12  0xb600b950 in _cogl_object_default_unref (object=0x19e7f60)

    at ./cogl-object.c:96

#13 0xb600b8c4 in cogl_object_unref (obj=<optimized out>)

    at ./cogl-object.c:104

What has happened here is the following race:

  1. The DRI2SwapBuffers thread acquires the Mali lock.
  2. The main thread sends an X_GrabServer request to X. This causes X to ignore all other clients, including the client that is used by the DRI2SwapBuffers thread. This server grab is done by the window manager library (mutter).
  3. The main thread attempts to start some GL operation e.g. glDeleteTextures above. It attempts to take the Mali lock, but as this is already taken, the main thread blocks.
  4. The DRI2SwapBuffers thread continues and sends the DRI2 SwapBuffers message to X, and blocks waiting for a response.

X is deaf to the message sent in step 4, since another client (in step 2) issued GrabServer. So the DRI2SwapBuffers thread sits around forever waiting for a response, with the Mali lock held. The client that issued GrabServer itself is hung trying to obtain the Mali lock to do some GL op, so it will never ungrab the server. Deadlock!

Any solutions or workarounds appreciated. The best I can think of is to make sure no client ever does any kind of GL operation while it has the server grabbed. As the scope of that is enormous, it does not seem optimal.

  • Hi dsd, I apologize no one has got back to you on your two questions yet, I am highlighting them to the team now. We'll get a response to you asap.

    Best wishes,

    Ellie

  • Hi,

      May I ask what version of gnome you use in your project?

      So you are porting Ubuntu onto the mali based device or something else? The hanged UI is developed by yourselft, or just the stock one?

      As we can see in the description of XGrabServer:

      The XGrabServer function disables processing of requests and close downs on all other connections than the one this request arrived on. You should not grab the X server any more than is absolutely necessary.

      How do you use XGrabServer in this case?

      Are the main thread and DRI2SwapBuffers thread in the same process?

      And does main thread call some opengl API with server grab hold?

    Thanks,

    Frank.

  • HI dsd,

    Have you managed to solve this issue? If not would it be possible to help us understand the problem more by providing the information Frank mentions above.

    Thanks.

  • Hi,

    We are using GNOME 3.8 and 3.10. The hang has been seen on custom and generic GNOME shells.

    XGrabServer is used by the window manager (mutter), and this is quite normal for a window manager. In this case, mutter emits signals to the desktop shell which does rendering in response, all while Mutter is holding the server grab.

    Maybe you can clarify how mali behaves here, as I am having trouble finding a way to solidly determine which library starts each thread and from which point. However, from what I have gathered, the window manager and desktop shell is single-threaded, and I believe that Mali creates an extra thread and a new X connection, and for some reason, it is sending SwapBuffers requests to the server from that thread. The main WM/shell thread does do rendering while a server grab is held on that thread's X connection, but then we hit the deadlock.

    In the mean time, I have been improving mutter to not emit any signals (which could cause GL rendering) while holding server grabs. But, assuming my statement above is correct (Mali creating its own thread and separate X client and doing SwapBuffers there), this problem could be seen in other contexts.

  • Hi dsd,

    Can you please let me know what device you are testing the driver with?. We cannot reproduce the problem on the Odroid-U2 running kernel 3.8 and r3p2-01rel4 Mali driver.

    r3p2-01rel0 driver was released end of last year, our latest kernel driver is r4p0-00rel0 and is available from malideveloper.com .

    Thanks,

    Tu

  • Hi,

    Thanks for looking into this. We are testing on the ODROID-U2. Unfortunately the bug is not easy to reproduce, you have to try a lot of window operations (minimize, maximize, open, close) before it bites.

    Nevertheless, I believe I have posted enough technical information above to highlight the bug in Mali. Mali's threaded design is incompatible with any X client that might perform GL operations while holding a server grab. If ARM is not prepared to support such scenario, or is not prepared to fix this, then it would be unfortunate but actually not a big issue from our end any more. I have now improved Mutter (the GNOME window manager) so that it never takes server grabs and my changes were accepted upstream for the next version.

    However, I believe this threaded model that has bitten here is part of a wider design problem that is biting us in other areas such as in Bad interaction with DRI2 for vsync

    So it would still be worth working on the larger problem. Right now Mali starts a new thread and a new X connection, and uses that new thread/connection to perform SwapBuffers. This seems bizarre, given that SwapBuffers is an asynchronous operation that should always return immediately, but as we see in the other thread, it looks like there was some confusion here and the original Mali stack has implemented this as a synchronous, blocking op (which explains why libMali would then create a new thread and connection). Would be great if that can be fixed.

    We would love to try r4p0 and in fact I already downloaded the source code you mentioned, compiled it and checked that it loads - great. However we cannot actually test this until someone provides us with the r4p0 version of libMali/libEGL/libGLESv2. Hardkernel have some DDK licensing difficulties at the moment so they can't provide this. If ARM could provide a new binary libMali for Mali-400/armhf/Linux/X11/UMP it would be ideal.

    Thanks,

    Daniel

  • Hi dsd

    mali driver may create another "Display" through XOpenDisplay, and use its return value to do swap, so if XGrabServer is used, deadlock may happen.

    I think the root cause is that gnome-shell just passes EGL_DEFAULT_DISPLAY into eglGetDisplay(), so mali driver doesn't have a X connection to do proper operations instead of create an own one.

    Can you please have a try with the following in gnome-shell:

    Display* dpy = XOpenDisplay(...);

    ...

    EGLDisplay egl_dpy = eglGetDisplay((EGLNativeDisplayType)dpy);

    ....

    so that mali driver will use this "dpy", and it will not cause deadlock no matter XGrabServer is used or not.

  • dsd wrote:

    We would love to try r4p0 and in fact I already downloaded the source code you mentioned, compiled it and checked that it loads - great. However we cannot actually test this until someone provides us with the r4p0 version of libMali/libEGL/libGLESv2. Hardkernel have some DDK licensing difficulties at the moment so they can't provide this. If ARM could provide a new binary libMali for Mali-400/armhf/Linux/X11/UMP it would be ideal.

    Hi dsd,

    Thanks for your feedback, hopefully HardKernel can release soon as we typically do not release binaries to anyone who isn't under at least an LUL licence agreement.

    Thanks,

    Chris

  • Hi,

    Thanks for acknowledging the issue.

    I checked, and gnome-shell (via cogl) does already do the equivalent of what you say.

    https://git.gnome.org/browse/cogl/tree/cogl/cogl-xlib-renderer.c#n176

    https://git.gnome.org/browse/cogl/tree/cogl/winsys/cogl-winsys-egl-x11.c#n265

    but let's take a more simplistic example: es2_info from mesa-demos. This one very clearly does what you say. http://cgit.freedesktop.org/mesa/demos/tree/src/egl/opengles1/es1_info.c#n227

    Running this in gdb, log below, I very clearly observe the following:

    1. mali creates threads when eglInitialize is called
    2. mali calls XOpenDisplay (for a 2nd time) even though we correctly passed in an X display  as you suggested
    # gdb es2_info
    GNU gdb (GDB) 7.5.91.20130417-cvs-ubuntu
    Copyright (C) 2013 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "arm-linux-gnueabihf".
    For bug reporting instructions, please see:
    ...
    Reading symbols from /root/mesa-demos-8.0.1+git20110129+d8f7d6b/src/egl/opengles2/es2_info...done.
    (gdb) b eglInitialize
    Breakpoint 1 at 0x8b08
    (gdb) b XOpenDisplay
    Breakpoint 2 at 0x8b50
    (gdb) run
    Starting program: /root/mesa-demos-8.0.1+git20110129+d8f7d6b/src/egl/opengles2/es2_info 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
    
    Breakpoint 2, 0xb6e476b4 in XOpenDisplay ()
       from /usr/lib/arm-linux-gnueabihf/libX11.so.6
    (gdb) info threads
      Id   Target Id         Frame 
    * 1    Thread 0xb6ff83c0 (LWP 10531) "es2_info" 0xb6e476b4 in XOpenDisplay ()
       from /usr/lib/arm-linux-gnueabihf/libX11.so.6
    (gdb) bt
    #0  0xb6e476b4 in XOpenDisplay ()
       from /usr/lib/arm-linux-gnueabihf/libX11.so.6
    #1  0x0000903e in main (argc=1, argv=0xbefff744) at es2_info.c:246
    

    So far, es2_info called XOpenDisplay, it will later pass this display into Mali. We have just 1 thread. Good.

    (gdb) c
    Continuing.
    
    Breakpoint 1, 0xb6f6244a in eglInitialize ()
       from /usr/lib/arm-linux-gnueabihf/libGLESv2.so
    (gdb) info threads
      Id   Target Id         Frame 
    * 1    Thread 0xb6ff83c0 (LWP 10531) "es2_info" 0xb6f6244a in eglInitialize ()
       from /usr/lib/arm-linux-gnueabihf/libGLESv2.so
    (gdb) bt
    #0  0xb6f6244a in eglInitialize ()
       from /usr/lib/arm-linux-gnueabihf/libGLESv2.so
    #1  0x00008cd0 in main (argc=1, argv=0xbefff744) at es2_info.c:259
    

    Now es2_info calls eglInitialize. Still only 1 thread. So far so good. Let's step to the next line in es2_info

    (gdb) n
    Single stepping until exit from function eglInitialize,
    which has no line number information.
    [New Thread 0xb6be6470 (LWP 10534)]
    [New Thread 0xb63e6470 (LWP 10535)]
    
    Breakpoint 2, 0xb6e476b4 in XOpenDisplay ()
       from /usr/lib/arm-linux-gnueabihf/libX11.so.6
    (gdb) bt          
    #0  0xb6e476b4 in XOpenDisplay ()
       from /usr/lib/arm-linux-gnueabihf/libX11.so.6
    #1  0xb6f67326 in __egl_platform_initialize ()
       from /usr/lib/arm-linux-gnueabihf/libGLESv2.so
    #2  0xb6f62d0c in __egl_main_open_mali ()
       from /usr/lib/arm-linux-gnueabihf/libGLESv2.so
    #3  0xb6f61cb0 in _egl_initialize ()
       from /usr/lib/arm-linux-gnueabihf/libGLESv2.so
    #4  0xb6f6246c in eglInitialize ()
       from /usr/lib/arm-linux-gnueabihf/libGLESv2.so
    #5  0x00008cd0 in main (argc=1, argv=0xbefff744) at es2_info.c:259
    (gdb) info threads
      Id   Target Id         Frame 
      3    Thread 0xb63e6470 (LWP 10535) "es2_info" __libc_do_syscall ()
        at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:43
      2    Thread 0xb6be6470 (LWP 10534) "es2_info" 0xb6dd0928 in ioctl ()
        at ../sysdeps/unix/syscall-template.S:81
    * 1    Thread 0xb6ff83c0 (LWP 10531) "es2_info" 0xb6e476b4 in XOpenDisplay ()
       from /usr/lib/arm-linux-gnueabihf/libX11.so.6
    

    Now..before we hit the next line in es2_info (i.e. during eglInitialize still), 2 threads were created, and Mali ran XOpenDisplay() again. There should be no need for this XOpenDisplay call to happen, because we already passed in the X display to eglGetDisplay. But mali goes ahead and runs XOpenDisplay() anyway and it is my suspicion that it then later uses this connection from the "SwapBuffers thread".

  • Hi dsd

    I think the XOpenDisplay is not needed if there has been an X connection passed in, and we should use the passed in dpy to call into X server.

    thanks for your information, we will work on it.

  • Hi dsd

    I think I have problems. The main problem is that our GLES driver requires multi-thread, so currently we first call XInitThreads and then call XOpenDisplay to avoid multi-access to xlib, and application can use its own X connection.

    If the driver uses application's X connection handle, then it will have problem in following case:

    application is single thread and doesn't call XInitThreads, and GLES driver creates multi-thread and uses the application's X connection handle, this will lead to xlib crash if both application's thread and GLES driver's thread are accessing xlib.

    So I think I cannot finish it in a short time, can you please change to avoid call any gl & egl functions inside XGrabServer/XUngrabServer? and I will continue to investigate if there is any good solutions.

  • Thanks for looking into this. Yes, I had also seen that the Mali driver creates internal threads and in such a situation I can see why you would make the internal thread have its own X connection.

    I have fixed GNOME/mutter not to do GL operations under XGrabServer so there is no immediate pressure from this end, but I think it is only a matter of time until someone else runs into this issue under another context.

    I also think the reasons for creating an internal Mali thread are not totally valid. It seems like this internal thread is there just to run SwapBuffers calls? But in a correctly implemented setup, the SwapBuffer is asynchronous, it does not block, so there is no clear reason why this would need it's own thread. This is explained a bit in Bad interaction with DRI2 for vsync

    I'm glad to hear that you are looking into solving this going forward - without this superflous extra thread, Mali will be better and more reliable as a result.

  • Hi dsd,

    dsd wrote:

    But in a correctly implemented setup, the SwapBuffer is asynchronous, it does not block

    Forgive me if this is irrelevant, as a lot of this conversation is regarding Linux internals with which I am not intimately familiar, but eglSwapBuffers is not necessarily an asynchronous call. In a single buffered environment, it has no effect and returns immediately, but in double or more buffered environments it will wait until there is a buffer available to be written into. For example, you might complete rendering to the back buffer, but the actual "swap" to "copy" the contents of that buffer to the front buffer can only occur at VSYNC if VSYNC is enabled, so this will not return until the sync has occured, the buffers have been swapped, and rendering can continue.

  • Totally agreed about eglSwapBuffers semantics on the application side.

    I was referring to DRI2's SwapBuffers call, which is what is called by libMali as part of the implementation of such a thing. That one is designed to be non-blocking, and such is the case in ARM's latest X driver (xf86-video-armsoc), but the fact that Mali seems to create a dedicated DRI2SwapBuffers thread (the cause of this issue and others) seems to be in disagreement.

  • Understood, my bad!

    All seems well then, I assume sunsun will reply again when the issue is fixed and we can let you know what release it will be in.

    Thanks,

    Chris