This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali deadlock with X server grab

Hi,

We are working with Mali-400 driver r3p2-01rel0 on Exynos4412, running gnome-shell under Linux/X11.

base: BUILD=RELEASE ARCH=arch_011_udd PLATFORM=default_7a TRACE=0 THREAD= GEOM= CORES=MALI400 USING_MALI400=1 TARGET_CORE_REVISION=0x0101 TOPLEVEL_REPO_URL=Linux-r3p2-01rel0 REVISION=Linux-r3p2-01rel0 CHANGED_REVISION=Linux-r3p2-01rel0 REPO_URL=Linux-r3p2-01rel0 BUILD_DATE=Fri Jan 11 14:58:31 UTC 2013 CHANGE_DATE=Linux-r3p2-01rel0 TARGET_TOOLCHAIN=gcc HOST_TOOLCHAIN=gcc TARGET_TOOLCHAIN_VERSION=gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)  HOST_TOOLCHAIN_VERSION=gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)  TARGET_SYSTEM=gcc-arm-linux-gnueabihf HOST_SYSTEM=gcc-arm-linux-gnueabihf CPPFLAGS= CUSTOMER=internal VARIANT=mali400-r3p2-gles11-gles20-linux-ump-x11 HOSTLIB=direct INSTRUMENTED=FALSE USING_MRI=FALSE MALI_TEST_API= UDD_OS=linux

We are facing a problem with gnome-shell that is easy to reproduce: the UI often hangs while minimizing windows or opening new windows. I have traced this down to a deadlock.

At the point of hang, one thread is waiting for a reply from X:

#0 0xb656ed30 in poll () at ../sysdeps/unix/syscall-template.S:81

#1 0xb587dfa2 in poll (__timeout=-1, __nfds=1, __fds=0xb40fe988)

  at /usr/include/arm-linux-gnueabihf/bits/poll2.h:46

#2 _xcb_conn_wait (c=c@entry=0x17ae08, cond=cond@entry=0xb40fe9d8,

  vector=0x0, count=0x0) at ../../src/xcb_conn.c:400

#3 0xb587edb0 in wait_for_reply (c=c@entry=0x17ae08,

  request=, e=e@entry=0xb40fea7c) at ../../src/xcb_in.c:395

#4 0xb587ef3a in xcb_wait_for_reply (c=0x17ae08, request=36, e=0xb40fea7c)

  at ../../src/xcb_in.c:425

#5 0xb5e22644 in _XReply () from /usr/lib/arm-linux-gnueabihf/libX11.so.6

#6 0xb5627b9a in DRI2SwapBuffers ()

  from /usr/lib/arm-linux-gnueabihf/libEGL.so.1

The main gnome-shell thread is hung trying to acquire a mali lock:

#0  __libc_do_syscall ()

    at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:43

#1  0xb6630c44 in __lll_lock_wait (futex=futex@entry=0x10180c, private=0)

    at ../ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c:46

#2  0xb662d4b0 in __GI___pthread_mutex_lock (mutex=0x10180c)

    at pthread_mutex_lock.c:64

#3  0xb5b9219e in _mali_osu_lock_wait ()

   from /usr/lib/arm-linux-gnueabihf/libEGL.s

#4  0xb5bd2e80 in glDeleteTextures ()

   from /usr/lib/arm-linux-gnueabihf/libEGL.so

#5  0xb5fff5ca in _cogl_delete_gl_texture (gl_texture=42)

    at ./driver/gl/cogl-pipeline-opengl.c:212

#6  0xb6024ae4 in _cogl_texture_2d_free (tex_2d=0x19f6858)

    at ./cogl-texture-2d.c:72

#7  _cogl_object_texture_2d_indirect_free (obj=0x19f6858)

    at ./cogl-texture-2d.c:56

#8  0xb600b950 in _cogl_object_default_unref (object=0x19f6858)

    at ./cogl-object.c:96

#9  0xb600b8c4 in cogl_object_unref (obj=<optimized out>)

    at ./cogl-object.c:104

#10  0xb601d646 in _cogl_pipeline_layer_free (layer=0x19e7f60)

    at ./cogl-pipeline-layer.c:630

#11  _cogl_object_pipeline_layer_indirect_free (obj=0x19e7f60)

    at ./cogl-pipeline-layer.c:52

#12  0xb600b950 in _cogl_object_default_unref (object=0x19e7f60)

    at ./cogl-object.c:96

#13 0xb600b8c4 in cogl_object_unref (obj=<optimized out>)

    at ./cogl-object.c:104

What has happened here is the following race:

  1. The DRI2SwapBuffers thread acquires the Mali lock.
  2. The main thread sends an X_GrabServer request to X. This causes X to ignore all other clients, including the client that is used by the DRI2SwapBuffers thread. This server grab is done by the window manager library (mutter).
  3. The main thread attempts to start some GL operation e.g. glDeleteTextures above. It attempts to take the Mali lock, but as this is already taken, the main thread blocks.
  4. The DRI2SwapBuffers thread continues and sends the DRI2 SwapBuffers message to X, and blocks waiting for a response.

X is deaf to the message sent in step 4, since another client (in step 2) issued GrabServer. So the DRI2SwapBuffers thread sits around forever waiting for a response, with the Mali lock held. The client that issued GrabServer itself is hung trying to obtain the Mali lock to do some GL op, so it will never ungrab the server. Deadlock!

Any solutions or workarounds appreciated. The best I can think of is to make sure no client ever does any kind of GL operation while it has the server grabbed. As the scope of that is enormous, it does not seem optimal.

Parents
  • Thanks for looking into this. Yes, I had also seen that the Mali driver creates internal threads and in such a situation I can see why you would make the internal thread have its own X connection.

    I have fixed GNOME/mutter not to do GL operations under XGrabServer so there is no immediate pressure from this end, but I think it is only a matter of time until someone else runs into this issue under another context.

    I also think the reasons for creating an internal Mali thread are not totally valid. It seems like this internal thread is there just to run SwapBuffers calls? But in a correctly implemented setup, the SwapBuffer is asynchronous, it does not block, so there is no clear reason why this would need it's own thread. This is explained a bit in Bad interaction with DRI2 for vsync

    I'm glad to hear that you are looking into solving this going forward - without this superflous extra thread, Mali will be better and more reliable as a result.

Reply
  • Thanks for looking into this. Yes, I had also seen that the Mali driver creates internal threads and in such a situation I can see why you would make the internal thread have its own X connection.

    I have fixed GNOME/mutter not to do GL operations under XGrabServer so there is no immediate pressure from this end, but I think it is only a matter of time until someone else runs into this issue under another context.

    I also think the reasons for creating an internal Mali thread are not totally valid. It seems like this internal thread is there just to run SwapBuffers calls? But in a correctly implemented setup, the SwapBuffer is asynchronous, it does not block, so there is no clear reason why this would need it's own thread. This is explained a bit in Bad interaction with DRI2 for vsync

    I'm glad to hear that you are looking into solving this going forward - without this superflous extra thread, Mali will be better and more reliable as a result.

Children
  • Hi dsd,

    dsd wrote:

    But in a correctly implemented setup, the SwapBuffer is asynchronous, it does not block

    Forgive me if this is irrelevant, as a lot of this conversation is regarding Linux internals with which I am not intimately familiar, but eglSwapBuffers is not necessarily an asynchronous call. In a single buffered environment, it has no effect and returns immediately, but in double or more buffered environments it will wait until there is a buffer available to be written into. For example, you might complete rendering to the back buffer, but the actual "swap" to "copy" the contents of that buffer to the front buffer can only occur at VSYNC if VSYNC is enabled, so this will not return until the sync has occured, the buffers have been swapped, and rendering can continue.

  • Totally agreed about eglSwapBuffers semantics on the application side.

    I was referring to DRI2's SwapBuffers call, which is what is called by libMali as part of the implementation of such a thing. That one is designed to be non-blocking, and such is the case in ARM's latest X driver (xf86-video-armsoc), but the fact that Mali seems to create a dedicated DRI2SwapBuffers thread (the cause of this issue and others) seems to be in disagreement.

  • Understood, my bad!

    All seems well then, I assume sunsun will reply again when the issue is fixed and we can let you know what release it will be in.

    Thanks,

    Chris

  • sorry for the late response.

    for eglSwapBuffers, the current implementation is that create another thread to wait for GPU's interrupt, once the GPU job is done, this thread will call DRI2SwapBuffers. Because we only has a very simple DDX driver which is difficult to guarantee GPU rendering done before swap.

    For the XGrabServer issue, I have changed our DDK to use xcb instead of xlib because xcb supports multi-thread very well, this assumes to work with:

    1. application is single thread and uses xlib and doesn't call XInitThread, mali DDK uses xcb which supports multi-thread

    2. application is multi-thread and uses xlib and calls XInitThread, mali DDK uses xcb

    so that application can access XServer through xlib while mali DDK can access XServer through xcb, and it is said (MixingCalls) that xlib & xcb can work together very well.

    unfortunately, after my implementation of xcb, I found many issues like:

    xcb] Unknown sequence number while processing queue

    [xcb] Most likely this is a multi-threaded client and XInitThreads has not been called

    [xcb] Aborting, sorry about that.

    pixmap_test: xcb_io.c:274: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.

    Aborted

     

    And even with XInitThread is called, there are still random similar errors, so I reverted my xcb changes. I have no idea now how to fix the XGrabServer issue now.