This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Receiving GPU fence timeouts and GPU crashes on Android 5.1.1. Mali-400 MP OpenGL ES 2.0 Android Rockchip 3126 processor. EGL implementation 1.4 Linux-r6p0-01rel1

Good Day,

I know this is questions is about older technology, and this may totally be the wrong forum, but I have no place else to turn. Hopefully one of the experts here can point me in the correct direction. We have developed a kiosk like app on a generic Android tablet that is used for point of sale operations. The application works without issue on an existing tablet running Android 4.1. However due to parts being deprecated, we have been forced to upgrade to a new version of the tablet and such a new version of Android. However after running the existing application on the new tablet, the graphical subsystem crashes after some period of operation. This period can be as short as 20 minutes or longer than a week. The only way to recover from this state is to power cycle the unit. When the crash happens, the application is not executing anything that should be taxing the system. Basically the app is started up and defaults into its basic mode of operation of displaying a text marketing message or a PNG logo splash screen which alternates once every 15 seconds, displaying a digital clock that updates once every second at the center bottom of the screen, and displaying static location information (name of establishment, etc) in bottom right hand corner. No animation, no streaming, etc. Just a very basic kiosk like application. 

What we see in the log (see snippet below) every 30 seconds or so, are fence timeout messages, then a listing of objects, then the GPU failure to restart message. Once this happens the entire graphics subsystem is non-functioning, although we can still connect to the unit via the Android debugging bridge - which is how we retrieve the logs. If anyone could point us in the right direction on how to resolve this issue, it would be greatly appreciated. Thank you in advance for any assistance.

Cordially,

Dale

========

<<< KERNEL LOG SNIPPET >>>>

<6>[259628.544422] fence timeout on [d8534880] after 500ms

.

.

<4>[259568.463295] objs:

<4>[259568.463295] --------------

<4>[259568.463295] fb-timeline sw_sync: 286483

<4>[259568.463295]   pt signaled@24.203036: 255

<4>[259568.463295]   pt signaled@2925.669034: 3516

<4>[259568.463295]   pt signaled@243109.259577: 264000

<4>[259568.463295]   pt signaled@263766.516201: 286460

<4>[259568.463295]   pt signaled@263787.033239: 286483

<4>[259568.463295]   pt signaled@263787.033240: 286483

<4>[259568.463295]   pt active: 286484

<4>[259568.463295]   pt active: 286484

<4>[259568.463295]

<4>[259568.463295] mali-170-gp Mali: oldest (286484) next (286484)

<4>[259568.463295]

<4>[259568.463295]

<4>[259568.463295] mali-170-pp Mali: oldest (286484) next (286484)

.

.

.

<4>[259627.137394] Mali: Executor GP: Job 1146132 Timeout on Mali_GP

<4>[259627.137485] Mali: Dump Group Mali_GP
<4>[259627.137535] Mali: 0x0000: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.137608] Mali: 0x0010: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.137681] Mali: 0x0020: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.137753] Mali: 0x0030: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.137825] Mali: 0x0040: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.137897] Mali: 0x0050: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.137969] Mali: 0x0060: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.138041] Mali: 0x0070: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.138113] Mali: 0x0080: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.138184] Mali: 0x0090: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.138257] Mali: 0x00a0: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.138329] Mali: Dump Group MMU
<4>[259627.138375] Mali: 0x0000: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.138448] Mali: 0x0010: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.138520] Mali: 0x0020: 0xffffffff 0xffffffff 0xffffffff 0xffffffff
<4>[259627.324242]
<4>[259627.325415]
<4>[259627.325486]
<4>[259627.362487] Mali: ERR: drivers/gpu/arm/mali400/mali/common/mali_gp.c
<4>[259627.362580] mali_gp_hard_reset() 140
<4>[259627.362580] Mali GP: The hard reset loop didn't work, unable to recover
<4>[259627.362660]
<4>[259627.521234]
<4>[259627.522242]
<4>[259627.522313]
<4>[259627.551750] Mali: ERR: drivers/gpu/arm/mali400/mali/common/mali_mmu.c
<4>[259627.551782] mali_mmu_raw_reset() 279
<4>[259627.551782] Reset request failed, MMU status is 0xFFFFFFFF

Parents
  • Hi Dale,
    Thanks for your report, I am an engineer form Mali-400 team, firstly, I do not know if you have the ability to replace the GPU kernel driver with the debug version, if yes, you could help have a try some following debug methods:
    1) run"mount -t debugfs none /sys/kernel/debug/" to mount the mali debugfs node.
    2) run "echo 1 > /sys/kernel/debug/mali/power/always_on" which can cause the GPU power always on.
    3) run "cat /sys/kernel/debug/mali/state_dump or timeline_dump" which can output more GPU information.
    Secondly, if you also can build our kernel driver, we have one known issue fix that found in r6p0, which may be the root cause for this issue, I think you can have a try if possible.
    Or, you have to raise this issue with the tablet vendor(Rockchip), then we can check this issue with the SOC customer.

    Brs,
    Luffy

Reply
  • Hi Dale,
    Thanks for your report, I am an engineer form Mali-400 team, firstly, I do not know if you have the ability to replace the GPU kernel driver with the debug version, if yes, you could help have a try some following debug methods:
    1) run"mount -t debugfs none /sys/kernel/debug/" to mount the mali debugfs node.
    2) run "echo 1 > /sys/kernel/debug/mali/power/always_on" which can cause the GPU power always on.
    3) run "cat /sys/kernel/debug/mali/state_dump or timeline_dump" which can output more GPU information.
    Secondly, if you also can build our kernel driver, we have one known issue fix that found in r6p0, which may be the root cause for this issue, I think you can have a try if possible.
    Or, you have to raise this issue with the tablet vendor(Rockchip), then we can check this issue with the SOC customer.

    Brs,
    Luffy

Children
  • Luffy,
    Thank you for your response. I believe I do have the ability to replace the driver, but I need a bit more information on which file(s) to replace. With that said, the system may already be running the debug version. In a non-crashed state I was able to execute the command:

    cat /sys/kernel/debug/mali/state_dump

    the output of which is below. I will capture the output of this command once in the crashed state if that would be of value. Secondly, I&#x27;d be more than happy to try and recompile the driver, assuming that Rockchip or the tablet vendor has not made any modifications to it. The tablet supplier will not provide the sources to the system, which is one of the reasons I ended up reaching out to your organization directly. Is there some way I can compile the driver myself? Thanks again for all of your assistance, it is greatly appreciated...

    Dale
    =====


    root:/ # cat /sys/kernel/debug/mali/state_dump
    cat /sys/kernel/debug/mali/state_dump
    Mali device driver -5c43549
    License: GPL

    GP queues
    Queue depth: 0
    Normal priority queue is empty
    High priority queue is empty
    PP queues
    Queue depth: 0
    Normal priority queue is empty
    High priority queue is empty

    GP group is in state INACTIVE
    GP Group: ddd26e80
    state: INACTIVE
    SW power: Off
    Power domain: id 12
    Mask: 0x1000
    Use count: 0
    Current power state: Off
    Wanted power state: Off
    Power domain: id 12
    Mask: 0x1000
    Use count: 0
    Current power state: Off
    Wanted power state: Off
    GP: Mali_GP
    GP running job: (null)
    Physical PP groups in WORKING state (count = 0):
    Physical PP groups in IDLE state (count = 0):
    Physical PP groups in INACTIVE state (count = 2):
    Physical PP Group: dde23180
    state: INACTIVE
    SW power: Off
    Power domain: id 12
    Mask: 0x1000
    Use count: 0
    Current power state: Off
    Wanted power state: Off
    Power domain: id 12
    Mask: 0x1000
    Use count: 0
    Current power state: Off
    Wanted power state: Off
    PP #1: Mali_PP1
    PP running job: (null), subjob 0
    Physical PP Group: dde23000
    state: INACTIVE
    SW power: Off
    Power domain: id 12
    Mask: 0x1000
    Use count: 0
    Current power state: Off
    Wanted power state: Off
    Power domain: id 12
    Mask: 0x1000
    Use count: 0
    Current power state: Off
    Wanted power state: Off
    PP #0: Mali_PP0
    PP running job: (null), subjob 1
    Physical PP groups in DISABLED state (count = 0):
  • Hi Dale,

    1) the /sys/kernel/debug/mali/state_dump looks fine.

    2) Do you have a try reproduce this issue after set /sys/kernel/debug/mali/power/always_on as 1?

    3) Based on the original Mali400 DDK, the SOC customer usually have some customized changes, so if we not have the sources, we unable to modify the kernel driver, so I am sorry that maybe you have to report this issue to Rockchip.
  • Luffy,
    Yes I tried the
    set /sys/kernel/debug/mali/power/always_on as 1

    That also worked. The same command now reports that SW Power is On. To better explain what is occurring to Rockchip and/or the tablet vendor, can you provide some insight as to what you believe the root cause of the issue is? I understand it may be a bug in the driver, but what are the conditions that cause it? Is there any work around that we might be able to take advantage of that can help avoid the issue from within our application? From the command above you indicate that we can leave the GPU always on. This is not an issue for us as we are not concerned with battery operations - the unit is always plugged in. Will this avoid the issue? I will also reach out to the tablet vendor/Rockchip for support to see if they might be able to provide the source to the driver.

    Thank you again for your help...
  • Hi Dale,
    I think there are two possible reason for this issue:
    1)The GPU works abnormally, which may be caused by the wrong customized GPU power switch, and according to your log, the Dump Group registers are all returning 0xFFFFFFFF, so that is why I suggest to keep the GPU power as always_on, then check is this issue can be fixed. If not, I suggest to report the issue to Rockchip, which can help check if this reason is the root cause.

    2 ) This issue also may be caused by one DDK feature (the Dirty Bit Optimization), which is used to remove the readback comand by recording the location of the readback command, then output only modified pixels in each GPU tile to overwrite the traditional readback, for some corner cases, it can break the GPU command and cause the GPU hung by a wrong GPU command. It is hard to forbid the app to trigger Dirty Bit Optimization, but this bug has already been fixed after r7p0, maybe you can ask Rockchip to check if you can update the Mali driver into r7p0.
    I do not know if these can help you, please feel free to let me know if you have any other question.

    Brs,
    Luffy

  • Luffy,
    Thank you again for your response. For item one, I can test that here without issue. I can setup the system to have the GPU always on and run the test. This will take a few days, maybe a week to confirm as the crash happens sporadically. For item 2, I have already contacted the vendor to see how we can include the updated driver. With the information you have provided, I can ensure that we received at least version r7p0. I will keep you posted as to our progress over the next couple of weeks. Thank you again for your team&#x27;s support, it has been invaluable.

    Best regards,
    Dale
    =====