Not sure if this is the right place to ask this question but how do I disable watchdog timer on my Note3 which has a Mali-T628? All I could find was dvfs stuff inside /sys/devices/platform/mali.0. Is there something in the user-mode driver I could modify and recompile?
I am trying to run some GLES 3.0 benchmarks and if I increase the number of iterations inside the pixel shader beyond a certain limit, I get a black framebuffer. I figured it must be related to the watchdog timer.
There is a general assumption in the GPU scheduler that shader programs are "well behaved" - every sane real world use case will be trying to render quickly.
If a GPU rendering task takes too long to complete and there is other work pending then the Mali kernel driver tries to pre-empt it (a soft-stop); if it does not pre-empt quickly because of very long running threads are still running then that job gets killed (a hard stop). Jobs which are soft-stopped can be scheduled back on to the GPU later once they are assigned a new timeslice, hard-stopped jobs are not restartable.
If you are getting black rendering that sounds like you either (1) have a Job which is failing to soft-stop quickly enough, and so it is getting hard-stopped by the Mali kernel driver, or (2) the window system down-steam of Mali is timing out the fence used for frame composition.
If the issue is (1) then the soft-stop and hard-stop timeouts can be configured in the kernel drivers for your platform - just grep for softstop and hardstop and you'll find them (the standard kernel drivers for Mali, excluding any platform configuration from specific vendors, can be found here Home - Mali Developer Center Mali Developer Center) . I'm not sure what Samsung configure these values too, but the defaults are relatively generous (10ms timeslices, 100ms hard stop timeouts if you fail to softstop).
If the issue is (2) then the problem is not an issue with the Mali drivers; some of the downstream drivers (such as display controller) may be making assumptions on how long a frame may take before it times out and just uses whatever is in memory. For the Note 3 these downstream parts are controlled by Samsung, not by ARM.
However, I'd generally question the need for very very long running threads in a benchmark. Assuming you are taking longer than 100ms for a single fragment thread (which is what would be needed for Mali to fail to soft-stop cleanly with the default timeouts), then that seems a little overkill; there should be no need for a single GPU thread to run for 50 million cycles to get a good benchmark of the hardware capability. Having shorter running threads and more of them (more vertices and/or more pixels) would seem a more pragmatic change.
Hope that helps, Pete
One other thought is that if things are getting killed mid-render then you should get stale data in the framebuffer (because the rendering is interrupted, and the new values won't have overwritten what was there already) - so I wouldn't expect to see a completely black framebuffer. The one exception to this unless it is one of the very first renders (first three frames) which is timing out and the first tile is getting killed, in which case getting black is valid as we will get zeroed memory from when the new window surface is allocated.
If you render 10 frames of a bright red clear-color first to populate all of the window buffers with a recognizable non-black pattern, do you still get black when the benchmark is running, or do you start getting red buffers?
If you still get black then it is likely not a problem with jobs getting killed, but more likely that you have a precision problem in your shaders.
HTH, Pete
"there should be no need for a single GPU thread to run for 50 million cycles to get a good benchmark of the hardware capability. Having shorter running threads and more of them (more vertices and/or more pixels) would seem a more pragmatic change."
Yes, and I totally agree with you. But my benchmark is not really a normal bench, it's a microbenchmark to figure out the texture filtering rate. I launch a pixel-shader on a full-screen quad and every thread does texture fetches from L1 cache repeatedly. There is no concept of frames here as I don't care about the image quality and am just rendering into a off-screen buffer.
The problem I am facing is that if each thread (all threads must do the same amount of work) reads more than 2048 texels then I get black results but not the correct number of 4 bilinear filtered pixels/clock. I am getting close to around 3.6 but not 4 - can only reach 1.66 GTexels/s but GFXBench can reach till 1.9 GTexels/s for my device. I did what you suggested too - have less number of reads per thread but send a lot of batches to the GPU. But then the driver seems to be optimizing all these exactly same drawcalls writing to the same framebuffer. So the only choice I have here is to try out the options you suggested i.e increasing the time on softstop (or hardstop) and the downstream driver. Thanks a lot for your help Peter!
The problem I am facing is that if each thread (all threads must do the same amount of work) reads more than 2048 texels then I get black results but not the correct number of 4 bilinear filtered pixels/clock
What is your shader doing? I suspect the black results are due to a precision problem in your shaders exceeding the maximum representable range of a variable. Are you able to share?
Most graphics shaders are very short - even high end content like the GFXBench 3.0 Manhattan test typically only uses a handful of texture accesses - so if you have to many unique accesses I wonder if you are hitting some other limit unrelated to the main texturing unit.
But then the driver seems to be optimizing all these exactly same drawcalls writing to the same framebuffer.
Multiple opaque drawcalls to the framebuffer won't work - we can kill the overdrawn pixels in hardware - see Killing Pixels - A New Optimization for Shading on ARM Mali GPU.Try turning on blending, as this forces us to keep the overdrawn fragments (we need their color to blend against).
Cheers,Pete
Hmm, my shader is fetching from a texture inside a loop and writing out the results once to the framebuffer. All the results are added inside a vec4 variable. I should try using highp instead of lowp/mediump qualifiers then.
On Mali mediump is fp16 precision - so the dynamic range is quite small, and if you start using a significant number of bits to represent non-fractional digits you rapidly run out of the fractional part. Try highp - it sounds like it might help.
Okay so I tried both suggestions :
-Use highp for the output color and intermediate variable - the black output is still present on increasing the texel fetches > 2k
-Enabled blending to prevent pixel killing optimization
And still no luck :[
Anyways, It's a good thing you guys report the texel fill rate which is 1 bilinear/clock/unit and 1/2 triliear/clock/unit. And also FP16 is full-rate which I measured.