This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

What is the GPU/driver doing if not shading?

I ran Streamline on the SimpleTriangle example from Mali OpenGL ES SDK for Android v1.6.0. Basically, each frame it renders a triangle, that covers half of the frame. It renders to the default framebuffer. What I observe is, that most of the time is spent not vertex/fragment processing. What is actually the GPU/driver doing during this time? Note that I don't mean the time between frames, but the time between vertex and fragment processing.

I have tried this example on two platforms, one with Mali-400 and the other with Mali-450. Both give the same result.

Below is an illustration of the behavior when rendering a single frame. As you can see, the middle part is a significant portion of the processing of the frame.

what-is-the-gpu-doing-part.png

Below is a trace of the OpenGL API calls for a single frame.

glClearColor(red=0.0, green=0.0, blue=0.0, alpha=1.0)

glClear(mask=GL_DEPTH_BUFFER_BIT|GL_COLOR_BUFFER_BIT)

glUseProgram(program=3)

glVertexAttribPointer(index=0, size=2, type=GL_FLOAT, normalized=GL_FALSE, stride=0, pointer=0x776e3af0)

glEnableVertexAttribArray(index=0)

glDrawArrays(mode=GL_TRIANGLES, first=0, count=3)

eglSwapBuffers(dpy=0x1, surface=0x77474988)

Parents
  • Hi sogartar,

    Sorry but I do not understand your question.

    Note that I don't mean the time between frames, but the time between vertex and fragment processing.

    According to that streamline chart, there is no time between the vertex (ORANGE) and fragment (BLUE) processing... the fragment happens immediately after the vertex finishes.

    As you can see, the middle part is a significant portion of the processing of the frame.

    The middle part (highlighted time region) is the Fragment activity, and is the GPU running your fragment shader on each pixel in the framebuffer to render your output. It is not idle during this time.

    If my answers didn't help, please try to explain your issue further so we can try to help.

    Kind Regards,

    Michael McGeagh

Reply
  • Hi sogartar,

    Sorry but I do not understand your question.

    Note that I don't mean the time between frames, but the time between vertex and fragment processing.

    According to that streamline chart, there is no time between the vertex (ORANGE) and fragment (BLUE) processing... the fragment happens immediately after the vertex finishes.

    As you can see, the middle part is a significant portion of the processing of the frame.

    The middle part (highlighted time region) is the Fragment activity, and is the GPU running your fragment shader on each pixel in the framebuffer to render your output. It is not idle during this time.

    If my answers didn't help, please try to explain your issue further so we can try to help.

    Kind Regards,

    Michael McGeagh

Children
  • I was probably not clear.

    There is fragment activity, but during that time the fragment processors are not rasterizing fragments, this happens only later, as the image shows. I was wandering what is the GPU doing during that time.

  • Hi sogartar,

    As McGeagh mentioned in that highlighted region the GPU fragment activity is your fragment shader running on each pixel. If I understand correctly you are asking about "Fragments rasterized count" counter? This counter counts the fragments rasterized from triangles. More details available Rasterisation - Wikipedia, the free encyclopedia.

    In that highlighted region for these counters, you see it idle because this needs to happen before any fragment shading can start. If you scroll to the left you should see these counters with some bigger numbers for that highlighted GPU fragment activity.

    HTH,

    Wasim

  • Hi Wasim,

    To be honest your replay did not make much sense to me.

    If you are to collect total bus writes/reads of the fragment processors, alongside the number of rasterized fragments, you would always find them matching in time. In the above case, there won't be much reading, because the contents of the buffer are not preserved before drawing, so there is no uploading to tile memory before fragment shading. On the other hand total bus writes would match the size of the buffer in memory. This means, that the whole process of uploading to tile memory, running the fragment shader program and downloading the tile back to main memory happens only at the end of GPU fragment activity. This is when the Mali-4xx FPs are active. Then the highlighted area in the image can't be where each pixel is shaded.

  • Please note that there is a difference between "Counters" and "Activity" (Hence the difference in chart looks).

    Counters are collected at set times, and gives the value of that counter when read, resets itself back to zero, and will continue counting until the next time it is read.

    Activity however is different and not done via hardware counters. It is the activity... a rough % of utilisation of the GPU (Vertex and Fragment separately) and its activity.

    Streamline is telling you that in your highlighted region, the GPU is active, and doing work.

    The hardware counters tell you what specific part(s) inside the vertex and/or fragment core(s) were active between the time it was last checked and the current check.

    I hope this helps explain things further.

    Kind Regards,

    Michael McGeagh

  • This seams reasonable, one thing though. Data is sampled with 1KHz frequency. Doesn't this mean that the Fragment Processor counters should measure an increase at least once during this 2.8 ms window? I never observe this. Never are the counters increased in the beginning of fragment activity, always at the end, so it can't be a matter of occasional drop in sampling frequency.

    As you can see in the image, vertex activity is aways between Vertex Processor counters increases. This is what I expect to see for the Fragment activity as well.

  • Could you provide us with an export of your capture (option within Streamline) and provide me with this for further investigation?

    This could be an issue with how Streamline is presenting the information, or it could be correct behaviour... I cant quite tell from the screenshot.

    Kind Regards,

    Michael McGeagh

  • I have the export command grayed out. Probably, because it is a community edition. I am sending you the apc directory.

  • Hi Sogartar

    You are seeing two different things here

    i) In Mali-4x0 the hardware counters values are sampled once at the end of a fragment or vertex job. They are not sampled continuously every 1ms. Therefore you will see a single spike representing the total value of a given hardware counter, for a job, at the end of a period of activity. This simplifies the design without, at least in theory, compromising the information provided on the grounds that you cannot usefully associate a 1ms sample of a counter with any particular vertex or shader, or line of code or whatever because you don't know what order things are happening internally. Contrast this with a CPU which provides a program counter to correlate against.

    ii) In the UI the "Activity" information has a higher resolution timestamp than the counter information. At high zoom levels you can then see the slight discrepancy this introduces between where the spike appears (on 1 ms tick) and the end of the activity chart (on a higher resolution)

    Finally you need to be aware of one other detail

    iii) The counter values are only recorded in 1ms samples but they too will initially have a higher resolution timestamp. At high enough zoom levels a counter value will fall between two 1ms samples. Streamline attempts to interpolate the results so it divides the count proportionately between the two 1ms samples. This can lead to the single count I explained above looking like two counts. You can either zoom out to get them totalled up into a single count, or stretch the cursor out (click on the extreme left or right of the blue lozenge with the timestamp) to cover multiple samples.

    And finally, finally

    iv) If you have two back to back periods of fragment activity you may end up with a view where it appears there is one period of activity with more than one counter spike. You can be sure there is more than one period of activity i,e, more than one job because of the multiple spikes. Zooming in should reveal the gap.

    BTW Mali T6/7/8xx have a different design. One key difference is that counters are sampled every 1ms.

    Does that help?

    R