This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Mali 400 performance analysis using the DS-5 Streamline

Dear ARM forum,

I am using the  DS-5 Streamline to analyze my application performance on ARM MALI 400.

I am seeing that,   GPU vertext processor activity for 3 milliseconds followed by a in-active period of 13 milliseconds then followed by 34 milliseconds GPU pixel proessor activity.

Questions:

1.  I am trying to under stand , why there a so much of in-active period? How can I analyze this period for the performance impact?

2. Streamline has provided many performance measuring events, but there is a very poor documentation on , what is this event capturing and how to make use of it for GPU            
     performance analysis?

3. I want to measure the GPU  Vertex processor  performance in - How many triangles it is processing in one frame, how much time it consumed for that.

                                       GPU  Pixel processor  performance in - How many pixels processed in one frame,how much time it consumed for that.

4. Is there a document to discuss on analyzing all the events for performance analysis.

Thanks,

Ravinder Are

Parents Reply Children
  • Thanks Peter for your Reply.

    I did not get any information in the Mali GPU Application Optimization Guide - Mali Developer Center document, to get the

         a) number triangles processed in a second

        b) number pixels processed in a second

        c) Bandwidth consumed in a second

        d) frame rate.

    It would be great if you could provide the details on this.

    Thanks,

    Ravinder Are

  • Hi Ravinder,

    Based off the counter names and descriptions that DS-5 Streamline gives you:

    a) Mali-4xx Software Counters: Geometry Statistics: Triangles

    "The total number of triangles passed to GLES per-frame."

    b) Mali Fragment Processor: Mali-4xx FP: Fragment rasterized count

    "Number of fragment rasterized. Fragments/(Quads*4) gives average actual fragments per quad."

    c) For bandwidth you will need 4 counters:

    Mali Fragment Processor: Mali-4xx FP: Total bus reads

    "Total number of 64-bit words read from the bus."

    Mali Fragment Processor: Mali-4xx FP: Total bus writes

    "Total number of 64-bit words written to the bus."

    Mali Vertex Processor: Mali-4xx VP: Words read, system bus

    "Total number of 64 bit words read by the GP2 from the system bus per frame."

    Mali Vertex Processor: Mali-4xx VP: Words written, system bus

    "Total number of 64 bit words written by the GP2 to the system bus per frame."

    Add those 4 together will give you the complete GPU Bandwidth used. Multiply this number by 8 to get the value as Bytes.

    d) This is actually non-trivial due to Streamline being a time based profiling tool.

    There is a counter that 'may' be enabled in your BSP:

    Mali-4xx Filmstrip: 1:10

    "captures every 10th frame"

    If you can use this, you can visually see how many thumbnails are produced in your capture and multiply by 10. This is obviously only accurate to within 10 frames.

    Another method, assuming you are not vertex limited, is to measure the time between Vertex Activity spikes. Each frame is 'likely' to only issue one vertex activity spike. Note in some composition environments like android's triple buffering composition, there will be a second smaller spike per frame.

    Another is to use streamline annotations and mark eglSwapBuffers so you can see when they are called on the timeline.

    If you have sourcecode access however, you may find it best to just measure within the app itself.

    I hope that helps.

    Kind Regards,

    Michael McGeagh

  • Hi Michael and Peter,

    Did you get chance to look in to the streamline log I shared. 

    I am seeing  in a frame VP has some activity followed by some idle time and followed by PP has some activity.

    Here my questions are,

    1. why VP and PP activity is not parallel, why one after another?

        Am I doing something wrong where the parallelism is not possible?

    2. Why there is a idle time in my application ? why cant PP start immediately?

        I am not using the Vsync, and  I have double buffering in my processing.

    3. I am running QT based OpenGLES2.0 Application with a simple Vertex shader and a simple Fragment Shader.

    I need your support in analyzing .

    Thanks,

    Ravinder Are

  • As I said before, Streamline isn't going to help answer questions about idle time. It's a performance profiler - you can't profile "nothing running" - all of the counters are zero.

    The cause of (1) and (2) are probably the same thing. Serialization means idle time and things not overlapping.

    Usual suspects:

    • Window system fences for framebuffers not being released by the previous user of the buffer (normally the display controller or compositor).
    • Vsync and less than 3 framebuffers buffers
    • Application CPU load is too high (e.g. CPU limited) - doesn't look like it in your case
    • Application calling sleep()
    • Application using glFinish, glReadpixels, or waiting on an OpenGL ES level synchronization primitive and draining the rendering pipeline.

    Less likely suspects:

    • Kernel not processing interrupts quickly enough (e.g. another driver is disabling IRQs, and not re-enabling them for a long time).

    HTH,
    Pete

  • Thanks Peter Its useful information