This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Mali 400 performance analysis using the DS-5 Streamline

Dear ARM forum,

I am using the  DS-5 Streamline to analyze my application performance on ARM MALI 400.

I am seeing that,   GPU vertext processor activity for 3 milliseconds followed by a in-active period of 13 milliseconds then followed by 34 milliseconds GPU pixel proessor activity.

Questions:

1.  I am trying to under stand , why there a so much of in-active period? How can I analyze this period for the performance impact?

2. Streamline has provided many performance measuring events, but there is a very poor documentation on , what is this event capturing and how to make use of it for GPU            
     performance analysis?

3. I want to measure the GPU  Vertex processor  performance in - How many triangles it is processing in one frame, how much time it consumed for that.

                                       GPU  Pixel processor  performance in - How many pixels processed in one frame,how much time it consumed for that.

4. Is there a document to discuss on analyzing all the events for performance analysis.

Thanks,

Ravinder Are

  • Hi Ravinder,

    Can I please get more information from you in order to help.

    Which device are you testing this on? What version of the driver? What version of DS-5? What OS is running and what version?

    In addition, are you able to export and update the streamline capture so I can inspect it to see if there is something obvious happening?

    Are you seeing similar length of inactivity on other devices, whether mali based or not?

    Regarding documentation, I agree there is a lot of room for improvement. I can only say that this tool primarily targets silicon partners and OEMs. For developers, this tool is trickier to use due to some assumed knowledge and access to sources. However we are always continually improving our tools and documentation to make them more accessible to everyone.

    To answer your other questions in more detail, I will first need the above questions answered.

    Kind Regards,

    Michael McGeagh

  • GPU vertex processor activity for 3 milliseconds followed by a in-active period of 13 milliseconds then followed by 34 milliseconds GPU pixel processor activity.

    If you are debugging the inactive period, why do you think the GPU performance counters are going to help? If the GPU is inactive, then by definition all of the GPU counters will be zero for that period ...

    Assuming you are running on something like Android, the most likely reason for a long idle delay is that the driver is waiting for a framebuffer fence from SurfaceFlinger. The GPU can run vertex processing early, because it doesn't need a buffer to render into, but can only render into the framebuffer when the fence is signaled by the window system (otherwise we may corrupt a buffer which is still being scanned out on to the screen).

    This incoming fence is outside of the Mali driver's control, so check that your system is correctly configured with triple buffering, and that your display controller or compositor stack is correctly signalling fences.

    HTH,

    Pete

  • Thanks Michael and Peter for your replies.

    Let me add more details on my system environment:

    Target Chip: A53+Mali400 chip

    Target Board OS: Linux OS 64 Bit

    Mali Drivers: r5p1-01rel0-64bit

    Target Application: OpenGLES2.0,  fbdev based  Double fuffering

    ARM Streamline Performance Analyzer: Version 5.23, Build 20151109_152210

    Host OS: Windows 7 64 bit

    Streamline Log-file :   I have Attached the Log file.

    Questions:

    1.  I am using double buffering,  but you mentioned triple buffering? why need triple buffering? what is the additional advantage with this, I am      
          using simple fbdev, and only graphics content I am showing on the display.

    2. my interest is to find out the GPU VP and PP performance

        a) number triangles processed in a second

        b) number pixels processed in a second

        c) Bandwidth consumed in a second

        d) frame rate.

    Please give information on how to calculate above four performance data using the streamline analyzer.

    3. Please provide information on Mali-400 supporting Streamline Events and provide their purpose and provide some details on how I can make use of these Events for all GPU performance analysis.

    Thanks,

    Ravinder Are

  • I am using double buffering,  but you mentioned triple buffering? why need triple buffering?

    Multiple buffering - Wikipedia, the free encyclopedia

    Summary - in a system with vsync, double buffering locks you to a multiple of the vsync period. If your system can't quite run at 60 FPS then it will snap down to 30 FPS, if it can't hit 30 FPS, then it snaps down to 20FPS, etc ...

    Please give information on how to calculate above four performance data using the streamline analyzer.

    This might be a good place to start in terms of using DS-5 Streamline for graphics performance analysis:

    Mali GPU Application Optimization Guide - Mali Developer Center

    See Chapter 7 (Utgard Optimization Workflows) for the parts relevent to Mali-400.

    HTH,

    Pete

  • Thanks Peter for your Reply.

    I did not get any information in the Mali GPU Application Optimization Guide - Mali Developer Center document, to get the

         a) number triangles processed in a second

        b) number pixels processed in a second

        c) Bandwidth consumed in a second

        d) frame rate.

    It would be great if you could provide the details on this.

    Thanks,

    Ravinder Are

  • Hi Ravinder,

    Based off the counter names and descriptions that DS-5 Streamline gives you:

    a) Mali-4xx Software Counters: Geometry Statistics: Triangles

    "The total number of triangles passed to GLES per-frame."

    b) Mali Fragment Processor: Mali-4xx FP: Fragment rasterized count

    "Number of fragment rasterized. Fragments/(Quads*4) gives average actual fragments per quad."

    c) For bandwidth you will need 4 counters:

    Mali Fragment Processor: Mali-4xx FP: Total bus reads

    "Total number of 64-bit words read from the bus."

    Mali Fragment Processor: Mali-4xx FP: Total bus writes

    "Total number of 64-bit words written to the bus."

    Mali Vertex Processor: Mali-4xx VP: Words read, system bus

    "Total number of 64 bit words read by the GP2 from the system bus per frame."

    Mali Vertex Processor: Mali-4xx VP: Words written, system bus

    "Total number of 64 bit words written by the GP2 to the system bus per frame."

    Add those 4 together will give you the complete GPU Bandwidth used. Multiply this number by 8 to get the value as Bytes.

    d) This is actually non-trivial due to Streamline being a time based profiling tool.

    There is a counter that 'may' be enabled in your BSP:

    Mali-4xx Filmstrip: 1:10

    "captures every 10th frame"

    If you can use this, you can visually see how many thumbnails are produced in your capture and multiply by 10. This is obviously only accurate to within 10 frames.

    Another method, assuming you are not vertex limited, is to measure the time between Vertex Activity spikes. Each frame is 'likely' to only issue one vertex activity spike. Note in some composition environments like android's triple buffering composition, there will be a second smaller spike per frame.

    Another is to use streamline annotations and mark eglSwapBuffers so you can see when they are called on the timeline.

    If you have sourcecode access however, you may find it best to just measure within the app itself.

    I hope that helps.

    Kind Regards,

    Michael McGeagh

  • Hi Michael and Peter,

    Did you get chance to look in to the streamline log I shared. 

    I am seeing  in a frame VP has some activity followed by some idle time and followed by PP has some activity.

    Here my questions are,

    1. why VP and PP activity is not parallel, why one after another?

        Am I doing something wrong where the parallelism is not possible?

    2. Why there is a idle time in my application ? why cant PP start immediately?

        I am not using the Vsync, and  I have double buffering in my processing.

    3. I am running QT based OpenGLES2.0 Application with a simple Vertex shader and a simple Fragment Shader.

    I need your support in analyzing .

    Thanks,

    Ravinder Are

  • As I said before, Streamline isn't going to help answer questions about idle time. It's a performance profiler - you can't profile "nothing running" - all of the counters are zero.

    The cause of (1) and (2) are probably the same thing. Serialization means idle time and things not overlapping.

    Usual suspects:

    • Window system fences for framebuffers not being released by the previous user of the buffer (normally the display controller or compositor).
    • Vsync and less than 3 framebuffers buffers
    • Application CPU load is too high (e.g. CPU limited) - doesn't look like it in your case
    • Application calling sleep()
    • Application using glFinish, glReadpixels, or waiting on an OpenGL ES level synchronization primitive and draining the rendering pipeline.

    Less likely suspects:

    • Kernel not processing interrupts quickly enough (e.g. another driver is disabling IRQs, and not re-enabling them for a long time).

    HTH,
    Pete

  • Thanks Peter Its useful information