This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Performance Optimization

Note: This was originally posted on 17th November 2010 at http://forums.arm.com

I thought we could use a forum to share ideas on how to optimize graphics rendering on Mali-based devices.  So, here's my first post on identifying bottlenecks in the rendering pipeline.

We usually use frame rate, or frames per second (FPS), to measure graphics rendering performance.  System FPS is the overall rendering performance when all system components (CPU, GPU, memory, display) are hooked up together.  What system FPS fails to reveal, however, is how individual components in the rendering pipeline perform.  Knowing each component's performance and locating bottlenecks is the first necessary step in optimizing graphics rendering on any Mali-based system.

Graphics rendering on Mali is a frame-based pipelined process that involves several processing units.  The process begins with the graphics application running on CPU making API calls to the Mali driver.  The Mali driver then sets up data in system memory required by Mali GPU to render a frame.  The GPU can't start rendering until the CPU has completely set up data for that frame.

Within Mali GPU, the geometry processor (GP) consumes data previously set up by the CPU and passes them on to the pixel processor (PP).  The PP can't start rendering until the GP completely sets up required data in memory for that frame.

You get the picture.  Data dependency between processing blocks means a bottleneck in any of the processing cores throttles the whole system FPS.  In addition, processing cores need to access system memory so high memory latencies can potentially be a bottleneck as well.

Using the Mali performance analysis tool (PAT), in conjunction with instrumented Mali drivers, one can usually spot a CPU-bound use-case easily.  The measured system FPS would be significantly lower than GP or PP FPS (measured by PAT).  See attached image.

Has anyone found any interesting bottlenecks, or tricky ones to spot?

  • Note: This was originally posted on 19th January 2011 at http://forums.arm.com


    I thought we could use a forum to share ideas on how to optimize graphics rendering on Mali-based devices.  So, here's my first post on identifying bottlenecks in the rendering pipeline.

    We usually use frame rate, or frames per second (FPS), to measure graphics rendering performance.  System FPS is the overall rendering performance when all system components (CPU, GPU, memory, display) are hooked up together.  What system FPS fails to reveal, however, is how individual components in the rendering pipeline perform.  Knowing each component's performance and locating bottlenecks is the first necessary step in optimizing graphics rendering on any Mali-based system.

    Graphics rendering on Mali is a frame-based pipelined process that involves several processing units.  The process begins with the graphics application running on CPU making API calls to the Mali driver.  The Mali driver then sets up data in system memory required by Mali GPU to render a frame.  The GPU can't start rendering until the CPU has completely set up data for that frame.

    Within Mali GPU, the geometry processor (GP) consumes data previously set up by the CPU and passes them on to the pixel processor (PP).  The PP can't start rendering until the GP completely sets up required data in memory for that frame.

    You get the picture.  Data dependency between processing blocks means a bottleneck in any of the processing cores throttles the whole system FPS.  In addition, processing cores need to access system memory so high memory latencies can potentially be a bottleneck as well.

    Using the Mali performance analysis tool (PAT), in conjunction with instrumented Mali drivers, one can usually spot a CPU-bound use-case easily.  The measured system FPS would be significantly lower than GP or PP FPS (measured by PAT).  See attached image.

    Has anyone found any interesting bottlenecks, or tricky ones to spot?





    Is it possible for you now to check CPU load - how much application and drivers are consuming - seperately? I mean it is relevant to find split b/w application and drivers load by this way we may come to know which area should be look for optimization.
  • Note: This was originally posted on 20th January 2011 at http://forums.arm.com

    Hi,

    yes, checking CPU load is important in finding the application bottleneck, as it may not always be the GPU's vertex or fragment processing stages.

    Under Linux, you can run 'top' to see CPU load. For example:

    top - 14:18:01 up 1 min,  1 user,  load average: 0.65, 0.29, 0.10
    Tasks:  24 total,   1 running,  23 sleeping,   0 stopped,   0 zombie
    Cpu(s): 47.4%us, 25.4%sy,  0.0%ni,  2.6%id,  0.0%wa, 22.8%hi,  1.8%si,  0.0%st
    Mem: 213540k total,   100892k used,   112648k free,  4228k buffers
    Swap:     0k total,     0k used,     0k free, 83568k cached

      PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+  COMMAND
    1097 root   20   0 32412 4108 2216 S 90.7  1.9   0:04.52 main
    1098 root   20   0  2332 1080  892 R  6.2  0.5   0:00.53 top


    This application's process (main) is showing 90% CPU load, so the CPU is quite likely to be the bottleneck. In fact, this application is doing video decode on the CPU and showing the result on a spinning cube, so the GPU will be lightly loaded.

    This application:


    top - 14:24:17 up 2 min,  1 user,  load average: 1.26, 0.58, 0.23
    Tasks:  24 total,   3 running,  21 sleeping,   0 stopped,   0 zombie
    Cpu(s): 21.8%us, 25.5%sy,  0.0%ni, 38.2%id,  0.0%wa, 13.6%hi,  0.9%si,  0.0%st
    Mem: 213540k total,   100292k used,   113248k free,  5276k buffers
    Swap:     0k total,     0k used,     0k free, 80448k cached

      PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+  COMMAND
    1113 root   20   0 22460 2676 1592 R 49.3  1.3   0:20.84 main_pixmap_ump
    1114 root   20   0  2332 1080  892 R  5.5  0.5   0:02.23 top


    is only using 50% of the CPU, so the bottleneck is likely in the GPU, either the fragment or vertex processing stage. You can often find which by running the application with a very small (e.g. 16x16) resolution. At this resolution the application shouldn't be fragment bound, so if the speed doesn't go up the application was probably vertex limited. If the speed does go up, the application was probably fragment bound.

    If you want to see where the time on the CPU is being spent, you could consider compiling your program using gprof (GNU profiling tool), or running on a system with Oprofile support (a kernel module which collects performance data). Both of these should show you what CPU time is being spent in which functions.

    Under Android there is also a tool called the Dalvik Debug Monitor. You can find it in the SDK tools directory, called ddms.bat. This tool opens a window with several options, including a tab labeled SysInfo. This tab shows a CPU load pie chart which you can use to see the CPU load percentage of your application.

    Hope this helps, Pete
  • Note: This was originally posted on 22nd February 2011 at http://forums.arm.com

    Hi,

    I could add these points.

    - shader optimization (ex: size)
    - review data structures used to create the frame-buffers
    - Size of Data sent to GPU (Ex: size of textures )
    - Check it uses [font=arial, sans-serif][size=2]new drivers ( has profiling tool attached) [/size][/font]
    - Environmental conditions in the Lab such as
        1.EMI (Electro magnetic Interference)
        2.RFI (Radio Frequency Interference)
        3.ESD (Electro Static Discharge)


    Let me know if you get a 2X Performance with this :)

    Ravi
  • Note: This was originally posted on 24th February 2011 at http://forums.arm.com

    Tim, btw,

    Is it possible to use GLBenchmarking for Performance Optimization ?

    Ravindran
  • Note: This was originally posted on 2nd March 2011 at http://forums.arm.com

    Is it possible to use SPECviewperf for Performance Benchmarking ? or any Embassy Benchmarking tool ? (In the view of 3rd Parities like us )
  • Note: This was originally posted on 3rd March 2011 at http://forums.arm.com

    Hi Ravindran,
    I suppose you could use GLbenchmark, as well as any other test case, to analyse bottlenecks in your system, which could point to opportunities for system design  improvements.
    --Tim

    Tim, btw,

    Is it possible to use GLBenchmarking for Performance Optimization ?

    Ravindran