This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Performance Optimization

Note: This was originally posted on 17th November 2010 at http://forums.arm.com

I thought we could use a forum to share ideas on how to optimize graphics rendering on Mali-based devices.  So, here's my first post on identifying bottlenecks in the rendering pipeline.

We usually use frame rate, or frames per second (FPS), to measure graphics rendering performance.  System FPS is the overall rendering performance when all system components (CPU, GPU, memory, display) are hooked up together.  What system FPS fails to reveal, however, is how individual components in the rendering pipeline perform.  Knowing each component's performance and locating bottlenecks is the first necessary step in optimizing graphics rendering on any Mali-based system.

Graphics rendering on Mali is a frame-based pipelined process that involves several processing units.  The process begins with the graphics application running on CPU making API calls to the Mali driver.  The Mali driver then sets up data in system memory required by Mali GPU to render a frame.  The GPU can't start rendering until the CPU has completely set up data for that frame.

Within Mali GPU, the geometry processor (GP) consumes data previously set up by the CPU and passes them on to the pixel processor (PP).  The PP can't start rendering until the GP completely sets up required data in memory for that frame.

You get the picture.  Data dependency between processing blocks means a bottleneck in any of the processing cores throttles the whole system FPS.  In addition, processing cores need to access system memory so high memory latencies can potentially be a bottleneck as well.

Using the Mali performance analysis tool (PAT), in conjunction with instrumented Mali drivers, one can usually spot a CPU-bound use-case easily.  The measured system FPS would be significantly lower than GP or PP FPS (measured by PAT).  See attached image.

Has anyone found any interesting bottlenecks, or tricky ones to spot?

Parents
  • Note: This was originally posted on 20th January 2011 at http://forums.arm.com

    Hi,

    yes, checking CPU load is important in finding the application bottleneck, as it may not always be the GPU's vertex or fragment processing stages.

    Under Linux, you can run 'top' to see CPU load. For example:

    top - 14:18:01 up 1 min,  1 user,  load average: 0.65, 0.29, 0.10
    Tasks:  24 total,   1 running,  23 sleeping,   0 stopped,   0 zombie
    Cpu(s): 47.4%us, 25.4%sy,  0.0%ni,  2.6%id,  0.0%wa, 22.8%hi,  1.8%si,  0.0%st
    Mem: 213540k total,   100892k used,   112648k free,  4228k buffers
    Swap:     0k total,     0k used,     0k free, 83568k cached

      PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+  COMMAND
    1097 root   20   0 32412 4108 2216 S 90.7  1.9   0:04.52 main
    1098 root   20   0  2332 1080  892 R  6.2  0.5   0:00.53 top


    This application's process (main) is showing 90% CPU load, so the CPU is quite likely to be the bottleneck. In fact, this application is doing video decode on the CPU and showing the result on a spinning cube, so the GPU will be lightly loaded.

    This application:


    top - 14:24:17 up 2 min,  1 user,  load average: 1.26, 0.58, 0.23
    Tasks:  24 total,   3 running,  21 sleeping,   0 stopped,   0 zombie
    Cpu(s): 21.8%us, 25.5%sy,  0.0%ni, 38.2%id,  0.0%wa, 13.6%hi,  0.9%si,  0.0%st
    Mem: 213540k total,   100292k used,   113248k free,  5276k buffers
    Swap:     0k total,     0k used,     0k free, 80448k cached

      PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+  COMMAND
    1113 root   20   0 22460 2676 1592 R 49.3  1.3   0:20.84 main_pixmap_ump
    1114 root   20   0  2332 1080  892 R  5.5  0.5   0:02.23 top


    is only using 50% of the CPU, so the bottleneck is likely in the GPU, either the fragment or vertex processing stage. You can often find which by running the application with a very small (e.g. 16x16) resolution. At this resolution the application shouldn't be fragment bound, so if the speed doesn't go up the application was probably vertex limited. If the speed does go up, the application was probably fragment bound.

    If you want to see where the time on the CPU is being spent, you could consider compiling your program using gprof (GNU profiling tool), or running on a system with Oprofile support (a kernel module which collects performance data). Both of these should show you what CPU time is being spent in which functions.

    Under Android there is also a tool called the Dalvik Debug Monitor. You can find it in the SDK tools directory, called ddms.bat. This tool opens a window with several options, including a tab labeled SysInfo. This tab shows a CPU load pie chart which you can use to see the CPU load percentage of your application.

    Hope this helps, Pete
Reply
  • Note: This was originally posted on 20th January 2011 at http://forums.arm.com

    Hi,

    yes, checking CPU load is important in finding the application bottleneck, as it may not always be the GPU's vertex or fragment processing stages.

    Under Linux, you can run 'top' to see CPU load. For example:

    top - 14:18:01 up 1 min,  1 user,  load average: 0.65, 0.29, 0.10
    Tasks:  24 total,   1 running,  23 sleeping,   0 stopped,   0 zombie
    Cpu(s): 47.4%us, 25.4%sy,  0.0%ni,  2.6%id,  0.0%wa, 22.8%hi,  1.8%si,  0.0%st
    Mem: 213540k total,   100892k used,   112648k free,  4228k buffers
    Swap:     0k total,     0k used,     0k free, 83568k cached

      PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+  COMMAND
    1097 root   20   0 32412 4108 2216 S 90.7  1.9   0:04.52 main
    1098 root   20   0  2332 1080  892 R  6.2  0.5   0:00.53 top


    This application's process (main) is showing 90% CPU load, so the CPU is quite likely to be the bottleneck. In fact, this application is doing video decode on the CPU and showing the result on a spinning cube, so the GPU will be lightly loaded.

    This application:


    top - 14:24:17 up 2 min,  1 user,  load average: 1.26, 0.58, 0.23
    Tasks:  24 total,   3 running,  21 sleeping,   0 stopped,   0 zombie
    Cpu(s): 21.8%us, 25.5%sy,  0.0%ni, 38.2%id,  0.0%wa, 13.6%hi,  0.9%si,  0.0%st
    Mem: 213540k total,   100292k used,   113248k free,  5276k buffers
    Swap:     0k total,     0k used,     0k free, 80448k cached

      PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+  COMMAND
    1113 root   20   0 22460 2676 1592 R 49.3  1.3   0:20.84 main_pixmap_ump
    1114 root   20   0  2332 1080  892 R  5.5  0.5   0:02.23 top


    is only using 50% of the CPU, so the bottleneck is likely in the GPU, either the fragment or vertex processing stage. You can often find which by running the application with a very small (e.g. 16x16) resolution. At this resolution the application shouldn't be fragment bound, so if the speed doesn't go up the application was probably vertex limited. If the speed does go up, the application was probably fragment bound.

    If you want to see where the time on the CPU is being spent, you could consider compiling your program using gprof (GNU profiling tool), or running on a system with Oprofile support (a kernel module which collects performance data). Both of these should show you what CPU time is being spent in which functions.

    Under Android there is also a tool called the Dalvik Debug Monitor. You can find it in the SDK tools directory, called ddms.bat. This tool opens a window with several options, including a tab labeled SysInfo. This tab shows a CPU load pie chart which you can use to see the CPU load percentage of your application.

    Hope this helps, Pete
Children
No data