[CTAToken URL = "https://developer.arm.com/-/media/C6F5A617AF724C01825352A3A9CFDDF1.ashx?revision=b0f1df09-2bbd-4a8a-846a-60289ec9a854" target="_blank" text="Japanese version - 日本語版" class ="green"] [CTAToken URL = "https://developer.arm.com/-/media/BF9344E331004FEAB9B5EAF6C9BC8C32.ashx?revision=7a4b38dd-7af7-45fa-9485-860ea117f3cb" target="_blank" text="Korean version - 한국어판" class ="green"]
The Streamline performance analyzer is a sample-based profiler which can present detailed performance information about the Arm CPU and Mali GPU present in a device. Recent versions of Streamline have included a set of predefined templates which can be used to easily select a set of data sources to use, and control how they are visualized. The latest release of Streamline, included in Arm Mobile Studio 2019.0 and Arm Development Studio 2019.0, includes a number of improvements to the Mali GPU templates for the Mali Bifrost GPU family. This article walks through the use of the template for a Mali-G72 GPU.
This blog assumes that the reader has familiarity with graphics terminology, in particular relating to tile-based rendering GPU architectures. Some useful quick-start guides on these topics can be found here:
Once you have followed the Quick Start Guide to set up your application and install the gator daemon to the target, it is time to select some data sources and start profiling. Connect to your device and bring up the Counter Selection dialog. In the Counter Selection dialog select the appropriate template for your device from the drop-down menu.
This will automatically select all of the data sources necessary to render the template's visualization. Click save, and then capture a trace of your application. Once the initial data analysis has completed the default Timeline visualization will be presented:
This shows an alphabetical list of the charts and data series that were captured. The first thing we therefore need to do is select the same template we used for capture for visualization:
This will change the Timeline to display a pre-defined visualization, designed by our in-house performance analysis team. This will order the charts in a more methodical sequence, and make use of mathematical expressions to combine multiple raw counters to derive more readable metrics such as percentage utilization of a functional unit.
The initial view that the Timeline presents gives us 1 second per on-screen sample, which is too coarse for debugging graphics content as we are most interested in viewing how well we are processing frames which are typically between 16-32 milliseconds in length. The first step in the analysis is therefore to zoom in the view until single frames become distinct.
In the application shown in the sample we have added instrumentation to the source code to generate a Streamline marker annotation whenever the application calls eglSwapBuffers(). These are visible as the red ticks on the time track above the charts.
eglSwapBuffers()
Once you can see individual frames it is possible to make an initial assessment of the current system behavior:
In our example above we can see that the CPUs are all going completely idle for a significant proportion of the frame, so we are not CPU bound. We can also see that the GPU is active all of the time, so the GPU is highly likely to be the processor limiting this application's performance.
In terms of breaking down the GPU workload further we can see that the fragment shading queue is the one active all of the time, with the non-fragment queue used for all geometry and compute processing going idle for most of the frame. You would therefore look to optimize fragment workload for this application if you wanted to improve performance.
The following sections in this tutorial work through each of the charts in the template, and explain what they mean and what changes they could imply for an application developer looking to improve performance.
The CPU charts show the overall usage of the CPUs in the system.
The CPU Activity charts show the per-CPU utilization, computed as a percentage of time the CPU was active, split by processor type if you have big.LITTLE clustering present. This is based off OS scheduling event data. The CPU Cycles chart shows the number of cycles that each CPU was active, measured using the CPU performance monitoring unit (PMU). By considering both of these together we can assess the overall application software load; a high utilization and a high CPU cycle count indicate that the CPU is both very busy and running at a high clock frequency.
The process view at the bottom of the Timeline tab shows the application thread activity, allowing an identification of which threads are causing the measured load. Selecting one or more threads from the list will filter the CPU related charts so that only the load from the selected threads is shown. When a thread-level filter is active the chart title background changes to a blue-tinted color to indicate that not all of the measured load is currently visible.
If the application is not hitting its performance target and has a single CPU thread which is active all of the time then it is likely to be CPU bound. Improvements to frame time will require software optimizations to reduce the cost of this thread's workload. Streamline provides native software profiling via program counter sampling, in addition to the performance counter views. Software profiling is beyond the scope of this tutorial, so please refer to the Streamline User Guide for more information.
The GPU workload charts show the overall usage of the GPU.
The Mali Job Manager Cycles chart shows the number of GPU cycles spent with work running, both for the GPU as a whole, the two parallel hardware work queues for Non-fragment and Fragment work. The Mali Job Manager Utilization charts show the same data normalized as a percentage against GPU active cycles.
For GPU bound content the dominant work queue should be active all of the time, with the other queue running in parallel to it. If a GPU bound application is not achieving good parallelism check for API calls which drain the rendering pipeline, such as glFinish() or synchronous use of glReadPixels(), or Vulkan dependencies which are too conservative to allow for stage overlap of multiple render passes (including overlap across frames).
glFinish()
glReadPixels()
The Tiler active counter in this chart is not always directly useful, as the tiler is normally active for the entire duration of geometry processing, but it can give an indication of how much compute shading is present. Any large gap between Non-fragment active and Tiler active may be caused by application compute shaders.
The IRQ active counter shows the number of cycles the GPU has an interrupt pending with the CPU. A IRQ pending rate of ~2% of GPU cycles is normal, but applications can cause a higher rate of interrupts by enqueuing a large number of small render passes or compute dispatches.
Note: that a high IRQ overhead can also be indicative of a system integration issue, such as a CPU interrupts being masked for a long time by a privileged kernel activity. It is not usually possible to fix a high IRQ overhead using application changes.
The memory system charts show the behavior seen at the GPU memory interface, both in terms of memory traffic generated by the GPU and how effectively the system is handling that traffic.
The Mali External Bus Bandwidth chart shows the total read and write bandwidth generated by the application. Reducing memory bandwidth can be an effective application optimization goal, as external DDR memory accesses are very energy intensive. Later charts can help identify which application resource types are the cause of the traffic.
The Mali External Bus Stall Rate chart shows the percentage of GPU cycles with a bus stall, indicating how much back-pressure the GPU is getting from the external memory system. Stall rates of up to 5% are considered normal; a stall rate much higher than this is indicative of a workload which is generating more traffic than the memory system can handle. Stall rates can be reduced by reducing overall memory bandwidth, or improving access locality.
The Mali External Bus Read Latency chart shows a stacked histogram of the response latency of external memory accesses. Mali GPUs are designed for an external memory latency of up to 170 GPU cycles, so seeing a high percentage of reads in the slower bins may indicate a memory system performance issue. DDR performance is not constant, and latency will increase when the DDR is under high load, so reducing bandwidth can be an effective method to reduce latency.
Note: It is expected that a small proportion of memory accesses will be the in slower bins, as the DDR is a shared resource and there will be competing traffic from other parts of the system.
The Mali External Bus Outstanding Reads/Writes charts show another set of stacked histograms, this time showing the percentage of allowed memory accesses the GPU has queued in the memory system. If a high percentage of the histogram is in the 75-100% bin, it may be possible that the GPU is running out of transactions. This will stall new memory requests until an older request has retired. Reducing memory bandwidth or improving access locality in DDR may improve performance.
The geometry charts show the amount of geometry being processed by the GPU, and the behavior of the primitive culling unit.
The Mali Primitive Culling chart shows the absolute number of primitives being processed, how many are killed by each culling stage, and how many are visible. A single vertex is much more expensive to process than a single fragment because they have high memory bandwidth requirements, so you should aim to reduce total primitive count per frame as much as possible.
The Mali Primitive Culling Rate chart shows the percentage of primitives entering each culling stage that are killed by it, and the percentage of visible primitives. The culling pipeline runs as a number of serial stages:
For a 3D scene it is expected that ~50% of the primitives are back-facing and killed by the facing test culling unit. If the Culled by facing test rate is much lower than this review whether the facing-test is correctly enabled.
It is standard best practice for an application to cull out-of-frustum draw calls on the CPU, so the Culled by frustum test rate should be kept as low as possible. If more than 10% of input primitives are getting killed at this stage, review for effectiveness of CPU-side culling. In addition it can be worth reviewing batch sizes, as overly large batches of objects can reduce culling efficiency.
The final culling rate, Culled by sample test, measures the percentage of primitives killed because they are so small that they will hit no rasterization sample points. Dense geometry is very expensive, both in terms of the direct vertex processing cost and reduced fragment shading efficiency, so this number should be kept as close to 0% as possible. If a high number of primitives are being killed here review both static mesh density, and the effectiveness of any dynamic level-of-detail selection.
The Mali Geometry Threads chart shows the absolute number of shading requests generated by Mali's index-driven vertex shading algorithm. This design splits the application's vertex shader into two pieces, one piece which computes position and one which computes the other varyings. The varying shader is only run for vertices belonging to primitives which survive clipping and culling. It is possible to review multiple things at this point.
The shader front-end charts show the behavior of the fixed function unit which turns a primitives in to fragment threads to be shaded.
The Mali Core Primitives chart shows the number of primitives that are loaded for rasterization. Note that Mali will load large primitives once per tile, so a single primitive will be included in this count once for every tile it intersects.
The Mali Early ZS Testing Rate chart shows the depth (Z) and stencil (S) testing and culling rates in the front-end. Early ZS testing is much less expensive than late ZS testing, so aim for nearly all fragments to be Early ZS tested by minimizing use of shader discard, alpha-to-coverage, and shader generated depth values. The FPK killed counter reports the proportion of quads killed by Mali's Forward Pixel Kill hidden surface removal scheme. A high proportion of quads being killed by FPK indicates a back-to-front render order; reversing this to a front-to-back render order will kill quads earlier during early-ZS testing which will reduce energy consumption.
discard
The Mali Late ZS Testing Rate chart shows the depth and stencil testing and culling rates in the back-end after fragment shading. Killing a high proportion of quads during late ZS testing indicates a potential efficiency issue because these fragments are killed after they have been shaded.
Note: A render pass using an existing depth or stencil attachment as a starting state, rather than a clear color, will trigger a late ZS operation as part of the reload process. This may not be avoidable, but aim to minimize the number of render passes starting without a clear of all attachments.
The Mali Core Warps chart shows the number of warps created by the compute front-end (which includes all non-fragment workloads) and the fragment front-end. Note that the warp width can vary from product to product.
The final two charts show the average shader core processing cost per thread. For GPU bound content there are two possible optimization objectives for shader workloads:
This set of charts looks at the rate at which the shader core is producing pixels.
The Mali Pixels chart shows the total number of pixels shaded by all shader cores, allowing an assessment of the total number of pixels required to produce a frame.
The Mali Overdraw chart shows the average number of fragments shaded per output pixel. High levels of overdraw can reduce performance, even if the cost per fragment is low. Aim to minimize the number of layers of transparent fragments in use to reduce overdraw.
The shader core is the heart of the GPU, so it should be no surprise that we have a large number of counters allowing inspection of shader core workloads. The first set of charts aim to give an at-a-glance view of the overall shader core utilization.
Note: The shader core "compute" data path is used for processing all non-fragment workloads, so any compute related counter will also include vertex shading workloads.
The Mali Core Utilization chart shows the percentage utilization of the three major parts of shader core.
The Mali Core Unit Utilization chart shows the percentage utilization of the major pipelines inside the Execution core.
For shader content which is shader core bound, identifying the unit which is most heavily loaded using this chart is a good way to determine where to target optimizations.
The Mali Workload Properties chart contains a variety of component series which indicate interesting behaviors of the workload.
The Warp divergence rate series reports the percentage of instructions executed when there is control flow divergence across the warp, causing some execution lanes to be masked out. Aim to minimize control flow divergence, as it can rapidly erode shader execution efficiency.
The Partial warp rate series reports the percentage of warps which contain thread slots with no coverage. This occurs due to a fragment quad intersecting the edge of a primitive, resulting in a fragment with no hit samples. A high percentage of partial warps may be indicative of an application with a high number of microtriangles, or triangles which are very thin. Aim to minimize the number of partial warps, as they can also rapidly erode shader execution efficiency.
The Tile CRC kill rate series reports the percentage of tiles which are killed due to a CRC match, indicating that the computed color matches the color already in memory. A high kill rate may indicate an optimization opportunity, if the application is able to identify and draw only the changed parts of the screen.
The Execution Engine is used for executing all shader instructions, including all of the arithmetic workloads. The Execution engine utilization series in the Mali Core Unit Utilization chart described earlier can be used to determine whether the application is arithmetic processing limited.
The Mali Core Varying Cycles chart reports the usage of the fixed-function varying interpolator, broken down by data precision.
For content which is varying bound there are three possible optimization opportunities:
highp
mediump
vec4
float
vec3
The texture unit is a complex unit which handles all texture sampling and filtering. It can have variable performance depending upon texture format and filtering mode being used.
The Mali Core Texture Cycles chart reports the total usage of the fixed-function texture filtering unit.
The Mali Core Texture CPI chart reports the average number of texture cycles per request, giving some idea of the number of multi-cycle operations being triggered by use of more complex filtering modes. For texture bound content reducing CPI by using simpler filtering modes can be an effective means to improve performance.
The Mali Core Texture Usage Rate chart reports statistics about the types of texture access being made.
The Mali Core Texture Bytes/Cycle chart reports how many bytes need to be fetched from the L2 and from external memory for each filtering cycles. External accesses are particularly energy intensive so it is recommended to use compressed textures and mipmaps, as well as ensuring good sample locality to reduce cache pressure.
The load/store unit provides generic read/write data memory access, as well as image and atomic access.
The Mali Core Load/Store Cycles chart reports the access types made to the load/store cache. Reads and Writes are either "full" or "short", where a short access does not make full use of the available data bus width. Reducing the number of short accesses by making vectorized memory accesses, and accessing adjacent data in spatially adjacent threads in a compute shader, can help improve performance.
The Mali Core Load/Store Bytes/Cycle chart reports how many bytes need to be fetched from the L2 and from external memory per read, and how many bytes are written per write. Interpreting this counter can be difficult without knowledge of the algorithms being used, but it can be useful for investigating compute shader performance where the memory access and data usage pattern is known. For example, identifying content where most loads are coming from external memory rather than the L2 cache could indicate that the working set is too large.
The final set of charts show the memory bandwidth generated by a shader core, broken down by the unit which is generating the traffic.
For content which has an overall bandwidth problem, these counters can help identify which data resource is responsible for generating the most traffic.
image()
texture()
This article has shown how to perform an initial performance review of a graphical application using an Arm CPU and a Mali GPU, using performance counter information to identify dominant workloads and possible causes of performance slow down. Predefined templates built in to Streamline can be used to quickly and efficiently capture the necessary counters to support a methodical review workflow, allowing a step-by-step review of the various blocks and behaviors in the design.
[CTAToken URL = "https://developer.arm.com/products/software-development-tools/arm-mobile-studio" target="_blank" text="Arm Mobile Studio" class ="green"]
Great article Pete! Quite insightful, thanks!
Glad you found it useful, shout if you have any questions.