Mali Bifrost Family Performance Counters

October 19, 2016

41 minute read time.

Analysis and optimization of graphics and compute content running on a GPU is an important task when trying to build a top quality system integration, or a compelling high performance application. For developers working with the public APIs, such as OpenGL ES and OpenCL, the GPU is a black box which is very difficult to analyze based solely on the API visible behaviors. Frame pipelining and asynchronous processing of submitted work effectively decouple the application’s visible performance from the API calls which define the workload being executed, making analysis of performance an activity based on expert knowledge and intuition rather than direct measurement.

Tools such as ARMDS-5 Streamline provide developers access to the GPU hardware performance counters, the principle means to determine the behavior inside the black box beneath the API and identify any problem areas which need optimization. This work guide assumes that DS-5 Streamline is the tool being used for performance analysis, and follows the DS-5 naming conventions for the counters.

1 Performance Counter Infrastructure

The Bifrost GPU family supports many performance counters which can all be captured simultaneously. Performance counters are provided for each functional block in the design:

Job Manager
Tiler
Shader core(s)
L2 cache(s)

See my earlier blog series for an introduction to the Bifrost GPU architecture - it introduces some of the fundamental concepts which are important to understand, and which place the more detailed information in this document in context.

1.1 Supported Counters

The GPUs in the Bifrost family implement a large number of performance counters natively in the hardware, and it is also generally useful to generate some derived counters by combining one or more of the raw hardware counters in useful and interesting ways. This document will describe all of the counters exported from DS-5 Streamline, and some of the useful derived counters which can be derived from them. DS-5 Streamline allows custom performance counter graphs to be created using equations, so all of these performance counters can be directly visualized in the GUI.

1.2 Counter Implementation Caveats

The hardware counter implementation in the GPU is designed to be low cost, such that it has minimal impact on performance and power. Many of the counters are close approximations of the behavior described in this document in order to minimize the amount of additional hardware logic required to generate the counter signals, so some small deviations from what you may expect may be encountered.

2 Job Manager Counters

This section describes the counters implemented by the Mali Job Manager component.

2.1 Top Level Activity

These counters provide information about the overall number of cycles that the GPU was processing a workload, or waiting for software to handle workload completion interrupts.

2.1.1 JM.GPU_ACTIVE

Availability: All

This counter increments every cycle that the GPU either has any workload queued in a Job slot. Note that this counter will increment any cycle a workload is present even if the GPU is totally stalled waiting for external memory to return data; that is still counted as active time even though no forward progress was made.

2.1.2 JM.GPU_UTILIZATION (Derived)

Availability: All

If the GPU operating frequency is known then overall GPU utilization can be calculated as:

JM.GPU_UTILIZATION = JM.GPU_ACTIVE / GPU_MHZ

Well pipelined applications which are not running at vsync and keeping the GPU busy should achieve a utilization of around 98%. Lower utilization than this typically indicates one of the following scenarios:

Content running at vsync.
- In this scenario the GPU goes idle as it has no need to run until next vsync signal.
Content which is bottlenecked by the CPU.
- In this scenario the application or driver is causing high CPU load, and cannot build new workloads for the GPU quickly enough to keep it busy.
Content which is oscillating between CPU and the GPU activity.
- In this scenario the application is using APIs which break the frame-level pipeline needed to keep the GPU busy. The most common causes are calls to glReadPixels() or glFinish(), as these explicitly drain the pipeline, but other API calls can cause stalls if used in a blocking manner before their result is ready. These include calls such as glClientWaitSync(), glWaitSync(), or glGetQueryObjectuiv().

Collecting GPU activity and CPU activity as part of the same DS-5 Streamline data capture can help disambiguate between the cases above. This type of analysis is explored in more detail in my blog on Mali performance.

It is important to note that most modern devices support Dynamic Voltage and Frequency Scaling (DVFS) to optimize energy usage, which means that the GPU frequency is often not constant while running a piece of content. It is recommended that platform DVFS is disabled, locking the CPU, GPU and memory bus at a fixed frequency, if possible as it makes performance analysis much easier, and results more reproducible. The method for doing this is device specific, and many not be possible at all on production devices; please refer to your platform's documentation for details.

2.1.3 JM.JS0_ACTIVE

Availability: All

This counter increments every cycle that the GPU has a Job chain running in Job slot 0. This Job slot is used solely for the processing of fragment Jobs, so this corresponds directly to fragment shading workloads.

For most graphics content there are orders of magnitude more fragments than vertices, so this Job slot will usually be the dominant Job slot which has the highest processing load. In content which is not hitting vsync and the GPU is the performance bottleneck, it is normal for JS0_ACTIVE to be approximately equal to GPU_ACTIVE. In this scenario vertex processing can run in parallel to the fragment processing, allowing fragment processing to run all of the time.

2.1.4 JM.JS0_UTILIZATION (Derived)

Availability: All

The percentage JS0 utilization can be calculated as:

JM.JS0_UTILIZATION = JM.JS0_ACTIVE / JM.GPU_ACTIVE

In content which is not hitting vsync and the GPU is the performance bottleneck it is normal for this utilization metric to be close to 1.0 (100%). Fragment processing is normally the dominant workload, and a utilization of close to 100% shows that vertex processing is running in parallel to the fragment processing, allowing maximum utilization of the functional units in the hardware.

2.1.5 JM.JS1_ACTIVE

Availability: All

This counter increments every cycle the GPU has a Job chain running in Job slot 1. This Job slot can be used for compute shaders, vertex shaders, and tiling workloads. This counter cannot disambiguate between these workloads.

2.1.6 JM.JS1_UTILIZATION (Derived)

Availability: All

The percentage JS1 utilization can be calculated as:

JM.JS1_UTILIZATION = JM.JS1_ACTIVE / JM.GPU_ACTIVE

2.1.7 JM.IRQ_ACTIVE

Availability: All

This counter increments every cycle the GPU has an interrupt pending, awaiting handling by the driver running on the CPU. Note that this does not necessarily indicate lost performance because the GPU can still process Job chains from other Job slots, as well as process the next work item in the interrupt generating Job slot, while an interrupt is pending.

If a high JM.IRQ_ACTIVE cycle count is observed alongside other counters which make it look like the GPU is starving for work, such as a low SC.COMPUTE_ACTIVE and SC.FRAG_ACTIVE, this may indicate a system performance issue. Possible causes include:

A system where the CPU is fully utilized, causing a delay in scheduling IRQ handlers.
A system where a device driver, which may not be the Mali device driver, has IRQs masked for a long period of time, stopping the CPU receiving new interrupt notifications.
Processing a very high number of small framebuffers or small compute workloads, resulting in a high frequency of job completion interrupts to the CPU.

2.2 Task Dispatch

This section looks at the counters related to how the Job Manager issues work to shader cores.

2.2.1 JM.JS0_TASKS

Availability: All

This counter increments every time the Job Manager issues a task to a shader core. For JS0 these tasks correspond to a single 32x32 pixel screen region, although not all of these pixels may be rendered due to viewport or scissor settings.

2.2.2 JM.PIXEL_COUNT (Derived)

Availability: All

A approximation of the total scene pixel count can be computed as:

JM.PIXEL_COUNT = JM.JS0_TASKS * 32 * 32

3 Shader Core Counters

This section describes the counters implemented by the Mali Shader Core. For the purposes of clarity this section talks about either fragment workloads or compute workloads. Vertex, Geometry, and Tessellation workloads are treated as a one dimensional compute problem by the shader core, so are counted as a compute workload from the point of view of the counters in this section.

The GPU hardware records separate counters per shader core in the system. DS-5 Streamline shows the average of all of the shader core counters.

3.1 Shader Core Activity

These counters show the total activity level of the shader core.

3.1.1 SC.COMPUTE_ACTIVE

Availability: All

This counter increments every cycle at least one compute task is active anywhere inside the shader core, including the fixed-function compute frontend, or the programmable execution core.

3.1.2 SC.FRAG_ACTIVE

Availability: All

This counter increments every cycle at least one fragment task is active anywhere inside the shader core, including the fixed-function fragment frontend, the programmable execution core, or the fixed-function fragment backend.

3.1.3 SC.EXEC_CORE_ACTIVE

Availability: All

This counter increments every cycle at least one quad is active inside the programmable execution core. Note that this counter does not give any idea of total utilization of the shader core resources, but simply gives an indication that something was running.

3.1.4 SC.EXEC_CORE_UTILIZATION (Derived)

Availability: All

An approximation of the overall utilization of the execution core can be determined using the following equation:

SC.EXEC_CORE_UTILIZATION = SC.EXEC_CORE_ACTIVE / JM.GPU_ACTIVE

A low utilization of the execution core indicates possible lost performance, as there are spare shader core cycles which could be used if they could be accessed. There are multiple possible root causes of low utilization. The most common cause is content with a significant number tiles which do not require any fragment shader program to be executed. This may occur because:

Screen regions are simply a clear color and contain no drawn geometry.
Screen regions contain significant amount of geometry which only does a depth/stencil update and this update can be entirely resolved at the point of early-zs, prior to fragment shading.

Other causes include:

Screen regions containing a high level of front-to-back geometry, resulting in one layer being drawn and multiple redundant layers being killed by early-zs. If the cost of loading the redundant geometry exceeds the cost of shading the visible layer then some idle time will be observed.
Screen regions containing a high level of opaque back-to-front geometry, resulting in one layer being drawn and multiple redundant layers being killed by Forward Pixel Kill (FPK); see Killing Pixels - A New Optimization for Shading on ARM Mali GPUs.

3.2 Compute Frontend Events

These counters show the task and thread issue behavior of the shader core's fixed function compute frontend which issues work into the programmable core.

3.2.1 SC.COMPUTE_QUADS

Availability: All

This counter increments for every compute quad spawned by the shader core. One compute quad is spawned for every four work items (compute shaders), vertices (vertex and tessellation evaluation shaders), primitives (geometry shaders), or control points (tessellation control shaders). To ensure full utilization of the four thread capacity of a quad any compute workgroups should be a multiple of four in size.

3.2.2 SC.COMPUTE_QUAD_CYCLES (Derived)

Availability: All

This counter calculates an average compute cycles per compute quad, giving some measure of the per-quad processing load.

SC.COMPUTE_QUAD_CYCLES = SC.COMPUTE_ACTIVE / SC.COMPUTE_QUADS

Note that in most cases the dominant cost here is the programmable code running on the execution core, and so there will be some cross-talk caused by compute and fragment workloads running concurrently on the same hardware. This counter is therefore indicative of cost, but does not reflect precise costing.

3.3 Fragment Frontend Events

These counters show the task and thread issue behavior of the shader core's fixed-function fragment frontend. This unit is significantly more complicated than the compute frontend, so there are a large number of counters available.

3.3.1 SC.FRAG_PRIMITIVES_RAST

Availability: All

This counter increments for every primitive entering the frontend fixed-function rasterization stage; these primitives are guaranteed to be inside the current tile being rendered.

Note that this counter will increment once per primitive per tile in which that primitive is located. If you wish to know the total number of primitives in the scene without factoring in tiling effects see the Tiler block's primitive counters.

3.3.2 SC.FRAG_QUADS_RAST

Availability: All

This counter increments for every 2x2 pixel quad which is rasterized by the rasterization unit. The quads generated have at least some coverage based on the current sample pattern, but may subsequently be killed by early depth and stencil testing and as such never issued to the programmable core.

3.3.3 SC.FRAG_QUADS_EZS_TEST

Availability: All

This counter increments for every 2x2 pixel quad which is subjected to ZS testing. We want as many quads as possible to be subject to early ZS testing as it is significantly more efficient than late ZS testing, which will only kill threads after they have been fragment shaded.

3.3.4 SC.FRAG_QUADS_EZS_UPDATE

Availability: All

This counter increments for every 2x2 pixel quad which has completed an early ZS update operation. Quads which have a depth value which depends on shader execution, or which have indeterminate coverage due to use of discard statements in the shader or the use of alpha-to-coverage, may be early ZS tested but cannot do an early ZS update.

3.3.5 SC.FRAG_QUADS_EZS_KILLED

Availability: All

This counter increments for every 2x2 pixel quad which is completely killed by early ZS testing. These killed quads will not generate any further processing in the shader core.

3.3.6 SC.FRAG_QUADS_KILLED_BY_OVERDRAW (Derived)

Availability: All

This derived counter increments for every 2x2 pixel quad which survives early-zs testing but that is overdrawn by an opaque quad before spawning as fragment shading threads in the programmable core.

SC.FRAG_QUADS_KILLED_BY_OVERDRAW = SC.FRAG_QUADS_RAST - SC.FRAG_QUADS_EZS_KILL - SC.FRAG_QUADS

If a significant percentage of the total rasterized quads are overdrawn, this is indicative that the application is rendering in a back-to-front order which means that the early-zs test is unable to kill the redundant workload. Schemes such as Forward Pixel Kill can minimize the cost, but it is recommended that the application renders opaque geometry front-to-back as early-zs testing provides stronger guarantees of efficiency.

3.3.7 SC.FRAG_QUADS_OPAQUE

Availability: All

This counter increments for every 2x2 pixel quad which is architecturally opaque – i.e. not using blending, shader discard, or alpha-to-coverage – that survives early-zs testing. Opaque fragments are normally more efficient for the GPU to handle, as only the top opaque layer needs to be drawn, so we recommend ensuring opacity of draw calls whenever possible.

3.3.8 SC.FRAG_QUADS_TRANSPARENT (Derived)

Availability: All

This counter increments for every 2x2 pixel quad which is architecturally transparent – i.e. using blending, shader discard, or alpha-to-coverage – that survives early-zs testing. Note that transparent in this context implies either alpha transparency, or a shader-dependent coverage mask.

SC.FRAG_QUADS_TRANSPARENT = SC.FRAG_QUADS_RAST - SC.FRAG_QUADS_EZS_KILL - SC.FRAG_QUADS_OPAQUE

3.3.9 SC.FRAG_QUAD_BUFFER_NOT_EMPTY

Availability: All

This counter increments every cycle the fragment unit is active, and the pre-pipe buffer contains at least one 2x2 pixel quad waiting to be executed in the execution core. If this buffer drains the frontend will be unable to spawn a new quad if an execution core quad slot becomes free.

If this counter is low relative to SC.FRAG_ACTIVE then the shader core may be running out of rasterized quads to turn in to fragment quads, which can in turn cause low utilization of the functional units in the execution core if the total number of quads active in the execution core drops too far. Possible causes for this include:

Tiles which contain no geometry.
Tiles which contain a lot of geometry which can be dropped at early-zs, either because it is redundant and it is killed, or because it is a simple depth and stencil update which can be resolved without fragment shader execution.
Tiles which contain triangles from a large number of different drawing operations, causing state loading to become a bottleneck. It is recommended that industry best practice, such as draw batching, is used to minimize the number of unique drawing operations present in a frame.

3.3.10 SC.FRAG_QUADS

Availability: All

This counter increments for every fragment quad created by the GPU.

In most situations a single quad contains threads for four fragments spanning a 2×2 pixel region of the screen. If an application is rendering to a multi-sampled render target with GL_SAMPLE_SHADING enabled then shader evaluation is per-sample rather than per pixel and one fragment thread will be generated for example sample point covered. For example, an 8xMSAA render target using sample rate shading will generate two fragment quads per screen pixel covered by the primitive.

3.3.11 SC.FRAG_PARTIAL_QUADS

Availability: All

This counter increments for every fragment quad which contains at least one thread slot which has no sample coverage, and is therefore indicative of lost performance. Partial coverage in a 2×2 fragment quad will occur if its sample points span the edge of a triangle, or if one or more sample points fail an early-zs test.

3.3.12 SC.FRAG_PARTIAL_QUAD_PERCENTAGE (Derived)

Availability: All

This counter calculates an percentage of spawned quads that have partial coverage.

SC.FRAG_PARTIAL_QUAD_PERCENTAGE = SC.FRAG_PARTIAL_QUADS / SC.FRAG_QUADS

A high percentage of partial quads indicates possible problems with meshes containing high numbers of small triangles; the ratio of the total edge length of a primitive to the screen area of a primitive increases as primitives shrink, so quads which span primitive edges become more common.

Partial coverage issues can be reduced by using object meshes which contain larger triangles. One common optimization technique which helps reduce the frequency of microtriangles is the use of dynamic model level of detail selection. In these schemes, each object mesh is generated at multiple detail levels during content generation, and an appropriate mesh is chosen per draw call based on the distance between the object and the camera. The further the object is from the camera, the lower the selected mesh complexity needs to be.

3.3.13 SC.FRAG_QUAD_CYCLES (Derived)

Availability: All

This counter calculates an average fragment cycles per fragment quad, giving some measure of the per-quad processing cost.

SC.FRAG_QUAD_CYCLES = SC.FRAG_ACTIVE / SC.FRAG_QUADS

Note that in most cases the dominant cost here is the programmable code running on the execution core, so there will be some cross-talk caused by compute and fragment workloads running concurrently on the same hardware. This counter is therefore indicative of cost, but does not reflect precise costing.

3.4 Fragment Backend Events

These counters record the fragment backend behavior.

3.4.1 SC.FRAG_THREADS_LZS_TEST

Availability: All

This counter increments for every thread triggering late depth and stencil (ZS) testing.

3.4.2 SC.FRAG_THREADS_LZS_KILLED

Availability: All

This counter increments for every thread killed by late ZS testing. These threads are killed after their fragment program has executed, so a significant number of threads being killed at late ZS implies a significant amount of lost performance and/or wasted energy performing rendering which has no useful visual output.

The main causes of threads using late-zs are:

Fragment shader programs using explicit discard statements
Fragment shader programs using implicit discard (alpha-to-coverage).
Fragment shader programs with side-effects on shared resources, such as shader storage buffer objects, images, or atomics.

3.4.3 SC.FRAG_NUM_TILES

Availability: All

This counter increments for every tile rendered. The size of a physical tile can vary from 16×16 pixels (largest) downwards. The size of physical tile actually used depends on the number of bytes of memory needed to store the working set for each pixel; the largest tile size allows up to 128-bits per pixel of color storage – enough for a single 32-bit per pixel render target using 4xMSAA, or 4x32-bit per pixel surfaces using multiple-render targets (MRT). Requiring more than that will result in proportionally smaller tile sizes.

The total storage required per pixel depends on the use of:

Multi-sample anti-aliasing (MSAA)
Multiple render targets (MRT)
The size of the attached per-pixel color data format(s)
The use of pixel local storage (PLS); see Pixel Local Storage.

In general the larger tile sizes are more efficient than smaller tile sizes, especially for content with high geometry complexity. This counter cannot be used to directly determine the physical tile sizes used.

3.4.4 SC.FRAG_TILES_CRC_CULLED

Availability: All

This counter increments for every physical rendered tile which has its writeback cancelled due to a matching transaction elimination CRC hash. If a high percentage of the tile writes are being eliminated this implies that you are re-rendering the entire screen when not much has changed, so consider using scissor rectangles to minimize the amount of area which is redrawn. This isn't always easy, especially for window surfaces which are pipelines using multiple buffers, but EGL extensions such as these may be supported on your platform which can help manage the partial frame updates:

3.5 Execution Engine Events

These counters look at the behavior of the arithmetic execution engine.

3.5.1 SC.EE_INSTRS

Availability: All

This counter increments for every arithmetic instruction architecturally executed for a quad in an execution engine. This counter is normalized based on the number of execution engines implemented in the design, so gives the per engine performance, rather than the total executed application workload.

3.5.2. SC.EE_UTILIZATION (Derived)

Availability: All

The peak performance is one arithmetic instruction per engine per cycle, so the effective utilization of the arithmetic hardware can be computed as:

SC.EE_UTILIZATION = SC.EE_INSTRS / SC.EXEC_CORE_ACTIVE

3.5.3 SC.EE_INSTRS_DIVERGED

Availability: All

This counter increments for every arithmetic instruction architecturally executed where there is control flow divergence in the quad resulting in at least one lane of computation being masked out. Control flow divergence erodes arithmetic execution efficiency because it implies some arithmetic lanes are idle, so should be minimized when designing shader effects.

3.6 Load/Store Cache Events

These counters look at the behavior of the load/store pipe.

3.6.1 SC.LSC_READS_FULL

Availability: All

This counter increments for every LS cache access executed which returns 128-bits of data.

3.6.2 SC.LSC_READS_SHORT

Availability: All

This counter increments for every LS cache access executed which returns less than 128-bits of data.

Full width data loads make best use of the cache, so where possible efficiency can be improved by merging short loads together.

Maximize data locality in attributes, varyings, and uniform buffers within a thread, for example packing data into adjacent vector elements and structure fields.
Minimize the amount of unused data in vector data types, uniform buffer control structures, and interleaved vertex attribute buffers.
Write compute shaders so adjacent threads sharing a quad access adjacent addresses in memory, allowing multiple loads to return data from the same cache line.

3.6.3 SC.LSC_WRITES_FULL

Availability: All

This counter increments for every LS cache access executed which writes 128-bits of data.

3.6.4 SC.LSC_WRITES_SHORT

Availability: All

This counter increments for every LS cache access executed which writes less than 128-bits of data.

Full width data writes make best use of the cache, so where possible efficiency can be improved by merging short writes together. See LS_READ_SHORT section for advice on how this can be achieved.

3.6.5 SC.LSC_ATOMICS

Availability: All

This counter increments for atomic operation issued to the LS cache.

3.6.6 SC.LSC_ISSUES (Derived)

Availability: All

This counter counts the total number of load/store cache access operations issued. Each operation is executed with single cycle throughput, but latency of response depends on cache hit rate and external memory system performance.

SC.LSC_ISSUES = SC.LSC_READS_FULL + SC.LSC_READS_SHORT + 
                SC.LSC_WRITES_FULL + SC.LSC_WRITES_SHORT +
                SC.LSC_ATOMICS

3.6.7 SC.LSC_UTILIZATION (Derived)

Availability: All

The utilization of the load/store cache can be determined as:

SC.LSC_UTILIZATION = SC.LSC_ISSUES / SC.EXEC_CORE_ACTIVE

3.6.8 SC.LSC_READ_BEATS

Availability: All

This counter increments for every 16 bytes of data fetched from the L2 memory system.

3.6.9 SC.LSC_L2_BYTES_PER_ISSUE (Derived)

Availability: All

The average number of bytes read from the L2 cache per load/store L1 cache access can be given as.

SC.LSC_L2_BYTES_PER_ISSUE = (SC.LSC_READ_BEATS * 16) / SC.LSC_ISSUES

This gives some idea of level one cache efficiency, although does require some knowledge of how the application is using non-texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.

3.6.10 SC.LSC_READ_BEATS_EXTERNAL

Availability: All

This counter increments for every 16 bytes of data fetched from the L2 memory system which missed in the L2 cache and required a fetch from external memory.

3.6.11 SC.LSC_EXTERNAL_BYTES_PER_ISSUE (Derived)

Availability: All

The average number of bytes read from the external memory interface per load/store L1 cache access can be given as.

SC.LSC_EXTERNAL_BYTES_PER_ISSUE = (SC.LSC_READ_BEATS_EXTERNAL * 16) / SC.LSC_ISSUES

This gives some idea of level two cache efficiency, although does require some knowledge of how the application is using non-texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.

3.6.12 SC.LSC_WRITE_BEATS

Availability: All

This counter increments for every 16 bytes of data written to the L2 memory system.

3.7 Texture Pipe Events

This counter set looks at the texture pipe behavior.

Note: The texture pipe event counters increment per thread (fragment), not per quad.

3.7.1 SC.TEX_INSTRS

Availability: All

This counter increments for every architecturally executed texture instruction.

3.7.2 SC.TEX_ISSUES

Availability: All

This counter increments for every texture issue cycle used. Some instructions take more than one cycle due to multi-cycle data access and filtering operations:

2D bilinear filtering takes one cycle.
2D trilinear filtering takes two cycles.
3D bilinear filtering takes two cycles.
3D trilinear filtering takes four cycles.
Sampling from multi-plane YUV may take multiple cycles on some implementations.
Sampling from wide (>= 16-bit per channel) textures may take multiple cycles.

Note: sampling from a depth texture only requires a single channel to be returned and so only takes a single cycle, even though it would otherwise qualify as a wide data format.

3.7.3 SC.TEX_UTILIZATION (Derived)

Availability: All

The texture unit utilization is computed as:

SC.TEX_UTILIZATION = SC.TEX_ISSUES / SC.EXEC_CORE_ACTIVE

3.7.4 SC.TEX_CPI (Derived)

Availability: All

The average cycle usage of the texture unit per instruction can be computed as:

SC.TEX_CPI = SC.TEX_ISSUES / SC.TEX_INSTRS

The best case CPI is 1.0; CPI above 1.0 implies the use of multi-cycle texture instructions. The following counters give a direct view of two of the sources of multi-cycle texture operations:

SC.TEX_INSTR_3D (see )
SC.TEX_INSTR_TRILINEAR (see section)

If both of these counter sources are zero then the third source of multi-cycle operations (for which a direct counter does not exist) is accesses to wide channel texture formats such as the OpenGL ES 3.x 16-bit and 32-bit per channel integer and floating point formats, or multi-plane YUV formats.

3.7.5 SC.TEX_INSTR_3D

Availability: All

This counter increments for every architecturally executed texture instruction which is accessing a 3D texture. These will take at least two cycles to process, and may take four cycles if trilinear filtering is used.

3.7.6 SC.TEX_INSTR_TRILINEAR

Availability: All

This counter increments for every architecturally executed texture instruction which is using a trilinear (GL_LINEAR_MIPMAP_LINEAR) minification filter. These will take at least two cycles to process, and may take four cycles if a 3D texture is being sampled from.

In content which is texture filtering throughput limited, switching from trilinear filtering to bilinear filtering (GL_LINEAR_MIPMAP_NEAREST) may improve performance.

3.7.7 SC.TEX_INSTR_MIPMAP

Availability: All

This counter increments for every architecturally executed texture instruction which is accessing a texture which has mipmaps enabled. Mipmapping provides improved 3D texturing quality, as it provides some pre-filtering for minified texture samples, and also improves performance as it reduces pressure on texture caches. It is highly recommended that mipmapping is used for all 3D texturing operations reading from static input textures.

3.7.8 SC.TEX_INSTR_COMPRESSED

Availability: All

This counter increments for every architecturally executed texture instruction which is accessing a texture which is compressed, including both application-level texture compression such as ETC and ASTC, as well as internal texture compression such as AFBC framebuffer compression. Texture compression can significantly improve performance due to reduced pressure on the texture data caches and external memory system. It is recommended that all input assets from the application use compression whenever it is possible to do so.

3.7.9 SC.TEX_READ_BEATS

Availability: All

This counter increments for every 16 bytes of texture data fetched from the L2 memory system.

3.7.10 SC.TEX_L2_BYTES_PER_ISSUE (Derived)

Availability: All

The average number of bytes read from the L2 cache per texture L1 cache access can be given as.

SC.TEX_L2_BYTES_PER_ISSUE = (SC.TEX_READ_BEATS * 16) / SC.TEX_ISSUES

This gives some idea of level one cache efficiency, although does require some knowledge of how the application is using texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.

3.7.11 SC.TEX_READ_BEATS_EXTERNAL

Availability: All

This counter increments for every 16 bytes of texture data fetched from the L2 memory system which missed in the L2 cache and required a fetch from external memory.

3.7.12 SC.TEX_EXTERNAL_BYTES_PER_ISSUE (Derived)

Availability: All

The average number of bytes read from the external memory interface per texture operation can be given as.

SC.TEX_EXTERNAL_BYTES_PER_ISSUE = (SC.TEX_READ_BEATS_EXTERNAL * 16) / SC.TEX_ISSUES

This gives some idea of level two cache efficiency, although does require some knowledge of how the application is using texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.

3.8 Varying Unit Events

This counter set looks at the varying unit behavior:

3.8.1 SC.VARY_INSTR

Availability: All

This counter increments for every architecturally executed varying unit instruction for a fragment quad.

3.8.2 SC.VARY_ISSUES_16

Availability: All

This counter increments for every architecturally executed cycle of “mediump” 16-bit varying interpolation.

Interpolating mediump fp16 values is twice as fast as interpolating highp fp32 values, so should be used whenever it is suitable. Most use cases which contribute to computing an 8-bit unorm color value can safely use fp16 precision

3.8.3 SC.VARY_ISSUES_32

Availability: All

This counter increments for every architecturally executed cycle of “highp” 32-bit varying interpolation.

Interpolating highp fp32 values is half the performance and twice the bandwidth of interpolating medium fp16 values, so should only be used for cases where the additional floating point precision is necessary. The most common use cases requiring high-precision varyings are texture sampling coordinates, and anything related to accurately computing 3D position in the scene.

3.8.4 SC.VARY_UTILIZATION (Derived)

Availability: All

The utilization of the varying unit can be determined as:

SC.VARY_UTILIZATION = (SC.VARY_ISSUES_16 + SC.VARY_ISSUES_32) / SC.EXEC_CORE_ACTIVE

4 Tiler Counters

The tiler counters provide details of the workload of the fixed function tiling unit, which places primitives into the tile lists which are subsequently read by the fragment frontend during fragment shading.

4.1 Tiler Activity

These counters show the overall activity of the tiling unit.

4.1.1 TI.ACTIVE

Availability: All

This counter increments every cycle the tiler is processing a task. The tiler can run in parallel to vertex shading and fragment shading so a high cycle count here does not necessarily imply a bottleneck, unless the SC.COMPUTE_ACTIVE counter in the shader cores are very low relative to this.

4.2 Tiler Primitive Occurrence

These counters give a functional breakdown of the tiling workload given to the GPU by the application.

4.2.1 TI.PRIMITIVE_POINTS

Availability: All

This counter increments for every point primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.2.2 TI.PRIMITIVE_LINES

Availability: All

This counter increments for every line segment primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.2.3 TI.PRIMITIVE_TRIANGLES

Availability: All

This counter increments for every triangle primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.2.4 TI.INPUT_PRIMITIVES (Derived)

Availability: All

This derived counter contains the total number of primitives entering primitive assembly.

TI.INPUT_PRIMITIVES = TI.PRIMITIVE_POINTS + TI.PRIMITIVE_LINES + TI.PRIMITIVE_TRIANGLES

4.3 Tiler Visibility and Culling Occurrence

These counters give a breakdown of how the workload has been affected by clipping and culling. The culling schemes are applied in the order shown below:

This order impacts the interpretation of the counters in terms of comparing the culling rates against the total number of primitives entering and leaving each stage.

4.3.1 TI.CULLED_FACING

Availability: All

This counter is incremented for every primitive which is culled due to the application of front-face or back-face culling rules. For most meshes approximately half of the triangles are back facing so this counter should typically be similar to the visible primitives, although lower is always better.

4.3.2 TI.CULLED_FRUSTUM

Availability: All

This counter is incremented for every primitive which is culled due to being totally outside of the clip-space volume. Application-side culling should be used to minimize the amount of out-of-shot geometry being sent to the GPU as it is expensive in terms of bandwidth and power. One of my blogs looks at application side culling in more detail

4.3.3 TI.CULLED_COVERAGE

Availability: All

This counter is incremented for every microtriangle primitive which is culled due to lack of any coverage of active sample points.

4.3.4 TI.PRIMITIVE_VISIBLE

Availability: All

This counter is incremented for every primitive which is visible, surviving all types of culling which are applied.

Note: Visible in this context simply means that a primitive is inside the viewing frustum, facing in the correct direction, and has at least some sample coverage. Primitives which are visible at this stage still may generate no rendered fragments; for example ZS testing during fragment processing may determine that a primitive is entirely occluded by other primitives.

4.3.5 TI.CULLED_FACING_PERCENT (Derived)

Availability: All

This counter determines the percentage of primitive inputs into the facing test which are culled by it.

TI.CULLED_FACING_PERCENT = TI.CULLED_FACING / TI_INPUT_PRIMITIVES

In typical 3D content it is expected that approximately half of the input primitives will be culled by the facing tests, as the side of a model which is facing away from the camera is not visible and can be dropped without fragment shading. If a low percentage of primitives are culled by the facing tests in a 3D application this implies that the application may not be enabling the back-face test for everything which could benefit from it; check the application draw calls for opaque objects are enabling GL_CULL_FACE correctly.

4.3.6 TI.CULLED_FRUSTUM_PERCENT (Derived)

Availability: All

This counter determines the percentage of primitive inputs into the frustum test which are culled by it.

TI.CULLED_FRUSTUM_PERCENT = TI.CULLED_FRUSTUM / (TI.INPUT_PRIMITIVES - TL.CULLED_FACING)

One of the most important optimizations an application can perform is efficiently culling objects which are outside of the visible frustum, as these optimizations can be applied quickly by exploiting scene knowledge such as object bounding volume checks (see Mali Performance 5: An Application's Performance Responsibilities for more information on application culling techniques). It is expected that some triangles will be outside of the frustum – CPU culling is normally approximate, and some objects may be span frustum boundary – but this should be minimized as it indicates that redundant vertex processing is occurring.

4.3.7 TI.CULLED_COVERAGE_PERCENT (Derived)

Availability: All

This counter determines the percentage of primitive inputs into the coverage test which are culled by it.

TI.CULLED_COVERAGE_PERCENT = TI.CULLED_FRUSTUM / (TI.INPUT_PRIMITIVES - TL.CULLED_FACING - TI.CULLED_FRUSTUM)

A significant number of triangles being culled due to the coverage test indicates that the application is using very dense models which are producing small microtriangles; even if the triangles which produce no coverage are killed it is expected that there will also be a number of visible triangles which cover a small number of sample points, which are still disproportionately expensive to process relative to their screen coverage.

Microtriangles are expensive to process for a number of reasons.

On mobile devices they are most expensive due to the bandwidth cost they incur. The vertex shader has to read the vertex attributes and write the varyings, and the fragment shader has to read and interpolate the varyings, which are typically bulky floating point vector data types. For example, the simplest vertex consisting of only a vec4 fp32 position attribute requires two 128-bit reads and one 128-bit write, a total of 24 bytes of memory bandwidth used. The cost of the vertex bandwidth is amortized over the number of fragments that a triangle generates. A triangle covering 50 pixels will effectively cost 0.5 bytes per pixel in terms of vertex bandwidth, which is equivalent of the cost of a single ETC2 compressed texture fetch. A microtriangle covering a two pixels will cost 12 bytes per pixel, and is therefore likely to generate stalls on the memory system.

Note: This example is for the “best case” microtriangle consisting of only of a position; most real applications will also have additional per-vertex attributes, such as vertex normals and texture coordinates. Applications loading between 50 and 100 bytes of input data per vertex are common.

Fragment workloads are always spawned as 2×2 pixel quads; quads which span the edges of a triangle may contain partial sample coverage, in which one or more of the fragments in the quad does not contribute to the final render, but which costs some performance to process. Microtriangles cause an increase in partial quads, as there are more edges per unit area shaded. The shader core counter SC.FRAG_PARTIAL_QUADS (see section 3.3.11) may provide additional evidence of the existence of microtriangles.

4.3.8 TI.FRONT_FACING

Availability: All

This counter is incremented for every triangle which is front-facing. This counter is incremented after culling, so only counts visible primitives which are actually emitted into the tile list.

This counter is not directly useful for performance profiling as there is no rendering performance difference between front-facing and back-facing triangles, but is useful used for debugging culling related logic problems, and stencil test logic problems as stencil testing can do different things for front-facing and back-facing triangles.

4.3.9 TI.BACK_FACING

Availability: All

This counter is incremented for every triangle which is back-facing. This counter is incremented after culling, so only counts visible primitives which are actually emitted into the tile list.

If you are not using back-facing triangles for some special algorithmic purpose, such as Refraction Based on Local Cubemaps, then a high value here relative to the total number of triangles may indicate that the application has forgotten to turn on back-face culling. For most opaque geometry no back facing triangles should be expected.

4.4 Shading Requests

These counters track the workload requests for the Index-Driver Vertex Shading pipeline, one of the new features introduced in the Bifrost GPU architecture.

4.4.1 TI.IDVS_POSITION_SHADING_REQUEST

Availability: All

This counter is incremented for every batch of triangles which have been position shaded. Each batch consists of 4 vertices from sequential index ranges.

4.4.2 TI.IDVS_VARYING_SHADING_REQUEST

Availability: All

This counter is incremented for every batch of triangles which have been varying shaded. Each batch consists of 4 vertices from sequential index ranges.

5 L2 Cache Counters

This section documents the behavior of the L2 memory system counters.

In systems which implement multiple L2 caches or bus interfaces the counters presented in DS-5 Streamline are the sum of the counters from all of the L2 counter blocks present, as this gives the aggregate memory system usage.

All derivations in this document are computations per slice, so it may be necessary to divide these by the number of cache slices present in your design when using user-level equations in DS-5 Streamline.

5.1 Internal Cache Usage

These counters profile the internal use of the L2 cache versus the available cycle capacity.

5.1.1 L2.ANY_LOOKUP

Availability: All

The counter increments for any L2 read or write request from an internal master, or snoop request from an internal or external master.

5.1.2 L2.INTERNAL_UTILIZATION (Derived)

Availability: All

Each L2 cache slice can process a single read, write, or snoop operation per clock cycle. The internal utilization of the L2 cache by the processing masters in the system can be determined via the equation:

LS.INTERNAL_UTILIZATION = L2.ANY_LOOKUP / JM.GPU_ACTIVE

5.2 Internal Traffic Profile

These counters profile the internal read traffic into the L2 cache from the various internal masters.

5.2.1 L2.READ_REQUEST

Availability: All

The counter increments for every read transaction received by the L2 cache.

5.2.2 L2.EXTERNAL_READ_REQUEST

Availability: All

The counter increments for every read transaction sent by the L2 cache to external memory.

5.2.3 L2.READ_MISS_RATE (Derived)

Availability: All

The counter gives an indication of the number of reads which are missing and being sent on the L2 external interface to main memory.

L2.READ_MISS_RATE = L2.EXTERNAL_READ_REQUEST / L2.READ_REQUEST

5.2.4 L2.WRITE_REQUEST

Availability: All

The counter increments for every write transaction received by the L2 cache.

5.2.5 L2.EXTERNAL_WRITE_REQUEST

Availability: All

The counter increments for every write transaction sent by the L2 cache to external memory.

5.2.6 L2.WRITE_MISS_RATE (Derived)

Availability: All

The counter gives an indication of the number of writes which are missing and being sent on the L2 external interface to main memory.

L2.WRITE_MISS_RATE = L2.EXTERNAL_WRITE_REQUEST / L2.WRITE_REQUEST

Note: In most cases writes to main memory are necessary and not a bad thing, for example writing vertex data to intermediate storage for later use during fragment shading, or when writing back the final color contents of a tile at the end of a frame. A high write miss rate is therefore not necessarily indicative of a performance problem if those writes were always intended to be sent to main memory.

5.3 External Read Traffic Events

These counters profile the external read memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache, as some types of access will bypass the L2 cache.

5.3.1 L2.EXTERNAL_READ_BEATS

Availability: All

This counter increments on every clock cycle a read beat is read off the external AXI bus.

5.3.2 L2.EXTERNAL_READ_BYTES (Derived)

Availability: All

With knowledge of the bus width used in the GPU the beat counter can be converted into a raw bandwidth counter.

L2.EXTERNAL_READ_BYTES = SUM(L2.EXTERNAL_READ_BEATS * L2.AXI_WIDTH_BYTES)

Note: Most implementations of a Bifrost GPU use a 128-bit (16 byte) AXI interface, but a 64-bit (8 byte) interface is also possible to reduce the area used by a design. This information can be obtained from your chipset manufacturer.

5.3.3 L2.EXTERNAL_READ_UTILIZATION (Derived)

Availability: All

The GPU can issue one read beat per clock per implemented cache slice. The total utilization of the AXI read interface can be determined per cache slice using:

L2.EXTERNAL_READ_UTILIZATION = L2.EXTERNAL_READ_BEATS / SC.GPU_ACTIVE

Note: This utilization metric ignores any frequency changes which may occur downstream of the GPU. If you have, for example, a 600MHz GPU connected to a 300MHz AXI bus of the same data width then it will be impossible for the GPU to achieve more than 50% utilization of its native interface because the AXI bus is unable to provide the data as quickly as the GPU can consume it.

5.3.4 L2.EXTERNAL_READ_STALL

Availability: All

This counter increments every cycle that the GPU is unable to issue a new read transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention due to accesses from other sources, or that the GPU is clocked faster than the AXI bus it is connected to.

5.3.5 L2 Read Latency Histogram

Availability: All

The L2 interface implements a six entry histogram which tracks the response latency for the external reads. The counter for the sixth level is synthesized from multiple raw counter values.

Histogram Range	Counter Equation
0-127 Cycles	L2.EXT_RRESP_0_127
128-191 Cycles	L2.EXT_RRESP_128_191
192-255 Cycles	L2.EXT_RRESP_192_255
256-319 Cycles	L2.EXT_RRESP_256_319
320-383 Cycles	L2.EXT_RRESP_320_383
> 383 Cycles	L2.EXTERNAL_READ_BEATS - L2.EXT_RRESP_0_127 - L2.EXT_RRESP_128_191 - L2.EXT_RRESP_192_255 - L2.EXT_RRESP_256_319 - L2.EXT_RRESP_320_383

Mali shader cores are designed to tolerate an external read response latency of 170 GPU cycles; systems reporting significantly higher latency than this for a high percentage of transactions will observe some reduction in performance, as the shader core will stall more often waiting for main memory to provide data.

5.3.6 L2 Read Outstanding Transaction Histogram

Availability: All

The L2 interface implements a four entry histogram which tracks the outstanding transaction levels for the external reads. The counter for the fourth level is synthesized from multiple raw counter values.

Histogram Range	Counter Equation
0-25%	L2.EXT_READ_CNT_Q1
25-50%	L2.EXT_READ_CNT_Q2
50-75%	L2.EXT_READ_CNT_Q3
75%-100%	L2.EXTERNAL_READ - L2.EXT_READ_CNT_Q1 - L2.EXT_READ_CNT_Q2 - L2.EXT_READ_CNT_Q3

The number of currently outstanding transactions gives some idea of how many concurrent memory requests the shader core has queued on the AXI bus. This will not directly cost performance unless we completely run out of transactions; content with a high percentage of transactions in the 75-100% range may be losing performance because it is unable to construct new requests to be sent onto the AXI interface.

Note: The maximum number of outstanding transactions available is a synthesis time option when implementing the GPU. The total number of outstanding transaction count should be selected to ensure that the GPU can keep data requests queued on the external DDR controller. In a system with 170 cycles of read response latency, and a typical transaction size of 4 data beats, at least 170/4 (42) outstanding transactions are required.

5.4 External Write Traffic Events

These counters profile the external write memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache, as some types of access will bypass the L2 cache.

5.4.1 L2.EXTERNAL_WRITE_BEATS

Availability: All

This counter increments on every clock cycle a write beat is read off the external AXI bus.

5.4.2 L2.EXTERNAL_WRITE_BYTES (Derived)

Availability: All

With knowledge of the bus width used in the GPU the beat counter can be converted into a raw bandwidth counter.

L2.EXTERNAL_WRITE_BYTES = SUM(L2.EXTERNAL_WRITE_BEATS * L2.AXI_WIDTH_BYTES)

5.4.3 L2.EXTERNAL_WRITE_UTILIZATION (Derived)

Availability: All

The GPU can issue one read beat per clock per implemented cache slice. The total utilization of the AXI write interface can be determined per cache slice using:

L2.EXTERNAL_WRITE_UTILIZATION = L2.EXTERNAL_WRITE_BEATS / SC.GPU_ACTIVE

5.4.4 L2.EXTERNAL_WRITE_STALL

Availability: All

This counter increments every cycle that the GPU is unable to issue a new write transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention due to accesses from other sources, or that the GPU is clocked faster than the AXI bus it is connected to.

5.4.5 L2 Write Outstanding Transaction Histogram

Availability: All

The L2 interface implements a four entry histogram which tracks the outstanding transaction levels for the external writes. The counter for the fourth level is synthesized from multiple raw counter values.

Histogram Range	Counter Equation
0-25%	L2.EXT_WRITE_CNT_Q1
25-50%	L2.EXT_WRITE_CNT_Q2
50-75%	L2.EXT_WRITE_CNT_Q3
75%-100%	L2.EXTERNAL_WRITE - L2.EXT_WRITE_CNT_Q1 - L2.EXT_WRITE_CNT_Q2 - L2.EXT_WRITE_CNT_Q3

Note: The maximum number of outstanding transactions available is a synthesis time option when implementing the GPU. The total number of outstanding transaction count should be selected to ensure that the GPU can keep data requests queued on the external DDR controller. In a system with 90 cycles of write response latency, and a typical transaction size of 4 data beats, at least 90/4 (23) outstanding transactions are required.

6 Conclusions

This document has defined all of the Mali Bifrost family performance counters available via DS-5 Streamline, as well as some derived-counters which can be derived from them. Hopefully this provides a useful starting point for your application optimization activity when using Mali GPUs.

We also publish a Mali Application Optimization Guide. You can visit this by clicking on the link below:

[CTAToken URL = "https://developer.arm.com/docs/dui0555/b/introduction/about-optimization" target="_blank" text="Read Mali optimization guide" class ="green"]

descosmos over 3 years ago

Excuse me, I've checked so much resource to get the information about SC.LSC_WRITE_BEATS you mentioned. In this post, it means "This counter increments for every 16 bytes of data written to the L2 memory system".
However, the android document describe BEATS_RD_TEX is that "Texture read beats from L2 cache. This counter increments for every read beat received by the texture unit".

Now I'm confused by the question is that Is read beat equals 32 in Valhall architecture?

Thank you very much.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
ky.wang over 5 years ago

Hi Peter,

Thanks for such great article and help me a lot to understand Bifrost architecture.

Here is a question that I saw this

"For example, the simplest vertex consisting of only a vec4 fp32 position attribute requires two 128-bit reads and one 128-bit write, a total of 24 bytes of memory bandwidth used"

Could you help me to understand why 2x read + 1x write of 32bits operation takes 24 bytes?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
floragui over 5 years ago in reply to chunyi

in G51/G76 TEX_ISSUE is called TEX_FILT_NUM_OPERATIONS and TEX_INSTRUCTION can be calculated by TEX_MSGI_NUM_QUADS*4
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
floragui over 5 years ago in reply to chunyi

Hi Chunyi, the TEX_ISSUE in G76 is called TEX_FILT_NUM_OPERATIONS and TEX_INSTR can be calculated by TEX_MSGI_NUM_QUADS*4
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
chunyi over 5 years ago

Hello, Peter, How to interpret the Texture Unit related counter in G51 / G76 architecture?

It seems the TEX_ISSUE/TEX_Instruction were disappeared. Hence, I don't know how to derive the TEX_UTILIZATION and SC.TEX_CPI. Thanks.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Graphics, Gaming, and VR blog

Arm Performance Studio: A look back, and a look forward

Peter Harris

Arm Performance Studio release 2024.6 release bringing you quality-of-life improvements and bug fixes. Read this blog post for more information about other features in this release.
- December 20, 2024
The future of AI for games

Ian Bolton

Arm sponsored the AI and Games Conference at Goldsmiths in London, read about the day that brought experts and enthusiasts together for talks on the intersection of AI & gaming.
- November 29, 2024
Hidden Surface Removal in Immortalis-G925: The Fragment Prepass

Tord Øygard

Arm's Immortalis and Mali GPUs are energy efficient. In this blog post fragment pre-pass for Arm GPUs is discussed with Immortalis-G925, Mali-G725 & Mali-G625.
- November 28, 2024