Analysis and optimization of graphics and compute content running on a GPU is an important task when trying to build a top quality system integration, or a compelling high performance application. For developers working with the public APIs, such as OpenGL ES and OpenCL, the GPU is a black box which is very difficult to analyze based solely on the API visible behaviors. Frame pipelining and asynchronous processing of submitted work effectively decouple the application’s visible performance from the API calls which define the workload being executed, making analysis of performance an activity based on expert knowledge and intuition rather than direct measurement.
Tools such as ARM DS-5 Streamline provide developers access to the GPU hardware performance counters, the principle means to determine the behavior inside the black box beneath the API and identify any problem areas which need optimization. This work guide assumes that DS-5 Streamline is the tool being used for performance analysis, and follows the DS-5 naming conventions for the counters.
The Bifrost GPU family supports many performance counters which can all be captured simultaneously. Performance counters are provided for each functional block in the design:
See my earlier blog series for an introduction to the Bifrost GPU architecture - it introduces some of the fundamental concepts which are important to understand, and which place the more detailed information in this document in context.
The GPUs in the Bifrost family implement a large number of performance counters natively in the hardware, and it is also generally useful to generate some derived counters by combining one or more of the raw hardware counters in useful and interesting ways. This document will describe all of the counters exported from DS-5 Streamline, and some of the useful derived counters which can be derived from them. DS-5 Streamline allows custom performance counter graphs to be created using equations, so all of these performance counters can be directly visualized in the GUI.
The hardware counter implementation in the GPU is designed to be low cost, such that it has minimal impact on performance and power. Many of the counters are close approximations of the behavior described in this document in order to minimize the amount of additional hardware logic required to generate the counter signals, so some small deviations from what you may expect may be encountered.
This section describes the counters implemented by the Mali Job Manager component.
These counters provide information about the overall number of cycles that the GPU was processing a workload, or waiting for software to handle workload completion interrupts.
Availability: All
This counter increments every cycle that the GPU either has any workload queued in a Job slot. Note that this counter will increment any cycle a workload is present even if the GPU is totally stalled waiting for external memory to return data; that is still counted as active time even though no forward progress was made.
If the GPU operating frequency is known then overall GPU utilization can be calculated as:
JM.GPU_UTILIZATION = JM.GPU_ACTIVE / GPU_MHZ
Well pipelined applications which are not running at vsync and keeping the GPU busy should achieve a utilization of around 98%. Lower utilization than this typically indicates one of the following scenarios:
glReadPixels()
glFinish(),
glClientWaitSync()
glWaitSync()
glGetQueryObjectuiv()
Collecting GPU activity and CPU activity as part of the same DS-5 Streamline data capture can help disambiguate between the cases above. This type of analysis is explored in more detail in my blog on Mali performance.
It is important to note that most modern devices support Dynamic Voltage and Frequency Scaling (DVFS) to optimize energy usage, which means that the GPU frequency is often not constant while running a piece of content. It is recommended that platform DVFS is disabled, locking the CPU, GPU and memory bus at a fixed frequency, if possible as it makes performance analysis much easier, and results more reproducible. The method for doing this is device specific, and many not be possible at all on production devices; please refer to your platform's documentation for details.
This counter increments every cycle that the GPU has a Job chain running in Job slot 0. This Job slot is used solely for the processing of fragment Jobs, so this corresponds directly to fragment shading workloads.
For most graphics content there are orders of magnitude more fragments than vertices, so this Job slot will usually be the dominant Job slot which has the highest processing load. In content which is not hitting vsync and the GPU is the performance bottleneck, it is normal for JS0_ACTIVE to be approximately equal to GPU_ACTIVE. In this scenario vertex processing can run in parallel to the fragment processing, allowing fragment processing to run all of the time.
JS0_ACTIVE
GPU_ACTIVE.
The percentage JS0 utilization can be calculated as:
JM.JS0_UTILIZATION = JM.JS0_ACTIVE / JM.GPU_ACTIVE
In content which is not hitting vsync and the GPU is the performance bottleneck it is normal for this utilization metric to be close to 1.0 (100%). Fragment processing is normally the dominant workload, and a utilization of close to 100% shows that vertex processing is running in parallel to the fragment processing, allowing maximum utilization of the functional units in the hardware.
This counter increments every cycle the GPU has a Job chain running in Job slot 1. This Job slot can be used for compute shaders, vertex shaders, and tiling workloads. This counter cannot disambiguate between these workloads.
The percentage JS1 utilization can be calculated as:
JM.JS1_UTILIZATION = JM.JS1_ACTIVE / JM.GPU_ACTIVE
This counter increments every cycle the GPU has an interrupt pending, awaiting handling by the driver running on the CPU. Note that this does not necessarily indicate lost performance because the GPU can still process Job chains from other Job slots, as well as process the next work item in the interrupt generating Job slot, while an interrupt is pending.
If a high JM.IRQ_ACTIVE cycle count is observed alongside other counters which make it look like the GPU is starving for work, such as a low SC.COMPUTE_ACTIVE and SC.FRAG_ACTIVE, this may indicate a system performance issue. Possible causes include:
JM.IRQ_ACTIVE
SC.COMPUTE_ACTIVE
SC.FRAG_ACTIVE
This section looks at the counters related to how the Job Manager issues work to shader cores.
This counter increments every time the Job Manager issues a task to a shader core. For JS0 these tasks correspond to a single 32x32 pixel screen region, although not all of these pixels may be rendered due to viewport or scissor settings.
A approximation of the total scene pixel count can be computed as:
JM.PIXEL_COUNT = JM.JS0_TASKS * 32 * 32
This section describes the counters implemented by the Mali Shader Core. For the purposes of clarity this section talks about either fragment workloads or compute workloads. Vertex, Geometry, and Tessellation workloads are treated as a one dimensional compute problem by the shader core, so are counted as a compute workload from the point of view of the counters in this section.
The GPU hardware records separate counters per shader core in the system. DS-5 Streamline shows the average of all of the shader core counters.
These counters show the total activity level of the shader core.
This counter increments every cycle at least one compute task is active anywhere inside the shader core, including the fixed-function compute frontend, or the programmable execution core.
This counter increments every cycle at least one fragment task is active anywhere inside the shader core, including the fixed-function fragment frontend, the programmable execution core, or the fixed-function fragment backend.
This counter increments every cycle at least one quad is active inside the programmable execution core. Note that this counter does not give any idea of total utilization of the shader core resources, but simply gives an indication that something was running.
An approximation of the overall utilization of the execution core can be determined using the following equation:
SC.EXEC_CORE_UTILIZATION = SC.EXEC_CORE_ACTIVE / JM.GPU_ACTIVE
A low utilization of the execution core indicates possible lost performance, as there are spare shader core cycles which could be used if they could be accessed. There are multiple possible root causes of low utilization. The most common cause is content with a significant number tiles which do not require any fragment shader program to be executed. This may occur because:
Other causes include:
These counters show the task and thread issue behavior of the shader core's fixed function compute frontend which issues work into the programmable core.
This counter increments for every compute quad spawned by the shader core. One compute quad is spawned for every four work items (compute shaders), vertices (vertex and tessellation evaluation shaders), primitives (geometry shaders), or control points (tessellation control shaders). To ensure full utilization of the four thread capacity of a quad any compute workgroups should be a multiple of four in size.
This counter calculates an average compute cycles per compute quad, giving some measure of the per-quad processing load.
SC.COMPUTE_QUAD_CYCLES = SC.COMPUTE_ACTIVE / SC.COMPUTE_QUADS
Note that in most cases the dominant cost here is the programmable code running on the execution core, and so there will be some cross-talk caused by compute and fragment workloads running concurrently on the same hardware. This counter is therefore indicative of cost, but does not reflect precise costing.
These counters show the task and thread issue behavior of the shader core's fixed-function fragment frontend. This unit is significantly more complicated than the compute frontend, so there are a large number of counters available.
This counter increments for every primitive entering the frontend fixed-function rasterization stage; these primitives are guaranteed to be inside the current tile being rendered.
Note that this counter will increment once per primitive per tile in which that primitive is located. If you wish to know the total number of primitives in the scene without factoring in tiling effects see the Tiler block's primitive counters.
This counter increments for every 2x2 pixel quad which is rasterized by the rasterization unit. The quads generated have at least some coverage based on the current sample pattern, but may subsequently be killed by early depth and stencil testing and as such never issued to the programmable core.
This counter increments for every 2x2 pixel quad which is subjected to ZS testing. We want as many quads as possible to be subject to early ZS testing as it is significantly more efficient than late ZS testing, which will only kill threads after they have been fragment shaded.
This counter increments for every 2x2 pixel quad which has completed an early ZS update operation. Quads which have a depth value which depends on shader execution, or which have indeterminate coverage due to use of discard statements in the shader or the use of alpha-to-coverage, may be early ZS tested but cannot do an early ZS update.
This counter increments for every 2x2 pixel quad which is completely killed by early ZS testing. These killed quads will not generate any further processing in the shader core.
This derived counter increments for every 2x2 pixel quad which survives early-zs testing but that is overdrawn by an opaque quad before spawning as fragment shading threads in the programmable core.
SC.FRAG_QUADS_KILLED_BY_OVERDRAW = SC.FRAG_QUADS_RAST - SC.FRAG_QUADS_EZS_KILL - SC.FRAG_QUADS
If a significant percentage of the total rasterized quads are overdrawn, this is indicative that the application is rendering in a back-to-front order which means that the early-zs test is unable to kill the redundant workload. Schemes such as Forward Pixel Kill can minimize the cost, but it is recommended that the application renders opaque geometry front-to-back as early-zs testing provides stronger guarantees of efficiency.
This counter increments for every 2x2 pixel quad which is architecturally opaque – i.e. not using blending, shader discard, or alpha-to-coverage – that survives early-zs testing. Opaque fragments are normally more efficient for the GPU to handle, as only the top opaque layer needs to be drawn, so we recommend ensuring opacity of draw calls whenever possible.
This counter increments for every 2x2 pixel quad which is architecturally transparent – i.e. using blending, shader discard, or alpha-to-coverage – that survives early-zs testing. Note that transparent in this context implies either alpha transparency, or a shader-dependent coverage mask.
SC.FRAG_QUADS_TRANSPARENT = SC.FRAG_QUADS_RAST - SC.FRAG_QUADS_EZS_KILL - SC.FRAG_QUADS_OPAQUE
This counter increments every cycle the fragment unit is active, and the pre-pipe buffer contains at least one 2x2 pixel quad waiting to be executed in the execution core. If this buffer drains the frontend will be unable to spawn a new quad if an execution core quad slot becomes free.
If this counter is low relative to SC.FRAG_ACTIVE then the shader core may be running out of rasterized quads to turn in to fragment quads, which can in turn cause low utilization of the functional units in the execution core if the total number of quads active in the execution core drops too far. Possible causes for this include:
This counter increments for every fragment quad created by the GPU.
In most situations a single quad contains threads for four fragments spanning a 2×2 pixel region of the screen. If an application is rendering to a multi-sampled render target with GL_SAMPLE_SHADING enabled then shader evaluation is per-sample rather than per pixel and one fragment thread will be generated for example sample point covered. For example, an 8xMSAA render target using sample rate shading will generate two fragment quads per screen pixel covered by the primitive.
GL_SAMPLE_SHADING
This counter increments for every fragment quad which contains at least one thread slot which has no sample coverage, and is therefore indicative of lost performance. Partial coverage in a 2×2 fragment quad will occur if its sample points span the edge of a triangle, or if one or more sample points fail an early-zs test.
This counter calculates an percentage of spawned quads that have partial coverage.
SC.FRAG_PARTIAL_QUAD_PERCENTAGE = SC.FRAG_PARTIAL_QUADS / SC.FRAG_QUADS
A high percentage of partial quads indicates possible problems with meshes containing high numbers of small triangles; the ratio of the total edge length of a primitive to the screen area of a primitive increases as primitives shrink, so quads which span primitive edges become more common.
Partial coverage issues can be reduced by using object meshes which contain larger triangles. One common optimization technique which helps reduce the frequency of microtriangles is the use of dynamic model level of detail selection. In these schemes, each object mesh is generated at multiple detail levels during content generation, and an appropriate mesh is chosen per draw call based on the distance between the object and the camera. The further the object is from the camera, the lower the selected mesh complexity needs to be.
This counter calculates an average fragment cycles per fragment quad, giving some measure of the per-quad processing cost.
SC.FRAG_QUAD_CYCLES = SC.FRAG_ACTIVE / SC.FRAG_QUADS
Note that in most cases the dominant cost here is the programmable code running on the execution core, so there will be some cross-talk caused by compute and fragment workloads running concurrently on the same hardware. This counter is therefore indicative of cost, but does not reflect precise costing.
These counters record the fragment backend behavior.
This counter increments for every thread triggering late depth and stencil (ZS) testing.
This counter increments for every thread killed by late ZS testing. These threads are killed after their fragment program has executed, so a significant number of threads being killed at late ZS implies a significant amount of lost performance and/or wasted energy performing rendering which has no useful visual output.
The main causes of threads using late-zs are:
This counter increments for every tile rendered. The size of a physical tile can vary from 16×16 pixels (largest) downwards. The size of physical tile actually used depends on the number of bytes of memory needed to store the working set for each pixel; the largest tile size allows up to 128-bits per pixel of color storage – enough for a single 32-bit per pixel render target using 4xMSAA, or 4x32-bit per pixel surfaces using multiple-render targets (MRT). Requiring more than that will result in proportionally smaller tile sizes.
The total storage required per pixel depends on the use of:
In general the larger tile sizes are more efficient than smaller tile sizes, especially for content with high geometry complexity. This counter cannot be used to directly determine the physical tile sizes used.
This counter increments for every physical rendered tile which has its writeback cancelled due to a matching transaction elimination CRC hash. If a high percentage of the tile writes are being eliminated this implies that you are re-rendering the entire screen when not much has changed, so consider using scissor rectangles to minimize the amount of area which is redrawn. This isn't always easy, especially for window surfaces which are pipelines using multiple buffers, but EGL extensions such as these may be supported on your platform which can help manage the partial frame updates:
These counters look at the behavior of the arithmetic execution engine.
This counter increments for every arithmetic instruction architecturally executed for a quad in an execution engine. This counter is normalized based on the number of execution engines implemented in the design, so gives the per engine performance, rather than the total executed application workload.
The peak performance is one arithmetic instruction per engine per cycle, so the effective utilization of the arithmetic hardware can be computed as:
SC.EE_UTILIZATION = SC.EE_INSTRS / SC.EXEC_CORE_ACTIVE
This counter increments for every arithmetic instruction architecturally executed where there is control flow divergence in the quad resulting in at least one lane of computation being masked out. Control flow divergence erodes arithmetic execution efficiency because it implies some arithmetic lanes are idle, so should be minimized when designing shader effects.
These counters look at the behavior of the load/store pipe.
This counter increments for every LS cache access executed which returns 128-bits of data.
This counter increments for every LS cache access executed which returns less than 128-bits of data.
Full width data loads make best use of the cache, so where possible efficiency can be improved by merging short loads together.
This counter increments for every LS cache access executed which writes 128-bits of data.
This counter increments for every LS cache access executed which writes less than 128-bits of data.
Full width data writes make best use of the cache, so where possible efficiency can be improved by merging short writes together. See LS_READ_SHORT section for advice on how this can be achieved.
LS_READ_SHORT
This counter increments for atomic operation issued to the LS cache.
This counter counts the total number of load/store cache access operations issued. Each operation is executed with single cycle throughput, but latency of response depends on cache hit rate and external memory system performance.
SC.LSC_ISSUES = SC.LSC_READS_FULL + SC.LSC_READS_SHORT + SC.LSC_WRITES_FULL + SC.LSC_WRITES_SHORT + SC.LSC_ATOMICS
The utilization of the load/store cache can be determined as:
SC.LSC_UTILIZATION = SC.LSC_ISSUES / SC.EXEC_CORE_ACTIVE
This counter increments for every 16 bytes of data fetched from the L2 memory system.
The average number of bytes read from the L2 cache per load/store L1 cache access can be given as.
SC.LSC_L2_BYTES_PER_ISSUE = (SC.LSC_READ_BEATS * 16) / SC.LSC_ISSUES
This gives some idea of level one cache efficiency, although does require some knowledge of how the application is using non-texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.
This counter increments for every 16 bytes of data fetched from the L2 memory system which missed in the L2 cache and required a fetch from external memory.
The average number of bytes read from the external memory interface per load/store L1 cache access can be given as.
SC.LSC_EXTERNAL_BYTES_PER_ISSUE = (SC.LSC_READ_BEATS_EXTERNAL * 16) / SC.LSC_ISSUES
This gives some idea of level two cache efficiency, although does require some knowledge of how the application is using non-texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.
This counter increments for every 16 bytes of data written to the L2 memory system.
This counter set looks at the texture pipe behavior.
Note: The texture pipe event counters increment per thread (fragment), not per quad.
This counter increments for every architecturally executed texture instruction.
This counter increments for every texture issue cycle used. Some instructions take more than one cycle due to multi-cycle data access and filtering operations:
Note: sampling from a depth texture only requires a single channel to be returned and so only takes a single cycle, even though it would otherwise qualify as a wide data format.
The texture unit utilization is computed as:
SC.TEX_UTILIZATION = SC.TEX_ISSUES / SC.EXEC_CORE_ACTIVE
The average cycle usage of the texture unit per instruction can be computed as:
SC.TEX_CPI = SC.TEX_ISSUES / SC.TEX_INSTRS
The best case CPI is 1.0; CPI above 1.0 implies the use of multi-cycle texture instructions. The following counters give a direct view of two of the sources of multi-cycle texture operations:
SC.TEX_INSTR_3D
SC.TEX_INSTR_TRILINEAR
If both of these counter sources are zero then the third source of multi-cycle operations (for which a direct counter does not exist) is accesses to wide channel texture formats such as the OpenGL ES 3.x 16-bit and 32-bit per channel integer and floating point formats, or multi-plane YUV formats.
This counter increments for every architecturally executed texture instruction which is accessing a 3D texture. These will take at least two cycles to process, and may take four cycles if trilinear filtering is used.
This counter increments for every architecturally executed texture instruction which is using a trilinear (GL_LINEAR_MIPMAP_LINEAR) minification filter. These will take at least two cycles to process, and may take four cycles if a 3D texture is being sampled from.
(GL_LINEAR_MIPMAP_LINEAR)
In content which is texture filtering throughput limited, switching from trilinear filtering to bilinear filtering (GL_LINEAR_MIPMAP_NEAREST) may improve performance.
(GL_LINEAR_MIPMAP_NEAREST)
This counter increments for every architecturally executed texture instruction which is accessing a texture which has mipmaps enabled. Mipmapping provides improved 3D texturing quality, as it provides some pre-filtering for minified texture samples, and also improves performance as it reduces pressure on texture caches. It is highly recommended that mipmapping is used for all 3D texturing operations reading from static input textures.
This counter increments for every architecturally executed texture instruction which is accessing a texture which is compressed, including both application-level texture compression such as ETC and ASTC, as well as internal texture compression such as AFBC framebuffer compression. Texture compression can significantly improve performance due to reduced pressure on the texture data caches and external memory system. It is recommended that all input assets from the application use compression whenever it is possible to do so.
This counter increments for every 16 bytes of texture data fetched from the L2 memory system.
The average number of bytes read from the L2 cache per texture L1 cache access can be given as.
SC.TEX_L2_BYTES_PER_ISSUE = (SC.TEX_READ_BEATS * 16) / SC.TEX_ISSUES
This gives some idea of level one cache efficiency, although does require some knowledge of how the application is using texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.
This counter increments for every 16 bytes of texture data fetched from the L2 memory system which missed in the L2 cache and required a fetch from external memory.
The average number of bytes read from the external memory interface per texture operation can be given as.
SC.TEX_EXTERNAL_BYTES_PER_ISSUE = (SC.TEX_READ_BEATS_EXTERNAL * 16) / SC.TEX_ISSUES
This gives some idea of level two cache efficiency, although does require some knowledge of how the application is using texture data to interpret. For example some use cases expect to have good cache hit rates and reuse the same data many times from different threads, whereas other use cases are data streaming use cases are use each data item exactly once.
This counter set looks at the varying unit behavior:
This counter increments for every architecturally executed varying unit instruction for a fragment quad.
This counter increments for every architecturally executed cycle of “mediump” 16-bit varying interpolation.
Interpolating mediump fp16 values is twice as fast as interpolating highp fp32 values, so should be used whenever it is suitable. Most use cases which contribute to computing an 8-bit unorm color value can safely use fp16 precision
This counter increments for every architecturally executed cycle of “highp” 32-bit varying interpolation.
Interpolating highp fp32 values is half the performance and twice the bandwidth of interpolating medium fp16 values, so should only be used for cases where the additional floating point precision is necessary. The most common use cases requiring high-precision varyings are texture sampling coordinates, and anything related to accurately computing 3D position in the scene.
The utilization of the varying unit can be determined as:
SC.VARY_UTILIZATION = (SC.VARY_ISSUES_16 + SC.VARY_ISSUES_32) / SC.EXEC_CORE_ACTIVE
The tiler counters provide details of the workload of the fixed function tiling unit, which places primitives into the tile lists which are subsequently read by the fragment frontend during fragment shading.
These counters show the overall activity of the tiling unit.
This counter increments every cycle the tiler is processing a task. The tiler can run in parallel to vertex shading and fragment shading so a high cycle count here does not necessarily imply a bottleneck, unless the SC.COMPUTE_ACTIVE counter in the shader cores are very low relative to this.
These counters give a functional breakdown of the tiling workload given to the GPU by the application.
This counter increments for every point primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.
This counter increments for every line segment primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.
This counter increments for every triangle primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.
This derived counter contains the total number of primitives entering primitive assembly.
TI.INPUT_PRIMITIVES = TI.PRIMITIVE_POINTS + TI.PRIMITIVE_LINES + TI.PRIMITIVE_TRIANGLES
These counters give a breakdown of how the workload has been affected by clipping and culling. The culling schemes are applied in the order shown below:
This order impacts the interpretation of the counters in terms of comparing the culling rates against the total number of primitives entering and leaving each stage.
This counter is incremented for every primitive which is culled due to the application of front-face or back-face culling rules. For most meshes approximately half of the triangles are back facing so this counter should typically be similar to the visible primitives, although lower is always better.
This counter is incremented for every primitive which is culled due to being totally outside of the clip-space volume. Application-side culling should be used to minimize the amount of out-of-shot geometry being sent to the GPU as it is expensive in terms of bandwidth and power. One of my blogs looks at application side culling in more detail
This counter is incremented for every microtriangle primitive which is culled due to lack of any coverage of active sample points.
This counter is incremented for every primitive which is visible, surviving all types of culling which are applied.
Note: Visible in this context simply means that a primitive is inside the viewing frustum, facing in the correct direction, and has at least some sample coverage. Primitives which are visible at this stage still may generate no rendered fragments; for example ZS testing during fragment processing may determine that a primitive is entirely occluded by other primitives.
This counter determines the percentage of primitive inputs into the facing test which are culled by it.
TI.CULLED_FACING_PERCENT = TI.CULLED_FACING / TI_INPUT_PRIMITIVES
In typical 3D content it is expected that approximately half of the input primitives will be culled by the facing tests, as the side of a model which is facing away from the camera is not visible and can be dropped without fragment shading. If a low percentage of primitives are culled by the facing tests in a 3D application this implies that the application may not be enabling the back-face test for everything which could benefit from it; check the application draw calls for opaque objects are enabling GL_CULL_FACE correctly.
GL_CULL_FACE
This counter determines the percentage of primitive inputs into the frustum test which are culled by it.
TI.CULLED_FRUSTUM_PERCENT = TI.CULLED_FRUSTUM / (TI.INPUT_PRIMITIVES - TL.CULLED_FACING)
One of the most important optimizations an application can perform is efficiently culling objects which are outside of the visible frustum, as these optimizations can be applied quickly by exploiting scene knowledge such as object bounding volume checks (see Mali Performance 5: An Application's Performance Responsibilities for more information on application culling techniques). It is expected that some triangles will be outside of the frustum – CPU culling is normally approximate, and some objects may be span frustum boundary – but this should be minimized as it indicates that redundant vertex processing is occurring.
This counter determines the percentage of primitive inputs into the coverage test which are culled by it.
TI.CULLED_COVERAGE_PERCENT = TI.CULLED_FRUSTUM / (TI.INPUT_PRIMITIVES - TL.CULLED_FACING - TI.CULLED_FRUSTUM)
A significant number of triangles being culled due to the coverage test indicates that the application is using very dense models which are producing small microtriangles; even if the triangles which produce no coverage are killed it is expected that there will also be a number of visible triangles which cover a small number of sample points, which are still disproportionately expensive to process relative to their screen coverage.
Microtriangles are expensive to process for a number of reasons.
On mobile devices they are most expensive due to the bandwidth cost they incur. The vertex shader has to read the vertex attributes and write the varyings, and the fragment shader has to read and interpolate the varyings, which are typically bulky floating point vector data types. For example, the simplest vertex consisting of only a vec4 fp32 position attribute requires two 128-bit reads and one 128-bit write, a total of 24 bytes of memory bandwidth used. The cost of the vertex bandwidth is amortized over the number of fragments that a triangle generates. A triangle covering 50 pixels will effectively cost 0.5 bytes per pixel in terms of vertex bandwidth, which is equivalent of the cost of a single ETC2 compressed texture fetch. A microtriangle covering a two pixels will cost 12 bytes per pixel, and is therefore likely to generate stalls on the memory system.
Note: This example is for the “best case” microtriangle consisting of only of a position; most real applications will also have additional per-vertex attributes, such as vertex normals and texture coordinates. Applications loading between 50 and 100 bytes of input data per vertex are common.
Fragment workloads are always spawned as 2×2 pixel quads; quads which span the edges of a triangle may contain partial sample coverage, in which one or more of the fragments in the quad does not contribute to the final render, but which costs some performance to process. Microtriangles cause an increase in partial quads, as there are more edges per unit area shaded. The shader core counter SC.FRAG_PARTIAL_QUADS (see section 3.3.11) may provide additional evidence of the existence of microtriangles.
SC.FRAG_PARTIAL_QUADS
This counter is incremented for every triangle which is front-facing. This counter is incremented after culling, so only counts visible primitives which are actually emitted into the tile list.
This counter is incremented for every triangle which is back-facing. This counter is incremented after culling, so only counts visible primitives which are actually emitted into the tile list.
If you are not using back-facing triangles for some special algorithmic purpose, such as Refraction Based on Local Cubemaps, then a high value here relative to the total number of triangles may indicate that the application has forgotten to turn on back-face culling. For most opaque geometry no back facing triangles should be expected.
These counters track the workload requests for the Index-Driver Vertex Shading pipeline, one of the new features introduced in the Bifrost GPU architecture.
This counter is incremented for every batch of triangles which have been position shaded. Each batch consists of 4 vertices from sequential index ranges.
This counter is incremented for every batch of triangles which have been varying shaded. Each batch consists of 4 vertices from sequential index ranges.
This section documents the behavior of the L2 memory system counters.
In systems which implement multiple L2 caches or bus interfaces the counters presented in DS-5 Streamline are the sum of the counters from all of the L2 counter blocks present, as this gives the aggregate memory system usage.
All derivations in this document are computations per slice, so it may be necessary to divide these by the number of cache slices present in your design when using user-level equations in DS-5 Streamline.
These counters profile the internal use of the L2 cache versus the available cycle capacity.
The counter increments for any L2 read or write request from an internal master, or snoop request from an internal or external master.
Each L2 cache slice can process a single read, write, or snoop operation per clock cycle. The internal utilization of the L2 cache by the processing masters in the system can be determined via the equation:
LS.INTERNAL_UTILIZATION = L2.ANY_LOOKUP / JM.GPU_ACTIVE
These counters profile the internal read traffic into the L2 cache from the various internal masters.
The counter increments for every read transaction received by the L2 cache.
The counter increments for every read transaction sent by the L2 cache to external memory.
The counter gives an indication of the number of reads which are missing and being sent on the L2 external interface to main memory.
L2.READ_MISS_RATE = L2.EXTERNAL_READ_REQUEST / L2.READ_REQUEST
The counter increments for every write transaction received by the L2 cache.
The counter increments for every write transaction sent by the L2 cache to external memory.
The counter gives an indication of the number of writes which are missing and being sent on the L2 external interface to main memory.
L2.WRITE_MISS_RATE = L2.EXTERNAL_WRITE_REQUEST / L2.WRITE_REQUEST
Note: In most cases writes to main memory are necessary and not a bad thing, for example writing vertex data to intermediate storage for later use during fragment shading, or when writing back the final color contents of a tile at the end of a frame. A high write miss rate is therefore not necessarily indicative of a performance problem if those writes were always intended to be sent to main memory.
These counters profile the external read memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache, as some types of access will bypass the L2 cache.
This counter increments on every clock cycle a read beat is read off the external AXI bus.
With knowledge of the bus width used in the GPU the beat counter can be converted into a raw bandwidth counter.
L2.EXTERNAL_READ_BYTES = SUM(L2.EXTERNAL_READ_BEATS * L2.AXI_WIDTH_BYTES)
Note: Most implementations of a Bifrost GPU use a 128-bit (16 byte) AXI interface, but a 64-bit (8 byte) interface is also possible to reduce the area used by a design. This information can be obtained from your chipset manufacturer.
The GPU can issue one read beat per clock per implemented cache slice. The total utilization of the AXI read interface can be determined per cache slice using:
L2.EXTERNAL_READ_UTILIZATION = L2.EXTERNAL_READ_BEATS / SC.GPU_ACTIVE
Note: This utilization metric ignores any frequency changes which may occur downstream of the GPU. If you have, for example, a 600MHz GPU connected to a 300MHz AXI bus of the same data width then it will be impossible for the GPU to achieve more than 50% utilization of its native interface because the AXI bus is unable to provide the data as quickly as the GPU can consume it.
This counter increments every cycle that the GPU is unable to issue a new read transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention due to accesses from other sources, or that the GPU is clocked faster than the AXI bus it is connected to.
The L2 interface implements a six entry histogram which tracks the response latency for the external reads. The counter for the sixth level is synthesized from multiple raw counter values.
Histogram Range
Counter Equation
0-127 Cycles
L2.EXT_RRESP_0_127
128-191 Cycles
L2.EXT_RRESP_128_191
192-255 Cycles
L2.EXT_RRESP_192_255
256-319 Cycles
L2.EXT_RRESP_256_319
320-383 Cycles
L2.EXT_RRESP_320_383
> 383 Cycles
L2.EXTERNAL_READ_BEATS - L2.EXT_RRESP_0_127 - L2.EXT_RRESP_128_191 - L2.EXT_RRESP_192_255 - L2.EXT_RRESP_256_319 - L2.EXT_RRESP_320_383
Mali shader cores are designed to tolerate an external read response latency of 170 GPU cycles; systems reporting significantly higher latency than this for a high percentage of transactions will observe some reduction in performance, as the shader core will stall more often waiting for main memory to provide data.
The L2 interface implements a four entry histogram which tracks the outstanding transaction levels for the external reads. The counter for the fourth level is synthesized from multiple raw counter values.
0-25%
L2.EXT_READ_CNT_Q1
25-50%
L2.EXT_READ_CNT_Q2
50-75%
L2.EXT_READ_CNT_Q3
75%-100%
L2.EXTERNAL_READ - L2.EXT_READ_CNT_Q1 - L2.EXT_READ_CNT_Q2 - L2.EXT_READ_CNT_Q3
The number of currently outstanding transactions gives some idea of how many concurrent memory requests the shader core has queued on the AXI bus. This will not directly cost performance unless we completely run out of transactions; content with a high percentage of transactions in the 75-100% range may be losing performance because it is unable to construct new requests to be sent onto the AXI interface.
Note: The maximum number of outstanding transactions available is a synthesis time option when implementing the GPU. The total number of outstanding transaction count should be selected to ensure that the GPU can keep data requests queued on the external DDR controller. In a system with 170 cycles of read response latency, and a typical transaction size of 4 data beats, at least 170/4 (42) outstanding transactions are required.
These counters profile the external write memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache, as some types of access will bypass the L2 cache.
This counter increments on every clock cycle a write beat is read off the external AXI bus.
L2.EXTERNAL_WRITE_BYTES = SUM(L2.EXTERNAL_WRITE_BEATS * L2.AXI_WIDTH_BYTES)
The GPU can issue one read beat per clock per implemented cache slice. The total utilization of the AXI write interface can be determined per cache slice using:
L2.EXTERNAL_WRITE_UTILIZATION = L2.EXTERNAL_WRITE_BEATS / SC.GPU_ACTIVE
This counter increments every cycle that the GPU is unable to issue a new write transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention due to accesses from other sources, or that the GPU is clocked faster than the AXI bus it is connected to.
The L2 interface implements a four entry histogram which tracks the outstanding transaction levels for the external writes. The counter for the fourth level is synthesized from multiple raw counter values.
L2.EXT_WRITE_CNT_Q1
L2.EXT_WRITE_CNT_Q2
L2.EXT_WRITE_CNT_Q3
L2.EXTERNAL_WRITE - L2.EXT_WRITE_CNT_Q1 - L2.EXT_WRITE_CNT_Q2 - L2.EXT_WRITE_CNT_Q3
Note: The maximum number of outstanding transactions available is a synthesis time option when implementing the GPU. The total number of outstanding transaction count should be selected to ensure that the GPU can keep data requests queued on the external DDR controller. In a system with 90 cycles of write response latency, and a typical transaction size of 4 data beats, at least 90/4 (23) outstanding transactions are required.
This document has defined all of the Mali Bifrost family performance counters available via DS-5 Streamline, as well as some derived-counters which can be derived from them. Hopefully this provides a useful starting point for your application optimization activity when using Mali GPUs.
We also publish a Mali Application Optimization Guide. You can visit this by clicking on the link below:
[CTAToken URL = "https://developer.arm.com/docs/dui0555/b/introduction/about-optimization" target="_blank" text="Read Mali optimization guide" class ="green"]
Hi Peter,
Thanks for such great article and help me a lot to understand Bifrost architecture.
Here is a question that I saw this
"For example, the simplest vertex consisting of only a vec4 fp32 position attribute requires two 128-bit reads and one 128-bit write, a total of 24 bytes of memory bandwidth used"
Could you help me to understand why 2x read + 1x write of 32bits operation takes 24 bytes?