This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Understanding Mali GPU Hardware Counters

Hi ,

I have read your blog on Mali GPU Hardware Counters. I have a few questions. 

The Mali Job Manager Cycles:GPU cycles counter gives the total amount of cycles, the GPU was active. If I execute a compute workload (not graphics), I should be able to predict the execution time of the kernel should be from Tripipe cycles counter.  There is always a differnce in value  between the Mali Job Manager Cycles:GPU cycles and Mali Core Cycles:Tripipe cycles. What does this extra cycles signify. I know that the values reported by streamline is average value across all the shader cores but still what does this extra cycles signify?

I also would like to know what exactly does the  Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors counters report ??.

This is because I ran a OpenCL benchamark with zero arithmetic instructions but still the values of Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors  are not zero while Mali Compute Threads:Compute tasks and Mali Compute Threads:Compute threads started were zero.

Also the tripipe cycles counter value should be equal to the maximum of cycles spent in Arithmetic/LS-pipeline/Texture pipeline but even when there are no texture and Arithmetic instructions, the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter. Why this is happening? If I am executing only memory instructions, Mali Core Cycles:Tripipe cycles should be equal to Mali Load/Store Pipe cycles instead I see that  Mali Core Cycles:Compute cycles , Mali Compute Threads:Compute cycles awaiting descriptors and  Mali Core Cycles:Tripipe cycles have similar values??

 

It would be helpful if you can give some insights to these behaviours?

P.S. I am doing an academic project and i am modeling the performance of opencl kernel on Mali GPUs.

P.P.S.I am not an android developer looking at optimizations

Parents
  • maasa said:
    There is always a differnce in value  between the Mali Job Manager Cycles:GPU cycles and Mali Core Cycles:Tripipe cycles. What does this extra cycles signify.

    The Job Manager is responsible for turning a compute dispatch into smaller pieces of work which can be distributed over the shader cores in the system, so it has to load some control structures from main memory and ensure memory coherency with the CPU's view of the world before starting any work and ensure any results are memory coherent with the CPU's view of the world at the end of the dispatch.

    This Job Manager overhead is normally a small (< 10K cycles) fixed cost on any compute work submissions, and will be amortized and pipelined if you submit a compute queue containing multiple kernels to execute. 

    maasa said:
    I also would like to know what exactly does the  Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors counters report ??.

    Compute cycles increments any time that a shader core is processing any part of a compute job (or a non-fragment job for graphics). This includes any cycle where something is either in the fixed function thread setup unit, or in the shader core itself. Descriptors are the control-plane state for the work being submitted - so the other counter shows how long the front-end is waiting to load descriptors. Note, for a pipelined compute submission containing multiple kernels this pipelines, so a cycle spent waiting here doesn't necessarily mean a lost cycle of tripipe processing time in non-trivial applications.

    In general:

    "Job Manager: GPU Active Cycles" > "Shader Core: Compute Active Cycles" > "Shader Core: Tripipe Active Cycles"

    The tripipe is the heart of the system for running code, but there are a few thin layers of overhead which are incurred to determine how to drive it.

    There is some overhead to determine that we don't need to do anything at all (e.g. we don't spend hardware to optimize cases which don't occur in real world scenarios - sending zero sized jobs is an application issue really), which is why you see some Compute Active time even if you don't spawn any threads.

    Also the tripipe cycles counter value should be equal to the maximum of cycles spent in Arithmetic/LS-pipeline/Texture pipeline but even when there are no texture and Arithmetic instructions, the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter

    maasa said:
    the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter. Why this is happening?

    The architectural best case is one cycle per instruction for the load/store pipe, but that assumes cache hits for either reads or writes. Real-world performance will vary depending on how your addresses are spread across memory and how those addresses land temporally. If you are hitting external memory (e.g. L1 and L2 caches both miss) a lot then the performance issues may be beyond Mali - we don't control the external memory system linking the GPU to the DDR memory controller.

    Also note that a lot also depends on how big you workloads are - it sounds like you are running some small test workloads if you are running zero sized kernels. Remember GPUs are massively multithreaded and rely on running hundreds of threads concurrently in each shader core to ensure that the functional units stay busy, and the 1 cycle per pipe throughput assumes full thread occupancy for a reasonably large workload. If you are running very small test loads then it is likely that you are losing a lot of utilization to ramp up and ramp down time, and may simply lack enough threads to keep the hardware busy. Very rough rule of thumb - aim for compute kernels which are at least hundreds of thousands of work items.

    maasa said:
    Mali Core Cycles:Compute cycles , Mali Compute Threads:Compute cycles awaiting descriptors and  Mali Core Cycles:Tripipe cycles have similar values??

    Ideally Compute cycles == Tripipe cycles; this shows that the workload is mostly spending its time running in the shader core (e.g. there is minimal overhead from the fixed function thread setup units). I wouldn't worry too much about the waiting for descriptor counter; because of how it loads pipeline in parallel to the tripipe running it's not really showing you anything particularly useful.

    HTH, 

    Pete

Reply
  • maasa said:
    There is always a differnce in value  between the Mali Job Manager Cycles:GPU cycles and Mali Core Cycles:Tripipe cycles. What does this extra cycles signify.

    The Job Manager is responsible for turning a compute dispatch into smaller pieces of work which can be distributed over the shader cores in the system, so it has to load some control structures from main memory and ensure memory coherency with the CPU's view of the world before starting any work and ensure any results are memory coherent with the CPU's view of the world at the end of the dispatch.

    This Job Manager overhead is normally a small (< 10K cycles) fixed cost on any compute work submissions, and will be amortized and pipelined if you submit a compute queue containing multiple kernels to execute. 

    maasa said:
    I also would like to know what exactly does the  Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors counters report ??.

    Compute cycles increments any time that a shader core is processing any part of a compute job (or a non-fragment job for graphics). This includes any cycle where something is either in the fixed function thread setup unit, or in the shader core itself. Descriptors are the control-plane state for the work being submitted - so the other counter shows how long the front-end is waiting to load descriptors. Note, for a pipelined compute submission containing multiple kernels this pipelines, so a cycle spent waiting here doesn't necessarily mean a lost cycle of tripipe processing time in non-trivial applications.

    In general:

    "Job Manager: GPU Active Cycles" > "Shader Core: Compute Active Cycles" > "Shader Core: Tripipe Active Cycles"

    The tripipe is the heart of the system for running code, but there are a few thin layers of overhead which are incurred to determine how to drive it.

    There is some overhead to determine that we don't need to do anything at all (e.g. we don't spend hardware to optimize cases which don't occur in real world scenarios - sending zero sized jobs is an application issue really), which is why you see some Compute Active time even if you don't spawn any threads.

    Also the tripipe cycles counter value should be equal to the maximum of cycles spent in Arithmetic/LS-pipeline/Texture pipeline but even when there are no texture and Arithmetic instructions, the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter

    maasa said:
    the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter. Why this is happening?

    The architectural best case is one cycle per instruction for the load/store pipe, but that assumes cache hits for either reads or writes. Real-world performance will vary depending on how your addresses are spread across memory and how those addresses land temporally. If you are hitting external memory (e.g. L1 and L2 caches both miss) a lot then the performance issues may be beyond Mali - we don't control the external memory system linking the GPU to the DDR memory controller.

    Also note that a lot also depends on how big you workloads are - it sounds like you are running some small test workloads if you are running zero sized kernels. Remember GPUs are massively multithreaded and rely on running hundreds of threads concurrently in each shader core to ensure that the functional units stay busy, and the 1 cycle per pipe throughput assumes full thread occupancy for a reasonably large workload. If you are running very small test loads then it is likely that you are losing a lot of utilization to ramp up and ramp down time, and may simply lack enough threads to keep the hardware busy. Very rough rule of thumb - aim for compute kernels which are at least hundreds of thousands of work items.

    maasa said:
    Mali Core Cycles:Compute cycles , Mali Compute Threads:Compute cycles awaiting descriptors and  Mali Core Cycles:Tripipe cycles have similar values??

    Ideally Compute cycles == Tripipe cycles; this shows that the workload is mostly spending its time running in the shader core (e.g. there is minimal overhead from the fixed function thread setup units). I wouldn't worry too much about the waiting for descriptor counter; because of how it loads pipeline in parallel to the tripipe running it's not really showing you anything particularly useful.

    HTH, 

    Pete

Children
  • Hi ,

    Thankyou for your quick reply. I have a few questions

    In the blog, LS CPI is calculated as LS-Instruction Issues/ LS-instructions

    Does the Mali Load/Store Pipe:LS instruction issues counter report the cycles taken to complete all the load/store instructions of all the threads in the kernel. I meant does the counter adds the memory stall cycles as well?

    How can I get the cycles taken for executing Arithmetic instructions  and cycles taken for executing LS -instructions separately ? The tripipe cycles gives the max of (A1/A2/LS/TEx)

    In general, GPU Active Cycles = Tripipe Cycles + Ovehead1 cycles

                        Tripipe Cycles =  max of (A1/A2/LS/) + Overhead2 Cycles

                          Compute Cycles == Tripipe Cycles

    Is my understanding Correct?

  • maasa said:
    Does the Mali Load/Store Pipe:LS instruction issues counter report the cycles taken to complete all the load/store instructions of all the threads in the kernel. I meant does the counter adds the memory stall cycles as well?

    It depends which Mali GPU you are using; for Mali-T600 and Mali-T620 then you could use the issues counter to measure reissues due to cache misses. Later products changed how cache misses are handled, so the issues counter no longer shows cache misses for loads. Memory stalls are handled asynchronously - other threads can progress while some other threads are blocked on memory misses - so a counter which measures stalled cycles isn't useful for measuring actual hardware throughput.

    maasa said:
    How can I get the cycles taken for executing Arithmetic instructions  and cycles taken for executing LS -instructions separately ?

    The three pipeline issues counters for the three pipelines are a close approximation.

    maasa said:
    Tripipe Cycles =  max of (A1/A2/LS/) + Overhead2 Cycles

    For a fully loaded pipe with sufficient threads the reality would be closer to "max(A1+ Overhead A1, A2 + Overhead A2, LS + Overhead LS, Tex + Overhead Tex)"; all three pipelines have different caches and caching structures specialized for their use, and some pipelines have different issue rates for different workloads, so the overhead is different in each case.

    Note that the A* loading will be identical, so you can ignore the differences between the multiple arithmetic pipes in the design (the arithmetic counters are effectively reporting A1 only).

    This relationship breaks down in cases where you don't have enough threads to keep the critical path pipelines busy; e.g. running a 1x1 kernel which generates a single thread will have very low utilization even if that one thread issues an instruction successfully on every cycle it is eligible to do so because there simply isn't enough work there to keep the pipelines full. GPUs are throughput machines relying on having a large pool of threads, and the throughput equations assume you have "sufficient" threads in the core.

    HTH, 
    Pete

  • Hi ,

    Thanks once again for the quick reply.  I am using a Mali T-628 GPU, So i guess I can use the  LS- Instruction issue counter for getting the cycles taken for executing load-store instructions.

    But I just have a Mali Arithmetic Pipe:A instructions counter. There are no A instruction issues counter in Mail T-628. So how I do I get the cycles taken for executing Arithmetic instructions  ?

  • All arithmetic instructions are single cycle throughput, so one instruction = one active cycle. One caveat is that there are some forms of slowdown possible (e.g. if you get a high density of instruction cache misses), but they are not visible in the counters.

  • Hi

    I assume the cache line size is 64 bytes for  both Mali Load Store cache as well as Mali L2 cache.

    For some kernels, when compute total Mali L1 misses (avg misses given by streamline * 4) and L2 hits , L2 hits > L1 misses ? How can this happen?

    Also in those cases, how do I get to know about L2 misses?

  • maasa said:
    I assume the cache line size is 64 bytes

    Yes.

    maasa said:
    For some kernels, when compute total Mali L1 misses (avg misses given by streamline * 4) and L2 hits, How can this happen?

    Not all L2 accesses are from the L1 LSC, so you would expect some hits from other sources - e.g. loading control structures and shader programs. Hard to give a precise answer without knowing your kernels.

    maasa said:
    Also in those cases, how do I get to know about L2 misses?

    For Midgard GPUs you have a L2 read lookups counters, and an L2 read hits counter. Misses is lookups minus hits.

    Note that as a GPU is a massively multi-threaded design it's not uncommon to have parallel lookups from multiple threads and shader cores hitting the same addresses, which may get optimized in a manner which is impossible on a traditional CPU architecture.

  • Hi ,

    I am using Mali T-628 GPU and there are no L2 lookups counter in the streamline V 5.26.2. 

    All I have are  Mali L2 Cache Reads:L2 read hits, Mali L2 Cache Writes:L2 write hits, Mali L2 Cache Reads:Read snoops and Mali L2 Cache Reads:Write snoops counters.

    The sum of all these will give me L2 hits. But there is no counter to tell the total L2 lookups.

  • Hi ,

    Since I donot have L2 read lookup counters, is it correct to use Mali L2 Cache Ext Writes:External read beats + Mali L2 Cache Ext Writes:External write beats as a proxy for L2 cache misses ?

    Does the read/write beats counter give the number of transactions that reach the DRAM?

  • No - the beats count is the number of bus data beat cycles. A single transaction is normally multiple data beats (e.g. 64 byte transactions with 16 byte bus = 4 beats per transaction). 

  • Hi ,

    Thanks.

    So do you have any suggestions of getting L2 misses in the absence of L2 read lookup counters?

    Also does  Mali L2 Cache Ext Reads:External bus stalls (AR) + Mali L2 Cache Ext Writes:External bus stalls (W) give the total number of stall cycles due to external memory request?

  • There are definitely should be L2 read and write lookup counters available for Mali-T62x.

    https://github.com/ARM-software/gator/blob/master/daemon/events-Mali-T62x_hw.xml

    What do you get in your counter selection list in Streamline?

    Stall counter definitions are here (e.g.):

    https://community.arm.com/graphics/b/blog/posts/mali-midgard-family-performance-counters#jive_content_id_534_L2_EXT_AR_STALL

    Cheers, 
    Pete

  • Hi ,

    These are the L2 counters that are visible in my streamline selection

    Mali L2 Cache Ext Reads:External bus stalls (AR)    

    Mali L2 Cache Ext Reads:External write bytes    

    Mali L2 Cache Ext Writes:External bus stalls (W)  

    Mali L2 Cache Ext Writes:External read bytes  

     Mali L2 Cache Reads:L2 read hits    

    Mali L2 Cache Reads:Read snoops    

    Mali L2 Cache Writes:L2 write hits    

    Mali L2 Cache Writes:Write snoops

  • Hi

    I just realised that the streamline is showing the events available in events-Mali-Midgard_hw.xml and not in events-Mali-T62x_hw.xml.

    Can you please let me know, how can change gator to use events-Mali-T62x_hw.xml. and not events-Mali-Midgard_hw.xml