This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Understanding Mali GPU Hardware Counters

Hi ,

I have read your blog on Mali GPU Hardware Counters. I have a few questions. 

The Mali Job Manager Cycles:GPU cycles counter gives the total amount of cycles, the GPU was active. If I execute a compute workload (not graphics), I should be able to predict the execution time of the kernel should be from Tripipe cycles counter.  There is always a differnce in value  between the Mali Job Manager Cycles:GPU cycles and Mali Core Cycles:Tripipe cycles. What does this extra cycles signify. I know that the values reported by streamline is average value across all the shader cores but still what does this extra cycles signify?

I also would like to know what exactly does the  Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors counters report ??.

This is because I ran a OpenCL benchamark with zero arithmetic instructions but still the values of Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors  are not zero while Mali Compute Threads:Compute tasks and Mali Compute Threads:Compute threads started were zero.

Also the tripipe cycles counter value should be equal to the maximum of cycles spent in Arithmetic/LS-pipeline/Texture pipeline but even when there are no texture and Arithmetic instructions, the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter. Why this is happening? If I am executing only memory instructions, Mali Core Cycles:Tripipe cycles should be equal to Mali Load/Store Pipe cycles instead I see that  Mali Core Cycles:Compute cycles , Mali Compute Threads:Compute cycles awaiting descriptors and  Mali Core Cycles:Tripipe cycles have similar values??

 

It would be helpful if you can give some insights to these behaviours?

P.S. I am doing an academic project and i am modeling the performance of opencl kernel on Mali GPUs.

P.P.S.I am not an android developer looking at optimizations

Parents
  • maasa said:
    Does the Mali Load/Store Pipe:LS instruction issues counter report the cycles taken to complete all the load/store instructions of all the threads in the kernel. I meant does the counter adds the memory stall cycles as well?

    It depends which Mali GPU you are using; for Mali-T600 and Mali-T620 then you could use the issues counter to measure reissues due to cache misses. Later products changed how cache misses are handled, so the issues counter no longer shows cache misses for loads. Memory stalls are handled asynchronously - other threads can progress while some other threads are blocked on memory misses - so a counter which measures stalled cycles isn't useful for measuring actual hardware throughput.

    maasa said:
    How can I get the cycles taken for executing Arithmetic instructions  and cycles taken for executing LS -instructions separately ?

    The three pipeline issues counters for the three pipelines are a close approximation.

    maasa said:
    Tripipe Cycles =  max of (A1/A2/LS/) + Overhead2 Cycles

    For a fully loaded pipe with sufficient threads the reality would be closer to "max(A1+ Overhead A1, A2 + Overhead A2, LS + Overhead LS, Tex + Overhead Tex)"; all three pipelines have different caches and caching structures specialized for their use, and some pipelines have different issue rates for different workloads, so the overhead is different in each case.

    Note that the A* loading will be identical, so you can ignore the differences between the multiple arithmetic pipes in the design (the arithmetic counters are effectively reporting A1 only).

    This relationship breaks down in cases where you don't have enough threads to keep the critical path pipelines busy; e.g. running a 1x1 kernel which generates a single thread will have very low utilization even if that one thread issues an instruction successfully on every cycle it is eligible to do so because there simply isn't enough work there to keep the pipelines full. GPUs are throughput machines relying on having a large pool of threads, and the throughput equations assume you have "sufficient" threads in the core.

    HTH, 
    Pete

Reply
  • maasa said:
    Does the Mali Load/Store Pipe:LS instruction issues counter report the cycles taken to complete all the load/store instructions of all the threads in the kernel. I meant does the counter adds the memory stall cycles as well?

    It depends which Mali GPU you are using; for Mali-T600 and Mali-T620 then you could use the issues counter to measure reissues due to cache misses. Later products changed how cache misses are handled, so the issues counter no longer shows cache misses for loads. Memory stalls are handled asynchronously - other threads can progress while some other threads are blocked on memory misses - so a counter which measures stalled cycles isn't useful for measuring actual hardware throughput.

    maasa said:
    How can I get the cycles taken for executing Arithmetic instructions  and cycles taken for executing LS -instructions separately ?

    The three pipeline issues counters for the three pipelines are a close approximation.

    maasa said:
    Tripipe Cycles =  max of (A1/A2/LS/) + Overhead2 Cycles

    For a fully loaded pipe with sufficient threads the reality would be closer to "max(A1+ Overhead A1, A2 + Overhead A2, LS + Overhead LS, Tex + Overhead Tex)"; all three pipelines have different caches and caching structures specialized for their use, and some pipelines have different issue rates for different workloads, so the overhead is different in each case.

    Note that the A* loading will be identical, so you can ignore the differences between the multiple arithmetic pipes in the design (the arithmetic counters are effectively reporting A1 only).

    This relationship breaks down in cases where you don't have enough threads to keep the critical path pipelines busy; e.g. running a 1x1 kernel which generates a single thread will have very low utilization even if that one thread issues an instruction successfully on every cycle it is eligible to do so because there simply isn't enough work there to keep the pipelines full. GPUs are throughput machines relying on having a large pool of threads, and the throughput equations assume you have "sufficient" threads in the core.

    HTH, 
    Pete

Children