This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Understanding Mali GPU Hardware Counters

Hi ,

I have read your blog on Mali GPU Hardware Counters. I have a few questions. 

The Mali Job Manager Cycles:GPU cycles counter gives the total amount of cycles, the GPU was active. If I execute a compute workload (not graphics), I should be able to predict the execution time of the kernel should be from Tripipe cycles counter.  There is always a differnce in value  between the Mali Job Manager Cycles:GPU cycles and Mali Core Cycles:Tripipe cycles. What does this extra cycles signify. I know that the values reported by streamline is average value across all the shader cores but still what does this extra cycles signify?

I also would like to know what exactly does the  Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors counters report ??.

This is because I ran a OpenCL benchamark with zero arithmetic instructions but still the values of Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors  are not zero while Mali Compute Threads:Compute tasks and Mali Compute Threads:Compute threads started were zero.

Also the tripipe cycles counter value should be equal to the maximum of cycles spent in Arithmetic/LS-pipeline/Texture pipeline but even when there are no texture and Arithmetic instructions, the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter. Why this is happening? If I am executing only memory instructions, Mali Core Cycles:Tripipe cycles should be equal to Mali Load/Store Pipe cycles instead I see that  Mali Core Cycles:Compute cycles , Mali Compute Threads:Compute cycles awaiting descriptors and  Mali Core Cycles:Tripipe cycles have similar values??

 

It would be helpful if you can give some insights to these behaviours?

P.S. I am doing an academic project and i am modeling the performance of opencl kernel on Mali GPUs.

P.P.S.I am not an android developer looking at optimizations

Parents
  • maasa said:
    I assume the cache line size is 64 bytes

    Yes.

    maasa said:
    For some kernels, when compute total Mali L1 misses (avg misses given by streamline * 4) and L2 hits, How can this happen?

    Not all L2 accesses are from the L1 LSC, so you would expect some hits from other sources - e.g. loading control structures and shader programs. Hard to give a precise answer without knowing your kernels.

    maasa said:
    Also in those cases, how do I get to know about L2 misses?

    For Midgard GPUs you have a L2 read lookups counters, and an L2 read hits counter. Misses is lookups minus hits.

    Note that as a GPU is a massively multi-threaded design it's not uncommon to have parallel lookups from multiple threads and shader cores hitting the same addresses, which may get optimized in a manner which is impossible on a traditional CPU architecture.

Reply
  • maasa said:
    I assume the cache line size is 64 bytes

    Yes.

    maasa said:
    For some kernels, when compute total Mali L1 misses (avg misses given by streamline * 4) and L2 hits, How can this happen?

    Not all L2 accesses are from the L1 LSC, so you would expect some hits from other sources - e.g. loading control structures and shader programs. Hard to give a precise answer without knowing your kernels.

    maasa said:
    Also in those cases, how do I get to know about L2 misses?

    For Midgard GPUs you have a L2 read lookups counters, and an L2 read hits counter. Misses is lookups minus hits.

    Note that as a GPU is a massively multi-threaded design it's not uncommon to have parallel lookups from multiple threads and shader cores hitting the same addresses, which may get optimized in a manner which is impossible on a traditional CPU architecture.

Children