This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Understanding Mali GPU Hardware Counters

maasa over 7 years ago

I have read your blog on Mali GPU Hardware Counters. I have a few questions.

The Mali Job Manager Cycles:GPU cycles counter gives the total amount of cycles, the GPU was active. If I execute a compute workload (not graphics), I should be able to predict the execution time of the kernel should be from Tripipe cycles counter. There is always a differnce in value between the Mali Job Manager Cycles:GPU cycles and Mali Core Cycles:Tripipe cycles. What does this extra cycles signify. I know that the values reported by streamline is average value across all the shader cores but still what does this extra cycles signify?

I also would like to know what exactly does the Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors counters report ??.

This is because I ran a OpenCL benchamark with zero arithmetic instructions but still the values of Mali Core Cycles:Compute cycles and Mali Compute Threads:Compute cycles awaiting descriptors are not zero while Mali Compute Threads:Compute tasks and Mali Compute Threads:Compute threads started were zero.

Also the tripipe cycles counter value should be equal to the maximum of cycles spent in Arithmetic/LS-pipeline/Texture pipeline but even when there are no texture and Arithmetic instructions, the value of Mali Core Cycles:Tripipe cycles is not the same as Mali Load/Store Pipe:LS instruction issues counter. Why this is happening? If I am executing only memory instructions, Mali Core Cycles:Tripipe cycles should be equal to Mali Load/Store Pipe cycles instead I see that Mali Core Cycles:Compute cycles , Mali Compute Threads:Compute cycles awaiting descriptors and Mali Core Cycles:Tripipe cycles have similar values??

It would be helpful if you can give some insights to these behaviours?

P.S. I am doing an academic project and i am modeling the performance of opencl kernel on Mali GPUs.

P.P.S.I am not an android developer looking at optimizations

Top replies

Peter Harris over 7 years ago in reply to maasa +1

No - the beats count is the number of bus data beat cycles. A single transaction is normally multiple data beats (e.g. 64 byte transactions with 16 byte bus = 4 beats per transaction).

Parents

Peter Harris over 7 years ago in reply to maasa

maasa said:
I assume the cache line size is 64 bytes

Yes.

maasa said:
For some kernels, when compute total Mali L1 misses (avg misses given by streamline * 4) and L2 hits, How can this happen?

Not all L2 accesses are from the L1 LSC, so you would expect some hits from other sources - e.g. loading control structures and shader programs. Hard to give a precise answer without knowing your kernels.

maasa said:
Also in those cases, how do I get to know about L2 misses?

For Midgard GPUs you have a L2 read lookups counters, and an L2 read hits counter. Misses is lookups minus hits.

Note that as a GPU is a massively multi-threaded design it's not uncommon to have parallel lookups from multiple threads and shader cores hitting the same addresses, which may get optimized in a manner which is impossible on a traditional CPU architecture.
Cancel
Vote up 0 Vote down

Cancel

Reply

Peter Harris over 7 years ago in reply to maasa

maasa said:
I assume the cache line size is 64 bytes

Yes.

maasa said:
For some kernels, when compute total Mali L1 misses (avg misses given by streamline * 4) and L2 hits, How can this happen?

Not all L2 accesses are from the L1 LSC, so you would expect some hits from other sources - e.g. loading control structures and shader programs. Hard to give a precise answer without knowing your kernels.

maasa said:
Also in those cases, how do I get to know about L2 misses?

For Midgard GPUs you have a L2 read lookups counters, and an L2 read hits counter. Misses is lookups minus hits.

Note that as a GPU is a massively multi-threaded design it's not uncommon to have parallel lookups from multiple threads and shader cores hitting the same addresses, which may get optimized in a manner which is impossible on a traditional CPU architecture.
Cancel
Vote up 0 Vote down

Cancel

Children

maasa over 7 years ago in reply to Peter Harris

Hi Peter Harris,

I am using Mali T-628 GPU and there are no L2 lookups counter in the streamline V 5.26.2.

All I have are Mali L2 Cache Reads:L2 read hits, Mali L2 Cache Writes:L2 write hits, Mali L2 Cache Reads:Read snoops and Mali L2 Cache Reads:Write snoops counters.

The sum of all these will give me L2 hits. But there is no counter to tell the total L2 lookups.
Cancel
Vote up 0 Vote down

Cancel
maasa over 7 years ago in reply to Peter Harris

Hi Peter Harris,

Since I donot have L2 read lookup counters, is it correct to use Mali L2 Cache Ext Writes:External read beats + Mali L2 Cache Ext Writes:External write beats as a proxy for L2 cache misses ?

Does the read/write beats counter give the number of transactions that reach the DRAM?
Cancel
Vote up 0 Vote down

Cancel
Peter Harris over 7 years ago in reply to maasa

No - the beats count is the number of bus data beat cycles. A single transaction is normally multiple data beats (e.g. 64 byte transactions with 16 byte bus = 4 beats per transaction).
Cancel
Vote up +1 Vote down

Cancel
maasa over 7 years ago in reply to Peter Harris

Hi Peter Harris,

Thanks.

So do you have any suggestions of getting L2 misses in the absence of L2 read lookup counters?

Also does Mali L2 Cache Ext Reads:External bus stalls (AR) + Mali L2 Cache Ext Writes:External bus stalls (W) give the total number of stall cycles due to external memory request?
Cancel
Vote up 0 Vote down

Cancel
Peter Harris over 7 years ago in reply to maasa

There are definitely should be L2 read and write lookup counters available for Mali-T62x.

https://github.com/ARM-software/gator/blob/master/daemon/events-Mali-T62x_hw.xml

What do you get in your counter selection list in Streamline?

Stall counter definitions are here (e.g.):

https://community.arm.com/graphics/b/blog/posts/mali-midgard-family-performance-counters#jive_content_id_534_L2_EXT_AR_STALL

Cheers,
Pete
Cancel
Vote up 0 Vote down

Cancel
maasa over 7 years ago in reply to Peter Harris

Hi Peter Harris,

These are the L2 counters that are visible in my streamline selection

Mali L2 Cache Ext Reads:External bus stalls (AR)

Mali L2 Cache Ext Reads:External write bytes

Mali L2 Cache Ext Writes:External bus stalls (W)

Mali L2 Cache Ext Writes:External read bytes

Mali L2 Cache Reads:L2 read hits

Mali L2 Cache Reads:Read snoops

Mali L2 Cache Writes:L2 write hits

Mali L2 Cache Writes:Write snoops
Cancel
Vote up 0 Vote down

Cancel
maasa over 7 years ago in reply to Peter Harris

Hi Peter Harris

I just realised that the streamline is showing the events available in events-Mali-Midgard_hw.xml and not in events-Mali-T62x_hw.xml.

Can you please let me know, how can change gator to use events-Mali-T62x_hw.xml. and not events-Mali-Midgard_hw.xml
Cancel
Vote up 0 Vote down

Cancel