This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali GPU performance counters query

Hi, I'm optimizing my compute shader by streamline. I have some question about the counters:

1. My GPU is mali-G78 mp24, it has 24 cores, so the GPU Cycles and Mali Core Cycles is the sum of all 24 cores ???

2. The gpu cycles is 400k, the exectuion core active cycles is 288k, does it means the exectuion core has (400 - 288) 112k cycle idle ???  What's the exectuion core waiting for in 112k cycles (memory fetch or some other things??) 

3. If the execution core is waiting for memory data return, does the Load/Store Cycles contain the waiting cycle ???  Does the execution core active cycle contains the waiting cycle ???

Thank you!

Parents
  • Before you go too much further, I'd recommend applying an analysis template matching your GPU (https://developer.arm.com/tools-and-software/graphics-and-gaming/arm-mobile-studio/learn/get-started/get-started-with-streamline/capture-a-profile). This makes the charts a lot easier to understand (although a lot won't be relevant for a compute-only use case).

    1. My GPU is mali-G78 mp24, it has 24 cores, so the GPU Cycles and Mali Core Cycles is the sum of all 24 cores ???

    Shader core counters are presented as a per-core figure, so they are normalized for the GPU size in the design. 

    2. The gpu cycles is 400k, the exectuion core active cycles is 288k, does it means the exectuion core has (400 - 288) 112k cycle idle ???  What's the exectuion core waiting for in 112k cycles (memory fetch or some other things??) 

    Yes. Can't tell what it's waiting for based on the counter's you've shared, but one common cause is GPU cache maintenance to ensure that the CPU and GPU see the same view of memory. Some designs have hardware coherency which can skip the cache maintenance, but most don't. 

    The other possibility is that your compute job is too small to fill the entire GPU (too few workgroups), and some of the GPU core are idle. When the shader cores are averaged this can show up as a reduction in the total. Unfortunately individual core values cannot be recovered, as the averaging is done on the device before the tool see them.

    3. If the execution core is waiting for memory data return, does the Load/Store Cycles contain the waiting cycle ???  Does the execution core active cycle contains the waiting cycle ???

    Yes, execution core active will increment any cycle that a thread is live in the core, even if that thread is blocked waiting on memory.  

    HTH, 
    Pete

Reply
  • Before you go too much further, I'd recommend applying an analysis template matching your GPU (https://developer.arm.com/tools-and-software/graphics-and-gaming/arm-mobile-studio/learn/get-started/get-started-with-streamline/capture-a-profile). This makes the charts a lot easier to understand (although a lot won't be relevant for a compute-only use case).

    1. My GPU is mali-G78 mp24, it has 24 cores, so the GPU Cycles and Mali Core Cycles is the sum of all 24 cores ???

    Shader core counters are presented as a per-core figure, so they are normalized for the GPU size in the design. 

    2. The gpu cycles is 400k, the exectuion core active cycles is 288k, does it means the exectuion core has (400 - 288) 112k cycle idle ???  What's the exectuion core waiting for in 112k cycles (memory fetch or some other things??) 

    Yes. Can't tell what it's waiting for based on the counter's you've shared, but one common cause is GPU cache maintenance to ensure that the CPU and GPU see the same view of memory. Some designs have hardware coherency which can skip the cache maintenance, but most don't. 

    The other possibility is that your compute job is too small to fill the entire GPU (too few workgroups), and some of the GPU core are idle. When the shader cores are averaged this can show up as a reduction in the total. Unfortunately individual core values cannot be recovered, as the averaging is done on the device before the tool see them.

    3. If the execution core is waiting for memory data return, does the Load/Store Cycles contain the waiting cycle ???  Does the execution core active cycle contains the waiting cycle ???

    Yes, execution core active will increment any cycle that a thread is live in the core, even if that thread is blocked waiting on memory.  

    HTH, 
    Pete

Children
  • Hi, thank you for reply.

    Does the Load/Store Cycles  also include the waiting memory cycles ??

    Yes, execution core active will increment any cycle that a thread is live in the core, even if that thread is blocked waiting on memory.  
  • Yes. Can't tell what it's waiting for based on the counter's you've shared, but one common cause is GPU cache maintenance to ensure that the CPU and GPU see the same view of memory. Some designs have hardware coherency which can skip the cache maintenance, but most don't. 

    The other possibility is that your compute job is too small to fill the entire GPU (too few workgroups), and some of the GPU core are idle. When the shader cores are averaged this can show up as a reduction in the total. Unfortunately individual core values cannot be recovered, as the averaging is done on the device before the tool see them.

    Is their any counter can indicate the GPU cache maintenance ??? 

    Or is their some other counters can help me to find the cause of 112k idle cycles ??

    I have dispatch the shader with max threads.

  • Does the Load/Store Cycles  also include the waiting memory cycles ??

    No. The load/store cycles are the actual data cache access cycles.