This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

GPU's cycles in Streamline

Hi guys,

I need some clarification in regards to the GPU Active ($MaliGPUCyclesGPUActive) counter on a Mali-G76 (not really specific to this GPU though) in a Samsung S10e.

I'm getting a total of ~310 mega-cycles in 1 sec. From a quick look at the DVFS table file (found it in /sys/devices/platform/18500000.mali/dvfs_table) it seems the max frequency is 702MHz (for some reason I can't find any Arm official page with the max frequency of this GPU though) - and I'm relying in my interpretation of the values in there.

So, my question is, is it possible that the GPU is being underutilized or that DVFS has kicked-in (although I made sure to have the device running on top of an ice-pack); or both?

Cheers.

  • Hi JPJ, 

    For some reason I can't find any Arm official page with the max frequency of this GPU though.

    The max frequency entirely depends on the silicon implementation, so it can vary from device to device depending on silicon process, and target performance/area/power trade off used by our silicon partner.

    So, my question is, is it possible that the GPU is being underutilized or that DVFS has kicked-in  

    On mobile you'll be getting DVFS all the time; tuning CPU and GPU performance to match the workload demand is critical for getting good energy efficiency. 

    In general for a high-end part like the S10 you nearly always want to run the GPU well below maximum frequency. The high core count is really designed to allow complex content to run at under-drive voltages to gain energy efficiency. Energy per operation is proportional to V^2 so high frequencies get expensive pretty quickly. Running well below 700 should be "normal" for this device, yes,

    You can use the peak performance for short periods, but if you tried to run all the cores in a high-end device at max performance for a prolonged period you'd definitely overheat the device.  

    HTH, 
    Pete

  • Thanks Pete! I've massively oversimplified the problem, so thanks for the clarification. 

    What I meant by the "max frequency" was perhaps a representative frequency (e.g. as in this page https://developer.arm.com/ip-products/graphics-and-multimedia/mali-gpus/mali-t860-and-mali-t880-gpus for the T880-MP16)? So, I was searching for the equivalent for the G-76. I assume the frequency given in that page is not useful to individual user, but rather to silicon implementers, right?

    Regardless, I was trying to establish a parallel with an old Arm presentation from the Nordic Game Conference 2014, "Performance Optimization and Debugging Mobile Games". Although there's no text my interpretation was that I could somehow benchmark the value I was getting in the GPU Active counter against a set frequency.

    Having said all this, is there any meaningful use for that specific counter apart from using it to relative work from the different queues/job types (e.g. Fragment, Tiler, Interrupt)?

  • I assume the frequency given in that page is not useful to individual user, but rather to silicon implementers, right?

    Yes, agreed. Those are really just guidance for what we'd expect the design to be able to achieve on some reference silicon implementation at nominal voltage with a middle-of-the-road transistor choice. A specific implementation can significantly deviate from that (different silicon process, different target top frequency, different transistor choices). For the same GPU design from Arm it's not unusual to have a 2x spread in maximum frequency on different implementations (small and fast vs wide and slow).

    Although there's no text my interpretation was that I could somehow benchmark the value I was getting in the GPU Active counter against a set frequency.

    It's generally possible to get a feel for what a specific device can cope with, but this may need some empirical testing. Unfortunately nothing you can just read off a spec sheet.

    In the mass market devices with fewer cores, such as a Mali-G72 MP2 or MP3, you are generally able to run those close to their top frequency. There simply are not enough cores to cause thermal issues even if you run them fast. Getting those designs to hit 750MHz or 800MHz under normal usage scenarios is common.

    The high-end devices are more challenging, because the main issue is achieving thermally sustainable performance, not peak performance. For these systems you have a 2-3 Watt total budget (CPU, GPU, memory), and so you may find that different content will have different achievable sustainable GPU frequency because they use more or less CPU processing or memory bandwidth. You may also find environmental issues are critical; phones rely on passive cooling so a user in Brazil may hit problems when a different user in Iceland playing the same game on the same device may not. Setting rough budgets for these devices is possible, but it's hard to fine tune them without real data from a specific title.

    Having said all this, is there any meaningful use for that specific counter apart from using it to relative work from the different queues/job types (e.g. Fragment, Tiler, Interrupt)?

    For developer use the two queue counters are generally the most useful, as these correspond to some specific workload. The GPU Active counter is just incrementing any cycle at least one of the queue counters is incrementing, so it gives an aggregate. Some useful things to look for:

    * The two queue counters are useful individually, but showing that they are getting good overlap is also important. If you are GPU bound you really want the dominant queue to be very similar to the GPU ACTIVE counter (i.e. pipelining well without bubbles).

    * The GPU active counter can show what frequency you are running at (zoom in, find a flat spot and measure the aggregate cycles over that window). Optimizing and getting that frequency as low as possible is important for energy efficiency, even if not VSYNC limited (lower frequency = lower voltage = V^2 energy savings). You could do this with the queue counters too; but GPU_ACTIVE may cover more of the runtime.

    HTH, 
    Pete

  • Thanks so much for the highly detailed explanation Pete! Much appreciated!

  • I can definitely confirm that, thanks. I also looked into /sys/devices/platform/18500000.mali/clock and had a look at the changing frequencies (325000 was matching what I was seeing in Streamline at certain points). This frequency is listed in found it in /sys/devices/platform/18500000.mali/dvfs_table.

    zoom in, find a flat spot and measure the aggregate cycles over that window
  • I have to extend thanks for the detailed explanation as well. Great stuff!

    I have two additional questions:

    1. Is there a recommended way to quickly benchmark the device on app startup and set various budgets / LOD tweaks accordingly, so that application is running as close to the thermally sustainable threshold?

    2. Is there a recommended way to track energy usage / temperature so that the application can adapt in run-time to current conditions device is in and keep the workload sustainable yet optimal?

    Also, any resource pointers / links relevant to above questions are more then welcome.

    Thank you in advance,
    Milan

  • Hi Speedym, 

    Would you mind raising this as a new discussion thread please; it makes searching the forums for other users a lot easier. 

    Thanks, 
    Pete

  • Hi Pete!

    Yup, good idea, will do.