This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali GPU performance counters query

Dear Team

I came across  HWCPipe library to access mali gpu performance counters from ARM. I would like to know if I sample performance counters each millisecond  independent of any graphics application as a separate process (Just hwcpipe apis) then is it shows the global/system wide count of that particular counter? How can I know at specific time which graphics process it belongs to when multiple gfx processes are running?

I know if i use HWCPipe apis inside gfx process then I can get it per process counters.

I would also like to understand How the gator daemon dumps the counter per process?  If you could provide any pointers then it will be very helpful to understand.

Thank you.

Best Regards,

Vikash

Parents Reply Children
  • So it means my only option is to use the gator daemon to get per process counter information and use ARM DS to visualize it. Could you give some pointer how gator daemon does it?

  • It doesn't - the counter data is the same. FTrace events (scheduling) do contain process information and can be used to modulate the counter data. 

  • Thank you Peter. Where can I find what kind of ftrace events mali gpu provide? Are the events for gpu task scheduling and context switching?

  • Hi Pete,

    Thank you for the pointers. I have one more query regarding below derived counter mentioned for Bifrost family.

    What is SUM refers here in formula? Does it mean we have to read the counter for certain duration, do the math and report the bytes ? OR How does it work?

    5.3.2 L2.EXTERNAL_READ_BYTES (Derived)

    Availability: All

    With knowledge of the bus width used in the GPU the beat counter can be converted into a raw bandwidth counter.

    L2.EXTERNAL_READ_BYTES = SUM(L2.EXTERNAL_READ_BEATS * L2.AXI_WIDTH_BYTES)
    

    Note: Most implementations of a Bifrost GPU use a 128-bit (16 byte) AXI interface, but a 64-bit (8 byte) interface is also possible to reduce the area used by a design. This information can be obtained from your chipset manufacturer.

    Thank you.

    Best Regards,

    Vikash

  • You may have multiple parallel L2 cache slices, depending on the number of shader cores in the GPU design. Each slice reports counters separately, so if you have multiple slices you need to add them together to get the total bandwidth for a given sample period.

  • Hi Peter Harris,

    Could you please provide more details here? What does this mean and how does it work? I am checking is such thing is possible using HWCPipe or not.

    Which event is responsible for scheduling? I could see following events are available.

    drm:drm_vblank_event_delivered
    drm:drm_vblank_event_queued
    drm:drm_vblank_event
    mali:mali_jit_trim
    mali:mali_jit_trim_from_region
    mali:mali_jit_report_gpu_mem
    mali:sysgraph_gpu
    mali:sysgraph
    mali:mali_jit_report_pressure
    mali:mali_jit_report
    mali:mali_jit_free
    mali:mali_jit_alloc
    mali:mali_mmu_page_fault_grow
    mali:mali_total_alloc_pages_change
    mali:mali_page_fault_insert_pages
    mali:mali_pm_status
    mali:mali_job_slots_event
    power:gpu_frequency
    gpu_mem:gpu_mem_total

    Best Regards,

    Vikash

  • I'd expect it to be mali:mali_job_slots_event, but not 100% sure.