Dear Team
I came across HWCPipe library to access mali gpu performance counters from ARM. I would like to know if I sample performance counters each millisecond independent of any graphics application as a separate process (Just hwcpipe apis) then is it shows the global/system wide count of that particular counter? How can I know at specific time which graphics process it belongs to when multiple gfx processes are running?
I know if i use HWCPipe apis inside gfx process then I can get it per process counters.
I would also like to understand How the gator daemon dumps the counter per process? If you could provide any pointers then it will be very helpful to understand.
Thank you.
Best Regards,
Vikash
Thank you for your quick reply. Does it means it is global state of counter and may include multiple process usage? Are there any way to get performance counters per process using HWCPipe?
Yes, it's global. No means to filter to a single process.
So it means my only option is to use the gator daemon to get per process counter information and use ARM DS to visualize it. Could you give some pointer how gator daemon does it?
It doesn't - the counter data is the same. FTrace events (scheduling) do contain process information and can be used to modulate the counter data.
Thank you Peter. Where can I find what kind of ftrace events mali gpu provide? Are the events for gpu task scheduling and context switching?
Source code for Mali kernel drivers can be found here:* https://developer.arm.com/tools-and-software/graphics-and-gaming/mali-driversHTH, Pete
Hi Pete,
Thank you for the pointers. I have one more query regarding below derived counter mentioned for Bifrost family.
What is SUM refers here in formula? Does it mean we have to read the counter for certain duration, do the math and report the bytes ? OR How does it work?
Availability: All
With knowledge of the bus width used in the GPU the beat counter can be converted into a raw bandwidth counter.
L2.EXTERNAL_READ_BYTES = SUM(L2.EXTERNAL_READ_BEATS * L2.AXI_WIDTH_BYTES)
Note: Most implementations of a Bifrost GPU use a 128-bit (16 byte) AXI interface, but a 64-bit (8 byte) interface is also possible to reduce the area used by a design. This information can be obtained from your chipset manufacturer.
You may have multiple parallel L2 cache slices, depending on the number of shader cores in the GPU design. Each slice reports counters separately, so if you have multiple slices you need to add them together to get the total bandwidth for a given sample period.
Hi Peter Harris,
Could you please provide more details here? What does this mean and how does it work? I am checking is such thing is possible using HWCPipe or not.
Which event is responsible for scheduling? I could see following events are available.
drm:drm_vblank_event_delivereddrm:drm_vblank_event_queueddrm:drm_vblank_eventmali:mali_jit_trimmali:mali_jit_trim_from_regionmali:mali_jit_report_gpu_memmali:sysgraph_gpumali:sysgraphmali:mali_jit_report_pressuremali:mali_jit_reportmali:mali_jit_freemali:mali_jit_allocmali:mali_mmu_page_fault_growmali:mali_total_alloc_pages_changemali:mali_page_fault_insert_pagesmali:mali_pm_statusmali:mali_job_slots_eventpower:gpu_frequencygpu_mem:gpu_mem_total
I'd expect it to be mali:mali_job_slots_event, but not 100% sure.