This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali offline compiler - L/S cycles meaning

Hi, 

It's slightly unclear to me what the L/S cycles reported refer to. Since malioc is not taking into account memory-latency, etc.. are those cycles just related with the number of instructions issued to fetch attribute data and store the pre-interpolated varying results?

e.g. 

                               A      LS       T    Bound
Total instruction cycles:   20.60   35.00    0.00       LS
Shortest path cycles:       16.60   29.00    0.00       LS
Longest path cycles:          N/A     N/A     N/A      N/A

Cheers

  • Hi JPJ, 

    The aim is that it report architectural throughput for the Load/Store pipeline, so number of active cycles doing useful work.

    In terms of what it counts:

    • For Midgard family GPUs the LS pipe includes the interpolator - there is no separate varying pipeline.
    • For Bifrost and Valhall family GPUs the LS pipe excludes the interpolation - the interpolator is a separate unit reported as the "V" pipe in the reports. 

    L/S includes - any non-texture memory access (attributes, ubos, ssbos, atomics, images, local memory in compute shaders, stack spills, programmatic tile access).

    One caveat on the newer hardware (Bifrost / Valhall) is that the LS metric may over-estimate. The hardware can merge LS accesses for threads in the same warp if they hit the same cache line, but the compiler cannot know if this happens at compile time so you get the conservative number.

    HTH, 
    Pete

  • Thanks for the reply Pete. So, my interpretation "architectural throughput" is as a combination of memory-related cycles (hits/misses, latency) and instructions - is this assumption correct?
    Regarding the interpolator cycles, wouldn't that scale with the amount of pixels a primitive would span? So, more pixels more cycles?

  • "Architectural throughput" is just the processing cost of "doing" the instruction. Most of the time the GPU can hide misses and fetch latency - we have other things to run in parallel - so that's all ignored for the purposes of this metric. 

    For the interpolator costing, the cycle cost here is per fragment so primitive size doesn't matter for these metric (but would for determining total draw call cost - you need to scale these by your screen coverage).

    HTH, 
    Pete

  • Thanks for the clarification Pete! So, for the Midgard case this metric combines a mix of vertex and fragment stage cost: a "fixed" cost for the 3 vertices (in case of a triangle) and a variable cost (coverage-dependent) for the fragment side, correct?