This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Offline Compiler - arithmetics cycles vs texture cycles

So as its written in documentation and explained in some sources, whenever you work with mali offline compiler - you need to focus on stage which has the highest score in output from Mali first (I.e. arithmetics/load storage or texture stage)

One thing I noticed is that in pretty much any shader texture unit is never a bottleneck.
Example:

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 1
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   22.00    1.00    5.00        A
Shortest path cycles:       10.75    1.00    5.00        A
Longest path cycles:        10.75    1.00    5.00        A

A = Arithmetic, LS = Load/Store, T = Texture


Every texture instruction usually takes one cycle (on Midgard at least). 

So if you add another texture - you need to do something with it - blend it with other computation at least - it means arithmetics cycles will go up as well.. So as I said - texture cycles are like never higher than other columns.
So when I work on optimizing shaders - my current intuition is to still be quite agressive and try to reduce texture fetches as much as possible. And usually I don't tradeoff arithmetics and texture fetches - i.e. I don't move computation from arithmetcs to baked texture unless it's something very expensive.

Another thing: Mali offline compiler makes assumption that texture fetch is bilinear and texture has mipmaps.
We currently mostly use bilinear filtering without mipmaps on mobile. 
Rationale: when you start using mipmaps - you also need trilinear filtering, otherwise transition between mipmaps levels become visible.
Trilinear filtering means - double the cycles and also more memory throughput is needed (fetching 8 texels instead of 4 for bilinear).
On the other hand not using mipmaps means poor cache utilization which also means - more memory throughput is needed. No idea what's better in practice. I guess depends on the project/hardware. Or is there a universal answer?

And also fetching texture means latency, this latency is hidden to some degree if shader use relatively small amount but I assume it's still there.

Once I switch to another project in the company, I'll have time to do extensive tests related to the cost of textures and hopefully build some intuition.
As I am impatient and curious, I do hope other more experienced devs will share their intuition here.

So my questions:
1. Is it good strategy to aggressively optimize out texture fetches and treat them as very expensive thing (even if it's not a bottleneck by Mali offline compiler). Should I adjust score by Mali offline compiler, i.e. multiply it by 2  (so it's trilinear) or maybe I should use GPU profiler and look at some GPU metrics like memory throughput to make final decision? How do you do it in practice? 

2. Bilinear no mipmaps vs Trilinear mipmaps - what do you think is better in practice? How do you choose what to use? Does it depend on hardware maybe? We do need to support Midgard devices (we support very old devices, we're mobile development company) 
 
3. If you can share with me any links/books/resources explaining anything above which might help me - please do share them as well. I already read official mali documentation and optimization guides.

  • Hi Mikhail, 

    One thing I noticed is that in pretty much any shader texture unit is never a bottleneck.

    Mali-T720 is an older GPU, with the lowest ALU:TEX ratio of any Midgard GPU, so it's going to be more arithmetic limited than any other Midgard GPUs and any of the more modern GPU architecture families. It's really tuned for user interface rendering and casual gaming - complex 3D shaders are not going to perform as well as they would on higher-end products.

    Midgard GPUs are increasingly rare in devices - Mali-T720 was launched 10 years ago, and the last new Midgard GPU was released in 2016. 

    Is Mali-T720 really a GPU you need to target? A lot of the later devices with Midgard GPUs are Mali-T880-based which has a lot more arithmetic performance than a Mali-T720. I also suspect you'd get a substantially different result on a newer Bifrost or Valhall GPU. If you want to try out more modern entry-level devices configuration I'd suggest Mali-G51 (Bifrost architecture) or Mali-G57 (Valhall architecture).

    Is it good strategy to aggressively optimize out texture fetches and treat them as very expensive thing (even if it's not a bottleneck by Mali offline compiler). 

    GPUs are designed to be efficient at texturing, so I wouldn't optimize it out for the sake of it if it's not the bottleneck. However ...

    We currently mostly use bilinear filtering without mipmaps on mobile. Rationale: when you start using mipmaps - you also need trilinear filtering, otherwise transition between mipmaps levels become visible.

    In general I'd always recommend using mipmaps for 3D content, even with bilinear filtering. The visible filtering line on the mipmap boundary tends to be less objectionable than under-sampling shimmer. In addition, the under-sampling gives you poor locality in the texture cache. If the 4 samples for a fragment quad hit different cache lines because of missing mips causing sparse sample locations then you'll take a 4x hit on filtering performance (which is more expensive than the 2x cost of trilinear filtering), so that's definitely one to watch out for.

    For the Valhall hardware we doubled the effective texturing performance, so trilinear is definitely feasible there. For Valhall I would also test the performance of trilinear samples with 2x MAX_ANISOTROPY - this can significantly improve texture image quality at glancing viewing angles - and the performance hit is usually manageable (for a 2x MAX with trilinear sub-samples, the cost of a sample is between 1 - 4x the cost of a bilinear sample, depending on orientation)

    3. If you can share with me any links/books/resources explaining anything above which might help me - please do share them as well.

    Not aware of much beyond our optimization guides that goes in to more detail. Perhaps I need to write one =)

    Cheers, 
    Pete

  • Thank you, Pete :) I secretly hoped for you to answer my question :) 

    That Mali-T720 thing above was just an example from one of past projects.
    For the upcoming project I will reevaluate our target devices again, so Mali T-720 will go out.
    My current intuition is that our users still use quite a few newer Midgard devices so I plan to support them. I need to recheck ratio of devices through our analytics. Don't remember out of my head. I could be wrong for 2022.

    Sorry if I am asking the same question again, I want to clarify this.

    So let say I have this shader (it's from G31, I lost full report).

    Work registers: 20
    Uniform registers: 12
    Stack spilling: false
    16-bit arithmetic: 74%
    
                                    A      LS       V       T    Bound
    Total instruction cycles:    4.12    0.00    1.38    2.00        A
    Shortest path cycles:        4.00    0.00    1.38    2.00        A
    Longest path cycles:         4.12    0.00    1.38    2.00        A
    

    According to Mali Offline Compiler texture operations are 2 cycles and arithmetics operations are 4 cycles.
    And let say textures are bilinear/mipmaps enabled.

    So does this report mean that if I reduce texture operations to 1 cycle - I won't get anything out of it performance wise.
    I guess I might get some energy savings/maybe less heat. But shader will execute in more or less the same time? 
    Can I fully trust this info from Mali Offline Compiler or it's some approximation and real situation on device is more complicated?




  • So does this report mean that if I reduce texture operations to 1 cycle - I won't get anything out of it performance wise.
    I guess I might get some energy savings/maybe less heat. But shader will execute in more or less the same time? 

    Correct - texturing will run in parallel to the arithmetic, and arithmetic is the critical path. 

    Footnote - Mali-G31 is a lot like Mali-T720 - the arithmetic performance is cut down to save silicon area for simple user interface use cases. I'm not 100% confident on this one, but IIRC the Mali-G31 is rarely found in phones - it's intended for embedded consumer electronics use cases (DTV and set top box, etc).

  • 1. By saying "texturing will run in parallel to the arithmetic" - can you elaborate a little bit more about it?

    Do I have correct understanding about this?

    So hardware executes multiple threads in lockstep (warp). 
    It comes to instruction which fetches texture.
    If this texture is in cache - this instruction takes 1-4 cycles (depending on filtering/anisotropicity)
    if not - it will take a lot more (like hundreds/thousands cycles)
    Once warp is blocked - hardware swaps it to other warp (and saves its registers into registers storage, which can become full so then hardware will have to wait)

    So by saying "texturing will run in parallel to the arithmetic" you mean that arithmetic unit will execute one warp while texturing unit will execute different warp and load/storage will execute third warp and so on - so different stages are more or less always busy with executing different warps - that's the way how latency is hidden.

    Core doesn't execute multiple instructions of single thread/reorder them/anything like that - things which happen on CPU - GPU works differently.
    Am I correct about stuff written above?

    2. Can you disclose size of cache lines and count of cycles to fetch texture in case of cache miss? 


  • Once warp is blocked - hardware swaps it to other warp (and saves its registers into registers storage, which can become full so then hardware will have to wait)

    Each shader core has capacity and register storage for hundreds of concurrent threads, so if one thread blocks the hardware can just select another one to run. No save/restore needed - it's an instant zero-cost switch.

    So by saying "texturing will run in parallel to the arithmetic" you mean that arithmetic unit will execute one warp while texturing unit will execute different warp and load/storage will execute third warp and so on - so different stages are more or less always busy with executing different warps - that's the way how latency is hidden.

    Yes, that's the general idea.

    2. Can you disclose size of cache lines and count of cycles to fetch texture in case of cache miss? 

    For line size, assuming 64 bytes is a good starting point for planning purposes (for both CPU and GPU).

    The latency of a cache miss - tens of cycles if you hit in L2, hundreds of cycles if you end up in DRAM. However, not that GPUs can hide most cache miss latency - we can just pick another thread to run that has data available.

  • This video might help introduce some of the concepts here:

    www.youtube.com/watch

  • Thanks again Peter. I watched through the videos, it was helpful

    Do I have correct understanding now? 

    i.e. each shader core has a list of threads (how many depends on architecture, Midgard - 256, Valhall - 1024 and usage of registers by shader program)

    Midgard executes single thread at time (because its vector architecture)
    Bifrost/Valhall executes warp (i.e. 8/16 threads at the same time in lockstep)

    Once thread/warp stalls - core selects another one and this is considered "free"

    1. Am I correct that when thread/warp finishes, it's being removed from core thread set and immediately replaced by something else i.e. core pulls more work from some queue? or does it finish all thread set and then takes next batch? (I don't see reason for it but who knows)

    2. Does it mean that for example Midgard core can effectively wait for 256 texture fetches in parallel wiithout any problems?

    3. In theoretical situation where there is no texture cache (for simplicity of calculations) total execution time for fetching all 256 texels will be approximately 256+time_of_one_fetch (hundreds of cycles) instead of 256*time_of_one_fetch.

    4. if you can answer this one: can I apply this rough understanding model to all modern mobile GPUs (from other major vendors) or there are some caveats and better to study their documentation?

  • Midgard executes single thread at time (because its vector architecture)
    Bifrost/Valhall executes warp (i.e. 8/16 threads at the same time in lockstep)

    Yes. Just to be clear "at a time" = per instruction issue. You can have multiple threads live at different stages in the pipeline.

    1. Am I correct that when thread/warp finishes, it's being removed from core thread set and immediately replaced by something else i.e. core pulls more work from some queue

    Yes, the shader core has queues of work waiting to become threads (the next compute work items, or the next set of rasterized fragments) as soon as there is capacity for them.

    2. Does it mean that for example Midgard core can effectively wait for 256 texture fetches in parallel wiithout any problems?

    Yes, that's the general idea. In reality if a very high percentage of your total thread pool is waiting for data you probably start to run out of things to do, so "without any problems" is going to be an optimistic outlook =0.

    3. In theoretical situation where there is no texture cache (for simplicity of calculations) total execution time for fetching all 256 texels will be approximately 256+time_of_one_fetch (hundreds of cycles) instead of 256*time_of_one_fetch.

    Yes, that's the idea.

    can I apply this rough understanding model to all modern mobile GPUs (from other major vendors)

    I don't know if this is entirely accurate for other vendors - I don't know their microarchitectures - but I'd expect all GPUs to broadly fit this working model. 

    or there are some caveats

    There are always caveats =)

    Cheers, 
    Pete

    1. The best strategy for optimizing texture fetches would depend on the specific hardware and project you're working with. The Mali offline compiler provides a good starting point for identifying potential performance bottlenecks, but it's not always an accurate representation of what's happening on the actual hardware. To make the final decision, it's best to use a GPU profiler and look at metrics such as memory throughput to determine the actual performance impact of your optimizations.

    2. The choice between bilinear filtering without mipmaps and trilinear filtering with mipmaps is a tradeoff between performance and visual quality. Bilinear filtering without mipmaps provides better performance, but the transition between mipmap levels can be noticeable, while trilinear filtering with mipmaps provides better visual quality, but at the cost of increased performance overhead. It's important to consider the specific hardware and project you're working with, as well as the target audience, when making this decision.

    3. Here are a few resources that might be helpful for further learning: