This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Low-level shader optimization resources

So I recently started using Mali Offline Compiler and got great results with it by optimizing performance in the game we develop quite significantly.

The problem I am still struggling with is that I still don't have good understanding about what code changes will actually help and which won't.

So I know some general things i.e. use halfs everywhere whenever you can, recognize MADD, avoid division and so on.
In a lot of cases where I see optimization opportunity - I try to make code modification, compile it with malioc and get the same result (or sometimes even worse).

In a way I try semi random things and just check with profiler if they help or not. This approach works to some degree (in the end I get faster shaders after some tries) but I really want to dig deeper and get some understanding.

I am also preparing a talk about Mali Offline Compiler for other devs in the company I work with, so I do want to share checked information with people and give them answers (not like try these N things and hope some of them will benefit you).

What doesn't help is that Unity also has a transpiler from HLSL to GLSL so it does apply some optimizations on its own and it's a lot harder to understand what's going on because sometimes Unity does some optimizations for you (cause it ignores IEEE 754 and freely reorders stuff for you).
At least Unity output (final GLSL) is visible, what optimizations Mali Offline Compiler applies - I have no idea, it doesn't provide any assembly code, only numbers.

So my current approach as I said is the following - take results for the earliest supported GPU (Mali T720) and optimize shaders for it.
For now I ignore the differences between vector (Midgard) and scalar GPUs and I assume that if shaders are fast enough for Midgard - they will be fast enough for newer GPUs too :) 

In theory I can check output for high quality shader variants for newer GPUs (scalar) but currently I am not doing it.

So my questions are:

1. In a situation when I recognize a place for optimization, I do it and I get exactly the same result from compiler - does it mean doing this optimization is totally useless? What about other compiler versions, GPUs, GPUs from different brands, etc? 

What I am doing now - I still leave this "optimization" (which doesn't do anything according to malioc) in the code if code doesn't become more complex. 
Reasoning is maybe some other compilers (older versions or different chips) won't recognize this optimization automatically so I am doing it manually. Is it ok approach or not? 

2. Can you recommend me some better way to understand what's happening? 

How do you usually do it? My only guess (haven't tried yet) is to use some other shader compiler (for example for desktop GPUs), get assembly from it and try to use it to understand what's going on.. But this will only work for some simple cases and if ISA is very similar. I am not experienced in low level stuff - so I assume a lot here, no idea if this is viable approach.

3. Can you recommend me any resources about shader optimizations?

i.e. how to optimize for vector GPUs? how to optimize for scalar GPUs? common optimizations, tricks, etc
guides about how everything works, etc

So any documentation, videos, books which might help me to understand this topic better, i.e. how to write the fastest theoretically possible shaders.
I have basic understanding of x86/x64 assembly (for CPU) but I don't care if these guides are more advanced, I am willing to learn prerequisites if needed.

P.S. I really have hope you'll share your knowledge but everyone else is welcome too :) 

Parents
  • Hi Mikhail, 

    The short answer is that this is definitely one of the more challenging areas to completely optimize. Most of the useful details of the GPU pipeline (from any vendor) are generally not public, so you are inevitably working blind some of the time.

    One important point to note about the Midgard GPUs is that they have significant visibility issues in the offline compiler because one instruction is a really a bundle of multiple sub-operations (up to 2 scalar and 3 vector ops). Optimizations often remove a sub-operation from an instruction bundle, but you will not see an improvement in the offline compiler cycle count unless a whole bundle is removed.

    For this reason I'd generally suggest measuring on the newer scalar GPUs - the visibility you get tends to be better because removing one scalar op will show up in the offline compiler metrics. Most optimizations tend to be portable across architectures.

    In terms of general recommendations for things to try:

    Use mediump as much as possible (including for uniform inputs, buffer inputs, and vertex outputs) - good for computational cost, good for register pressure.

    Branches on Midgard have bad performance due to their impact on instruction bundling, so try to avoid them. Use compile-time specialization as much as possible rather than e.g. branches based on uniforms. The scalar GPUs are much better here, but still generally see some minor benefit from specialization (at the CPU expense of needing to compile and manage more shader variants).

    Use literal constants in the shader source rather than uniforms if values never really change. It gives the compiler more ability to optimize computation, constant storage, and unroll loops at compile time.

    Write clean code. Load data from memory at the right precision, and try to avoid casting between mediump/highp as it's not always free. Remove redundant floating point operations from the source - we tend to be quite conservative on removing and reordering float operations because it has a habit of introducing NaN/infinities/negative values in places the developer didn't expect them - so don't assume the compiler can remove them.

    Don't use "invariant" unless you _really_ need it. It can have major impacts on performance because it disables a lot of optimizations for the invariant variables.

    We have a whole collection of recommendations here if you've not found it:


    HTH,
    Pete
Reply
  • Hi Mikhail, 

    The short answer is that this is definitely one of the more challenging areas to completely optimize. Most of the useful details of the GPU pipeline (from any vendor) are generally not public, so you are inevitably working blind some of the time.

    One important point to note about the Midgard GPUs is that they have significant visibility issues in the offline compiler because one instruction is a really a bundle of multiple sub-operations (up to 2 scalar and 3 vector ops). Optimizations often remove a sub-operation from an instruction bundle, but you will not see an improvement in the offline compiler cycle count unless a whole bundle is removed.

    For this reason I'd generally suggest measuring on the newer scalar GPUs - the visibility you get tends to be better because removing one scalar op will show up in the offline compiler metrics. Most optimizations tend to be portable across architectures.

    In terms of general recommendations for things to try:

    Use mediump as much as possible (including for uniform inputs, buffer inputs, and vertex outputs) - good for computational cost, good for register pressure.

    Branches on Midgard have bad performance due to their impact on instruction bundling, so try to avoid them. Use compile-time specialization as much as possible rather than e.g. branches based on uniforms. The scalar GPUs are much better here, but still generally see some minor benefit from specialization (at the CPU expense of needing to compile and manage more shader variants).

    Use literal constants in the shader source rather than uniforms if values never really change. It gives the compiler more ability to optimize computation, constant storage, and unroll loops at compile time.

    Write clean code. Load data from memory at the right precision, and try to avoid casting between mediump/highp as it's not always free. Remove redundant floating point operations from the source - we tend to be quite conservative on removing and reordering float operations because it has a habit of introducing NaN/infinities/negative values in places the developer didn't expect them - so don't assume the compiler can remove them.

    Don't use "invariant" unless you _really_ need it. It can have major impacts on performance because it disables a lot of optimizations for the invariant variables.

    We have a whole collection of recommendations here if you've not found it:


    HTH,
    Pete
Children
  • In terms of applicability across vendors, most of what we recommend is relatively generic advice so I'd hope it helps across the board (or at least doesn't make things worse).

    There are definitely code generation issues that end up driver-version specific, or hardware-version specific, so I can't promise that everything in the offline compiler reports will be perfectly portable though ;)

    Cheers ,
    Pete

  • Wow.. didn't expect that quick of an answer :) Thank you! :) 

    Few things I want to ask.

    1. So, am I correct that vector GPUs also have scalar ALUs? 
    I am not familiar with this concept of sub operations and bundles. Can you please tell a little bit more or point where I can read about it?
     
    So you're saying that in 1 cycle arithmetic unit does 2 scalar and 3 vector ops? 
    What does it mean? Does it mean let say in 1 cycle I will be able to do 2 scalar multiplications and 3 vec4 multiplications (so 5 operations in total)?

    2. I am a little bit confused by non whole numbers i.e. 1.5 cycles in Mali compiler output. What does it mean exactly?
    I get the idea lower number is better than higher number but still puzzled what 0.5 cycles actually mean

    3. When you're saying that I won't be able to see any changes when I compile for Midgard but I would be able to see when it's compiled for Bifrost: am I correct that if I see that result for Bifrost becomes better - I should expect result for Midgard to be better too (i.e. malioc confuses me) or does it mean it will be faster on Bifrost and the same on Midgard? 

  • 1. So, am I correct that vector GPUs also have scalar ALUs? 

    For Midgard this is a good overview: https://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/5

    So you're saying that in 1 cycle arithmetic unit does 2 scalar and 3 vector ops? 

    Yes, assuming the compiler can fill all 5 sub-ops with useful work (it often can't, so some subops are unused that cycle). 

    2. I am a little bit confused by non whole numbers i.e. 1.5 cycles in Mali compiler output. What does it mean exactly?

    Most Mali GPUs have multiple arithmetic units - e.g. Mali-T860 has two arithmetic units per core, and Mali-T880 has three. The cycle costs in the offline compiler are the throughput per core, so normalized for pipeline count. A shader with 3 arithmetic instructions would be 1.5 cycles on Mali-T860 and 1 cycle on Mali-T880.

    3. When you're saying that I won't be able to see any changes when I compile for Midgard but I would be able to see when it's compiled for Bifrost: am I correct that if I see that result for Bifrost becomes better - I should expect result for Midgard to be better too (i.e. malioc confuses me) or does it mean it will be faster on Bifrost and the same on Midgard? 

    The latter. Removing one scalar op will go faster on Bifrost/Valhall, but will only help on Midgard if a whole instruction bundle becomes empty (i.e. if it means than none of the 5 sub-ops are used).

    HTH, 
    Pete