So I recently started using Mali Offline Compiler and got great results with it by optimizing performance in the game we develop quite significantly.The problem I am still struggling with is that I still don't have good understanding about what code changes will actually help and which won't.So I know some general things i.e. use halfs everywhere whenever you can, recognize MADD, avoid division and so on.In a lot of cases where I see optimization opportunity - I try to make code modification, compile it with malioc and get the same result (or sometimes even worse).In a way I try semi random things and just check with profiler if they help or not. This approach works to some degree (in the end I get faster shaders after some tries) but I really want to dig deeper and get some understanding.I am also preparing a talk about Mali Offline Compiler for other devs in the company I work with, so I do want to share checked information with people and give them answers (not like try these N things and hope some of them will benefit you).
What doesn't help is that Unity also has a transpiler from HLSL to GLSL so it does apply some optimizations on its own and it's a lot harder to understand what's going on because sometimes Unity does some optimizations for you (cause it ignores IEEE 754 and freely reorders stuff for you).At least Unity output (final GLSL) is visible, what optimizations Mali Offline Compiler applies - I have no idea, it doesn't provide any assembly code, only numbers.So my current approach as I said is the following - take results for the earliest supported GPU (Mali T720) and optimize shaders for it.For now I ignore the differences between vector (Midgard) and scalar GPUs and I assume that if shaders are fast enough for Midgard - they will be fast enough for newer GPUs too :) In theory I can check output for high quality shader variants for newer GPUs (scalar) but currently I am not doing it.So my questions are:1. In a situation when I recognize a place for optimization, I do it and I get exactly the same result from compiler - does it mean doing this optimization is totally useless? What about other compiler versions, GPUs, GPUs from different brands, etc? What I am doing now - I still leave this "optimization" (which doesn't do anything according to malioc) in the code if code doesn't become more complex. Reasoning is maybe some other compilers (older versions or different chips) won't recognize this optimization automatically so I am doing it manually. Is it ok approach or not? 2. Can you recommend me some better way to understand what's happening? How do you usually do it? My only guess (haven't tried yet) is to use some other shader compiler (for example for desktop GPUs), get assembly from it and try to use it to understand what's going on.. But this will only work for some simple cases and if ISA is very similar. I am not experienced in low level stuff - so I assume a lot here, no idea if this is viable approach.3. Can you recommend me any resources about shader optimizations?i.e. how to optimize for vector GPUs? how to optimize for scalar GPUs? common optimizations, tricks, etcguides about how everything works, etcSo any documentation, videos, books which might help me to understand this topic better, i.e. how to write the fastest theoretically possible shaders.I have basic understanding of x86/x64 assembly (for CPU) but I don't care if these guides are more advanced, I am willing to learn prerequisites if needed.P.S. Peter Harris I really have hope you'll share your knowledge but everyone else is welcome too :)
In terms of applicability across vendors, most of what we recommend is relatively generic advice so I'd hope it helps across the board (or at least doesn't make things worse). There are definitely code generation issues that end up driver-version specific, or hardware-version specific, so I can't promise that everything in the offline compiler reports will be perfectly portable though ;)
Cheers ,Pete