So I recently started using Mali Offline Compiler and got great results with it by optimizing performance in the game we develop quite significantly.The problem I am still struggling with is that I still don't have good understanding about what code changes will actually help and which won't.So I know some general things i.e. use halfs everywhere whenever you can, recognize MADD, avoid division and so on.In a lot of cases where I see optimization opportunity - I try to make code modification, compile it with malioc and get the same result (or sometimes even worse).In a way I try semi random things and just check with profiler if they help or not. This approach works to some degree (in the end I get faster shaders after some tries) but I really want to dig deeper and get some understanding.I am also preparing a talk about Mali Offline Compiler for other devs in the company I work with, so I do want to share checked information with people and give them answers (not like try these N things and hope some of them will benefit you).
What doesn't help is that Unity also has a transpiler from HLSL to GLSL so it does apply some optimizations on its own and it's a lot harder to understand what's going on because sometimes Unity does some optimizations for you (cause it ignores IEEE 754 and freely reorders stuff for you).At least Unity output (final GLSL) is visible, what optimizations Mali Offline Compiler applies - I have no idea, it doesn't provide any assembly code, only numbers.So my current approach as I said is the following - take results for the earliest supported GPU (Mali T720) and optimize shaders for it.For now I ignore the differences between vector (Midgard) and scalar GPUs and I assume that if shaders are fast enough for Midgard - they will be fast enough for newer GPUs too :) In theory I can check output for high quality shader variants for newer GPUs (scalar) but currently I am not doing it.So my questions are:1. In a situation when I recognize a place for optimization, I do it and I get exactly the same result from compiler - does it mean doing this optimization is totally useless? What about other compiler versions, GPUs, GPUs from different brands, etc? What I am doing now - I still leave this "optimization" (which doesn't do anything according to malioc) in the code if code doesn't become more complex. Reasoning is maybe some other compilers (older versions or different chips) won't recognize this optimization automatically so I am doing it manually. Is it ok approach or not? 2. Can you recommend me some better way to understand what's happening? How do you usually do it? My only guess (haven't tried yet) is to use some other shader compiler (for example for desktop GPUs), get assembly from it and try to use it to understand what's going on.. But this will only work for some simple cases and if ISA is very similar. I am not experienced in low level stuff - so I assume a lot here, no idea if this is viable approach.3. Can you recommend me any resources about shader optimizations?i.e. how to optimize for vector GPUs? how to optimize for scalar GPUs? common optimizations, tricks, etcguides about how everything works, etcSo any documentation, videos, books which might help me to understand this topic better, i.e. how to write the fastest theoretically possible shaders.I have basic understanding of x86/x64 assembly (for CPU) but I don't care if these guides are more advanced, I am willing to learn prerequisites if needed.P.S. Peter Harris I really have hope you'll share your knowledge but everyone else is welcome too :)
Wow.. didn't expect that quick of an answer :) Thank you! :) Few things I want to ask.1. So, am I correct that vector GPUs also have scalar ALUs? I am not familiar with this concept of sub operations and bundles. Can you please tell a little bit more or point where I can read about it? So you're saying that in 1 cycle arithmetic unit does 2 scalar and 3 vector ops? What does it mean? Does it mean let say in 1 cycle I will be able to do 2 scalar multiplications and 3 vec4 multiplications (so 5 operations in total)?2. I am a little bit confused by non whole numbers i.e. 1.5 cycles in Mali compiler output. What does it mean exactly?I get the idea lower number is better than higher number but still puzzled what 0.5 cycles actually mean3. When you're saying that I won't be able to see any changes when I compile for Midgard but I would be able to see when it's compiled for Bifrost: am I correct that if I see that result for Bifrost becomes better - I should expect result for Midgard to be better too (i.e. malioc confuses me) or does it mean it will be faster on Bifrost and the same on Midgard?
Mikhail Golub said:1. So, am I correct that vector GPUs also have scalar ALUs?
For Midgard this is a good overview: https://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/5
Mikhail Golub said:So you're saying that in 1 cycle arithmetic unit does 2 scalar and 3 vector ops?
Yes, assuming the compiler can fill all 5 sub-ops with useful work (it often can't, so some subops are unused that cycle).
Mikhail Golub said:2. I am a little bit confused by non whole numbers i.e. 1.5 cycles in Mali compiler output. What does it mean exactly?
Most Mali GPUs have multiple arithmetic units - e.g. Mali-T860 has two arithmetic units per core, and Mali-T880 has three. The cycle costs in the offline compiler are the throughput per core, so normalized for pipeline count. A shader with 3 arithmetic instructions would be 1.5 cycles on Mali-T860 and 1 cycle on Mali-T880.
Mikhail Golub said:3. When you're saying that I won't be able to see any changes when I compile for Midgard but I would be able to see when it's compiled for Bifrost: am I correct that if I see that result for Bifrost becomes better - I should expect result for Midgard to be better too (i.e. malioc confuses me) or does it mean it will be faster on Bifrost and the same on Midgard?
The latter. Removing one scalar op will go faster on Bifrost/Valhall, but will only help on Midgard if a whole instruction bundle becomes empty (i.e. if it means than none of the 5 sub-ops are used).