This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why reordering uniforms affects arithmetic cycles?

Hello.


I've recently started using Mali Offline Compiler to get insight into our shaders and I get confusing results from it which I can't really explain.

So I have one quite big shader.

It has block of uniforms, quite large one cause it's uber shader.
I noticed that if I reorder uniforms in a different way - I get different results from Mali compiler.

#if HLSLCC_ENABLE_UNIFORM_BUFFERS
UNITY_BINDING(0) uniform UnityPerMaterial {
#endif
	UNITY_UNIFORM vec4 _MainTex_ST;
	UNITY_UNIFORM float _MainTexUVSet2;
	UNITY_UNIFORM vec4 _SecondaryTex_ST;
	UNITY_UNIFORM mediump vec4 _SecondaryColor;
	UNITY_UNIFORM float _SecondaryTexUVSet2;
	UNITY_UNIFORM vec4 _MaskTex_ST;
	UNITY_UNIFORM float _MaskTexUVSet2;
	UNITY_UNIFORM vec4 _DissolveTex_ST;
	UNITY_UNIFORM float _DissolveTexUVSet2;
	UNITY_UNIFORM mediump vec3 _MainColorBright;
	UNITY_UNIFORM mediump vec3 _MainColorMid;
	UNITY_UNIFORM mediump vec3 _MainColorDark;
	UNITY_UNIFORM mediump vec4 _MainColor;
	UNITY_UNIFORM vec2 _MainTexScrollSpeed;
	UNITY_UNIFORM vec2 _SecondaryTexScrollSpeed;
	UNITY_UNIFORM vec2 _DissolveTexScrollSpeed;
	UNITY_UNIFORM mediump float _Intensity;
	UNITY_UNIFORM mediump float _PSDriven;
	UNITY_UNIFORM mediump float _DissolveAmount;
	UNITY_UNIFORM mediump float _DissolveSoftness;
	UNITY_UNIFORM int _ScrollMainTex;
	UNITY_UNIFORM int _ScrollSecondaryTex;
	UNITY_UNIFORM int _ScrollDissolveTex;
	UNITY_UNIFORM int _MultiplyWithVertexColor;
	UNITY_UNIFORM int _MultiplyWithVertexAlpha;
	UNITY_UNIFORM int _UseGradientMap;
	UNITY_UNIFORM int _UseStepMasking;
	UNITY_UNIFORM float _Curvature;
	UNITY_UNIFORM mediump float _StepBorder;
	UNITY_UNIFORM mediump float _UseRForSecondaryTex;
	UNITY_UNIFORM mediump float _UseRForMask;
	UNITY_UNIFORM mediump float _MaskSecondTexWithFirst;
	UNITY_UNIFORM mediump float _UseRAsAlpha;
#if HLSLCC_ENABLE_UNIFORM_BUFFERS
};
 

So if I take let say _Curvature uniform and reorder it so it's before any other half/int variable
Here are results from fragment shader:

Mali Offline Compiler v7.4.0 (Build 330167)
Copyright 2007-2021 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 0
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   16.00    9.00    4.00        A
Shortest path cycles:       10.00    9.00    3.00        A
Longest path cycles:        10.25    9.00    3.00        A

A = Arithmetic, LS = Load/Store, T = Texture

And then they become

Mali Offline Compiler v7.4.0 (Build 330167)
Copyright 2007-2021 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 0
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   16.00    9.00    4.00        A
Shortest path cycles:        9.50    9.00    3.00        A
Longest path cycles:         9.75    9.00    3.00        A

A = Arithmetic, LS = Load/Store, T = Texture


This uniform is only used in vertex shader but somehow it also affects fragment shader results.

Why do arithmetic cycles are now different?

Right now I have no idea what affects it and how to optimize this in the best possible way and if I should even bother.
But when shader executes in let say 10 cycles and reordering fields can make it execute in 9 or even 8 cycles - this is 10-20% of performance to be gained so I would like to understand what's going on underhood.

Is there a way to get disassembly from mali compiler?
Right now it is a black box to me.

I am attaching both shaders and output from mali compiler in case someone will take a look.

mali.zip

Parents
  • Midgard is a vector architecture with 128-bit vector registers and SIMD instructions, not more modern scalar operations. The ability of the compiler to auto-vectorize is sensitive to the ordering of values in registers - if variables don't "align" in the same SIMD lanes then the compiler either has to run operations multiple times or swizzle registers at runtime, which isn't always free.

    Later Midgard GPUs don't have this problem as the uniform loads are converted into uniform register access, which can hide alignment issues and repack vectors, so you hit the same performance for both shaders. For Mali-T720 you will have to deal with how things map into vectors - sorry.

    Cheers, 
    Pete

Reply
  • Midgard is a vector architecture with 128-bit vector registers and SIMD instructions, not more modern scalar operations. The ability of the compiler to auto-vectorize is sensitive to the ordering of values in registers - if variables don't "align" in the same SIMD lanes then the compiler either has to run operations multiple times or swizzle registers at runtime, which isn't always free.

    Later Midgard GPUs don't have this problem as the uniform loads are converted into uniform register access, which can hide alignment issues and repack vectors, so you hit the same performance for both shaders. For Mali-T720 you will have to deal with how things map into vectors - sorry.

    Cheers, 
    Pete

Children
  • Thank you very much for quick answer. 

    Can you have any recommendations how to understand this better? i.e. how do I write code in a better way to help compiler to vectorize stuff?
    Do you maybe have some link to a guide?

    Right now I am thinking to not bother about it especially after you said it's not a problem on later midgard GPUs. What's your recommendation?

    And a little bit unrelated question.

    So we're doing mobile game and we want to have good performance on the widest scope of devices as possible. It's both android and ios and not just Mali devices but other devices too.

    Mali has the best developer tools so thanks for that :) and that's why I am mostly using Streamline and Mali offline compiler now to optimize stuff.

    My current strategy is to optimize shaders for the oldest GPU supported by mali offline compiler which is T720 and then I just hope that all other devices will be better than this one.

    And also let say devices from other manufacturers which have similar vector architecture will probably benefit from exactly same optimizations.

    is it valid strategy?

    My fear is that I overoptimize for one device and it won't really help with others, so I kind of waste my time.
    So far results are good i.e. mali offline compiler really helped me a lot to increase performance of our game.

  • I'm glad you're finding the tools useful =)

    For shader optimization, if you want to target entry-level lowest-common denominator I think there are really three major classes of interesting device in terms of giving different results:

    • Mali-T720 (SIMD, but without the uniform constant register optimization later GPUs have).
    • Mali-T820 (SIMD, but with the uniform constant register optimization)
    • Mali-G52 (Scalar instruction set).

    There were a lot of Mali-T720-based devices sold, but it's an old product now (first released 9 years ago) so I'd agree with your position that it's not worth worrying too much about. 

    Mali-T820 is Midgard (SIMD) which is a few years newer than Mali-T720, but still relatively old (first released 7 years ago). There are still a lot of Midgard devices kicking around, so it's probably still worth checking but I wouldn't totally rewrite your shaders for it, especially if those changes are detrimental to Mali-G52. 

    All (?) modern GPUs use scalar warp instruction sets (including both Mali and GPUs from other vendors) so the Mali-G52 results should more indicative of what you will see on any hardware released in the last 5 years. (Mali-G31 is a more restrictive target, but mostly found in embedded devices, so I wouldn't worry about that one unless you know you have users using it).

    Cheers, 
    Pete