This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why reordering uniforms affects arithmetic cycles?

Hello.


I've recently started using Mali Offline Compiler to get insight into our shaders and I get confusing results from it which I can't really explain.

So I have one quite big shader.

It has block of uniforms, quite large one cause it's uber shader.
I noticed that if I reorder uniforms in a different way - I get different results from Mali compiler.

#if HLSLCC_ENABLE_UNIFORM_BUFFERS
UNITY_BINDING(0) uniform UnityPerMaterial {
#endif
	UNITY_UNIFORM vec4 _MainTex_ST;
	UNITY_UNIFORM float _MainTexUVSet2;
	UNITY_UNIFORM vec4 _SecondaryTex_ST;
	UNITY_UNIFORM mediump vec4 _SecondaryColor;
	UNITY_UNIFORM float _SecondaryTexUVSet2;
	UNITY_UNIFORM vec4 _MaskTex_ST;
	UNITY_UNIFORM float _MaskTexUVSet2;
	UNITY_UNIFORM vec4 _DissolveTex_ST;
	UNITY_UNIFORM float _DissolveTexUVSet2;
	UNITY_UNIFORM mediump vec3 _MainColorBright;
	UNITY_UNIFORM mediump vec3 _MainColorMid;
	UNITY_UNIFORM mediump vec3 _MainColorDark;
	UNITY_UNIFORM mediump vec4 _MainColor;
	UNITY_UNIFORM vec2 _MainTexScrollSpeed;
	UNITY_UNIFORM vec2 _SecondaryTexScrollSpeed;
	UNITY_UNIFORM vec2 _DissolveTexScrollSpeed;
	UNITY_UNIFORM mediump float _Intensity;
	UNITY_UNIFORM mediump float _PSDriven;
	UNITY_UNIFORM mediump float _DissolveAmount;
	UNITY_UNIFORM mediump float _DissolveSoftness;
	UNITY_UNIFORM int _ScrollMainTex;
	UNITY_UNIFORM int _ScrollSecondaryTex;
	UNITY_UNIFORM int _ScrollDissolveTex;
	UNITY_UNIFORM int _MultiplyWithVertexColor;
	UNITY_UNIFORM int _MultiplyWithVertexAlpha;
	UNITY_UNIFORM int _UseGradientMap;
	UNITY_UNIFORM int _UseStepMasking;
	UNITY_UNIFORM float _Curvature;
	UNITY_UNIFORM mediump float _StepBorder;
	UNITY_UNIFORM mediump float _UseRForSecondaryTex;
	UNITY_UNIFORM mediump float _UseRForMask;
	UNITY_UNIFORM mediump float _MaskSecondTexWithFirst;
	UNITY_UNIFORM mediump float _UseRAsAlpha;
#if HLSLCC_ENABLE_UNIFORM_BUFFERS
};
 

So if I take let say _Curvature uniform and reorder it so it's before any other half/int variable
Here are results from fragment shader:

Mali Offline Compiler v7.4.0 (Build 330167)
Copyright 2007-2021 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 0
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   16.00    9.00    4.00        A
Shortest path cycles:       10.00    9.00    3.00        A
Longest path cycles:        10.25    9.00    3.00        A

A = Arithmetic, LS = Load/Store, T = Texture

And then they become

Mali Offline Compiler v7.4.0 (Build 330167)
Copyright 2007-2021 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 0
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   16.00    9.00    4.00        A
Shortest path cycles:        9.50    9.00    3.00        A
Longest path cycles:         9.75    9.00    3.00        A

A = Arithmetic, LS = Load/Store, T = Texture


This uniform is only used in vertex shader but somehow it also affects fragment shader results.

Why do arithmetic cycles are now different?

Right now I have no idea what affects it and how to optimize this in the best possible way and if I should even bother.
But when shader executes in let say 10 cycles and reordering fields can make it execute in 9 or even 8 cycles - this is 10-20% of performance to be gained so I would like to understand what's going on underhood.

Is there a way to get disassembly from mali compiler?
Right now it is a black box to me.

I am attaching both shaders and output from mali compiler in case someone will take a look.

mali.zip

Parents
  • Midgard is a vector architecture with 128-bit vector registers and SIMD instructions, not more modern scalar operations. The ability of the compiler to auto-vectorize is sensitive to the ordering of values in registers - if variables don't "align" in the same SIMD lanes then the compiler either has to run operations multiple times or swizzle registers at runtime, which isn't always free.

    Later Midgard GPUs don't have this problem as the uniform loads are converted into uniform register access, which can hide alignment issues and repack vectors, so you hit the same performance for both shaders. For Mali-T720 you will have to deal with how things map into vectors - sorry.

    Cheers, 
    Pete

Reply
  • Midgard is a vector architecture with 128-bit vector registers and SIMD instructions, not more modern scalar operations. The ability of the compiler to auto-vectorize is sensitive to the ordering of values in registers - if variables don't "align" in the same SIMD lanes then the compiler either has to run operations multiple times or swizzle registers at runtime, which isn't always free.

    Later Midgard GPUs don't have this problem as the uniform loads are converted into uniform register access, which can hide alignment issues and repack vectors, so you hit the same performance for both shaders. For Mali-T720 you will have to deal with how things map into vectors - sorry.

    Cheers, 
    Pete

Children