Why reordering uniforms affects arithmetic cycles?

Hello.


I've recently started using Mali Offline Compiler to get insight into our shaders and I get confusing results from it which I can't really explain.

So I have one quite big shader.

It has block of uniforms, quite large one cause it's uber shader.
I noticed that if I reorder uniforms in a different way - I get different results from Mali compiler.

#if HLSLCC_ENABLE_UNIFORM_BUFFERS
UNITY_BINDING(0) uniform UnityPerMaterial {
#endif
	UNITY_UNIFORM vec4 _MainTex_ST;
	UNITY_UNIFORM float _MainTexUVSet2;
	UNITY_UNIFORM vec4 _SecondaryTex_ST;
	UNITY_UNIFORM mediump vec4 _SecondaryColor;
	UNITY_UNIFORM float _SecondaryTexUVSet2;
	UNITY_UNIFORM vec4 _MaskTex_ST;
	UNITY_UNIFORM float _MaskTexUVSet2;
	UNITY_UNIFORM vec4 _DissolveTex_ST;
	UNITY_UNIFORM float _DissolveTexUVSet2;
	UNITY_UNIFORM mediump vec3 _MainColorBright;
	UNITY_UNIFORM mediump vec3 _MainColorMid;
	UNITY_UNIFORM mediump vec3 _MainColorDark;
	UNITY_UNIFORM mediump vec4 _MainColor;
	UNITY_UNIFORM vec2 _MainTexScrollSpeed;
	UNITY_UNIFORM vec2 _SecondaryTexScrollSpeed;
	UNITY_UNIFORM vec2 _DissolveTexScrollSpeed;
	UNITY_UNIFORM mediump float _Intensity;
	UNITY_UNIFORM mediump float _PSDriven;
	UNITY_UNIFORM mediump float _DissolveAmount;
	UNITY_UNIFORM mediump float _DissolveSoftness;
	UNITY_UNIFORM int _ScrollMainTex;
	UNITY_UNIFORM int _ScrollSecondaryTex;
	UNITY_UNIFORM int _ScrollDissolveTex;
	UNITY_UNIFORM int _MultiplyWithVertexColor;
	UNITY_UNIFORM int _MultiplyWithVertexAlpha;
	UNITY_UNIFORM int _UseGradientMap;
	UNITY_UNIFORM int _UseStepMasking;
	UNITY_UNIFORM float _Curvature;
	UNITY_UNIFORM mediump float _StepBorder;
	UNITY_UNIFORM mediump float _UseRForSecondaryTex;
	UNITY_UNIFORM mediump float _UseRForMask;
	UNITY_UNIFORM mediump float _MaskSecondTexWithFirst;
	UNITY_UNIFORM mediump float _UseRAsAlpha;
#if HLSLCC_ENABLE_UNIFORM_BUFFERS
};
 

So if I take let say _Curvature uniform and reorder it so it's before any other half/int variable
Here are results from fragment shader:

Mali Offline Compiler v7.4.0 (Build 330167)
Copyright 2007-2021 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 0
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   16.00    9.00    4.00        A
Shortest path cycles:       10.00    9.00    3.00        A
Longest path cycles:        10.25    9.00    3.00        A

A = Arithmetic, LS = Load/Store, T = Texture

And then they become

Mali Offline Compiler v7.4.0 (Build 330167)
Copyright 2007-2021 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 0
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   16.00    9.00    4.00        A
Shortest path cycles:        9.50    9.00    3.00        A
Longest path cycles:         9.75    9.00    3.00        A

A = Arithmetic, LS = Load/Store, T = Texture


This uniform is only used in vertex shader but somehow it also affects fragment shader results.

Why do arithmetic cycles are now different?

Right now I have no idea what affects it and how to optimize this in the best possible way and if I should even bother.
But when shader executes in let say 10 cycles and reordering fields can make it execute in 9 or even 8 cycles - this is 10-20% of performance to be gained so I would like to understand what's going on underhood.

Is there a way to get disassembly from mali compiler?
Right now it is a black box to me.

I am attaching both shaders and output from mali compiler in case someone will take a look.

mali.zip