Constant buffers and varyings size and layout optimization

I am trying to get deeper understanding of shader optimization and have basic questions about constant and varying buffers.

I read this thread ( and also parts of mali optimization guide about uniforms but still have some questions:

So for the context - we use Unity with SRP batching enabled ( - which means multiple drawcalls uniforms get grouped into big buffer containing uniforms for each drawcall with some offset)
I am optimizing one shader which has huge buffer of uniforms, i.e.

    half4 _MainTex_ST;
    half4 _SecondaryTex_ST;
    half4 _MaskTex_ST;
    int _MainTexUVSet2;
    int _SecondaryTexUVSet2;
    int _MaskTexUVSet2;
    half4 _MainColor;
    half4 _SecondaryColor;         
    half4 _MainColorBright;
    ... // rest

It's currently 324 bytes long according to Unity compiler output.
This buffer is used with ubershader (we use conditional compilation to enable/disable some features), that's why it's so big.

All GLSL data types are 4 bytes
scalar types (half, float, int) are 4 bytes aligned
vectors of size 2 - 8 bytes aligned
vectors of size 3 or 4 - 16 bytes aligned

1. Is stated above correct on all Mali hardware and graphics APIs (i.e. OpenGL ES and Vulkan, shoult not be api dependent, right?)?

What I read in the guide - uniforms are promoted to registers and it's essentially "free" but you need to watch that size stays under 128 bytes.

2. Does it mean that combined size of used fields is under 128 bytes? 
3. Does it matter if some fields are unused but buffer is big? Should I optimize buffer size i.e. have conditional compilation there as well? My current assumption - it doesn't matter on GPU side but it might help Unity with SRP batching (cause it basically needs to upload multiple such buffers to GPU - so it will need to upload less)
4. What's the general strategy to optimize such buffers? Should I even care or it's useless thing? i.e. should I pack multiple things together, remove as much padding as possible... or maybe think how to group stuff together so it's better vectorizes... how do you approach this (if it makes sense to approach). I tried different approaches and got inconclusive results so far - i.e. I can't really predict how performance will be affected but I do see that mali compiler output changes and sometimes in opposite ways depending on architecture (i.e. makes midgard worser, bifrost better or vice versa). Is it too much effort for not much benefit?

and one more thing:

So sometimes I use uniform as some kind of toggle so later it's used in lerp or if branch to either do or not do something (where I don't want to introduce new #ifdef)
I can either have this toggle as float/half or int.
half _Toggle1;
int _Toggle2;

and later somewhere

OUT.result = lerp(_Color1, _Color2, _Toggle1);


if (_Toggle2) {
  OUT.result = _Color2;
} else {
  OUT.result = _Color1;

I noticed that output from mali offline compiler does change when I experiment with types and I am not sure why yet. What should I use for such toggles? Does it depend on usage of it later or fluctuations I get is from whatever vectorization/other optimizations compiler does and it's hard to predict what's better in a specific case?

So I finished with my questions with uniforms (for now :))

And I do have similarish questions for varyings.

So one thing, when I write shader in Unity in HLSL, it looks like a struct
struct Varyings 
    float4 position : SV_POSITION;

    float2 mainTexCoord : TEXCOORD0;
    float2 secondTexCoord : TEXCOORD1;
    half4 customData : TEXCOORD2; // x - intensity, y - dissolveAmount, z - step masking (ps)
    #if defined(_USEDISSOLVE_ON)
        float2 dissolveTexCoord : TEXCOORD6; // dissolve uvs

But in GLSL it's specified in a different way i.e. field by field (no struct anymore)
varying vec4 v_color;
varying vec4 v_color2;

1. I assume that in varyings halfs are actually 2 bytes, not 4 (like in constant buffers). Is it correct? float and int are 4 bytes.
2. Can you please explain if I should care about padding and packing in varyings or not? is it HLSL concept which is not related to hardware?
is it just about using halfs whenever possible and maybe sometimes pack close things together if it will allow to write code in vector-friendly form? 

3. Am I correct that smaller size will help with two things i.e. less LS instructions in fragment shader and also after vertex shader executes and its result is written back to main memory (because tile-based GPUs) - less data to write - more bandwidth for other stuff? 

I know I asked a lot of questions this time, hopefully it will be useful for other people visiting this forum as well