This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Constant buffers and varyings size and layout optimization

I am trying to get deeper understanding of shader optimization and have basic questions about constant and varying buffers.

I read this thread (https://community.arm.com/support-forums/f/graphics-gaming-and-vr-forum/53873/will-data-precision-affect-uniform-block-layout) and also parts of mali optimization guide about uniforms but still have some questions:

So for the context - we use Unity with SRP batching enabled (https://docs.unity3d.com/Manual/SRPBatcher.html - which means multiple drawcalls uniforms get grouped into big buffer containing uniforms for each drawcall with some offset)
I am optimizing one shader which has huge buffer of uniforms, i.e.

CBUFFER_START(UnityPerMaterial)
    half4 _MainTex_ST;
    half4 _SecondaryTex_ST;
    half4 _MaskTex_ST;
    int _MainTexUVSet2;
    int _SecondaryTexUVSet2;
    int _MaskTexUVSet2;
    half4 _MainColor;
    half4 _SecondaryColor;         
    half4 _MainColorBright;
    ... // rest
CBUFFER_END

It's currently 324 bytes long according to Unity compiler output.
This buffer is used with ubershader (we use conditional compilation to enable/disable some features), that's why it's so big.

All GLSL data types are 4 bytes
scalar types (half, float, int) are 4 bytes aligned
vectors of size 2 - 8 bytes aligned
vectors of size 3 or 4 - 16 bytes aligned

1. Is stated above correct on all Mali hardware and graphics APIs (i.e. OpenGL ES and Vulkan, shoult not be api dependent, right?)?

What I read in the guide - uniforms are promoted to registers and it's essentially "free" but you need to watch that size stays under 128 bytes.

2. Does it mean that combined size of used fields is under 128 bytes? 
3. Does it matter if some fields are unused but buffer is big? Should I optimize buffer size i.e. have conditional compilation there as well? My current assumption - it doesn't matter on GPU side but it might help Unity with SRP batching (cause it basically needs to upload multiple such buffers to GPU - so it will need to upload less)
4. What's the general strategy to optimize such buffers? Should I even care or it's useless thing? i.e. should I pack multiple things together, remove as much padding as possible... or maybe think how to group stuff together so it's better vectorizes... how do you approach this (if it makes sense to approach). I tried different approaches and got inconclusive results so far - i.e. I can't really predict how performance will be affected but I do see that mali compiler output changes and sometimes in opposite ways depending on architecture (i.e. makes midgard worser, bifrost better or vice versa). Is it too much effort for not much benefit?

and one more thing:

So sometimes I use uniform as some kind of toggle so later it's used in lerp or if branch to either do or not do something (where I don't want to introduce new #ifdef)
I can either have this toggle as float/half or int.
i.e.
half _Toggle1;
int _Toggle2;

and later somewhere

OUT.result = lerp(_Color1, _Color2, _Toggle1);

or 

if (_Toggle2) {
  OUT.result = _Color2;
} else {
  OUT.result = _Color1;
}


I noticed that output from mali offline compiler does change when I experiment with types and I am not sure why yet. What should I use for such toggles? Does it depend on usage of it later or fluctuations I get is from whatever vectorization/other optimizations compiler does and it's hard to predict what's better in a specific case?

So I finished with my questions with uniforms (for now :))

And I do have similarish questions for varyings.

So one thing, when I write shader in Unity in HLSL, it looks like a struct
struct Varyings 
{
    float4 position : SV_POSITION;

    float2 mainTexCoord : TEXCOORD0;
    float2 secondTexCoord : TEXCOORD1;
    
    half4 customData : TEXCOORD2; // x - intensity, y - dissolveAmount, z - step masking (ps)
    
    #if defined(_USEDISSOLVE_ON)
        float2 dissolveTexCoord : TEXCOORD6; // dissolve uvs
    #endif
};


But in GLSL it's specified in a different way i.e. field by field (no struct anymore)
varying vec4 v_color;
varying vec4 v_color2;


1. I assume that in varyings halfs are actually 2 bytes, not 4 (like in constant buffers). Is it correct? float and int are 4 bytes.
2. Can you please explain if I should care about padding and packing in varyings or not? is it HLSL concept which is not related to hardware?
is it just about using halfs whenever possible and maybe sometimes pack close things together if it will allow to write code in vector-friendly form? 

3. Am I correct that smaller size will help with two things i.e. less LS instructions in fragment shader and also after vertex shader executes and its result is written back to main memory (because tile-based GPUs) - less data to write - more bandwidth for other stuff? 

I know I asked a lot of questions this time, hopefully it will be useful for other people visiting this forum as well

  • All GLSL data types are 4 bytes, ... type size ... and alignment

    It depends on the interface memory layout used. Assuming a uniform buffer with std140 layout then your data looks correct, although watch out for arrays which have weird padding rules in std140, or Vulkan code using GL_EXT_scalar_block_layout which basically removes alignment and padding constraints beyond the native primitive type.

    For Vulkan there is also the option of true 16-bit types in memory via GL_EXT_shader_16bit_storage and VK_KHR_16bit_storage (core in Vulkan 1.1).

    Is stated above correct on all Mali hardware and graphics APIs...?

    It's hardware independent, unless you use an implementation-defined memory layout that must be queried, but different layouts exist. Hans-Kristian wrote a good blog on interface layouts here:

    What I read in the guide - uniforms are promoted to registers and it's essentially "free" but you need to watch that size stays under 128 bytes.

    The 128 byte limit is for push constants in Vulkan, but using push constants is not required for register promotion (unless on a very very early Vulkan driver). Uniform storage can hold 128 32-bit values (packing two 16-bit values together, where possible), but don't assume the application gets all of them. Mali Offline Compiler can tell you about uniform usage per shader stage.

     Does it mean that combined size of used fields is under 128 bytes [or the higher limit above]?

    No.

    Does it matter if some fields are unused but buffer is big?

    In extreme cases it might. Try to keep commonly used fields together, but I doubt it will make a significant difference for normal content.

    Should I optimize buffer size i.e. have conditional compilation there as well?

    Unless you have evidence of a problem, I'd suggest not. Splitting pipelines on the CPU probably causes other problems you want to avoid, so there is value to not diverging.

    What's the general strategy to optimize such buffers?

    Broadly:

    • Keep frequently used values together.
    • Fold out uniform-on-uniform or uniform-on-constant computation on the CPU (e.g. don't upload M, V, and P uniform matrices and multiply in the vertex shader, instead multiply on the CPU and upload MVP as a uniform).
    • Still tag variables in the interface block that can be used as 16-bit values as mediump. Even through they will still be 32-bits in memory, they can be used as 16-bit in uniform registers/shader maths.

    ... Midgard worse, Bifrost better or vice versa

    The Midgard instruction set and register management is ... complicated ... so I would entirely expect strange results there. Bifrost onwards should be more predictable.

    So sometimes I use uniform as some kind of toggle so later it's used in lerp or if branch to either do or not do something (where I don't want to introduce new #ifdef). I can either have this toggle as float/half or int.

    Toggle type shouldn't matter.

    Branches are expensive on Midgard due to side-effects they have on instruction scheduling, so you may see some benefit by trying to avoid them there, although the compiler is good at converting simple branches into not-branch code sequences. Uniform branches are cheap on newer GPUs, so this isn't worth doing on Bifrost onwards.

    I assume that in varyings halfs are actually 2 bytes, not 4 (like in constant buffers). Is it correct? float and int are 4 bytes?

    Yes.

    Can you please explain if I should care about padding and packing in varyings or not?

    If you have half-precision varyings, it is worth packing them in at least multiples of vec2 to get best interpolator performance. (i.e. half-precision vec2 + vec2 is faster than half-precision vec3 + float).

    Am I correct that smaller varying size will help with two things i.e. less LS instructions in fragment shader and also after vertex shader executes and its result is written back to main memory...?

    Yes. Smaller varyings give:
    • Smaller memory bandwidth between vertex/fragment stages.
    • Faster interpolator performance on load into the fragment shader.
    • Denser register storage (can pack two 16-bit values into a 32-bit register).
    • Faster shader performance for vector maths ops (can do 32-bit per lane, which is either scalar fp32, or vec2 fp16, so vector ops are twice as fast in 16-bit).
    • Lower power (fewer bits to toggle)
    I know I asked a lot of questions this time, hopefully it will be useful for other people visiting this forum as well

    All good questions, and you'd done a lot of reading already, so no problem =)