This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Varying unit (V) reported by Offline Compiler

I'd like to know what it means when a fragment shader is bound by Varying unity (V), in our case.

According to: https://developer.arm.com/documentation/101863/7-4/Mali-GPU-pipelines/Mali-Bifrost-architecture

The varying pipeline is a dedicated pipeline which implements the varying interpolator.

Does it mean that the it takes a lot of cycles just interpolating the varyings than ALU operations, and reducing the amount of varyings could potentially reduce the fragment shader cycles ?

For example:

Mali Offline Compiler v7.4.0 (Build 330167)
Copyright 2007-2021 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-G71 r0p1
Architecture: Bifrost
Driver: r32p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 24
Uniform registers: 12
Stack spilling: false
16-bit arithmetic: 60%

A LS V T Bound
Total instruction cycles: 1.42 0.00 3.50 2.00 V
Shortest path cycles: 1.42 0.00 3.50 2.00 V
Longest path cycles: 1.42 0.00 3.50 2.00 V

A = Arithmetic, LS = Load/Store, V = Varying, T = Texture

  • Please see the attached shader. MAILOC gives V (Varying Unit)=3.00 (-c MALI-G71), but if I change the first line from:

    u_xlat16_0.xy = vs_TEXCOORD0.xy;

    to:

    u_xlat16_0.xy = texture(_BaseMap, vs_TEXCOORD0.xy).xy;

    V increases to 3.25. Why ?

    Full fragment shader:

    #version 300 es
    
    uniform mediump sampler2D _BaseMap;
    precision highp float;
    precision highp int;
    in mediump vec4 vs_TEXCOORD0;
    in mediump vec4 vs_TEXCOORD2;
    in highp vec4 vs_TEXCOORD5;
    in mediump vec4 vs_TEXCOORD7;
    in mediump vec4 vs_TEXCOORD9;
    in mediump vec4 vs_TEXCOORD10;
    layout(location = 0) out mediump vec4 SV_Target0;
    mediump vec4 u_xlat16_0;
    mediump vec4 u_xlat16_1;
    void main()
    {
        u_xlat16_0.xy = vs_TEXCOORD0.xy;
        //u_xlat16_0.xy = texture(_BaseMap, vs_TEXCOORD0.xy).xy; // Replace the previous line with this
    
        u_xlat16_0.z = float(0.0);
        u_xlat16_0.w = float(0.0);
        u_xlat16_0.xyz = u_xlat16_0.xyz + vs_TEXCOORD2.xyz;
        u_xlat16_0.xyz = u_xlat16_0.xyz + vs_TEXCOORD5.xyz;
        u_xlat16_0 = u_xlat16_0 + vs_TEXCOORD7;
        u_xlat16_0 = u_xlat16_0 + vs_TEXCOORD9;
        u_xlat16_1.xyz = vs_TEXCOORD10.xyz;
        u_xlat16_1.w = 0.0;
        u_xlat16_0 = u_xlat16_0 + u_xlat16_1;
        u_xlat16_1.x = vs_TEXCOORD2.y * 0.449999988 + 0.550000012;
    #ifdef UNITY_ADRENO_ES3
        u_xlat16_1.x = min(max(u_xlat16_1.x, 0.0), 1.0);
    #else
        u_xlat16_1.x = clamp(u_xlat16_1.x, 0.0, 1.0);
    #endif
        u_xlat16_1.x = u_xlat16_1.x * 0.199999988 + 0.800000012;
        u_xlat16_1.xyz = u_xlat16_1.xxx * vec3(0.5, 0.5, 0.5);
        u_xlat16_1.w = 1.0;
        SV_Target0 = u_xlat16_0 * vec4(1.00000001e-07, 1.00000001e-07, 1.00000001e-07, 1.00000001e-07) + u_xlat16_1;
        return;
    }

  • Does it mean that the it takes a lot of cycles just interpolating the varyings than ALU operations, and reducing the amount of varyings could potentially reduce the fragment shader cycles ?

    Yes, this shader is varying bound.

    V increases to 3.25. Why ?

    Texture coordinates nearly always need more than mediump precision to get enough sub-texel accuracy for stable filtering, so the compiler will implicitly promote the precision of varyings used in texture lookups to highp. Highp interpolation is half the speed of mediump interpolation. 

  • This makes sense and surprises me at the same time. Thank you a lot because I would never figure it out on my own. 
    - Is there any counter in Streamline which help us detect this kind of promotion
    - Is there any other precision surprises we should expect from the compiler ?
    - Is there anyway to defeat this optimization ? (say, the texture is very small)

  • Texture coordinates nearly always need more than mediump precision to get enough sub-texel accuracy for stable filtering, so the compiler will implicitly promote the precision of varyings used in texture lookups to highp. Highp interpolation is half the speed of mediump interpolation. 

    Pete, does that mean varyings will also be stored at full precision, or the promotion only happens upon loading and subsequent interpolation?

    We tend to pack mixed semantic varyings in attempt to achieve "optimal" packing: say 2 uvs with 2 lighting params in a single mediump vec4. Assuming we can't find "free" lanes in other mediump varyings I assume we have to bite the bullet for cases like this and have these other varyings spend more ALU they would need otherwise right?