This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

varying vs computation performance in fragment shader

I want to know the cost of varying or the cost of calculation, which is greater?

Take the following example as an example:
A*(1-factor), the factor is calculated in the vertex shader, and the factor will be passed to the fragment shader as varying. In order to achieve the same effect, there are the following two solutions:
1. A is the uniform of the vertex shader, A* (1-factor) is calculated in the vertex shader, and the result is passed to the fragment shader as a varying, and fragment shader uses the result directly - in this case, the main overhead should be varying interpolation.
2. A is the uniform of the fragment shader, A* (1-factor) is calculated in the fragment shader, and then fragment shader directly uses the result of the calculation - in this case, the main cost should be the calculation cost in fragment shader.

For the above two solutions, which performance is better? Also, where can I find the varying interpolation efficiency data of arm's GPU? For example: How many floats can by interpolated in a cycle?

// solution 1:
uniform float A;
varying float result;
varying float factor;
void vs()
{
factor = ...;
result = A * (1-factor);
}

void fs()
{
 // directly uses result to other computations
}

// solution 2:
varying float factor;
void vs()
{
factor = ...;

}

uniform float A;
void fs()
{
// ...
result = A * (1-factor);
// ...
}

Parents
  • Hi Shawn, 

    For Mali most uniform loads are effectively "free" (they get promoted into registers), the what you have here is a fairly straight trade-off between bandwidth (number of varyings written) and computation (number of fragment evaluations).

    It looks like both of your scenarios end up with a single varying (solution one only needs to pass "result" to the fragment shader, solution two only needs to pass "factor" to the fragment shader). Therefore, assuming you have fewer vertices than fragments, solution one will be the better option (bandwidth is the same, but computation is lower as vertex count is lower). 

    In the general case this is really about reducing the amount of bandwidth as much as possible; bandwidth that hits DDR is expensive. If you have you have very dense meshes (lots of vertices to write) then minimizing the number of varyings in the preferred option, provided that the computation moved to the fragment shader is relatively simple.  

    In terms of core performance, we have a data sheet here, although interpolation isn't currently on there. In general for Mali we can either interpolate 128 bits or 256 bits per clock cycle, depending on which shader core (recent Bifrost cores are larger 2 pixel per clock cores, so double the interpolation performance, although the ratio per thread is the same). Most importantly Mali GPUs can interpolate fp16 values faster than fp32 values, so use mediump wherever you can. 

     https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/Mali%20GPU%20datasheet/Arm%20Mali%20GPU%20Datasheet%202020.pdf

    HTH, 
    Pete

Reply
  • Hi Shawn, 

    For Mali most uniform loads are effectively "free" (they get promoted into registers), the what you have here is a fairly straight trade-off between bandwidth (number of varyings written) and computation (number of fragment evaluations).

    It looks like both of your scenarios end up with a single varying (solution one only needs to pass "result" to the fragment shader, solution two only needs to pass "factor" to the fragment shader). Therefore, assuming you have fewer vertices than fragments, solution one will be the better option (bandwidth is the same, but computation is lower as vertex count is lower). 

    In the general case this is really about reducing the amount of bandwidth as much as possible; bandwidth that hits DDR is expensive. If you have you have very dense meshes (lots of vertices to write) then minimizing the number of varyings in the preferred option, provided that the computation moved to the fragment shader is relatively simple.  

    In terms of core performance, we have a data sheet here, although interpolation isn't currently on there. In general for Mali we can either interpolate 128 bits or 256 bits per clock cycle, depending on which shader core (recent Bifrost cores are larger 2 pixel per clock cores, so double the interpolation performance, although the ratio per thread is the same). Most importantly Mali GPUs can interpolate fp16 values faster than fp32 values, so use mediump wherever you can. 

     https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/Mali%20GPU%20datasheet/Arm%20Mali%20GPU%20Datasheet%202020.pdf

    HTH, 
    Pete

Children
No data