This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

God Ray effect (light scattering) with Mali GPU

Note: This was originally posted on 10th May 2012 at http://forums.arm.com

Hello everyone,

I'm implementing the God Ray effect for an Android game on Mali-400 based devices (here is Samsung I9100 and Samsung I9300).
I've followed this article http://fabiensanglard.net/lightScattering/index.php
But the effect is not good (as you can see in the attached file).
It did work on Win32 and other Adreno and PowerVR based devices.
I think that there's problem with the texture's coordinate fetch from vertex shader to fragment shader, the interpolation computation may cause this issue.

I hope that you can give us some idea on this.

Thank you.

Parents

0 Michael McGeagh over 9 years ago in reply to John

God Ray effect (light scattering) with Mali GPU - ARM Community

Hi,

thanks for the link, an interesting read!

I believe the issue is rooted in floating point precision. For power efficiency and performance reasons the Mali-400 is designed to support mediump (16-bit) floating point numbers in the fragment processor (which conforms to the Khronos GLSL ES specification).

I suspect the fragment shader on the web page was written to assume highp (32-bit) floating point numbers. The way it has been written, it will suffer from loss of precision the more iterations (NUM_SAMPLES) are completed. This is because it is creating a delta (the light to texel vector) and then scaling it down by the number of samples.

This creates a small floating point number - for example if the texel was (0.5, 0.5) and the light was at (0.25, 0.75) then the delta vector becomes (0.25, -0.25) which is then divided by number of samples to become (0.0025, -0.0025).

Then, in the loop it is modifying the original texture coordinate - (0.5, 0.5) in the example - by subtracting the delta and storing the result and repeating the process each iteration.

The problem with this method is it loses precision in the low order bits early on, and can never recover them. Subtracting a relatively small number from a relatively larger one repeatedly also causes problems - there are only so many bits to hold the range between the magnitude of the large part and the precision of the small part.

The method could be reimplemented by trying to preserve the precision as long as possible before sampling the texture. I believe (though have not yet tested) that precision would be kept for longer with a method similar to this:

With a quick worked example, I think the original method grows to about 2.7% error by the 100th iteration, whereas the above method appears to be more stable, oscillating around ~0.003%-0.027% error.

precision mediump float;

uniform vec2 u_v2LightPos;

uniform sampler2D u_s2dFirstPass;

varying vec2 v_v2TexCoord;

const int k_iNumSamples = 100;

void main()

{

    // Keep the number as big as possible - no need to scale down by NumSamples yet.

    vec2 v2Delta = v_v2TexCoord - u_v2LightPos;

    int iSample;

    vec3 v3Color = vec3(0.0);

    for(iSample = 1; iSample <= k_iNumSamples; iSample ++)

    {

        vec2 v2TexCoord = v2Delta * float(-iSample); // Multiply by iteration rather than accumulating.

        v2TexCoord /= float(k_iNumSamples); // Then divide, which may lose precision.

        v2TexCoord += v_v2TexCoord; // Then add to the larger number, precision loss may also occur here.

        vec3 v3Texel = texture2D(u_s2dFirstPass, v2TexCoord).rgb;

        v3Color += v3Texel;

    }

    gl_FragColor = vec4(v3Color, 1.0);

}

Having said that, the whole approach to the effect looks like it will be sub-optimal on most embedded GPUs as it stands. For example, on a Mali-400 our offline shader compiler reports the loop body will compile to 2 cycles so every fragment rasterised would take 2 cycles * 100 samples = 200 cycles.

If we assumed an 800x480 FBO, a 266MHz Mali-400 with 4 fragment processors, that would mean it couldn't exceed 12FPS just rendering the FBO. Are you seeing framerates that low, or are you reducing the size of the FBO and/or number of samples to compensate?

The heart of the problem is using the fragment shader to do that many texture lookup operations in a loop. A shortcut would be to try generating the "light scattering" image from the "light and occluder" image by a different method. One such method would be to take the occluder image and additive-blend it over itself, with the 2nd copy centred at the light's coordinates and scaled slightly bigger. You could repeat this process a few times to get a similar effect to a radial blur but with much less fragment shader cost.

I mocked up a very simple test of this using The GIMP, scaling the original occlusion image by 120%, 150% then 200% and using an additive blend with a 20% opacity each time, centered around the light position. Looks like some parameters would need tweaking, but that's the general idea - a cheaper radial blur effect using a few FBO operations.

Please let me know your thoughts. Cheers, Pete
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Michael McGeagh over 9 years ago in reply to John

God Ray effect (light scattering) with Mali GPU - ARM Community

Hi,

thanks for the link, an interesting read!

I believe the issue is rooted in floating point precision. For power efficiency and performance reasons the Mali-400 is designed to support mediump (16-bit) floating point numbers in the fragment processor (which conforms to the Khronos GLSL ES specification).

I suspect the fragment shader on the web page was written to assume highp (32-bit) floating point numbers. The way it has been written, it will suffer from loss of precision the more iterations (NUM_SAMPLES) are completed. This is because it is creating a delta (the light to texel vector) and then scaling it down by the number of samples.

This creates a small floating point number - for example if the texel was (0.5, 0.5) and the light was at (0.25, 0.75) then the delta vector becomes (0.25, -0.25) which is then divided by number of samples to become (0.0025, -0.0025).

Then, in the loop it is modifying the original texture coordinate - (0.5, 0.5) in the example - by subtracting the delta and storing the result and repeating the process each iteration.

The problem with this method is it loses precision in the low order bits early on, and can never recover them. Subtracting a relatively small number from a relatively larger one repeatedly also causes problems - there are only so many bits to hold the range between the magnitude of the large part and the precision of the small part.

The method could be reimplemented by trying to preserve the precision as long as possible before sampling the texture. I believe (though have not yet tested) that precision would be kept for longer with a method similar to this:

With a quick worked example, I think the original method grows to about 2.7% error by the 100th iteration, whereas the above method appears to be more stable, oscillating around ~0.003%-0.027% error.

precision mediump float;

uniform vec2 u_v2LightPos;

uniform sampler2D u_s2dFirstPass;

varying vec2 v_v2TexCoord;

const int k_iNumSamples = 100;

void main()

{

    // Keep the number as big as possible - no need to scale down by NumSamples yet.

    vec2 v2Delta = v_v2TexCoord - u_v2LightPos;

    int iSample;

    vec3 v3Color = vec3(0.0);

    for(iSample = 1; iSample <= k_iNumSamples; iSample ++)

    {

        vec2 v2TexCoord = v2Delta * float(-iSample); // Multiply by iteration rather than accumulating.

        v2TexCoord /= float(k_iNumSamples); // Then divide, which may lose precision.

        v2TexCoord += v_v2TexCoord; // Then add to the larger number, precision loss may also occur here.

        vec3 v3Texel = texture2D(u_s2dFirstPass, v2TexCoord).rgb;

        v3Color += v3Texel;

    }

    gl_FragColor = vec4(v3Color, 1.0);

}

Having said that, the whole approach to the effect looks like it will be sub-optimal on most embedded GPUs as it stands. For example, on a Mali-400 our offline shader compiler reports the loop body will compile to 2 cycles so every fragment rasterised would take 2 cycles * 100 samples = 200 cycles.

If we assumed an 800x480 FBO, a 266MHz Mali-400 with 4 fragment processors, that would mean it couldn't exceed 12FPS just rendering the FBO. Are you seeing framerates that low, or are you reducing the size of the FBO and/or number of samples to compensate?

The heart of the problem is using the fragment shader to do that many texture lookup operations in a loop. A shortcut would be to try generating the "light scattering" image from the "light and occluder" image by a different method. One such method would be to take the occluder image and additive-blend it over itself, with the 2nd copy centred at the light's coordinates and scaled slightly bigger. You could repeat this process a few times to get a similar effect to a radial blur but with much less fragment shader cost.

I mocked up a very simple test of this using The GIMP, scaling the original occlusion image by 120%, 150% then 200% and using an additive blend with a 20% opacity each time, centered around the light position. Looks like some parameters would need tweaking, but that's the general idea - a cheaper radial blur effect using a few FBO operations.

Please let me know your thoughts. Cheers, Pete
Cancel
Vote up 0 Vote down

Cancel

Children

0 John over 9 years ago in reply to Michael McGeagh

Thank you for looking through the mists of history to uncover this .
A very good and extremely professional answer, well worth the read. Also works very well. The precision is great now, and will also keep this in mind for other shader that i might use for the embedded platforms.
Cancel
Vote up 0 Vote down

Cancel
0 Alban Rampon over 9 years ago in reply to Michael McGeagh

Excellent, Our Saviour is saving us once again
Cancel
Vote up 0 Vote down

Cancel