This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Dependent texture reads performance

Hi,

Are dependent texture reads still an issue on ARM hardware? (I'm speaking about low-end OpenGL ES 3.0 HW).

Cheers.

  • For Mali there was never any specific issue with dependent texture reads, other than the the usual cost of managing cache misses on the lookup path (which is not dependent-read specific, the same problem can occur for non-dependent reads too). 

    Any GPU cache miss can only be hidden if the shader core has "other work" to do - either non-dependent work from the same thread, or other threads to run. If you have a high number of cache misses then stalls cannot be completely hidden and start to eat into your overall content efficiency because the GPU runs out of work to do.

    Entry-level GPUs can be more susceptible to problems here because the L2 cache is smaller so you are more likely to get cache pressure causing eviction, and then subsequent misses that end up needing fetches from DRAM. 

    Things that help:

    • Ensure you maximize thread occupancy by keeping work register count below the point where register allocation will reduce occupancy. For Bifrost onwards this means keeping register count less than or equal to 32 work registers.
    • Use texture compression as much as you can to ensure that you minimize cache misses on the lookup path. For framebuffers, ensure you use the path that gives you AFBC lossless compression (e.g. no use in compute shaders as images).

    ... but the impact here is very content dependent, so if in doubt benchmark your usage on the devices you care about. 

    HTH, 
    Pete

  • Thanks for the quick answer! Just to clarify my question: with dependent texture reads I meant the ability of old OpenGL 2.0 devices to pre-fetch texture data before even running the fragment shader IF the texture coordinates were used unchanged in the fragment shader.

  • Don't think Mali ever did that - with enough threads running you don't need to bother. 

  • Cool thanks! This was a common performance optimization on PowerVR OpenGL 2.0 HS (old iPhones too). So was wondering how that did apply here.