This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Dependent texture reads performance

Hi,

Are dependent texture reads still an issue on ARM hardware? (I'm speaking about low-end OpenGL ES 3.0 HW).

Cheers.

Parents
  • For Mali there was never any specific issue with dependent texture reads, other than the the usual cost of managing cache misses on the lookup path (which is not dependent-read specific, the same problem can occur for non-dependent reads too). 

    Any GPU cache miss can only be hidden if the shader core has "other work" to do - either non-dependent work from the same thread, or other threads to run. If you have a high number of cache misses then stalls cannot be completely hidden and start to eat into your overall content efficiency because the GPU runs out of work to do.

    Entry-level GPUs can be more susceptible to problems here because the L2 cache is smaller so you are more likely to get cache pressure causing eviction, and then subsequent misses that end up needing fetches from DRAM. 

    Things that help:

    • Ensure you maximize thread occupancy by keeping work register count below the point where register allocation will reduce occupancy. For Bifrost onwards this means keeping register count less than or equal to 32 work registers.
    • Use texture compression as much as you can to ensure that you minimize cache misses on the lookup path. For framebuffers, ensure you use the path that gives you AFBC lossless compression (e.g. no use in compute shaders as images).

    ... but the impact here is very content dependent, so if in doubt benchmark your usage on the devices you care about. 

    HTH, 
    Pete

Reply
  • For Mali there was never any specific issue with dependent texture reads, other than the the usual cost of managing cache misses on the lookup path (which is not dependent-read specific, the same problem can occur for non-dependent reads too). 

    Any GPU cache miss can only be hidden if the shader core has "other work" to do - either non-dependent work from the same thread, or other threads to run. If you have a high number of cache misses then stalls cannot be completely hidden and start to eat into your overall content efficiency because the GPU runs out of work to do.

    Entry-level GPUs can be more susceptible to problems here because the L2 cache is smaller so you are more likely to get cache pressure causing eviction, and then subsequent misses that end up needing fetches from DRAM. 

    Things that help:

    • Ensure you maximize thread occupancy by keeping work register count below the point where register allocation will reduce occupancy. For Bifrost onwards this means keeping register count less than or equal to 32 work registers.
    • Use texture compression as much as you can to ensure that you minimize cache misses on the lookup path. For framebuffers, ensure you use the path that gives you AFBC lossless compression (e.g. no use in compute shaders as images).

    ... but the impact here is very content dependent, so if in doubt benchmark your usage on the devices you care about. 

    HTH, 
    Pete

Children