I'm having some conflicting thoughts on how I'm interpreting the charts of two Streamline captures of the same scenario over 4s (from a Mali G-71; Samsung A20e) , so I was wondering what's your view on the following?
For instance, while the number of L2 and external texture reads per cycle have increased (from Before to After), the total amount of texture bytes read from both the L2 cache and external memory is lower.
My interpretation is that in After we're performing less filtering operations but those are more bandwidth expensive (less coherent or heavier texture format?) thus the increase in bytes/cycle. However, all-in-all there's still an improvement because in total we're reading less data. Is this a reasonable read of these?
On the flipside, there's L/S reading. There's an overall (positive) drop in all the metrics (and increase in the full read cycles). So, the L/S reading seems clear to me, compared to the texture unit one.
Any views are appreciated. Thanks!
Hi JPJ,
Yes, that's my reading of it too. Long-hand explaination:
The critical numbers for system performance are the absolute quantities for Texturing active cycles and the total byte reads from L2 and external memory. In your "after" case you have substantial drops in all three:
The texture bytes per cycle numbers provide an indicator of how well the texture cache is working. A "good" number depends on the content texture formats being used, so there is no right answer.
For example, if you have an application blitting two compressed textures (e.g. 4bpp) and an uncompressed texture (e.g. 32bpp) then the expected number here is:
If you optimize this to remove one of the compressed layers the average per access goes up even though the overall scene load drops:
There is some indication this is exactly what is happening in your case. Your percentage of compressed and percentage of mipmapped textures both drop from ~36% to ~30%, indicating that a higher percentage of the total is now uncompressed texture data.
*EDIT* Added an answer to the LS comment too
The load/store data tends to behave a little more rationally because the counters are all counting physical accesses (full or partial reads of 64 byte cache lines), not a higher level concept like "a vertex" or even "an attribute". Sizes of cache lines don't change, so the ratios of "accesses" to L2/ext traffic should be more consistent unless you start thrashing the cache.
HTH, Pete
Thanks so much for quick and clear answer Pete! That's exactly what I was looking for :)
FYI: Edited in an additional comment on the LS counters above after you hit accept.
Cheers! It actually refreshed while I was finishing reading - I was just happy with the answer at that point, but it certainly made it more complete :).