Hi,
We've analysed our main shader (which presumably accounts for most of the pixels from the 3D pass). The shader is largely ALU bound in most architectures (see trimmed malioc's report below):
Before optimization (Mali G-71):
A LS V T
Total instruction cycles: 6.8 0.0 4.0 2.0
After optimization:
Total instruction cycles: 4.7 0.0 4.8 2.0
This optimization was driven by the fact that we were convinced, perhaps wrongly, that the Shader Core Unit was ALU bound (image attached).
(LEFT: after optimization; RIGHT: before optimization)
After the optimization being applied though, we didn't notice any significant improvement in ALU (both in the total span of a frame but also within the region I show above, which I believe to correspond to the 3D pass): ~69% from ~70%. My suspicion is that this might be related with the Partial Coverage Rate values - according to your blog, this could be due to sliver/micro triangles. The execution core utilization drops significantly midway and I can't flag any other culprit. So, if we're really eroding the performance due to that kind of geometry, would that explain the ineffective optimization?
Cheers!