GPU performance problem when not binding output texture

Hi, our game have a lot of shaders and now it's getting harder to add more shader  permutations.
So we tried some trick to lower shader permutations. We have a uber shader which outputs depth to a different render target.
Can's use depth resolve here because we're using MSAA. Can't store it to alpha channel because we have used all available bits to store HDR colors.

We don't want to add a new permutation separating shaders who needs to write extra depth and who doesn't.
So we tried to trick the gpu, we simply don't bind the separate depth render target on low-end devices, and hope without those extra bandwidth, the performance would be the same as using a permutation. I teste on mali gpu G76MP16, and the result implies it's a negative optimization.

The fps drops from 40 to 20, non-fragment cycles grows from 10M to 22M:

Memory related bandwidth/load store instructions are grows down as expected, but non-fragment related cycles grows up massively.
Does this mean we shouldn't use this no-binding trick to save shader permutations?

More questions in this forum