This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

glGetProgramBinary unsupported?

When I query the binary, I really get a binary and nothing human readable. I was expecting to see the generated assembly code like how Nvidia returns it. It's really difficult to write a maxFLOPS test without seeing this assembly. Moreover the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.

Top replies

Mark Bellamy over 10 years ago in reply to Priyadarshi Sharma +1 verified

Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops...

Parents

0 Priyadarshi Sharma over 10 years ago in reply to Priyadarshi Sharma

I also measured flops of VADD and VMUL separately and here are the results :
vec4 ADD : 21.3 Gflops/s
vec4 MUL : 42.3 Gflops/s
vec4 MADD : 42.6 Gflops/s
The shader for these are very similar, only the instruction inside the loop is changed. Here is vec4 MUL for example:
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
uniform lowp int numLoopIterations;
void main()
{
        highp vec4 v0 = u0;
        highp vec4 v1 = u1;
        highp vec4 v2 = u2;
        highp vec4 v3 = u3;
        highp vec4 v4 = u0 + u1;
        highp vec4 v5 = u1 + u2;
        for( lowp int i = 0; i < numLoopIterations; i++ )
        {
                v0 = v1 * v2;
                v1 = v2 * v3;
                v2 = v3 * v4;
                v3 = v4 * v5;
                v4 = v5 * v0;
                v5 = v0 * v1;
        }
        color = v0 + v1 + v2 + v3 + v4 + v5;
}
Is it possible the VMUL is being executed on a separate unit but not VADD?
Cancel
Up 0 Down

Cancel

Reply

0 Priyadarshi Sharma over 10 years ago in reply to Priyadarshi Sharma

I also measured flops of VADD and VMUL separately and here are the results :
vec4 ADD : 21.3 Gflops/s
vec4 MUL : 42.3 Gflops/s
vec4 MADD : 42.6 Gflops/s
The shader for these are very similar, only the instruction inside the loop is changed. Here is vec4 MUL for example:
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
uniform lowp int numLoopIterations;
void main()
{
        highp vec4 v0 = u0;
        highp vec4 v1 = u1;
        highp vec4 v2 = u2;
        highp vec4 v3 = u3;
        highp vec4 v4 = u0 + u1;
        highp vec4 v5 = u1 + u2;
        for( lowp int i = 0; i < numLoopIterations; i++ )
        {
                v0 = v1 * v2;
                v1 = v2 * v3;
                v2 = v3 * v4;
                v3 = v4 * v5;
                v4 = v5 * v0;
                v5 = v0 * v1;
        }
        color = v0 + v1 + v2 + v3 + v4 + v5;
}
Is it possible the VMUL is being executed on a separate unit but not VADD?
Cancel
Up 0 Down

Cancel

Children

+1 Mark Bellamy over 10 years ago in reply to Priyadarshi Sharma

Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops a cycle. Where as VADD is only 4 flops a cycle.
Cancel
Up +1 Down

Cancel
0 Priyadarshi Sharma over 10 years ago in reply to Mark Bellamy

Thanks for the info!
Cancel
Up 0 Down

Cancel