This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

glGetProgramBinary unsupported?

When I query the binary, I really get a binary and nothing human readable. I was expecting to see the generated assembly code like how Nvidia returns it. It's really difficult to write a maxFLOPS test without seeing this assembly. Moreover the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.

Top replies

Mark Bellamy over 10 years ago in reply to Priyadarshi Sharma +1 verified

Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops...

Parents

0 Priyadarshi Sharma over 10 years ago in reply to Chris Varnsverry

So I was finally able to achieve 42 Gflops/s using just MADDs on Note3. I measured the clock variation between 420-480 MHz so assuming 450 MHz average case:
8 flops * 2 pipes * 6 cores * 0.45MHz = 43.2 Gflops/s
The kernel I posted above has only one problem - register spilling to main memory. So I reduced the vec4 variables from 8 to 6 and ran the shader through offline compiler :
7 work registers used, 5 uniform registers used, spilling not used.
                A       L/S     T       Total   Bound
Cycles:         15      0       0       15      A
Shortest Path: 4.5     0       0       4.5     A
Longest Path:   1       -1      -1      -1      A
I suspected register spilling might become an issue when using vec4 registers and this tool confirmed it. Btw, what is the difference between shortest and longest path?
Cancel
Up 0 Down

Cancel

Reply

0 Priyadarshi Sharma over 10 years ago in reply to Chris Varnsverry

So I was finally able to achieve 42 Gflops/s using just MADDs on Note3. I measured the clock variation between 420-480 MHz so assuming 450 MHz average case:
8 flops * 2 pipes * 6 cores * 0.45MHz = 43.2 Gflops/s
The kernel I posted above has only one problem - register spilling to main memory. So I reduced the vec4 variables from 8 to 6 and ran the shader through offline compiler :
7 work registers used, 5 uniform registers used, spilling not used.
                A       L/S     T       Total   Bound
Cycles:         15      0       0       15      A
Shortest Path: 4.5     0       0       4.5     A
Longest Path:   1       -1      -1      -1      A
I suspected register spilling might become an issue when using vec4 registers and this tool confirmed it. Btw, what is the difference between shortest and longest path?
Cancel
Up 0 Down

Cancel

Children

0 Peter Harris over 10 years ago in reply to Priyadarshi Sharma

Btw, what is the difference between shortest and longest path?
Shortest path assumes no conditional block or loop is executed (irrespective of the actual values of those conditions), so it's the shortest instruction sequence from main entry to the end of the program - but can be optimistic.
Longest path assumes every conditional block is executed once, but it doesn't understand loops, so in your case reports -1 as an unknown value.
HTH,
Pete
Cancel
Up 0 Down

Cancel
0 Priyadarshi Sharma over 10 years ago in reply to Priyadarshi Sharma

I also measured flops of VADD and VMUL separately and here are the results :
vec4 ADD : 21.3 Gflops/s
vec4 MUL : 42.3 Gflops/s
vec4 MADD : 42.6 Gflops/s
The shader for these are very similar, only the instruction inside the loop is changed. Here is vec4 MUL for example:
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
uniform lowp int numLoopIterations;
void main()
{
        highp vec4 v0 = u0;
        highp vec4 v1 = u1;
        highp vec4 v2 = u2;
        highp vec4 v3 = u3;
        highp vec4 v4 = u0 + u1;
        highp vec4 v5 = u1 + u2;
        for( lowp int i = 0; i < numLoopIterations; i++ )
        {
                v0 = v1 * v2;
                v1 = v2 * v3;
                v2 = v3 * v4;
                v3 = v4 * v5;
                v4 = v5 * v0;
                v5 = v0 * v1;
        }
        color = v0 + v1 + v2 + v3 + v4 + v5;
}
Is it possible the VMUL is being executed on a separate unit but not VADD?
Cancel
Up 0 Down

Cancel
+1 Mark Bellamy over 10 years ago in reply to Priyadarshi Sharma

Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops a cycle. Where as VADD is only 4 flops a cycle.
Cancel
Up +1 Down

Cancel
0 Priyadarshi Sharma over 10 years ago in reply to Mark Bellamy

Thanks for the info!
Cancel
Up 0 Down

Cancel