Hi,
I was using SHOC to benchmark the Arndale Octa board and the newest Samsung Galaxy Note 10.1, both of which have Mali-T628 MP6 inside.
What I focused on is the performance of GFLOPS based on MAdd_{1,2,4,8,16}, and the results range from 2 ~ 24 GFLOPS, which seems unreasonable.
According to the theoretical calculation, T628 MP6 should have around 102 GFLOPS.
It seems not to be the problem of the benchmark program since this program reported 30~40 GFLOPS on Nexus 10 (Mali-T604 MP4).
That is, T628 MP6 had worse GFLOPS than T604 MP4 when running MaxFlops test item in SHOC.
How come can this happen?
Yoshi
Hi Yoshi,
I would certainly follow Pete's recommendations regarding DVFS. It's worth adding that although Mali-T628 MP6 does have 6 GPU cores - 2 more than in the Nexus 10 - those 6 cores are split 4:2 between two core-groups. OpenCL will see these as two devices and won't automatically split the job across both groups, though it will by default however run the job on the 4-core group. So - depending on relative GPU frequency, and taking any DVFS issues into account - I would expect the performance between the two devices to be roughly equivalent.
Hope that's useful,
Tim
Issue #1 to check: We have seen some benchmarks hit problems where either the individual tests are very short, or where CPU and GPU are blocking waiting for each other.
The reason for this is DVFS (frequency scaling power management). Most schemes follow a similar design to CPUFreq, and are based on utilization of the processing master so if the GPU or CPU are "mostly idle" the DVFS driver will drop the frequency (and voltage) to save power. If you have a pipelined workload (so both CPU and GPU are loaded, there is no blocking behavior unless you have run out of work) then these schemes work well.
For very short tests which do not push the GPU into a significant period of load (say < 25 ms) it is likely that the test is too short to trigger a DVFS frequency change, so if your device is mostly idle then the GPU will be running at its lowest operating frequency while the benchmark runs.
For tests which run on the CPU, then block waiting for the GPU, which then blocks waiting for the CPU you can end up in a situation where both processing units are on the critical path, but neither is heavily utilized (in the worst case the loading is ~50% on each processor). In this case the DVFS policy will generally decide that both processors are under-loaded and drop the frequency of both. However as the "under load" is built-in to the algorithm due to the blocking waits utilization won't go up, so it will drop in frequency again ... eventually you get stuck at Vmin.
To test this theory try disabling DVFS on both the CPU and the GPU.
Issue #2 to check: The Mali-T600 and T700 series GPUs are vector processing units so each thread which is running gets SIMD-style instructions. The compiler can auto-vectorize but to get the best workload out of the GPU it works best if the OpenCL kernels use the vector built-in functions as this is guaranteed to vectorize cleanly, rather than the compiler trying to piece together scalar code.
Issue #3 to check: The Mali maths units are relatively flexible - so if you can structure a problem to use int8 or int16 rather than int32, or fp16 rather than fp32, then you can get considerably higher throughput.
HTH, Pete