This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Performance and SGEMM example

Hi,

I've gotten most of my code working with the Chromebook development platform. Simple kernels run fine and show a nice performance speedup. However more complex kernels perform poorly.

As per the suggestions of some members of this forum, I've reorganized one of my OpenCL kernels to avoid using local memory and to use the float4 vectors. The routine I was optimizing was a simple covariance calculation for large data sets. Since this is essentially a modified matrix multiply, I used the provided sgemm.cl as an example. This resulted in a pretty substantial speed up in my code (about a factor of 2-3).

Since I can't guarantee that a client's system has a function OpenCL system, I have a standard CPU version of the same function. Running both of these on the chromebook I get the following performance figures for a data set:

Using local memory (i.e. optimized for nVidia): 24.5 seconds

Using vectors and no local memory (i.e. optimized for Mali): 9.26 seconds

Using CPUs with calls to BLAS (no OpenCL): 6.99 seconds

So here despite optimization effort, it is still quicker to use the CPU to do the processing.

Just checking to see if I did anything wrong, I checked the provided sgemm.cpp file. As a comparison, I ran the BLAS CPU version of sgemm (installed using apt-get) against the provided OpenCL example. For the default matrix size of (2048x2048), I get the following run times:

OpenCL sgemm: 84 seconds

CPU BLAS sgemm: 3 seconds

So the single core CPU version is abut 30 times faster than the OpenCL version!

I've checked an the output is the same for both cases.

I suspect that performance is being choked in the OpenCL code by the sheer number of loads from global memory. On the systems with Local memory, this can be cached so performance doesn't choke. I assume the CPU is caching data fairly well and speeding this up.

If this is indeed the case, performance will be bad on anything other than extremely simple kernels.

Am I making some sort of mistake here? Is there anything I can do to mitigate this? Essentially 90% of the processing time in most of my stuff is a matrix multiply.

Thanks for all of your help.

--Mike

Parents

mike winter over 11 years ago in reply to Tim Hartley

The suggested fix dropped the run time for sgemm down to 10 seconds.
It didn't seem to have any effect on my code run times, however. I still get the same 9.25 seconds for the Mali optimized covariance.
Your point with respect to power consumption is well taken. It is also nice that your mapped pointer implementation works, so you can shift back to the CPU. As far as I know this doesn't work on nVidia cards.
Is there any hope of future chips having dedicated fast local memory?
Thanks for all of your help and hard work.
Cancel
Up 0 Down

Cancel

Reply

mike winter over 11 years ago in reply to Tim Hartley

The suggested fix dropped the run time for sgemm down to 10 seconds.
It didn't seem to have any effect on my code run times, however. I still get the same 9.25 seconds for the Mali optimized covariance.
Your point with respect to power consumption is well taken. It is also nice that your mapped pointer implementation works, so you can shift back to the CPU. As far as I know this doesn't work on nVidia cards.
Is there any hope of future chips having dedicated fast local memory?
Thanks for all of your help and hard work.
Cancel
Up 0 Down

Cancel

Children

Peter Harris over 11 years ago in reply to mike winter

> It didn't seem to have any effect on my code run times, however. I still get the same 9.25 seconds for the Mali optimized covariance.

I suspect you are running into one of the quirks of the DVFS implementation in our BSP. Most mobile DVFS policies are geared towards sustained workloads - so frequency choices are based on average utilization of the hardware over a time window. For pipelined workloads which keep both the CPU and GPU running at the same time and the dominant processor fully loaded this approach works well. For a graphics heavy game for example, the GPU will be fully loaded and drift towards a high frequency, the CPU will start of perhaps 50% loaded, and drift towards a lower frequency. As graphics is pipelined there is a constant queue of work available for the GPU, it never goes idle, so utilization stays high.

Many off-the-shelf OpenCL kernels do not always fit this model - some push buffers to the CPU do some work, then switch to the GPU and do some work, and then flip back to the CPU again. Even though all cores are as busy as the design lets them be, their utilization is not 100%, so the DFVS governor will generally try to drop the frequency. Because the idle time is algorithmic in nature, rather than performance related, it doesn't go away even with a lower frequency, so the frequency drops some more, and soon you are running at the lowest operating point ...

Our general advice is, where possible, pipeline your workloads so the CPU is preparing the next task, while the GPU is processing the last one, and keep the most heavily loaded part of the system fully utilized. That plays nicely with the default DVFS you will see today in most devices, but that said ... OpenCL is still a maturing technology so we are very interested in getting more developer input about what their use cases look like - we can undoubtedly do more with our partners to refine power policies here.

> Is there any hope of future chips having dedicated fast local memory?

The scale of a mobile GPU and a desktop GPU are very different, so our global memory access is already relatively fast in comparison. We've not yet seen much evidence that we need a faster local memory - GPUs are already much more latency tolerant than a CPU, and you can already share memory in the GPU caches if it is "hot" so that provides most of the locality benefits. If you have a particular use case we'd love to hear it.

Kind regards,
Pete
Cancel
Up 0 Down

Cancel
Tim Hartley over 11 years ago in reply to mike winter

Hi Mike,
Just a further update which you might find useful when writing your own kernels.
The SGEMM sample in the SDK doesn't set the workgroup size when queueing the kernel - it just passes NULL as the workgroup size parameter. We made this configurable and found a sweet spot with a workgroup size of 4 x 16 or 8 x 16. We believe this modifies the memory accesses sufficiently to reduce the cache misses... in our tests the SGEMM sample was then running in around 7.5s. It's certainly worth experimenting with this, particularly for kernels with many load/store operations.
Regards,
Tim
Cancel
Up 0 Down

Cancel
Chris Varnsverry over 11 years ago in reply to Tim Hartley

And a further update, we have managed to get the SGEMM kernel example down to ~2.5 seconds runtime by transposing the matrix, removing the 4 scalar loads and replacing them with a vector load, and changing the work-group size to 4x16. This significantly improves the cache problems which were crippling it before. In the future the SDK examples should be updated to include these changes.
So, faster AND lower power than the CPU
Hope this helps,
Chris
Cancel
Up 0 Down

Cancel
mike winter over 11 years ago in reply to Chris Varnsverry

Thanks,
This is good news. My workflow is almost all matrix multiplies or matrix multiply-like operations.
I look forward to seeing the new SDK code.
Thanks!
--Mike
Cancel
Up 0 Down

Cancel
Ryan Booth over 10 years ago in reply to mike winter

Hi,
I'm just changing this to a discussion rather than a question as I think there is no one right answer here. Please continue with the discussion if anything more comes up.
Cancel
Up 0 Down

Cancel