This is a demo created internally at ARM by Anthony Barbier.
Mali OpenCL Flag Demo
The demo shows the performance improvements you can achieve when using OpenCL™ on a Mali powered device.
The application is simulating a cloth flag with a ~6000 vertex model. Every frame, for each for these vertices, the application is calculating the affect of the forces of gravity, wind and spring forces between the vertices.
The demo is shown running on the Samsung Exynos 5250 Arndale Board from InSignal which has a dual core ARM® Cortex®-A15 CPU and a quad core ARM Mali™-T604 GPU.
The version shown first is written in multithreaded C running on the CPU (without using ARM Neon™ technology). This uses 100% of both cores of a dual core Cortex-A15 CPU but only achieves around 4-5 fps. You can see that visually this is not a nice result, the scene is too slow and the movement of the cloth is therefore not smooth. The GPU is being underutilised in this version (less than 1% utilisation). It's a resource of the system which could be put to good use.
Next, the OpenCL version is shown running on a Mali-T604 GPU. In this version, we render two flags (~12000 vertices) at around 36 fps. The flag looks much better now, and the intended simulation effect is much more obvious. The CPU usage in this version has fallen to single digits allowing it to be used for other tasks, for more features, or to sleep to reduce power usage. This shows a 16x performance improvement over the CPU version of the code (2x the number of vertices, 8x the frames per second).
This goes to show that for parallel applications such as this, OpenCL on a Mali device can provide superior performance. Each data point in this application can be calculated independently of all others and therefore, because the Mali GPU is very good at doing parallel processing (up to 256 hardware threads per core), it can easily outperform the CPU which is designed more for good sequential performance (one hardware thread per core).
The other interesting thing shown in this demo is efficient OpenGL ES and OpenCL interoperability. In the application OpenCL is used to manipulate the flag model data and then OpenGL ES is used to render it to the screen. Typically, the model data would be manipulated on the host (CPU) side of the application and then uploaded to the GPU for OpenGL ES to render. The host would upload the data into a VBO (Vertex Buffer Object) so the GPU has access to it. In a naïve system, you can imagine that in this demo you would have to (every frame):
Thankfully, this is not the case as this would increase memory usage (increasing power usage) and reduce performance by needlessly copying memory. Instead the two APIs can share the same piece of memory directly.
Hopefully we will have an example of this in one of our Mali SDKs soon.