1 2 3 Previous Next

ARM Mali Graphics

231 posts

Evaluating compute performance on mobile platforms: an introduction

Using the GPU for compute-intensive processing is all about improving performance compared to using the CPU only. But how do we measure performance in the first place? In this post, I'll touch upon some basics of benchmarking compute workloads on mobile platforms to ensure we are on solid ground when talking about performance improvements.


Benchmarking basics

To measure performance, we select a workload and a metric of its performance. Because workloads are often called benchmarks, the process of evaluating performance is usually called benchmarking.


Selecting a representative workload is a bit of a dark art so we will leave this topic for another day. Selecting a metric is more straightforward.


The most widely used metric is the execution time. To state bluntly, the lower the execution time is, the faster the system is. In other words, lower is better.


Frequently, the chosen metric is inversely proportional to the execution time. So, the higher the metric is, the lower the execution time is. In other words, higher is better. For example, when measuring memory bandwidth, the usual metric is the amount of data copied per unit time. As this metric is inversely proportional to the execution time, higher is better.


Benchmarking pitfalls

Benchmarking on mobile platforms can be tricky. Running experiments back to back can produce unexpected performance variation, and so can dwindling battery charge, hot room temperature or an alignment of stars. Fundamentally, we are talking about battery powered, passively cooled devices which tend to like saving their battery charge and keeping their temperature down. In particular, dynamic voltage and frequency scaling (DVFS) can get in the way. Controlling these factors (or at least accounting for them) is key to meaningful performance evaluation on mobile platforms.


Deciding what to measure and how to measure it deserves special attention. In particular, when focussing on optimising device code (kernels), it's important to measure kernel execution time directly, because host overheads can hide effects of kernel optimisations.


To illustrate some of the pitfalls, I have created an IPython Notebook which I encourage you to view before peeking into our next topic.


Sample from iPython Notebook

What's next?


Using GPU settings that are ill-suited for evaluating performance is common but should not bite you once you've become aware of it. However, even when all known experimental factors are carefully controlled for, experiments on real systems may produce noticeably different results from run to run. To properly evaluate performance, what we really need is a good grasp of basic statistical concepts and techniques...


Are you snoring already? I too used to think that statistics was dull and impenetrable. (A confession: statistics was the only subject I flunked at university, I swear!) Apparently not so, when you apply it to optimising performance! If you are at the ARM TechCon on 1-3 October 2014, come along to my live demo, or just wait a little bit longer and I will tell you all you need to know!

Throughout this year, application developers have continued to release a vast range of apps using both the OpenGL® ES 2.0 and 3.0 APIs. While the more recent API offers a wider range of features and performance can be better on GPUs which support OpenGL ES 3.0 onwards, thanks to the backwards compatibility of OpenGL ES versions the success and longevity of more cost-optimized OpenGL ES 2.0 GPUs looks set to continue. A consequence of this trend is that demand for the ARM® Mali™-450 MP graphics processor, implementing a design that is optimised for OpenGL ES 2.0 acceleration, has never been higher.


The momentum behind ARM’s 64-bit ARMv8-A application processor architecture is growing, enabling more complex applications within strict power budgets. We were able to announce last week the 50th license of the technology across 27 different companies, showing that the demand for greater compute capabilities across a wide range of applications is strong.


This market support gave us the opportunity to further optimize the performance of our Mali-450 drivers to support 64-bit builds of OpenGL ES 2.0 apps. So, that’s exactly what we’ve done, with a brand new set of 64-bit Mali-450 drivers that were released to our partners recently. Examples of where we see a Mali-450 GPU and Cortex-A53 CPU successfully combined is the entry-level smartphone market, where cost efficiency is important but the implementation of a 64-bit CPU can offer the all-important differentiation from the competition. With this release, ARM is making it easier for the mass market to access the latest technology advances while providing silicon partners with a wider choice of which GPU can be paired with which CPU.


So watch out for the new wave of 64-bit devices based on Mali-450 MP and rest assured that the Mali drivers have been optimised for the feature set of the 64-bit CPU.  The only thing you should see is increased app performance, and a few more CPU cycles available – we’re sure you’ll do great things with them.

The ARM Ecosystem continues to drive innovation, diversity and opportunity across the entire industry at an astonishing pace, bringing the benefits of semiconductor technology to all potential users across the world. The changes appearing in the cost-efficient segment are especially exciting: there are a huge number of opportunities for silicon venders and OEMs to successfully differentiate their products for this market. Examples include the growing number of customers looking to upgrade from feature phone to smartphone technology as initiatives such as Android One (launched yesterday in India) emerge and gain momentum; the ability to bring high performing technology, showcasing fast frame rates, great displays and long-lasting batteries, into the mainstream; new applications emerging that offer desirable new functionality and capabilities; and new form factors placing mobile silicon into a variety of exciting and affordable new markets. The mass market is discovering a whole host of features which two years ago was only available in premium devices.  With all this change taking place, it is no wonder that the industry is seeing shipments of superphones waning, making way for the era of the mass market.

Gartner graph.png

The mass market (entry level & mid range) is predicted to total 80% of total smartphone shipments by 2017 (Source: Mixture of ARM & Gartner estimates)


But just how big is the global mass market opportunity? In ARM’s results statement, we predicted that the mobile app processor market would be worth $20bn in 2018 of which the total addressable market for the mass market would be $10bn. The main geographical areas driving this ongoing smartphone growth are emerging markets such as China, India, Russia, Brazil, as the graph from Credit Suisse shown below predicts. With 1.75 billion people already owning a smartphone, there are still over 5 billion left who are yet to experience fully mobile connectivity. China and India alone are predicted to bring over 400 million new users to this market in 2014.


Credit Suisse Graph.png

Emerging markets will be the long term driver for smartphone shipment volumes (Source: Credit Suisse, The Wireless View 2014)

ARM® Mali GPUs have rapidly become the de facto GPU for the mass market and for Android devices as a whole. Thanks to the low energy, low silicon area yet feature rich elements of our cost-efficient roadmap, we are now the most commonly deployed GPU in all new smartphone models with the fastest growing market share across all GPU vendors - in 4Q13 73 new Mali-based smartphones were introduced into the market.  In fact, over 75% of all application processors coming out of APAC now have an ARM Mali GPU inside. The first set of Android One devices, whose goal is to bring affordable smartphone technology to emerging markets, is entirely based on Mediatek's MT6582 SoC featuring a Mali-400 MP2 GPU.


Bank of America graph.png

ARM Mali GPUs took the #1 spot in 4Q13 among new models (Source: Bank of America Merrill Lynch Global Research estimates)

The Mali-400 GPU has driven success in this market for all its customers since its announcement in June 2008 and continues to be popular in emerging markets where great hardware and software has to be brought together in an affordable manner. Beyond the Android One smartphones, it can be found in a range of popular devices ranging from smartphones to wearables:


  • Oppo Joy (Mediatek MT6572)
  • Huawei Honor 3C (Mediatek MT6582)
  • Alcatel One Touch Idol X Plus (Mediatek MT6592)
  • Samsung Galaxy S5 Mini (Samsung Exynos 3 Quad)
  • Omate TrueSmart Smartwatch (Mediatek MT6572)


However, as the technology behind these devices is evolving at such a fast pace, tomorrow’s mass market consumers will be demanding more from their devices than their current counterparts. For this reason, ARM has developed a long-term GPU IP roadmap that specifically meets the needs of silicon partners addressing this market, ensuring that as consumer values evolve the ARM Ecosystem has everything it needs to continue its success.


For example, OpenGL® ES 3.0 will become the universal standard for developing mobile games and applications. Already, over 20% of devices support this API, according to stats from Android. Mass market consumers will expect to be able to enjoy the latest titles as soon as they come out and getting the most out of them will require a GPU which supports the most popular standards. As another example, the trend for higher resolutions continues and a mass market GPU will be required that has the computational power to deliver the desired performance at higher pixel densities. The ARM Mali-T720 GPU has been developed to meet these needs of future generations of mass market consumers, offering both higher computation capacity and API support up to and including OpenGL ES 3.1.


The opportunities in the mass market are seemingly endless and ARM IP is historically proven to be the leader in this field, offering functional, energy-efficient graphics within the smallest possible silicon area. Our mid-range GPU roadmap is advancing in line with the market with new GPUs ready to become the Mali-400 of the future, combining the best of ARM’s traditional mass market offering with the new requirements of a future age. For more information about ARM’s mass market offerings, visit www.arm.com.



Chris Doran, COO of Geomerics had a recent conversation with GamingBolt to discuss recent developments with Enlighten, how Geomerics is supporting indie game developers, and two major items on the roadmap.


Geomerics has a come a long way in the last few years. They are now officially backed by the UK government to set new benchmarks in the movie industry. They are also working closely with EA on games like Star Wars Battlefront and Mirrors Edge. It’s safe to assume that Geomerics are aware of where the next generation of lighting and graphics technology are heading.


Geomerics Interview: Realizing The Full Potential of Enlighten Using The New Console Cycle « GamingBolt.com: Video Game…

Tom Olson wrote a fantastic series of blogs about performance metrics and how to interpret them. His blog about triangles per second pretty much changed the industry. Very quickly, companies had to stop talking nonsense about triangles per second in any way being a useful metric. Now, along comes this ground-breaking serious technology research, and the whole comparison basis and industry-standard metric of uselessness becomes challenged. What shall we do? As useful as:

  • an umbrella in the desert?
  • a concrete lifebelt?
  • a glass hammer?

In the second part of Energy Efficiency in GPU Applications, Part 1 I will show some real SoC power consumption numbers and how they correlate with the workload coming from an application.

Study: How Application Workload Affects Power Consumption


We made a brief study to find out how an application workload affects SoC power consumption. The idea of the study was to develop a small micro-benchmark that runs at just above 60fps on the target devices i.e. it is always vsync limited. Here is a screen shot from the micro-benchmark (it is called Torus Test):



To leave some room for optimization we added a few deliberate performance issues to the original version of the micro-benchmark:


  • The vertex count is too high
  • The texture size is too high
  • The fragment shader consumes too many cycles
  • Back-face culling is not enabled


We wanted to see how power consumption is affected when we reduce the workload by fixing each of the above performance issues individually. All of these performance issues and related optimizations are somewhat artificial for being used directly with real applications. The micro-benchmark was written on purpose in a way that none of these crazy optimizations have any major visual impact, but with a real-world application you probably wouldn't be able to decrease the texture resolution from 1920x1920 to 96x96 without a drastic impact on the visual quality of the application. However, the effect of the optimizations described here is the same as the effect of optimizing real applications: you improve the energy efficiency of your application by reducing GPU cycles and bandwidth consumption.


At ARM we have a few development SoCs that can be used for measuring actual SoC power consumption which we were able to use in the study. The micro-benchmark allows the measurement of system FPS in offscreen rendering mode without the vsync limit, as described previously.  In the result graphs we use the frame time instead of the system FPS (frame time = 1s / system FPS), because that corresponds to the number of GPU cycles that consume power on the GPU.  We also used the L2 cache external bandwidth counters for measuring the bandwidth consumed by the GPU. By using these metrics we wanted to see how the workload in the application and GPU correlates with the power consumption in the SoC. Here are the results.


Decreasing Vertex Count

The micro-benchmark allows us to configure how many vertices are drawn in each frame. We tested three different values (4160, 2940 and 1760). The following graph shows how the vertex count correlates with the frame time and SoC Power:



This micro-benchmark is not very vertex heavy but still the correlation between vertex count and SoC power consumption is clear. When decreasing the vertex count, power is not only saved by reduced vertex shader processing, but also because there is less external bandwidth needed to copy vertex data to/from the vertex shading core. Therefore we can also see the correlation between vertex count and external bandwidth in the above graph.


Decreasing Texture Size

The micro-benchmark uses a generated texture for texture mapping, which makes it possible to configure the texture size. We tested the performance with three different texture sizes (1920x1920, 960x960 and 96x96). Each object is textured with a separate texture object instance. As expected, the texture size doesn't affect the frame time much but it affects the external bandwidth. We found the following correlation between texture size, external bandwidth and SoC power:



Notice that the bandwidth doesn't decrease linearly with the number of texels in a texture. This is because with a smaller texture size there is a much better hit rate in the L2 cache, which quickly reduces the external bandwidth.


Decreasing Fragment Shader Cycles

The micro-benchmark implements a Phong shading model with a configurable number of light sources.  We tested the performance with three different values for the number of light sources (5, 3, and 1). The Mali Shader Compiler outputs the following cycle count values for these configurations:


Light SourcesArithmetic CyclesLoad/Store CyclesTexture Pipe CyclesTotal Cycles


We found the following correlation between the number of fragment shader cycles, frame time and SoC power:




Adding Back-Face Culling and Putting All Optimizations Together

Finally, we tested the SoC power consumption impact when enabling back-face culling and when including all the previous optimization at the same time:



With all these optimizations we managed to reduce the SoC power consumption to less than 40% compared to the original version of the micro-benchmark. At the same time the frame time reduced to less than 30% and the bandwidth to less than 10% of the original micro-benchmark. Note that the large relative bandwidth reduction is possible due to the fact that writing the onscreen frame buffer to the external memory consumes very little bandwidth in this micro-benchmark, as Transaction Elimination was enabled in the device which is very effective with this application because there are lots of tiles filled with the constant background color that don't change between frames.



I hope this blog and the case study example has helped you to better understand the factors which impact on energy efficiency and the extent to which SoC power consumption can be reduced through optimizing GPU cycles and bandwidth in an application. As the processing capacity of embedded GPUs keeps growing, an application developer can often change the focus from performance optimization to energy efficiency optimization, which means that the desired visual output is implemented without consuming cycles or bandwidth unnecessarily. You should also consider the trade-off between improved visual quality and increased power consumption; is that last piece of "eye candy" which increases processing requirements by 20% really worth a 20-36% drop in the battery life for the end users of the application?

If you have any further questions, please don’t hesitate to ask them in the comments below.

Back in June I had the pleasure of visiting Barcelona in Spain for the first time to give a presentation at Gamelab.

I was lucky enough to attend some of the other talks given and was impressed by the quality and diversity of the presentations.

Particularly enjoyable were Tim Shafer's presentation on creativity in game development and a panel discussion ("The future of mobile entertainment") about the mobile game development which had some interesting points on the technical difficulties faced by developers.


I gave the attached presentation with our great partner Will Eastcott from PlayCanvas. We ran through:

  • an introduction to WebGL
  • how the guys at PlayCanvas use WebGL in their open source, cloud based game engine.
  • the importance of good tools for performance analysis and debug of mobile games
  • how you can use the Mali Graphics Debugger, ARM DS-5 Streamline, and the Mali Offline Compiler to analysis you  code, identify problems and find solutions

In this blog I will talk about energy efficiency in embedded GPUs and what an application programmer can do to improve the energy efficiency of their application. I have split this blog into two parts; in the first part I will give an introduction to the topic of energy efficiency and in the second part I will show some real SoC power measurements by using an in-house micro-benchmark to demonstrate the extent to which a variety of factors impact frame rendering time, external bandwidth and SoC power consumption.


Energy Efficiency in the GPU/Device


Let's look first at what energy efficiency means from the GPU's perspective.  At a high level the energy is consumed by the GPU and its associated driver in three different ways:


  • GPU is running active cycles in the hardware to perform its computation tasks in one or more of its cores.
  • GPU/driver is issuing memory transactions to read data from, or write data to, the external memory.
  • GPU driver code is executed in the CPU either in the user mode or in the kernel mode.


On most devices Vertical Synchronization (vsync) synchronizes the frame rate of an application with the screen display rate. Using vsync not only removes tearing, but it also reduces power consumption by preventing the application from producing frames faster than the screen can display them. When vsync is enabled on the device the application cannot draw frames faster than the vsync rate (vsync rate is typically 60fps on modern devices so we can keep that as our working assumption in the discussion). On the other hand, in order to give the best possible user experience the application/GPU should not draw frames significantly slower than the vsync rate i.e. 60fps. Therefore the device/GPU tries hard to keep the frame rate always at 60fps, while also trying to use as little power as possible.


A device typically has power management functionality for both GPU and CPU in order to adjust their operating frequencies based on the current workload. This functionality is referred to as DVFS (Dynamic Voltage and Frequency Scaling). DVFS allows the device to handle both normal and peak workload in an energy efficient fashion by adjusting the clock frequency to provide just enough performance for the current workload, which in turn allows us to drop the voltage as we do not need to drive the transistors as hard to meet the more relaxed timing constraints. The energy consumed per clock is proportional to V2, so if we drop frequency to allow a voltage reduction of 20% then energy efficiency would improve by 36%. Using a higher clock frequency than needed means higher voltage and consequently higher power consumption, therefore the power management tries to keep the clock frequency as low as possible while still keeping the frame rate at the vsync rate. When the GPU is under extremely high load some vendors allow the GPU to run at an overdrive frequency - a frequency which requires a voltage higher than the nominal voltage for the silicon process - which can provide a short performance boost, but cannot be sustained for long periods. If high workload from an application keeps the GPU frequency overdriven for a long time, the SoC may become overheated and as a consequence the GPU is forced to use a lower clock frequency to allow the SoC to cool down even if the frame rate goes under 60fps. This behavior is referred to as thermal throttling.


Device vendors often differentiate their devices by making their own customizations to the power management. As a result two devices having the same GPU may have different power management functionality. The ARM® Mali™ GPU driver provides an API to SoC vendors that can be used for implementing power management logic based on the ongoing workload in the GPU.


In addition to DVFS, some systems may also adjust the number of active GPU cores to find the most energy efficient configuration for the given GPU workload. Typically, DVFS provides just a few available operating frequencies and enabling/disabling cores can be used for fine-tuning the processing capacity for the given workload to save power.


In its simplest form the power management is implemented locally for the GPU i.e. the GPU power management is based only on the ongoing GPU workload and the temperature of the chip. This is not optimal as there can be several other sub-systems on the chip which all "compete" with each other to get the maximum performance for its own processing until the thermal limit is exceeded and all sub-systems are forced to operate in a lower capacity. A more intelligent power management scheme maintains a power budget for the entire SoC and allocates power for different sub-systems in a way that thermal throttling can be avoided.


Energy Efficiency in Applications

From an application point of view the power management functionality provided by the GPU/device means that the GPU/device always tries to adjust the processing capacity for the workload coming from the application. This adjustment happens automatically in the background and if the application workload doesn't exceed the maximum capacity of the GPU, the frame rate remains constantly at the vsync rate regardless of the application workload. The only side effect from the high application workload is that the battery runs out faster and you can feel the released energy as a higher temperature of the device.


Most applications don't need to create a higher workload than the GPU's maximum processing capacity i.e. the power management is able to keep the frame rate constantly at the vsync level. The interval between two vsync points is 1/60 seconds and if the GPU completes a frame faster than that, the GPU sits idle until the next frame starts. If the GPU constantly has lots of idle time before the next vsync point, the power management may decrease the GPU clock frequency to a lower level to save power.


VSync App.pngScreenshot from Streamline of a GPU and CPU idling each frame when the DVFS frequency selected is too high


As the maximum processing capacity of modern GPUs keeps growing, it is often not necessary for an application developer to optimize the application for better performance, but instead for better energy efficiency and that is the topic of this blog.


How to Make an Application Energy Efficient

In order to be energy efficient the application should:


  • Render frames with the least number of GPU cycles
  • Consume the least amount of external memory bandwidth
  • Generate the least amount of CPU load either directly in the application code or indirectly by using the OpenGL® ES API in a way that causes unnecessary CPU load in the driver


But hey, aren't these the same things that you used to focus on when optimizing your application for better performance? Yes, pretty much! To explain this further:


  • Every GPU cycle that you save when rendering a frame means more idle time in the GPU before the next vsync point. In the best case the idle time becomes long enough to allow the power management to use a lower GPU frequency or enable a smaller number of cores
  • Reducing bandwidth load doesn't always improve performance as GPUs are designed to tolerate high memory latencies without affecting performance. However, reducing bandwidth can improve energy efficiency significantly
  • The same as for bandwidth, extra CPU load may not impact performance but it definitely can increase the power consumption


So the task of improving energy efficiency becomes very similar to the task of optimizing the performance of an application. For that task you can find lots of useful tips in the Mali GPU Application Optimization Guide.

How Do You Measure Energy Efficiency?


There is one topic that may require some more attention: how can you measure the energy efficiency of your application? Measuring the actual SoC power consumption might not be practical. It might also be problematic to measure the system FPS of your application if vsync is enabled on your device and you cannot turn it off.


ARM provides a tool called DS-5 Streamline for system-wide performance analysis. Using DS-5 Streamline for detecting performance bottlenecks is explained in Peter Harris' s blog Mali Performance 1: Checking the Pipeline and Lorenzo Dal Col's blogs starting with Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel, and also in the Mali GPU Application Optimization Guide. Shortly, DS-5 Streamline allows you to measure the main components of energy efficiency with the following charts / HW counters:


GPU cycles:

  • Mali Job Manager Cycles: GPU cycles
    • This counter increments any clock cycle the GPU is doing something
  • Mali Job Manager Cycles: JS0 cycles
    • This counter increments any clock cycle the GPU is fragment shading
  • Mali Job Manager Cycles: JS1 cycles
    • This counter increments any clock cycle the GPU is vertex shading or tiling


External memory bandwidth:

  • Mali L2 Cache: External read beats
    • Number of external bus read beats
  • Mali L2 Cache: External write beats
    • Number of external bus write beats


CPU load:

  • CPU Activity
    • The percentage of the CPU time spent in system or user code


Another very useful tool for measuring GPU cycles is the Mali Offline Shader Compiler which allows you to see how many GPU cycles are spent in the arithmetic, load/store and texture pipes in the shader core. Each saved cycle in the shader code means thousands/millions of saved cycles in each frame, as the shader is executed for each vertex/fragment.


If you want to measure the performance of an application in a vsync limited device, it is possible to do it by rendering graphics in offscreen mode using FBOs. This is the trick used by some benchmark applications to get rid of vsync and resolution limitations in the performance measurement. The thing is that the vsync limitation applies only for the onscreen frame buffer, but not for the offscreen framebuffers implemented with FBOs. It is possible to measure performance by rendering to an FBO that has the same frame buffer resolution and configuration (color and depth buffer bit depths) as the onscreen frame buffer. After setting up the FBO and binding it with glBindFramebuffer() your rendering functions don't see any difference whether the render target is the onscreen frame buffer or an FBO. However, in order to make the performance measurement work correctly you need to do a few things:


  • You need to consume your FBO rendering results in the onscreen frame buffer. This is necessary because if you render something to an FBO and don't use your rendering results for anything visible, there is no guarantee that the GPU actually renders anything. After rendering to an FBO you can down-sample your output texture into a small area in the onscreen frame buffer. This guarantees that the GPU must render the frame image into an FBO as expected.
  • The offscreen rendering should be implemented with two different FBOs in order to simulate double buffering functionality. After rendering a frame to an FBO, you should down-sample the output texture to the onscreen buffer, and then swap to another FBO that is used for rendering the next frame.
  • You should Use glDiscardFramebufferExt (OpenGL ES 2.0) or glInvalidateFramebuffer (OpenGL ES 3.0) for discarding depth/stencil buffers right after the rendering of a frame to an FBO is complete. This is necessary to avoid writing out the depth/stencil buffer to the main memory in the Mali GPU (the same effect happens for the onscreen frame buffer when you call eglSwapBuffers()). You can find some details of this topic in Mali Performance 2: How to Correctly Handle Framebuffers.
  • After rendering a suitable number of offscreen frames (for example 100) and down-sampling them to a small area in the onscreen frame, you can call eglSwapBuffers() as normal to present the frame in the onscreen buffer. You can measure the offscreen FPS by dividing the total number of rendered offscreen frames by the total rendering time measured when eglSwapBuffers() returns.


There is a small overhead in the performance measurement when using this method because of down-sampling the offscreen frames to the onscreen frame, but nevertheless it should give you quite representative FPS results without the vsync limitation.


Is it really worth it?


You might ask how significant an energy saving you can really get by optimizing your application. We will focus on that in the next part of this blog, Energy Efficiency in GPU Applications, Part 2, where I will present a small micro-benchmark that will show how much you can reduce real SoC power consumption by optimizing your application.

Over the past couple of weeks, ARM and Collabora have been working together closely to showcase all the benefits that can be extract from Wayland for media content playback use cases and beyond.


This week in particular, ARM and Collabora are showing at SIGGRAPH 2014 a face-off between the near 30-year old X11 and the up and coming Wayland.


Leveraging ARM Mali as deployed in Samsung Chromebook 2, Collabora has, with the help of ARM, development an environment that makes it possible to clearly see the advantages of Wayland, particularly with the latest drivers made available by ARM for Mali.


The best way to find out more about this is to watch the video we've produced at SIGGRAPH:


Details can be found on our blog and are also available here:

Wayland on MALI

Over the past several years at Collabora, we have worked on Linux's graphics stack from top to bottom, from kernel-level hardware enablement through to the end applications. A particular focus has always been performance: not only increasing average throughput and performance metrics, but ensuring consistent results every time. One of the core underpinnings of the Linux graphics stack from its very inception has been the X Window System, which recently celebrated its 29th anniversary. Collabora have been one of the most prolific contributors to X.Org for the past several years, supporting its core development, but over the past few years we have also been working on its replacement - Wayland. Replacing something such as X is not to be taken lightly; we view Wayland as the culmination of the last decade of the work by the entire open-source graphics community. Wayland reached 1.0 maturity in 2012, and since then has shipped in millions of smart TVs, set-top boxes, IVI systems, and more.

This week at SIGGRAPH together with ARM, we have been showcasing some of our recent development on Wayland, as well as on the entire graphics stack, to provide best-in-class media playback with GStreamer.

'Every frame is perfect'

wayland-x11@2x.pngWayland's core value proposition for end users is simple: every frame must be perfect. What we mean by that, is that the user will never see any unintended or partially-rendered content, or any graphical glitches such as tearing. In contrast to X11, where the server performs rendering on behalf of its clients, which not only requires expensive parallelisation-destroying synchronisation with the GPU, but is often an unwanted side effect of unrelated requests, Wayland's buffer-oriented model places the client firmly in control of what the user will see.

The user will only ever be shown exactly the content that the client requests, in the exact way that it requests it: painstaking care has been taken to ensure that not only do these intermediate states not exist, but that any unnecessary synchronisation has been removed. The combination of perfect frames and lower latency results in a natural, fluid-feeling user experience.

Power and resource efficient

wayland-x11-2@2x.pngMuch of the impetus for Wayland's development came from ARM-based devices, such as smart TVs and set-top boxes, digital signage, and mobile, where not only is power efficiency key, but increased demands such as 4K media mean in order to ship a functioning product in the first place, the hardware must be pushed right to the margins of its capabilities. In order to achieve these demanding targets, the window system must make full use of all IP blocks provided by the platform, particularly hardware media decoders and any video overlays provided by the display controller. Not only must it use these blocks, but it must eliminate any copies of the content made along the way. X11 has two core problems which preclude it making full use of these features. Firstly, as X11 provides a rendering-command rather than a buffer-driven interface to clients, it is extremely difficult to integrate with hardware media decoders without making a copy of the full decoded media frame, consuming valuable memory bandwidth and time. Secondly, the X11 server is fundamentally unaware of the scene graph produced by the separate compositor, which precludes use of hardware overlays: the only interface it provides for doing this is OpenGL ES rendering, requiring another copy of the content. This increased memory bandwidth and power usage makes it extremely difficult to ship compelling products in a media-led environment. By contrast, Wayland's buffer-driven model is a natural fit for the hardware media engines of today and tomorrow, and the integration of the display server and compositor makes it easy to use the full functionality of the display controller to provide low-power media display, whilst reserving as much memory bandwidth as possible for other applications to run without having to contend with media playback for crucial system resources, or to push systems to their limits, such as 4K content on relatively low-spec systems.

A first-class media experience

To complement our hundreds of man-years of work on the industry-standard GStreamer media framework, which has proven to scale from playback on mobile devices to serving huge live broadcast streams, Collabora has worked to ensure that Wayland provides a first-class experience when used together with GStreamer. Our recent development work on both Wayland itself and GStreamer's Wayland support, ensures that GStreamer can realise its full potential when used together with Wayland. All media playback naturally occurs in a 'zero-copy' fashion, from hardware decoding engines into either the 3D GPU or display controller, thanks to DMA-BUF buffer passing, new in version 3.16 of the Linux kernel. The Wayland subsurface mechanism allows videos to be streamed separately to UI content, rather than combined by the client as they are today in X11. This separation allows the display server to make a frame-by-frame decision as to how to present it: using power-efficient hardware overlays, or using the more flexible and capable 3D GPU. This step allows maximum UI flexibility whilst also making the most of hardware IP blocks. The scaling mechanism also allows the compositor to scale the video at the last minute, potentially using high-quality scaling and filtering engines within the display controller, as well as reducing precious memory bandwidth usage when upscaling videos. Deep buffer queues are also possible for the first time, with both GStreamer and Wayland supporting ahead-of-time buffer queueing, where every buffer has a target time attached. Under this model, it is possible for the client to queue up a large number of frames in advance, offload them all to the compositor, and then go to sleep whilst they are autonomously displayed, saving CPU usage and power. Wayland also provides GStreamer with feedback on when exactly their buffers were shown on screen, allowing it to automatically adjust its internal pipeline and clock for the tightest possible A/V sync.

Easier deployment and support

In contrast to the X11 model of providing a driver specific to the combination of X server version, display controller and 3D GPU, Wayland offers vendors the ability to deploy drivers written according to external, well-tested, vendor-independent APIs. These drivers are required to perform only limited, well-scoped tasks, making validation, performance testing, and support much easier than under X11. This model makes it possible for vendors to deploy a single well-tested solution for Wayland, and for end users to deploy them in the knowledge that they will have reliable performance and functionality.

We are demonstrating all this at SIGGRAPH, on the ARM booth at stand #933 in the Mobility Pavilion on the Exhibition Hall. We are showing a side-by-side comparison of Wayland and X11 on Samsung Chromebook 2 machines (Samsung Exynos 5800 Octa hardware, with an ARM Mali-T628 GPU), demonstrating Collabora's expertise from the very bottom of the stack to the very top. Collabora's in-house Singularity OS runs a Linux 3.16-rc5 kernel, containing changes bound for upstream to improve and stabilise hardware support, and an early preview of atomic modesetting support inside the Exynos kernel modesetting driver for the display controller. The Wayland machine runs Weston with the new DMA-BUF and buffer-queueing extensions on top of atomic modesetting, demonstrating that videos played through GStreamer can be seamlessly switched between display controller hardware overlays and the Mali 3D GPU, using the DMA-BUF import EGL extension. The X11 machine runs the ChromeOS X11 driver, with a client which plays video through OpenGL ES at all times. The power usage, frame 'lateness' (difference between target display time and actual time), and CPU usage are shown, with Wayland providing a dramatic improvement in all these metrics.

Chinese Version中文版:SIGGRAPH、OpenGL ES 3.1 和下一代 OpenGL

It’s that time of year again – SIGGRAPH is here! For computer graphics artists, teachers, freaks and geeks of all descriptions, it’s like having Midsummer, Christmas, and your birthday all in the same week. By the time you read this, I’ll be in beautiful Vancouver BC, happily soaking up the latest in graphics research, technology, animation, and associated general weirdness along with the other 15,000-plus attendees. I can’t wait!


This year, SIGGRAPH has a special personal connection for me: my office-mate Dave Shreiner is this year’s general chair (amazingly, he’s still got all his hair – quite a lot of it actually), and my other office-mate Jesse Barker is chair of SIGGRAPH Mobile. (Jesse’s got no hair at all, but with him it’s a style choice.) My own job at SIGGRAPH is a lot less grand, but it’s something I love doing: In my capacity as OpenGL® ES working group chair, I’ll be co-hosting the Khronos OpenGL / OpenGL ES Birds of a Feather (BOF) session. That’s where the working groups report back to the user community about what’s going on in the ecosystem, what the committee has been doing, and what the future might hold. This year’s OpenGL ES update will mostly focus on the growing market presence of OpenGL ES 3.0, and on OpenGL ES 3.1, which we released earlier this year and which is starting to enter the market in a big way. It’s great stuff – but it’s not the big news.


There’s a change coming


By the standards of, well, standards, the OpenGL APIs have been an amazing success. OpenGL has stood unchallenged for twenty years as the cross-platform 3D API. Its mobile cousin, OpenGL ES, has grown phenomenally over the past ten years; with the mobile industry now shipping a billion and a half OpenGL ES devices per year, it has become the main driver of OpenGL adoption. One-point-five billion is a mind-boggling number, and we’re suitably humbled by the responsibility it implies.  But the APIs are not without problems: the programming model they present is frankly archaic, they have trouble taking advantage of multicore CPUs, they are needlessly complex, and there is far too much variability between implementations. Even highly skilled programmers find it frustrating trying to get predictable performance out of them. To some extent, OpenGL is a victim of its own success – I doubt that there are many APIs that have been evolving for twenty years without accumulating some pretty ugly baggage. But that doesn't change the central fact: OpenGL needs to change.

The Khronos working groups have known this for a long time; top developers (hi Rich!) have been telling us every chance they get.  But now, with OpenGL ES 3.1 finished but still early in its adoption cycle, we finally feel like we have an opportunity to do something about it. So at this year’s SIGGRAPH, Khronos is announcing the Next Generation OpenGL initiative, a project to redesign OpenGL along modern lines. The new API will be leaner and meaner, multicore and multithread-friendly. It will give applications much greater control over CPU and GPU workloads, making it easier to write performance-portable code. The work has already started, and we’re making rapid progress, thanks to strong commitment and active participation from the whole industry, including several of the world's top game engine companies.


Needless to say, ARM is fully behind this new direction, and we’re investing significant engineering resources in making sure it meets its goals and runs well on our Mali GPUs. We are of course also continuing to invest in the ecosystem for ‘traditional’ OpenGL ES, which will remain the dominant  mobile graphics API for quite some time to come.


That’s all I’ve got for now. If you’re going to be at SIGGRAPH, I hope you’ll come by the OpenGL / OpenGL ES BOF and after-party, 5-7pm on Wednesday at the Marriott Pinnacle, and say hi.  If not, drop me a line below…


Tom Olson is Director of Graphics Research at ARM. After a couple of years as a musician (which he doesn't talk about), and a couple more designing digital logic for satellites, he earned a PhD and became a computer vision researcher. Around 2001 he saw the coming tidal wave of demand for graphics on mobile devices, and switched his research area to graphics.  He spends his working days thinking about what ARM GPUs will be used for in 2016 and beyond. In his spare time, he chairs the Khronos OpenGL ES Working Group.

Olga Kounevitch

Rockchip Rock the Boat

Posted by Olga Kounevitch Aug 12, 2014

Chinese Version 中文版:瑞芯微电子打破现状


Today at SIGGRAPH, ARM will be showcasing the graphics capabilities of its highest-end product, the ARM Mali-T760 GPU, available to the public for the first time in the shape of the Rockchip RK3288 processor in the PiPO Pad P1 and the Teclast P90HD. The announcement of the Mali-T760 GPU’s release in October last year seems like a lifetime ago from where we’re sitting – ARM has managed to squeeze in so many activities since then – but when you compare it to the traditional lifespan of delivering a brand new chip to the market, the speed at which Rockchip has been able to deliver the RK3288 has been incredible.


Historically, it has often taken fabless semiconductor companies 2-3 years to move from initial design idea to sample silicon to having a prototype end-product to having the final production OEM device ready to ship.


The problem is, this is no longer holding true in all cases. For some partners, extracting the highest possible performance, best energy efficiency and lowest die area from the IP which ARM delivers is their key differentiation point at the launch of a new SoC. For others, it’s time to market and their competitive advantage comes from being the first to put a new feature, functionality or – in this case GPU - in the hands of consumers.

Rockchip, by working closely with ARM, their suppliers and their customers, have been able to reduce this time from idea to ready-to-ship consumer product down to 9 months.


How Did It Happen?


ARM has collaborated closely with Rockchip over a period of many years, helping them deliver best-in-class SoCs to the marketplace. Rockchip are extremely experienced in designing Mali GPU IP into their silicon – they have been licensees of Mali technology since the days of the Mali-55 GPU. Their engineers know ARM designs well and were able to apply this experience to the new design, along with some of the tools and software used previously when developing an ARM-based chip. Combine this with the benefits of being lead partner along with Samsung, LG and Mediatek in the launch of the new GPU and you have yourself a winner.


There are many advantages to being a lead partner for ARM. Rockchip were able to participate in the development of the product, ensuring their suggestions were considered, but most importantly they gained early access to the IP. This early access enabled Rockchip engineers to start work on their silicon design extremely early on in the lifecycle of the Mali-T760.  ARM also provided regular updates to the project as they were made and delivered detailed support, ensuring that by the time the Mali-T760 was announced, ARM and Rockchip had already done a lot of the legwork needed to bring the first iteration of the RK3288 to market. As Trina Watt, VP Solutions Marketing at ARM put it: “Such a phenomenal achievement in terms of getting end-user devices to the market in only seven months was made possible due to the close collaboration and commitment from both parties.”


Rockchip will continue to develop and refine their software offering over future iterations,  enhancing the processor’s energy efficiency and performance in order to get the most from the IP which they have licensed.

What Does This Mean for the Future of the Mobile Industry?


Firstly, for consumers it means that the latest mobile technology will be reaching your hands sooner than ever before – the days of hearing about sixteen core GPUs with 400% increases in energy efficiency and performance and then waiting for three years before the GPU is in an appreciable form in your pocket is over.


For the mobile industry, it means there is change in the air. With companies like Rockchip now setting the bar for fast tape outs and racing to be the first to market, the question will be to what extent other silicon partners can continue to spend two to three years on chip development.


ARM offers a range of Physical IP products to help reduce time to market for silicon partners. For example, ARM POP IP is a combination of physical IP with acceleration technology which guides licensees to produce the best ARM processor implementations (whether that is highest performing or most efficient) in the fastest time. It implements both the knowledge of our processor teams and the physical IP engineering teams to resolve common implementation challenges for the silicon partner. ARM POP IP is currently available for Mali-T628 and Mali-T760 GPUs.




In addition, to provide choice in the market, ARM works closely with leading EDA partners for integration and optimization of ARM IP deliverables with advanced design flows. ARM directly collaborates with each partner in the development and validation of various design flows and methodologies, enabling successful path from RTL to foundry-ready GDSII. For example, ARM processor-based Implementation Reference Methodologies (iRMs) enable ARM licensees to customize, implement, verify and characterize soft ARM processors. For Mali GPUs, the Synopsys design flow enables a predictable route to silicon, and a basis for custom methodology development.

The Potential of the ARM Ecosystem


“Consumers are increasingly becoming more sophisticated and desire to get hold of the latest technology in their hands as soon as possible” - said Chen Feng, CMO of Rockchip. “In order to do so we need to find new ways of working with our partners across the entire length of the supply chain. Having worked closely and found success with ARM and the ARM Ecosystem over so many years already, we knew that, though the targets were demanding, between us we had the strengths and capabilities to make it happen. The Mali-T760 is an extremely promising GPU and we are proud to be the first to bring it to the hands of consumers.”


If you want to see ARM’s latest GPU in action, come to the ARM booth at SIGGRAPH and discover how the ARM Ecosystem is continuing to expand the mobile experience, with new GPUs, advanced processor technology and innovative additions to the graphics industry.



Today at SIGGRAPH a new demo is being brought to the public as the result of 18 months of collaboration between teams at ARM, Samsung Research UK and Szeged University in Hungary.  It demonstrates massively accelerated mobile web rendering on an ARM® Mali-T628 GPU based Chromebook with 1.5 to 4.5 times higher performance compared to other solutions on the market (depending on the type of content run). The solution enables a smoother experience and is not just applicable to web browsing, but can also hugely improve the user experience on browser based UIs such as those in modern GPU-enabled DTVs.


The solution, named TyGL, is a new backend for WebKit which addresses the challenge that HTML5 developers currently have when balancing graphics rich web content against the constrained rendering capabilities of mobile CPUs.  Rasterization has typically been done mainly on the CPU. While this suffices for PCs, rendering using a mobile CPU is known to be more inefficient due to the constraints imposed on the CPUs by their batteries (such as lower clock frequencies) – leaving a parallel task like this to the GPU could result in a much smoother experience. However, using 3D graphics hardware such as Mali GPUs to render 2D content such as web pages is an extremely challenging task. Raster engines are designed to draw various graphics primitives one-by-one with frequently changing attributes, rather than drawing several primitives of the same type using a single draw call, which is the sort of task GPUs are optimized for. So while CPU rendering is slow, GPU rendering is complex to achieve efficiently because WebKit issues more draw calls, each with less data than is optimal for GPU usage.


The other challenge which HTML5 developers face with the WebKit ports that are currently available is the level of abstraction between layout and painting to the screen. By abstracting too far from the underlying accelerated API the developer can lose the ability to code to the API’s strengths, leading to a sub optimal implementation.


The Solution


TyGL seeks to cut down the level of abstraction in current ports and offer a web rendering solution that is fully optimized for the GPU whose only dependency is the OpenGL® ES 2.0 API, supported by the majority of application processors. It is a backend for WebKit, the open source application framework that can be used to build web-browser like functionality into an application. Major WebKit based products include embedded browsers from companies such as Espial, ACCESS and Company 100.


Both ARM and Szeged University conducted in-depth profiling of common webpages using the QtTestBrowser. The results showed that the majority of active CPU time was spent in libQtGui – the Rendering/Painting API used to render the content on the screen. GPUs are far more efficient at rendering to screens than CPUs and it was proposed that if the drawing commands of the pipeline were able to be done using the OpenGL ES 2.0 API, the performance could be improved considerably.


gl2d-pipeline.pngTyGL Pipeline


The diagram above outlines the pipeline of TyGL.  It applies three different processes to drawing text, images and paths, but the differences are in the preparation phases only. Even there, some similarities can be noticed. First, in the case of text and paths, some or all of the affine transformation is applied to the input in order to ensure higher quality output, which is then rendered to an alpha image with any remaining transformation being applied. Finally, the pipeline paths join at the point where GPU acceleration becomes efficient: colouring, clipping, blending and compositing. This common part of the pipeline is fully processed by the GPU. Each stage of the common pipeline is associated with an OpenGL ES fragment shader, which performs the necessary computations on each output pixel in a highly parallel fashion. Software-based graphics libraries such as Cairo usually have similar pipeline stages, executed sequentially and communicating through temporary buffers, but TyGL can do this more efficiently.


TyGL Image.pngExample webpage rendered by TyGL




Preparations are underway to open source the TyGL backend for WebKit imminently. Early results show that the port is successful at improving the performance and efficiency of 2D rasterization in the browser while remaining lightweight enough to reduce the level of overhead and abstraction in currently available solutions. By releasing this code into the Open Source arena, it is hoped that all browser vendors that make use of WebKit will be able to benefit from ARM and their partners leadership in the domain of 2D rasterization on embedded GPUs.


If you want to find out more about TyGL, come and visit the brains behind it at the ARM Booth at SIGGRAPH this week.

Ellie Stone

ARM Makes the World Mobile

Posted by Ellie Stone Aug 12, 2014



The SIGGRAPH exhibition floor is currently buzzing with activity as staff from all companies ready themselves for the grand opening tomorrow morning. All the ARM staff in attendance are smoothing out the final creases on the booth and making sure that everything will be perfect when attendees hit the showfloor tomorrow morning at 9:30am. Unfortunately, I can't quite leak a photo of the booth at this point in time, but I can show you a picture of an awesome statue outside the Vancouver Convention Center as a loosely related teaser:


Vancouver convention center2.jpg"Digital Orca" Statue outside the Vancouver Convention Center



So what does the ARM Booth have to offer this year? Besides the opportunity to win a Samsung Galaxy Note 10.1 each day, here are a couple more reasons to visit Booth #933:


Firstly, we have a number of fantastic demos from partners which are the results of many months collaboration with ARM. More information will be released very soon concerning the latest GPU technology in the Rockchip-based devices on display and also concerning the hardware accelerated web rendering solution on show at the Samsung Research pod - check back in on the ARM Mali Graphics blog tomorrow (Updated: Rockchip Rock the Boat) if you're curious (or of course, if you're at SIGGRAPH, pop by the booth and take a look yourself!). Besides the demo from the Research department at Samsung, Samsung LSI will also be on the ARM booth demonstrating the great capabilities of the latest devices powered by the ARM® Mali-T628 GPU-based  Exynos 5 Octa processor, including the Odroid-XU3 development board.


Also joining us are Collabora, who are showing how next-generation open source graphics technologies will provide power efficiency and great multimedia performance simultaneously. Their demonstration exemplifies the latest developments in the GStreamer media framework, the Wayland window system, and the Linux kernel, benchmarking power/CPU/GPU utilization and frame-time accuracy between the new Wayland and legacy X11 window systems. The Collabora demo makes use of the full breadth of ARM Mali GPUs and many features of the Samsung Exynos 5 Octa platform, including its powerful media decoding engine and display controller. In addition, Simplygon are showcasing their automatic 3D game content optimization solution, PlayCanvas  will show their cloud-hosted and real-time collaborative HTML5 & WebGL game engine which gives developers all they need to create stunning 3D games in your browser or on mobile devices, including some amazing tools, and Geomerics will be showcasing the Transporter demo, the latest and greatest demonstration of Enlighten technology which was recently integrated into Unity to provide dynamic global illumination.




On the ARM side, we have a number of demos coming to you for the very first time. Firstly, following the announcement of the Juno board last month, attendees will be able to see 64-bit content running on a quad-core ARM Cortex®-A53 CPU and dual-core Cortex-A57 in ARM big.LITTLE™ configuration. With this solution available in the market, developers will be able to more easily delivery the next generation of content for Android OS-based devices.


Secondly, we have a brand new demo showcasing the benefits of the Pixel Local Storage extension to the OpenGL® ES 3.0 API which promotes a new method of achieving bandwidth efficiency. The most significant difference between mobile GPUs and their desktop equivalents is the limited availability of sustained memory bandwidth. With advances in bandwidth expected to be incremental for many years, mobile graphics must be tailored to work efficiently in a bandwidth-scarce environment. This is true at all levels of the hardware-software stack. This demo shows that deferred rendering could be made bandwidth efficient by exploiting the on-chip memory used to store tile framebuffer contents in many tile-based GPUs. ARM is giving an unmissable talk for those interested in this subject this Wednesday at 10:45am in Rooms 109-111.


Some more familiar demos will also be on show, highlighting the benefits of ASTC Full Profile, the OpenGL ES 3.1 feature Compute Shaders and the unique tools ARM offers. For more information on these demos, check out Daniele Di Donato's blog Inside the Demo: GPU Particle Systems with ASTC 3D textures, Sylwester Bala's blog Get started with compute shaders or Lorenzo Dal Col's writings on Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel.


If you're at the show, we look forward to seeing you soon! Otherwise, keep an eye on our social media channels throughout the week for regular updates on ARM's activities at SIGGRAPH.

At SIGGRAPH 2014 we presented the benefits of the OpenGL® ES 3.0 API and the more newly introduced OpenGL ES 3.1 extension. Adaptive Scalable Texture Compression format (ASTC) is one of the biggest introductions to the OpenGL ES API. The demo I’m going to talk about is a case study of the usage of 3D textures in the mobile space and how ASTC can compress them to provide a huge memory reduction. 3D textures weren’t available in the core OpenGL ES spec up to version 2.0 and the workaround was to use hardware dependent extensions or 2D texture arrays. Now with OpenGL ES 3.x, 3D textures are embedded in the core specification and ready to use…..if only they were not so big! Using uncompressed 3D textures costs a huge amount of memory (for example a 256x256x256 texture with RGBA8888 format uses circa 68MB) which cannot be afforded on a mobile device.


Why did we use ASTC?

The same texture can instead be compressed using different levels of compression with ASTC, giving a saving of ~80% when using the highest quality settings. For those unfamiliar with the ASTC texture compression format, it is a block-based compression algorithm where LxM (or LxMxN in the case of 3D textures) blocks of pixels are compressed together into a single block of 128 bit. The L,M,N values are one of the compression quality factors and represent the number of texels per block dimension. For 3D textures, the dimensions allowed vary from 3 to 6 as reported in the table below:


Block DimensionBit Rate
(bits per texel)


Since the block compressed size is always 128 bit for all block dimensions, the bit rate is simply 128/#texel_in_a_block. One of the features of ASTC is that it can also compress HDR values (typically 16 bit per channel). Since we needed to store high precision floating-point values in the textures in the demo, we converted the float values (32 bit per channel) to half-float format (16 bit per channel) and used ASTC to compress those textures. In this way the loss of precision is less compared to the usual 32 bit to 8 bit conversion and compression. It is worth noticing that using the HDR formats doesn’t increase the size of the compressed texture because each compressed block will still use 128 bit. Below you can see a 3D texture rendered simply using slicing planes. The compression formats used are: (from left to right) uncompressed, ASTC 3x3x3, ASTC 4x4x4, ASTC 5x5x5.



For those interested in the details of the algorithm, an open source ASTC evaluation encoder/decoder is available at http://malideveloper.arm.com/develop-for-mali/tools/astc-evaluation-codec/ and a video of an internal demo ported to ASTC is available at https://www.youtube.com/watch?v=jEv-UvNYRpk. The demo is also available for viewing on the ARM booth #933 at SIGGRAPH this week.


Demo Overview

The main objective of the demo was to use the new OpenGL ES 3.0 API to realize realistic particle systems where motion physics as well as collisions are managed entirely on the GPU. The demo shows two scenes, one which simulates confetti, the other smoke.




Transform Feedback for physics simulation

The first feature I want to talk about, which is used for the physics simulation, is Transform Feedback. The physics simulation steps typically output a set of buffers using the previous step results as inputs. These kind of algorithms, called explicit methods in numerical analysis, are well suited to being used with Transform Feedback because it allows the results of vertex shader execution to get back into a buffer that can subsequently be mapped for CPU read or used as the input buffer for other shaders.  In the demo, each particle is mapped to a vertex and the input parameters (position, velocity and lifetime) are stored in an input vertex buffer while the outputs are bound to the transform feedback buffer. Because the whole physics simulation runs on the GPU, we needed a way to give to each particle the knowledge of the objects in the scene (this is now less problematic using Compute Shaders. See below for details). 3D textures helped us in this case because they can represent volumetric information and can be easily sampled in the vertex shader as a classic texture. The 3D textures are generated from the 3D mesh of various objects using a free tool called Voxelizer (http://techhouse.brown.edu/~dmorris/voxelizer/) and the voxel data contain the normal of the surface for voxels on the mesh surface or the direction and the distance to the nearest point on the surface in the case of voxels inside the object. 3D textures can be used to represent various types of data such as a simple mask for occupied or free areas in a scene, density maps or 3D noise. When uploading the files generated from Voxelizer, we convert the floating point values to half-float and then compress the 3D texture using ASTC HDR. In the demo, we use different compression block dimensions to show the differences between uncompressed and compressed textures. Such differences included memory size, memory read bandwidth reduction and energy consumption per frame. The smallest block size (3x3x3) gives us a ~90% reduction and our biggest texture goes down from ~87MB to ~7MB. Below you can find a table of bandwidth measurements for the various types of models we used on a Samsung Galaxy Note 10.1 (2014 Edition).


Texture Resolution128x128x128180x255x255255x181x24378x75x12743x97x127
Texture Size MB
ASTC 3x3x31.276.126.720.450.34
ASTC 4x4x40.522.632.870.190.14
ASTC 5x5x50.281.321.480.100.07
Memory Read Bandwidth in MB/s
ASTC 3x3x3342.01285.78206.39374.19228.05
ASTC 4x4x4327.63179.43175.21368.13224.26
ASTC 5x5x5323.10167.90162.89366.18222.76
Energy consumption per frame DDR2 mJ per frame
ASTC 3x3x32.311.931.392.531.54
ASTC 4x4x42.
ASTC 5x5x52.
Energy consumption per frame DDR3 mJ per frame
ASTC 3x3x31.901.591.152.081.27
ASTC 4x4x41.821.000.972.041.24
ASTC 5x5x51.790.930.902.031.24


Instancing for efficiency

Another feature that was introduced in OpenGL ES 3.0 is Instancing. It permits us to specify geometry only once and reuse it multiple times in different locations with a single draw call. In the demo we use it for the confetti rendering where, instead of defining a vertex buffer of 2500*4 vertices (we render 2500 particles as quads in the confetti scene), we just define a vertex buffer of 4 vertices and call the:


glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, 2500 );


where GL_TRIANGLE_STRIP specifies the type of primitive to render, 0 is the start index inside the enabled vertex buffers that represents the positions of the vertices of the quad, 4 specifies the number of indices needed to render one instance of the geometry (4 indices per quad) and 2500 is the number of instances to render. Inside the vertex shader, the gl_InstanceID built-in variable will be available and it will contain the identifier for the current invocation. This variable can, for example, be used to access an array of matrices or do specific calculations for each instance. A divisor can also be specified for each active vertex buffer which specifies how the vertex shader will advance in the vertex buffers for each instance.

The smoke scene

In the smoke scene, the smoke is rendered using a noise texture and some math to compute the final colour as if it were a 3D volume. To give the smoke a transparent look we need to combine different overlapping particles’ colours. To do so we use additive blending and disable the z-test when rendering the particles. This gives a nice result even without sorting the particles based on the z-value (otherwise we have to map the buffer in the CPU). Another reason for disabling it is to realize soft particles. The Mali-T6xx series of GPUs can use a specific extension in the fragment shader to read back the values of the framebuffer (colour, depth and stencil) without having to render-to-texture. This feature makes it easier to realize soft particles and in the demo we use a simple approach. First, we render all the solid objects so that their z-value will be written in the depth buffer. After we render the smoke (and thanks to the Mali extension) we can read the depth value of the object and compare it with the current fragment of the particle (to see if it is behind the object) and fade the colour accordingly. This technique eliminates the sharp profile that is formed by the particle quad intersecting the geometry due to the z-test (another reason we had to disable it).


Blurring the smoke

During development the smoke effect looked nice but we wanted it to be more dense and blurry. To achieve all this we decided to render the smoke in an off-screen render buffer with a lower resolution compared to the main screen. This gives us the ability to have a blurred smoke (since the lower resolution removes the higher frequencies) as well as let us increase the number of particles to get a denser look. The current implementation uses a 640x360  off-screen buffer that is up-scaled to 1080p resolution in the final image. A naïve approach causes jaggies on the outline of the object when the smoke is flowing near it due to the blending of the up-sampled low resolution buffer. To almost eliminate this effect, we apply a bilateral filter. The bilateral filter is applied to the off-screen buffer and is given by the product of a Gaussian filter in the colour texture and a linear weighting factor given by the difference in depth. The depth factor is useful on the edge of the model because it gives a higher weight to neighbour texels with depth similar to the one of the current pixel and lower weight when this difference is higher (if we consider a pixel on the edge of a model, some of the neighbour pixels will still be on the model while others will be far in the background).



Bonus track

The recently released OpenGL ES 3.1 spec introduced Compute Shaders as a method for general computing on the GPU (a sort of subset of OpenCL™, but in the same context of OpenGL so no context switching needed!!). You can see it in action below:


An introduction to Compute Shaders is also available at:

Get started with compute shaders




I would like to point out some useful websites that helped me understand Indexing and Transform Feedback:

Transform Feedback:




ASTC Evaluation Codec:




The topic of this blog was presented recently to students in a workshop at Brains Eden Gaming Festival 2014 at Anglia Ruskin University in Cambridge [1]. We wanted to provide students with an effective and low cost technique to implement reflections when developing games for mobile devices.

Early Reflection Implementations

From the very beginning, graphics developers have tried to find cheap alternatives to implement reflections. One of the first solutions was spherical mapping. Spherical mapping simulates reflections or lighting upon objects without going through expensive ray-tracing or lighting calculations. This approach has several disadvantages, but the main problem is related to the distortions when mapping a picture onto a sphere.  In 1999, it became possible to use cubemaps with hardware acceleration.
Figure 1: Spherical mapping.

Cubemaps solved the problems of image distortions, viewpoint dependency and computational inefficiency related with spherical mapping. Cube mapping uses the six faces of a cube as the map shape. The environment is projected onto each side of a cube and stored as six square textures, or unfolded into six regions of a single texture. The cubemap is generated by rendering the scene from a given position with six different camera orientations with a 90 degree view frustum representing each a cube face. Source images are sampled directly. No distortion is introduced by resampling into an intermediate environment map.


Figure 2:Cubemaps.

To implement reflections based on cubemaps we just need to evaluate the reflected vector R and use it to fetch the texel from the cubemap CubeMap using the available texture lookup function texCUBE:

float4 col = texCUBE(CubeMap, R);

Expression 1.



Figure 3: Reflections based on infinite cubemaps.

With this approach we can only reproduce reflections correctly from a distant environment where the cubemap position is not relevant. This simple and effective technique is mainly used in outdoor lighting, for example, to add reflections of the sky. If we try to use this technique in a local environment we get inaccurate reflections.


Figure 4: Reflection on the floor calculated wrongly with an infinite cubemap.


Local Reflections

The main reason why this reflection fails is that in Expression 1 there is not any binding to the local geometry. For example, according to Expression 1, if we were walking on a reflective floor looking at it from the same angle, we would always see the same reflection on it. As the direction of the view vector does not change, the reflected vector is always the same and Expression 1 gives the same result. Nevertheless, this is not what happens in the real world where reflections depend on viewing angle and viewing position.


The solution to this problem was first proposed by Kevin Bjorke[2] in 2004. For the first time a binding to the local geometry was introduced in the procedure to calculate the reflection:


Figure 5: Reflections using local cubemaps.


While this approach gives good results in objects’ surfaces with near to spherical shape, in the case of plane reflective surfaces the reflection shows noticeable deformations. Another drawback of this method is related to the relative complexity of the algorithm to calculate the intersection point with the bounding sphere which solves a second degree equation.


A few years later, in 2010, a better solution was proposed [3] in a thread of a developer forum at gamedev.net. The new approach replaced the previous bounding sphere by a box, solving the shortcomings of Bjorke’s method: deformations and complexity of the algorithm to find the intersection point.


Figure 6: Introducing a bounding box.


A more recent work [4] uses this new approach to simulate more complex ambient specular lighting using several cubemaps and proposes an algorithm to evaluate the contribution of each cubemap and efficiently blend on the GPU.

At this point we must clearly distinguish between local and infinite cubemaps:

Figure 7 shows the same scene from Figure 4 but this time with correct reflections using local cubemaps.

Figure 7: Reflection on the floor correctly calculated with a local cubemap.


Shader Implementation

The shader implementation in Unity of reflections using local cubemaps is provided below. In the vertex shader, we calculate the three magnitudes we need to pass to the fragment shader as interpolated values: the vertex position, the view direction and the normal, all of them in world coordinates:

vertexOutput vert(vertexInput input)


    vertexOutput output;

    output.tex = input.texcoord;

    // Transform vertex coordinates from local to world.

    float4 vertexWorld = mul(_Object2World, input.vertex);

    // Transform normal to world coordinates.

    float4 normalWorld = mul(float4(input.normal, 0.0), _World2Object);

    // Final vertex output position.   

    output.pos = mul(UNITY_MATRIX_MVP,  input.vertex);

    // ----------- Local correction ------------

    output.vertexInWorld = vertexWorld.xyz;

    output.viewDirInWorld = vertexWorld.xyz - _WorldSpaceCameraPos;

    output.normalInWorld = normalWorld.xyz;

    return output;


In the fragment shader the reflected vector is found along with the intersection point in the volume box. The new local corrected reflection vector is built and it is used to fetch the reflection texture from the local cubemap. Finally the texture and reflection are combined to produce the output colour:

float4 frag(vertexOutput input) : COLOR


     float4 reflColor = float4(1, 1, 0, 0);

     // Find reflected vector in WS.

     float3 viewDirWS = normalize(input.viewDirInWorld);

     float3 normalWS = normalize(input.normalInWorld);

     float3 reflDirWS = reflect(viewDirWS, normalWS);                       

     // Working in World Coordinate System.

     float3 localPosWS = input.vertexInWorld;

     float3 intersectMaxPointPlanes = (_BBoxMax - localPosWS) / reflDirWS;

     float3 intersectMinPointPlanes = (_BBoxMin - localPosWS) / reflDirWS;

     // Looking only for intersections in the forward direction of the ray.

     float3 largestParams = max(intersectMaxPointPlanes, intersectMinPointPlanes);

     // Smallest value of the ray parameters gives us the intersection.

     float distToIntersect = min(min(largestParams.x, largestParams.y), largestParams.z);

     // Find the position of the intersection point.

     float3 intersectPositionWS = localPosWS + reflDirWS * distToIntersect;

     // Get local corrected reflection vector.

     float3 localCorrReflDirWS = intersectPositionWS - _EnviCubeMapPos;

     // Lookup the environment reflection texture with the right vector.           

     reflColor = texCUBE(_Cube, localCorrReflDirWS);

     // Lookup the texture color.

     float4 texColor = tex2D(_MainTex, float2(input.tex));

     return _AmbientColor + texColor * _ReflAmount * reflColor;


In the above code for the fragment shader the magnitudes _BBoxMax and _BBoxMin are the maximum and minimum points of the bounding volume. The variable  _EnviCubeMapPos is the position where the cubemap was created. These values are passed to the shader from the below script:

public class InfoToReflMaterial : MonoBehaviour
    // The proxy volume used for local reflection calculations.
    public GameObject boundingBox;

    void Start()
        Vector3 bboxLenght = boundingBox.transform.localScale;
        Vector3 centerBBox = boundingBox.transform.position;
        // Min and max BBox points in world coordinates.
        Vector3 BMin = centerBBox - bboxLenght/2;
        Vector3 BMax = centerBBox + bboxLenght/2;
        // Pass the values to the material.
        gameObject.renderer.sharedMaterial.SetVector("_BBoxMin", BMin);
        gameObject.renderer.sharedMaterial.SetVector("_BBoxMax", BMax);
        gameObject.renderer.sharedMaterial.SetVector("_EnviCubeMapPos", centerBBox);

The values for _AmbientColor and _ReflAmount as well as the main texture and cubemap texture are passed to the shader as uniforms from the properties block:


        _MainTex ("Base (RGB)", 2D) = "white" { }
        _Cube("Reflection Map", Cube) = "" {}
        _AmbientColor("Ambient Color", Color) = (1, 1, 1, 1)
        _ReflAmount("Reflection Amount", Float) = 0.5

            #pragma glsl
            #pragma vertex vert
            #pragma fragment frag
            #include "UnityCG.cginc"
           // User-specified uniforms
            uniform sampler2D _MainTex;
            uniform samplerCUBE _Cube;
            uniform float4 _AmbientColor;
            uniform float _ReflAmount;
            uniform float _ToggleLocalCorrection;
           // ----Passed from script InfoRoReflmaterial.cs --------
            uniform float3 _BBoxMin;
            uniform float3 _BBoxMax;
            uniform float3 _EnviCubeMapPos;

            struct vertexInput
                float4 vertex : POSITION;
                float3 normal : NORMAL;
                float4 texcoord : TEXCOORD0;

            struct vertexOutput
                float4 pos : SV_POSITION;
                float4 tex : TEXCOORD0;
                float3 vertexInWorld : TEXCOORD1;
                float3 viewDirInWorld : TEXCOORD2;
                float3 normalInWorld : TEXCOORD3;

            Vertex shader {…}
            Fragment shader {…}

The algorithm to calculate the intersection point in the bounding volume is based on the use of the parametric representation of the reflected ray from the local position (fragment). More detailed explanation of the ray-box intersection algorithm can be found in [4] in the References.


Filtering Cubemaps

One of the advantages of implementing reflections using local cubemaps is the fact that the cubemap is static, i.e. it is generated during development rather than at run-time. This gives us the opportunity to apply any filtering to the cubemap images to achieve a given effect.

As an example, the image below shows reflections using a cubemap where a Gaussian filter was applied to achieve a “frosty” effect. The CubeMapGen [5] tool (from AMD) was used to apply filtering to the cubemap. Just to give an idea about how expensive this process can be it took more than one minute the filtering of a 256 pixels cubemap on a PC.


Figure 8: Gaussian filter applied to reflections in Figure 3.

A specific tool was developed for Unity to generate cubemaps and save cubemap images separately to import later into CubeMapGen. Detailed information about this tool and about the whole process of exporting cubemap images from Unity to CubeMapGen, applying filtering and reimporting back to Unity can be found in the References section [4].




Reflections based on static local cubemaps are an effective tool to implement high quality and realistic reflections and a cheap alternative to reflections generated at run-time.   This is especially important in mobile devices where performance and memory bandwidth consumption are critical to the success of many games.


Additionally, reflections based on static local cubemaps allow developers to apply filters to the cubemap to achieve complex effects that would otherwise be prohibitively expensive at run-time, even on high-end PCs.

The inherent limitation of static cubemaps when dealing with dynamic objects can be solved easily by combining static reflections with reflections generated at run-time. This topic will be examined in a future blog.



[1] Reflections based on local cubemaps. Presentation at Brains Eden, 2014 Gaming Festival at Anglia Ruskin University in Cambridge.  http://malideveloper.arm.com/downloads/ImplementingReflectionsinUnityUsingLocalCubemaps.pdf

[2] GPU Gems, Chapter 19. Image-Based Lighting. Kevin Bjork, 2004. http://http.developer.nvidia.com/GPUGems/gpugems_ch19.html.

[3] Cubemap Environment Mapping. 2010. http://www.gamedev.net/topic/568829-box-projected-cubemap-environment-mapping/?&p=4637262

[4] Image-based based Lighting approaches and parallax-corrected cubemap. Sebastien Lagarde. SIGGRAPH 2012.


[5] CubeMapGen. http://developer.amd.com/tools-and-sdks/archive/legacy-cpu-gpu-tools/cubemapgen/

Filter Blog

By date:
By tag: