1 2 3 Previous Next

ARM Mali Graphics

223 posts

Over the past couple of weeks, ARM and Collabora have been working together closely to showcase all the benefits that can be extract from Wayland for media content playback use cases and beyond.


This week in particular, ARM and Collabora are showing at SIGGRAPH 2014 a face-off between the near 30-year old X11 and the up and coming Wayland.


Leveraging ARM Mali as deployed in Samsung Chromebook 2, Collabora has, with the help of ARM, development an environment that makes it possible to clearly see the advantages of Wayland, particularly with the latest drivers made available by ARM for Mali.


The best way to find out more about this is to watch the video we've produced at SIGGRAPH:


Details can be found on our blog and are also available here:

Wayland on MALI

Over the past several years at Collabora, we have worked on Linux's graphics stack from top to bottom, from kernel-level hardware enablement through to the end applications. A particular focus has always been performance: not only increasing average throughput and performance metrics, but ensuring consistent results every time. One of the core underpinnings of the Linux graphics stack from its very inception has been the X Window System, which recently celebrated its 29th anniversary. Collabora have been one of the most prolific contributors to X.Org for the past several years, supporting its core development, but over the past few years we have also been working on its replacement - Wayland. Replacing something such as X is not to be taken lightly; we view Wayland as the culmination of the last decade of the work by the entire open-source graphics community. Wayland reached 1.0 maturity in 2012, and since then has shipped in millions of smart TVs, set-top boxes, IVI systems, and more.

This week at SIGGRAPH together with ARM, we have been showcasing some of our recent development on Wayland, as well as on the entire graphics stack, to provide best-in-class media playback with GStreamer.

'Every frame is perfect'

wayland-x11@2x.pngWayland's core value proposition for end users is simple: every frame must be perfect. What we mean by that, is that the user will never see any unintended or partially-rendered content, or any graphical glitches such as tearing. In contrast to X11, where the server performs rendering on behalf of its clients, which not only requires expensive parallelisation-destroying synchronisation with the GPU, but is often an unwanted side effect of unrelated requests, Wayland's buffer-oriented model places the client firmly in control of what the user will see.

The user will only ever be shown exactly the content that the client requests, in the exact way that it requests it: painstaking care has been taken to ensure that not only do these intermediate states not exist, but that any unnecessary synchronisation has been removed. The combination of perfect frames and lower latency results in a natural, fluid-feeling user experience.

Power and resource efficient

wayland-x11-2@2x.pngMuch of the impetus for Wayland's development came from ARM-based devices, such as smart TVs and set-top boxes, digital signage, and mobile, where not only is power efficiency key, but increased demands such as 4K media mean in order to ship a functioning product in the first place, the hardware must be pushed right to the margins of its capabilities. In order to achieve these demanding targets, the window system must make full use of all IP blocks provided by the platform, particularly hardware media decoders and any video overlays provided by the display controller. Not only must it use these blocks, but it must eliminate any copies of the content made along the way. X11 has two core problems which preclude it making full use of these features. Firstly, as X11 provides a rendering-command rather than a buffer-driven interface to clients, it is extremely difficult to integrate with hardware media decoders without making a copy of the full decoded media frame, consuming valuable memory bandwidth and time. Secondly, the X11 server is fundamentally unaware of the scene graph produced by the separate compositor, which precludes use of hardware overlays: the only interface it provides for doing this is OpenGL ES rendering, requiring another copy of the content. This increased memory bandwidth and power usage makes it extremely difficult to ship compelling products in a media-led environment. By contrast, Wayland's buffer-driven model is a natural fit for the hardware media engines of today and tomorrow, and the integration of the display server and compositor makes it easy to use the full functionality of the display controller to provide low-power media display, whilst reserving as much memory bandwidth as possible for other applications to run without having to contend with media playback for crucial system resources, or to push systems to their limits, such as 4K content on relatively low-spec systems.

A first-class media experience

To complement our hundreds of man-years of work on the industry-standard GStreamer media framework, which has proven to scale from playback on mobile devices to serving huge live broadcast streams, Collabora has worked to ensure that Wayland provides a first-class experience when used together with GStreamer. Our recent development work on both Wayland itself and GStreamer's Wayland support, ensures that GStreamer can realise its full potential when used together with Wayland. All media playback naturally occurs in a 'zero-copy' fashion, from hardware decoding engines into either the 3D GPU or display controller, thanks to DMA-BUF buffer passing, new in version 3.16 of the Linux kernel. The Wayland subsurface mechanism allows videos to be streamed separately to UI content, rather than combined by the client as they are today in X11. This separation allows the display server to make a frame-by-frame decision as to how to present it: using power-efficient hardware overlays, or using the more flexible and capable 3D GPU. This step allows maximum UI flexibility whilst also making the most of hardware IP blocks. The scaling mechanism also allows the compositor to scale the video at the last minute, potentially using high-quality scaling and filtering engines within the display controller, as well as reducing precious memory bandwidth usage when upscaling videos. Deep buffer queues are also possible for the first time, with both GStreamer and Wayland supporting ahead-of-time buffer queueing, where every buffer has a target time attached. Under this model, it is possible for the client to queue up a large number of frames in advance, offload them all to the compositor, and then go to sleep whilst they are autonomously displayed, saving CPU usage and power. Wayland also provides GStreamer with feedback on when exactly their buffers were shown on screen, allowing it to automatically adjust its internal pipeline and clock for the tightest possible A/V sync.

Easier deployment and support

In contrast to the X11 model of providing a driver specific to the combination of X server version, display controller and 3D GPU, Wayland offers vendors the ability to deploy drivers written according to external, well-tested, vendor-independent APIs. These drivers are required to perform only limited, well-scoped tasks, making validation, performance testing, and support much easier than under X11. This model makes it possible for vendors to deploy a single well-tested solution for Wayland, and for end users to deploy them in the knowledge that they will have reliable performance and functionality.

We are demonstrating all this at SIGGRAPH, on the ARM booth at stand #933 in the Mobility Pavilion on the Exhibition Hall. We are showing a side-by-side comparison of Wayland and X11 on Samsung Chromebook 2 machines (Samsung Exynos 5800 Octa hardware, with an ARM Mali-T628 GPU), demonstrating Collabora's expertise from the very bottom of the stack to the very top. Collabora's in-house Singularity OS runs a Linux 3.16-rc5 kernel, containing changes bound for upstream to improve and stabilise hardware support, and an early preview of atomic modesetting support inside the Exynos kernel modesetting driver for the display controller. The Wayland machine runs Weston with the new DMA-BUF and buffer-queueing extensions on top of atomic modesetting, demonstrating that videos played through GStreamer can be seamlessly switched between display controller hardware overlays and the Mali 3D GPU, using the DMA-BUF import EGL extension. The X11 machine runs the ChromeOS X11 driver, with a client which plays video through OpenGL ES at all times. The power usage, frame 'lateness' (difference between target display time and actual time), and CPU usage are shown, with Wayland providing a dramatic improvement in all these metrics.

It’s that time of year again – SIGGRAPH is here! For computer graphics artists, teachers, freaks and geeks of all descriptions, it’s like having Midsummer, Christmas, and your birthday all in the same week. By the time you read this, I’ll be in beautiful Vancouver BC, happily soaking up the latest in graphics research, technology, animation, and associated general weirdness along with the other 15,000-plus attendees. I can’t wait!


This year, SIGGRAPH has a special personal connection for me: my office-mate Dave Shreiner is this year’s general chair (amazingly, he’s still got all his hair – quite a lot of it actually), and my other office-mate Jesse Barker is chair of SIGGRAPH Mobile. (Jesse’s got no hair at all, but with him it’s a style choice.) My own job at SIGGRAPH is a lot less grand, but it’s something I love doing: In my capacity as OpenGL® ES working group chair, I’ll be co-hosting the Khronos OpenGL / OpenGL ES Birds of a Feather (BOF) session. That’s where the working groups report back to the user community about what’s going on in the ecosystem, what the committee has been doing, and what the future might hold. This year’s OpenGL ES update will mostly focus on the growing market presence of OpenGL ES 3.0, and on OpenGL ES 3.1, which we released earlier this year and which is starting to enter the market in a big way. It’s great stuff – but it’s not the big news.


There’s a change coming


By the standards of, well, standards, the OpenGL APIs have been an amazing success. OpenGL has stood unchallenged for twenty years a cross-platform 3D API. Its mobile cousin, OpenGL ES, has grown phenomenally over the past ten years; with the mobile industry now shipping a billion and a half OpenGL ES devices per year, it has become the main driver of OpenGL adoption. One-point-five billion is a mind-boggling number, and we’re suitably humbled by the responsibility it implies.  But the APIs are not without problems: the programming model they present is frankly archaic, they have trouble taking advantage of multicore CPUs, they are needlessly complex, and there is far too much variability between implementations. Even highly skilled programmers find it frustrating trying to get predictable performance out of them. To some extent, OpenGL is a victim of its own success – I doubt that there are many APIs that have been evolving for twenty years without accumulating some pretty ugly baggage. But that doesn't change the central fact: OpenGL needs to change.

The Khronos working groups have known this for a long time; top developers (hi Rich!) have been telling us every chance they get.  But now, with OpenGL ES 3.1 finished but still early in its adoption cycle, we finally feel like we have an opportunity to do something about it. So at this year’s SIGGRAPH, Khronos is announcing the Next Generation OpenGL initiative, a project to redesign OpenGL along modern lines. The new API will be leaner and meaner, multicore and multithread-friendly. It will give applications much greater control over CPU and GPU workloads, making it easier to write performance-portable code. The work has already started, and we’re making rapid progress, thanks to strong commitment and active participation from the whole industry, including several of the world's top game engine companies.


Needless to say, ARM is fully behind this new direction, and we’re investing significant engineering resources in making sure it meets its goals and runs well on our Mali GPUs. We are of course also continuing to invest in the ecosystem for ‘traditional’ OpenGL ES, which will remain the dominant  mobile graphics API for quite some time to come.


That’s all I’ve got for now. If you’re going to be at SIGGRAPH, I hope you’ll come by the OpenGL / OpenGL ES BOF and after-party, 5-7pm on Wednesday at the Marriott Pinnacle, and say hi.  If not, drop me a line below…


Tom Olson is Director of Graphics Research at ARM. After a couple of years as a musician (which he doesn't talk about), and a couple more designing digital logic for satellites, he earned a PhD and became a computer vision researcher. Around 2001 he saw the coming tidal wave of demand for graphics on mobile devices, and switched his research area to graphics.  He spends his working days thinking about what ARM GPUs will be used for in 2016 and beyond. In his spare time, he chairs the Khronos OpenGL ES Working Group.

Olga Kounevitch

Rockchip Rock the Boat

Posted by Olga Kounevitch Aug 12, 2014

Today at SIGGRAPH, ARM will be showcasing the graphics capabilities of its highest-end product, the ARM Mali-T760 GPU, available to the public for the first time in the shape of the Rockchip RK3288 processor in the PiPO Pad P1 and the Teclast P90HD. The announcement of the Mali-T760 GPU’s release in October last year seems like a lifetime ago from where we’re sitting – ARM has managed to squeeze in so many activities since then – but when you compare it to the traditional lifespan of delivering a brand new chip to the market, the speed at which Rockchip has been able to deliver the RK3288 has been incredible.


Historically, it has often taken fabless semiconductor companies 2-3 years to move from initial design idea to sample silicon to having a prototype end-product to having the final production OEM device ready to ship.


The problem is, this is no longer holding true in all cases. For some partners, extracting the highest possible performance, best energy efficiency and lowest die area from the IP which ARM delivers is their key differentiation point at the launch of a new SoC. For others, it’s time to market and their competitive advantage comes from being the first to put a new feature, functionality or – in this case GPU - in the hands of consumers.

Rockchip, by working closely with ARM, their suppliers and their customers, have been able to reduce this time from idea to ready-to-ship consumer product down to 9 months.


How Did It Happen?


ARM has collaborated closely with Rockchip over a period of many years, helping them deliver best-in-class SoCs to the marketplace. Rockchip are extremely experienced in designing Mali GPU IP into their silicon – they have been licensees of Mali technology since the days of the Mali-55 GPU. Their engineers know ARM designs well and were able to apply this experience to the new design, along with some of the tools and software used previously when developing an ARM-based chip. Combine this with the benefits of being lead partner along with Samsung, LG and Mediatek in the launch of the new GPU and you have yourself a winner.


There are many advantages to being a lead partner for ARM. Rockchip were able to participate in the development of the product, ensuring their suggestions were considered, but most importantly they gained early access to the IP. This early access enabled Rockchip engineers to start work on their silicon design extremely early on in the lifecycle of the Mali-T760.  ARM also provided regular updates to the project as they were made and delivered detailed support, ensuring that by the time the Mali-T760 was announced, ARM and Rockchip had already done a lot of the legwork needed to bring the first iteration of the RK3288 to market. As Trina Watt, VP Solutions Marketing at ARM put it: “Such a phenomenal achievement in terms of getting end-user devices to the market in only seven months was made possible due to the close collaboration and commitment from both parties.”


Rockchip will continue to develop and refine their software offering over future iterations,  enhancing the processor’s energy efficiency and performance in order to get the most from the IP which they have licensed.

What Does This Mean for the Future of the Mobile Industry?


Firstly, for consumers it means that the latest mobile technology will be reaching your hands sooner than ever before – the days of hearing about sixteen core GPUs with 400% increases in energy efficiency and performance and then waiting for three years before the GPU is in an appreciable form in your pocket is over.


For the mobile industry, it means there is change in the air. With companies like Rockchip now setting the bar for fast tape outs and racing to be the first to market, the question will be to what extent other silicon partners can continue to spend two to three years on chip development.


ARM offers a range of Physical IP products to help reduce time to market for silicon partners. For example, ARM POP IP is a combination of physical IP with acceleration technology which guides licensees to produce the best ARM processor implementations (whether that is highest performing or most efficient) in the fastest time. It implements both the knowledge of our processor teams and the physical IP engineering teams to resolve common implementation challenges for the silicon partner. ARM POP IP is currently available for Mali-T628 and Mali-T760 GPUs.




In addition, to provide choice in the market, ARM works closely with leading EDA partners for integration and optimization of ARM IP deliverables with advanced design flows. ARM directly collaborates with each partner in the development and validation of various design flows and methodologies, enabling successful path from RTL to foundry-ready GDSII. For example, ARM processor-based Implementation Reference Methodologies (iRMs) enable ARM licensees to customize, implement, verify and characterize soft ARM processors. For Mali GPUs, the Synopsys design flow enables a predictable route to silicon, and a basis for custom methodology development.

The Potential of the ARM Ecosystem


“Consumers are increasingly becoming more sophisticated and desire to get hold of the latest technology in their hands as soon as possible” - said Chen Feng, CMO of Rockchip. “In order to do so we need to find new ways of working with our partners across the entire length of the supply chain. Having worked closely and found success with ARM and the ARM Ecosystem over so many years already, we knew that, though the targets were demanding, between us we had the strengths and capabilities to make it happen. The Mali-T760 is an extremely promising GPU and we are proud to be the first to bring it to the hands of consumers.”


If you want to see ARM’s latest GPU in action, come to the ARM booth at SIGGRAPH and discover how the ARM Ecosystem is continuing to expand the mobile experience, with new GPUs, advanced processor technology and innovative additions to the graphics industry.



Today at SIGGRAPH a new demo is being brought to the public as the result of 18 months of collaboration between teams at ARM, Samsung Research UK and Szeged University in Hungary.  It demonstrates massively accelerated mobile web rendering on an ARM® Mali-T628 GPU based Chromebook with 1.5 to 4.5 times higher performance compared to other solutions on the market (depending on the type of content run). The solution enables a smoother experience and is not just applicable to web browsing, but can also hugely improve the user experience on browser based UIs such as those in modern GPU-enabled DTVs.


The solution, named TyGL, is a new backend for WebKit which addresses the challenge that HTML5 developers currently have when balancing graphics rich web content against the constrained rendering capabilities of mobile CPUs.  Rasterization has typically been done mainly on the CPU. While this suffices for PCs, rendering using a mobile CPU is known to be more inefficient due to the constraints imposed on the CPUs by their batteries (such as lower clock frequencies) – leaving a parallel task like this to the GPU could result in a much smoother experience. However, using 3D graphics hardware such as Mali GPUs to render 2D content such as web pages is an extremely challenging task. Raster engines are designed to draw various graphics primitives one-by-one with frequently changing attributes, rather than drawing several primitives of the same type using a single draw call, which is the sort of task GPUs are optimized for. So while CPU rendering is slow, GPU rendering is complex to achieve efficiently because WebKit issues more draw calls, each with less data than is optimal for GPU usage.


The other challenge which HTML5 developers face with the WebKit ports that are currently available is the level of abstraction between layout and painting to the screen. By abstracting too far from the underlying accelerated API the developer can lose the ability to code to the API’s strengths, leading to a sub optimal implementation.


The Solution


TyGL seeks to cut down the level of abstraction in current ports and offer a web rendering solution that is fully optimized for the GPU whose only dependency is the OpenGL® ES 2.0 API, supported by the majority of application processors. It is a backend for WebKit, the open source application framework that can be used to build web-browser like functionality into an application. Major WebKit based products include embedded browsers from companies such as Espial, ACCESS and Company 100.


Both ARM and Szeged University conducted in-depth profiling of common webpages using the QtTestBrowser. The results showed that the majority of active CPU time was spent in libQtGui – the Rendering/Painting API used to render the content on the screen. GPUs are far more efficient at rendering to screens than CPUs and it was proposed that if the drawing commands of the pipeline were able to be done using the OpenGL ES 2.0 API, the performance could be improved considerably.


gl2d-pipeline.pngTyGL Pipeline


The diagram above outlines the pipeline of TyGL.  It applies three different processes to drawing text, images and paths, but the differences are in the preparation phases only. Even there, some similarities can be noticed. First, in the case of text and paths, some or all of the affine transformation is applied to the input in order to ensure higher quality output, which is then rendered to an alpha image with any remaining transformation being applied. Finally, the pipeline paths join at the point where GPU acceleration becomes efficient: colouring, clipping, blending and compositing. This common part of the pipeline is fully processed by the GPU. Each stage of the common pipeline is associated with an OpenGL ES fragment shader, which performs the necessary computations on each output pixel in a highly parallel fashion. Software-based graphics libraries such as Cairo usually have similar pipeline stages, executed sequentially and communicating through temporary buffers, but TyGL can do this more efficiently.


TyGL Image.pngExample webpage rendered by TyGL




Preparations are underway to open source the TyGL backend for WebKit imminently. Early results show that the port is successful at improving the performance and efficiency of 2D rasterization in the browser while remaining lightweight enough to reduce the level of overhead and abstraction in currently available solutions. By releasing this code into the Open Source arena, it is hoped that all browser vendors that make use of WebKit will be able to benefit from ARM and their partners leadership in the domain of 2D rasterization on embedded GPUs.


If you want to find out more about TyGL, come and visit the brains behind it at the ARM Booth at SIGGRAPH this week.

Ellie Stone

ARM Makes the World Mobile

Posted by Ellie Stone Aug 12, 2014



The SIGGRAPH exhibition floor is currently buzzing with activity as staff from all companies ready themselves for the grand opening tomorrow morning. All the ARM staff in attendance are smoothing out the final creases on the booth and making sure that everything will be perfect when attendees hit the showfloor tomorrow morning at 9:30am. Unfortunately, I can't quite leak a photo of the booth at this point in time, but I can show you a picture of an awesome statue outside the Vancouver Convention Center as a loosely related teaser:


Vancouver convention center2.jpg"Digital Orca" Statue outside the Vancouver Convention Center



So what does the ARM Booth have to offer this year? Besides the opportunity to win a Samsung Galaxy Note 10.1 each day, here are a couple more reasons to visit Booth #933:


Firstly, we have a number of fantastic demos from partners which are the results of many months collaboration with ARM. More information will be released very soon concerning the latest GPU technology in the Rockchip-based devices on display and also concerning the hardware accelerated web rendering solution on show at the Samsung Research pod - check back in on the ARM Mali Graphics blog tomorrow (Updated: Rockchip Rock the Boat) if you're curious (or of course, if you're at SIGGRAPH, pop by the booth and take a look yourself!). Besides the demo from the Research department at Samsung, Samsung LSI will also be on the ARM booth demonstrating the great capabilities of the latest devices powered by the ARM® Mali-T628 GPU-based  Exynos 5 Octa processor, including the Odroid-XU3 development board.


Also joining us are Collabora, who are showing how next-generation open source graphics technologies will provide power efficiency and great multimedia performance simultaneously. Their demonstration exemplifies the latest developments in the GStreamer media framework, the Wayland window system, and the Linux kernel, benchmarking power/CPU/GPU utilization and frame-time accuracy between the new Wayland and legacy X11 window systems. The Collabora demo makes use of the full breadth of ARM Mali GPUs and many features of the Samsung Exynos 5 Octa platform, including its powerful media decoding engine and display controller. In addition, Simplygon are showcasing their automatic 3D game content optimization solution, PlayCanvas  will show their cloud-hosted and real-time collaborative HTML5 & WebGL game engine which gives developers all they need to create stunning 3D games in your browser or on mobile devices, including some amazing tools, and Geomerics will be showcasing the Transporter demo, the latest and greatest demonstration of Enlighten technology which was recently integrated into Unity to provide dynamic global illumination.




On the ARM side, we have a number of demos coming to you for the very first time. Firstly, following the announcement of the Juno board last month, attendees will be able to see 64-bit content running on a quad-core ARM Cortex®-A53 CPU and dual-core Cortex-A57 in ARM big.LITTLE™ configuration. With this solution available in the market, developers will be able to more easily delivery the next generation of content for Android OS-based devices.


Secondly, we have a brand new demo showcasing the benefits of the Pixel Local Storage extension to the OpenGL® ES 3.0 API which promotes a new method of achieving bandwidth efficiency. The most significant difference between mobile GPUs and their desktop equivalents is the limited availability of sustained memory bandwidth. With advances in bandwidth expected to be incremental for many years, mobile graphics must be tailored to work efficiently in a bandwidth-scarce environment. This is true at all levels of the hardware-software stack. This demo shows that deferred rendering could be made bandwidth efficient by exploiting the on-chip memory used to store tile framebuffer contents in many tile-based GPUs. ARM is giving an unmissable talk for those interested in this subject this Wednesday at 10:45am in Rooms 109-111.


Some more familiar demos will also be on show, highlighting the benefits of ASTC Full Profile, the OpenGL ES 3.1 feature Compute Shaders and the unique tools ARM offers. For more information on these demos, check out Daniele Di Donato's blog Inside the Demo: GPU Particle Systems with ASTC 3D textures, Sylwester Bala's blog Get started with compute shaders or Lorenzo Dal Col's writings on Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel.


If you're at the show, we look forward to seeing you soon! Otherwise, keep an eye on our social media channels throughout the week for regular updates on ARM's activities at SIGGRAPH.

At SIGGRAPH 2014 we presented the benefits of the OpenGL® ES 3.0 API and the more newly introduced OpenGL ES 3.1 extension. Adaptive Scalable Texture Compression format (ASTC) is one of the biggest introductions to the OpenGL ES API. The demo I’m going to talk about is a case study of the usage of 3D textures in the mobile space and how ASTC can compress them to provide a huge memory reduction. 3D textures weren’t available in the core OpenGL ES spec up to version 2.0 and the workaround was to use hardware dependent extensions or 2D texture arrays. Now with OpenGL ES 3.x, 3D textures are embedded in the core specification and ready to use…..if only they were not so big! Using uncompressed 3D textures costs a huge amount of memory (for example a 256x256x256 texture with RGBA8888 format uses circa 68MB) which cannot be afforded on a mobile device.


Why did we use ASTC?

The same texture can instead be compressed using different levels of compression with ASTC, giving a saving of ~80% when using the highest quality settings. For those unfamiliar with the ASTC texture compression format, it is a block-based compression algorithm where LxM (or LxMxN in the case of 3D textures) blocks of pixels are compressed together into a single block of 128 bit. The L,M,N values are one of the compression quality factors and represent the number of texels per block dimension. For 3D textures, the dimensions allowed vary from 3 to 6 as reported in the table below:


Block DimensionBit Rate
(bits per texel)


Since the block compressed size is always 128 bit for all block dimensions, the bit rate is simply 128/#texel_in_a_block. One of the features of ASTC is that it can also compress HDR values (typically 16 bit per channel). Since we needed to store high precision floating-point values in the textures in the demo, we converted the float values (32 bit per channel) to half-float format (16 bit per channel) and used ASTC to compress those textures. In this way the loss of precision is less compared to the usual 32 bit to 8 bit conversion and compression. It is worth noticing that using the HDR formats doesn’t increase the size of the compressed texture because each compressed block will still use 128 bit. Below you can see a 3D texture rendered simply using slicing planes. The compression formats used are: (from left to right) uncompressed, ASTC 3x3x3, ASTC 4x4x4, ASTC 5x5x5.



For those interested in the details of the algorithm, an open source ASTC evaluation encoder/decoder is available at http://malideveloper.arm.com/develop-for-mali/tools/astc-evaluation-codec/ and a video of an internal demo ported to ASTC is available at https://www.youtube.com/watch?v=jEv-UvNYRpk. The demo is also available for viewing on the ARM booth #933 at SIGGRAPH this week.


Demo Overview

The main objective of the demo was to use the new OpenGL ES 3.0 API to realize realistic particle systems where motion physics as well as collisions are managed entirely on the GPU. The demo shows two scenes, one which simulates confetti, the other smoke.




Transform Feedback for physics simulation

The first feature I want to talk about, which is used for the physics simulation, is Transform Feedback. The physics simulation steps typically output a set of buffers using the previous step results as inputs. These kind of algorithms, called explicit methods in numerical analysis, are well suited to being used with Transform Feedback because it allows the results of vertex shader execution to get back into a buffer that can subsequently be mapped for CPU read or used as the input buffer for other shaders.  In the demo, each particle is mapped to a vertex and the input parameters (position, velocity and lifetime) are stored in an input vertex buffer while the outputs are bound to the transform feedback buffer. Because the whole physics simulation runs on the GPU, we needed a way to give to each particle the knowledge of the objects in the scene (this is now less problematic using Compute Shaders. See below for details). 3D textures helped us in this case because they can represent volumetric information and can be easily sampled in the vertex shader as a classic texture. The 3D textures are generated from the 3D mesh of various objects using a free tool called Voxelizer (http://techhouse.brown.edu/~dmorris/voxelizer/) and the voxel data contain the normal of the surface for voxels on the mesh surface or the direction and the distance to the nearest point on the surface in the case of voxels inside the object. 3D textures can be used to represent various types of data such as a simple mask for occupied or free areas in a scene, density maps or 3D noise. When uploading the files generated from Voxelizer, we convert the floating point values to half-float and then compress the 3D texture using ASTC HDR. In the demo, we use different compression block dimensions to show the differences between uncompressed and compressed textures. Such differences included memory size, memory read bandwidth reduction and energy consumption per frame. The smallest block size (3x3x3) gives us a ~90% reduction and our biggest texture goes down from ~87MB to ~7MB. Below you can find a table of bandwidth measurements for the various types of models we used on a Samsung Galaxy Note 10.1 (2014 Edition).


Texture Resolution128x128x128180x255x255255x181x24378x75x12743x97x127
Texture Size MB
ASTC 3x3x31.276.126.720.450.34
ASTC 4x4x40.522.632.870.190.14
ASTC 5x5x50.281.321.480.100.07
Memory Read Bandwidth in MB/s
ASTC 3x3x3342.01285.78206.39374.19228.05
ASTC 4x4x4327.63179.43175.21368.13224.26
ASTC 5x5x5323.10167.90162.89366.18222.76
Energy consumption per frame DDR2 mJ per frame
ASTC 3x3x32.311.931.392.531.54
ASTC 4x4x42.
ASTC 5x5x52.
Energy consumption per frame DDR3 mJ per frame
ASTC 3x3x31.901.591.152.081.27
ASTC 4x4x41.821.000.972.041.24
ASTC 5x5x51.790.930.902.031.24


Instancing for efficiency

Another feature that was introduced in OpenGL ES 3.0 is Instancing. It permits us to specify geometry only once and reuse it multiple times in different locations with a single draw call. In the demo we use it for the confetti rendering where, instead of defining a vertex buffer of 2500*4 vertices (we render 2500 particles as quads in the confetti scene), we just define a vertex buffer of 4 vertices and call the:


glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, 2500 );


where GL_TRIANGLE_STRIP specifies the type of primitive to render, 0 is the start index inside the enabled vertex buffers that represents the positions of the vertices of the quad, 4 specifies the number of indices needed to render one instance of the geometry (4 indices per quad) and 2500 is the number of instances to render. Inside the vertex shader, the gl_InstanceID built-in variable will be available and it will contain the identifier for the current invocation. This variable can, for example, be used to access an array of matrices or do specific calculations for each instance. A divisor can also be specified for each active vertex buffer which specifies how the vertex shader will advance in the vertex buffers for each instance.

The smoke scene

In the smoke scene, the smoke is rendered using a noise texture and some math to compute the final colour as if it were a 3D volume. To give the smoke a transparent look we need to combine different overlapping particles’ colours. To do so we use additive blending and disable the z-test when rendering the particles. This gives a nice result even without sorting the particles based on the z-value (otherwise we have to map the buffer in the CPU). Another reason for disabling it is to realize soft particles. The Mali-T6xx series of GPUs can use a specific extension in the fragment shader to read back the values of the framebuffer (colour, depth and stencil) without having to render-to-texture. This feature makes it easier to realize soft particles and in the demo we use a simple approach. First, we render all the solid objects so that their z-value will be written in the depth buffer. After we render the smoke (and thanks to the Mali extension) we can read the depth value of the object and compare it with the current fragment of the particle (to see if it is behind the object) and fade the colour accordingly. This technique eliminates the sharp profile that is formed by the particle quad intersecting the geometry due to the z-test (another reason we had to disable it).


Blurring the smoke

During development the smoke effect looked nice but we wanted it to be more dense and blurry. To achieve all this we decided to render the smoke in an off-screen render buffer with a lower resolution compared to the main screen. This gives us the ability to have a blurred smoke (since the lower resolution removes the higher frequencies) as well as let us increase the number of particles to get a denser look. The current implementation uses a 640x360  off-screen buffer that is up-scaled to 1080p resolution in the final image. A naïve approach causes jaggies on the outline of the object when the smoke is flowing near it due to the blending of the up-sampled low resolution buffer. To almost eliminate this effect, we apply a bilateral filter. The bilateral filter is applied to the off-screen buffer and is given by the product of a Gaussian filter in the colour texture and a linear weighting factor given by the difference in depth. The depth factor is useful on the edge of the model because it gives a higher weight to neighbour texels with depth similar to the one of the current pixel and lower weight when this difference is higher (if we consider a pixel on the edge of a model, some of the neighbour pixels will still be on the model while others will be far in the background).



Bonus track

The recently released OpenGL ES 3.1 spec introduced Compute Shaders as a method for general computing on the GPU (a sort of subset of OpenCL™, but in the same context of OpenGL so no context switching needed!!). You can see it in action below:


An introduction to Compute Shaders is also available at:

Get started with compute shaders




I would like to point out some useful websites that helped me understand Indexing and Transform Feedback:

Transform Feedback:




ASTC Evaluation Codec:




The topic of this blog was presented recently to students in a workshop at Brains Eden Gaming Festival 2014 at Anglia Ruskin University in Cambridge [1]. We wanted to provide students with an effective and low cost technique to implement reflections when developing games for mobile devices.

Early Reflection Implementations

From the very beginning, graphics developers have tried to find cheap alternatives to implement reflections. One of the first solutions was spherical mapping. Spherical mapping simulates reflections or lighting upon objects without going through expensive ray-tracing or lighting calculations. This approach has several disadvantages, but the main problem is related to the distortions when mapping a picture onto a sphere.  In 1999, it became possible to use cubemaps with hardware acceleration.
Figure 1: Spherical mapping.

Cubemaps solved the problems of image distortions, viewpoint dependency and computational inefficiency related with spherical mapping. Cube mapping uses the six faces of a cube as the map shape. The environment is projected onto each side of a cube and stored as six square textures, or unfolded into six regions of a single texture. The cubemap is generated by rendering the scene from a given position with six different camera orientations with a 90 degree view frustum representing each a cube face. Source images are sampled directly. No distortion is introduced by resampling into an intermediate environment map.


Figure 2:Cubemaps.

To implement reflections based on cubemaps we just need to evaluate the reflected vector R and use it to fetch the texel from the cubemap CubeMap using the available texture lookup function texCUBE:

float4 col = texCUBE(CubeMap, R);

Expression 1.



Figure 3: Reflections based on infinite cubemaps.

With this approach we can only reproduce reflections correctly from a distant environment where the cubemap position is not relevant. This simple and effective technique is mainly used in outdoor lighting, for example, to add reflections of the sky. If we try to use this technique in a local environment we get inaccurate reflections.


Figure 4: Reflection on the floor calculated wrongly with an infinite cubemap.


Local Reflections

The main reason why this reflection fails is that in Expression 1 there is not any binding to the local geometry. For example, according to Expression 1, if we were walking on a reflective floor looking at it from the same angle, we would always see the same reflection on it. As the direction of the view vector does not change, the reflected vector is always the same and Expression 1 gives the same result. Nevertheless, this is not what happens in the real world where reflections depend on viewing angle and viewing position.


The solution to this problem was first proposed by Kevin Bjorke[2] in 2004. For the first time a binding to the local geometry was introduced in the procedure to calculate the reflection:


Figure 5: Reflections using local cubemaps.


While this approach gives good results in objects’ surfaces with near to spherical shape, in the case of plane reflective surfaces the reflection shows noticeable deformations. Another drawback of this method is related to the relative complexity of the algorithm to calculate the intersection point with the bounding sphere which solves a second degree equation.


A few years later, in 2010, a better solution was proposed [3] in a thread of a developer forum at gamedev.net. The new approach replaced the previous bounding sphere by a box, solving the shortcomings of Bjorke’s method: deformations and complexity of the algorithm to find the intersection point.


Figure 6: Introducing a bounding box.


A more recent work [4] uses this new approach to simulate more complex ambient specular lighting using several cubemaps and proposes an algorithm to evaluate the contribution of each cubemap and efficiently blend on the GPU.

At this point we must clearly distinguish between local and infinite cubemaps:

Figure 7 shows the same scene from Figure 4 but this time with correct reflections using local cubemaps.

Figure 7: Reflection on the floor correctly calculated with a local cubemap.


Shader Implementation

The shader implementation in Unity of reflections using local cubemaps is provided below. In the vertex shader, we calculate the three magnitudes we need to pass to the fragment shader as interpolated values: the vertex position, the view direction and the normal, all of them in world coordinates:

vertexOutput vert(vertexInput input)


    vertexOutput output;

    output.tex = input.texcoord;

    // Transform vertex coordinates from local to world.

    float4 vertexWorld = mul(_Object2World, input.vertex);

    // Transform normal to world coordinates.

    float4 normalWorld = mul(float4(input.normal, 0.0), _World2Object);

    // Final vertex output position.   

    output.pos = mul(UNITY_MATRIX_MVP,  input.vertex);

    // ----------- Local correction ------------

    output.vertexInWorld = vertexWorld.xyz;

    output.viewDirInWorld = vertexWorld.xyz - _WorldSpaceCameraPos;

    output.normalInWorld = normalWorld.xyz;

    return output;


In the fragment shader the reflected vector is found along with the intersection point in the volume box. The new local corrected reflection vector is built and it is used to fetch the reflection texture from the local cubemap. Finally the texture and reflection are combined to produce the output colour:

float4 frag(vertexOutput input) : COLOR


     float4 reflColor = float4(1, 1, 0, 0);

     // Find reflected vector in WS.

     float3 viewDirWS = normalize(input.viewDirInWorld);

     float3 normalWS = normalize(input.normalInWorld);

     float3 reflDirWS = reflect(viewDirWS, normalWS);                       

     // Working in World Coordinate System.

     float3 localPosWS = input.vertexInWorld;

     float3 intersectMaxPointPlanes = (_BBoxMax - localPosWS) / reflDirWS;

     float3 intersectMinPointPlanes = (_BBoxMin - localPosWS) / reflDirWS;

     // Looking only for intersections in the forward direction of the ray.

     float3 largestParams = max(intersectMaxPointPlanes, intersectMinPointPlanes);

     // Smallest value of the ray parameters gives us the intersection.

     float distToIntersect = min(min(largestParams.x, largestParams.y), largestParams.z);

     // Find the position of the intersection point.

     float3 intersectPositionWS = localPosWS + reflDirWS * distToIntersect;

     // Get local corrected reflection vector.

     float3 localCorrReflDirWS = intersectPositionWS - _EnviCubeMapPos;

     // Lookup the environment reflection texture with the right vector.           

     reflColor = texCUBE(_Cube, localCorrReflDirWS);

     // Lookup the texture color.

     float4 texColor = tex2D(_MainTex, float2(input.tex));

     return _AmbientColor + texColor * _ReflAmount * reflColor;


In the above code for the fragment shader the magnitudes _BBoxMax and _BBoxMin are the maximum and minimum points of the bounding volume. The variable  _EnviCubeMapPos is the position where the cubemap was created. These values are passed to the shader from the below script:

public class InfoToReflMaterial : MonoBehaviour
    // The proxy volume used for local reflection calculations.
    public GameObject boundingBox;

    void Start()
        Vector3 bboxLenght = boundingBox.transform.localScale;
        Vector3 centerBBox = boundingBox.transform.position;
        // Min and max BBox points in world coordinates.
        Vector3 BMin = centerBBox - bboxLenght/2;
        Vector3 BMax = centerBBox + bboxLenght/2;
        // Pass the values to the material.
        gameObject.renderer.sharedMaterial.SetVector("_BBoxMin", BMin);
        gameObject.renderer.sharedMaterial.SetVector("_BBoxMax", BMax);
        gameObject.renderer.sharedMaterial.SetVector("_EnviCubeMapPos", centerBBox);

The values for _AmbientColor and _ReflAmount as well as the main texture and cubemap texture are passed to the shader as uniforms from the properties block:


        _MainTex ("Base (RGB)", 2D) = "white" { }
        _Cube("Reflection Map", Cube) = "" {}
        _AmbientColor("Ambient Color", Color) = (1, 1, 1, 1)
        _ReflAmount("Reflection Amount", Float) = 0.5

            #pragma glsl
            #pragma vertex vert
            #pragma fragment frag
            #include "UnityCG.cginc"
           // User-specified uniforms
            uniform sampler2D _MainTex;
            uniform samplerCUBE _Cube;
            uniform float4 _AmbientColor;
            uniform float _ReflAmount;
            uniform float _ToggleLocalCorrection;
           // ----Passed from script InfoRoReflmaterial.cs --------
            uniform float3 _BBoxMin;
            uniform float3 _BBoxMax;
            uniform float3 _EnviCubeMapPos;

            struct vertexInput
                float4 vertex : POSITION;
                float3 normal : NORMAL;
                float4 texcoord : TEXCOORD0;

            struct vertexOutput
                float4 pos : SV_POSITION;
                float4 tex : TEXCOORD0;
                float3 vertexInWorld : TEXCOORD1;
                float3 viewDirInWorld : TEXCOORD2;
                float3 normalInWorld : TEXCOORD3;

            Vertex shader {…}
            Fragment shader {…}

The algorithm to calculate the intersection point in the bounding volume is based on the use of the parametric representation of the reflected ray from the local position (fragment). More detailed explanation of the ray-box intersection algorithm can be found in [4] in the References.


Filtering Cubemaps

One of the advantages of implementing reflections using local cubemaps is the fact that the cubemap is static, i.e. it is generated during development rather than at run-time. This gives us the opportunity to apply any filtering to the cubemap images to achieve a given effect.

As an example, the image below shows reflections using a cubemap where a Gaussian filter was applied to achieve a “frosty” effect. The CubeMapGen [5] tool (from AMD) was used to apply filtering to the cubemap. Just to give an idea about how expensive this process can be it took more than one minute the filtering of a 256 pixels cubemap on a PC.


Figure 8: Gaussian filter applied to reflections in Figure 3.

A specific tool was developed for Unity to generate cubemaps and save cubemap images separately to import later into CubeMapGen. Detailed information about this tool and about the whole process of exporting cubemap images from Unity to CubeMapGen, applying filtering and reimporting back to Unity can be found in the References section [4].




Reflections based on static local cubemaps are an effective tool to implement high quality and realistic reflections and a cheap alternative to reflections generated at run-time.   This is especially important in mobile devices where performance and memory bandwidth consumption are critical to the success of many games.


Additionally, reflections based on static local cubemaps allow developers to apply filters to the cubemap to achieve complex effects that would otherwise be prohibitively expensive at run-time, even on high-end PCs.

The inherent limitation of static cubemaps when dealing with dynamic objects can be solved easily by combining static reflections with reflections generated at run-time. This topic will be examined in a future blog.



[1] Reflections based on local cubemaps. Presentation at Brains Eden, 2014 Gaming Festival at Anglia Ruskin University in Cambridge.  http://malideveloper.arm.com/downloads/ImplementingReflectionsinUnityUsingLocalCubemaps.pdf

[2] GPU Gems, Chapter 19. Image-Based Lighting. Kevin Bjork, 2004. http://http.developer.nvidia.com/GPUGems/gpugems_ch19.html.

[3] Cubemap Environment Mapping. 2010. http://www.gamedev.net/topic/568829-box-projected-cubemap-environment-mapping/?&p=4637262

[4] Image-based based Lighting approaches and parallax-corrected cubemap. Sebastien Lagarde. SIGGRAPH 2012.


[5] CubeMapGen. http://developer.amd.com/tools-and-sdks/archive/legacy-cpu-gpu-tools/cubemapgen/

ARM Mali Video and Display Technology to Power Next Generation of Atmel Devices


ARM was delighted this week to announce that Atmel is the first company to license the ARM Mali-V500 and ARM Mali-DP500 processors along with the low-power Cortex-A7 CPU. With a multimedia subsystem from ARM, Atmel will be able to benefit from the energy efficiency and small die area of ARM IP and expand the visual experience of cost-conscious, non-smartphone markets such as wearables or automotive entertainment. We'll be very excited to see the first of these devices coming to market over the next couple of years.


For more information, Chris wrote a blog on the subject, Atmel and Mali Team up to Climb New Mountains - and Atmel wrote a blog on their own channel too.

ARM Holdings PLC Reports Results For The Second Quarter And Half Year


The results, released on Tuesday, showed that ARM signed a fantastic 41 new processor licenses in 2Q14, driven by demand for ARM technology in smart mobile devices, consumer electronics and embedded computing chips for the Internet of Things - this takes the number of cumulative ARM licenses to over 1,100. Of these 41, eight were for Mali processors, bringing the cumulative total for ARM Mali processor licenses to 96. ARM enters the second half of the year with a healthy pipeline of opportunities that is expected to both underpin continued strong licence revenue and give rise to an increase in the level of backlog.


Lots of New ARM Mali Tools Released


There's a saying about buses that comes to mind, but this week was definitely the week for developers to be upgrading their mobile tools! If you haven't yet, here are the links to download the latest in ARM's range of tools:


DS-5 v5.19 - new features include making it much easier to autodetect, define, launch and log simulation models from DS-5 thanks to the Models Platform Configuration Editor

OpenGL ES Emulator v1.4 - new features include a single library containing complete EGL/OpenGL ES emulation code for improving library loading scenarios

Mali Graphics Debugger v1.3 - new features include much-demanded frame replay functionality and lower overheads thanks to its new binary format

Mali Offline Shader Compiler v4.3 - upgraded to now support the compiling of OpenGL ES 3.0 shaders and the Mali-T600 series r4p0-00rel0 driver


It Wouldn't Be a Very Good Week Without the Launch of a New Mid-Range Device


And today the featured phone is the Spice's Stellar 526, launched into the Indian market featuring an ARM Mali-450 MP4 GPU. According to their PR, Spice Retail Limited believe in "democratizing technology and widening the product range for our customers" and this new device will definitely help bring enriched graphics to consumers of mid-range devices.

We have released a new version of the OpenGL® ES Emulator. The OpenGL ES Emulator is a library that maps OpenGL ES API calls to the OpenGL API. It supports OpenGL ES 2.0 and 3.0, plus additional extensions.

In this release we have implemented:

  • Single library containing complete EGL/OpenGL ES emulation code for improving library loading scenarios.
  • Improvements on how textures are working in various use scenarios.
  • Shader source processing is reflecting extension support correctly.
  • Providing mali-cube executable for installation verification.
  • Debian Software Package (.deb) now available for Ubuntu.


Version 1.4 of Mali OpenGL ES Emulator is also provided as DEB package that can be installed on Debian, Ubuntu or other Linux distributions that use compatible software package management systems.

emulator-timbuktu-0001 emulator-timbuktu-0007 emulator-timbuktu-0014 emulator-timbuktu-0010

Get the Mali OpenGL ES Emulator

Khronos is a trademark of The Khronos Group Inc. OpenGL is a registered trademark, and OpenGL ES is a trademark, of Silicon Graphics International.

Chinese Version中文版:Atmel携手Mali 共攀高峰

Le Tour de France is arguably the world’s greatest cycling road race and has been extremely successful on the streets of…well…France. That was until a few weeks ago when “Le Tour” came to the UK.


ARM® Mali™ Multimedia IP is widely regarded as the world’s best too and it has been incredibly successful in mobile, where we are the #1 shipping GPU for Android devices. However, the scalability of our multimedia IP means our partners can also target other markets beyond mobile with the same design.


The interesting thing for me about the Atmel license announced today is not just that they now have access to ARM Cortex®-A7, ARM's most energy efficient processor ever, and ARM’s video and display processors – it is the new and different types of markets Atmel will go on and address with the same ARM Mali IP which has done so well in the mobile market.


Atmel has traditionally not focused on mobile application processors. Instead, they have been very successful in a broad range of other low power markets.


Each ARM Mali GPU, Video and Display processor delivers best-in-class performance in the smallest area. It is this small area advantage that will allow Atmel to target a vast array of markets with a rich multimedia experience.


Atmel plans to use ARM's technology in wearables, toys and even industrial markets. Here, user expectation is that any screen-based device will deliver a similar experience to the latest tablet or smartphone with a smooth 3D user interface, video capture and playback functionality, all at HD resolution and incorporating secure features for protection of data and content. And critically, all of this in a low power budget.


The Mali IP Atmel has licensed includes system-wide power saving technology such as ARM Frame Buffer Compression (see Jem Davies' blog ARM Mali Display - the joined-up story continues) out of the box support for ARM TrustZone® processor security technology and an optimized software solution delivering maximum efficiency all the way to the glass [see Dave Brown blog Do Androids have nightmares of botched system integrations).


A few weeks ago, the organisers of the Tour de France brought the race to the UK, speeding right through my tiny little village on the outskirts of Cambridge. The event was a massive success in the UK with crowds surpassing anything seen outside Paris.


I look forward to seeing where Atmel takes ARM’s range of multimedia IP and the success they will see in some very exciting markets outside mobile.

Mali Graphics Debugger v1.3

We are pleased to announce that version 1.3 of Mali Graphics Debugger is finally available to everyone. This is the biggest update since we first released MGD, around a year ago. In this version we have introduced:

  • A new frame replay feature (see User Guide Section 5.2.10 for details).
  • Faster tracing and lower overheads with a new binary format.
  • Memory performance improvements (allows for longer traces and improved performance of the tool in general).
  • Enhancements to the GUI.

While we were developing these features we have also fixed a large number of bugs, and made the user experience better (see User Guide Section 9.1 for details).

Get the Mali Graphics Debugger

About Frame Replay

Different draw modes in Mali Graphics DebuggerMali Graphics Debugger now has the ability to replay certain frames, depending on what calls were in that frame. To see if a frame can be replayed you must pause your application in MGD. Once paused, if a frame can be replayed, the replay button will be enabled. Clicking this button will cause MGD to reset the OpenGL ES state of your application back to how it was before the previous frame had been drawn. It will then play back the function calls in that frame. This feature can be combined with the Overdraw, Shadermap, and Fragment Count features. For example, you can pause you application in an interesting position, activate the Overdraw mode and then replay the frame. (See Section 5.2.10 in the User Guide for Frame Replay Limitations).


About the New Binary Format

Originally Mali Graphics Debugger was using a text based format to transfer all the data from the target device to the host. Now we’ve decided to switch to a binary format, using Google Protocol Buffers. This has made the interceptor lighter allowing faster transfers and lower overhead while capturing the application. Old text-based traces can still be read and they will be converted to the new binary format when saved again on disk.


Mali Graphics Debugger v1.3

Mali Offline Shader Compiler v4.3

We are also happy to announce the release of a new version of the Mali GPU Offline Shader Compiler.

This version is capable of compiling OpenGL ES 3.0 shaders and supports the Mali-T600 series r4p0-00rel0 driver, together with all previous driver versions.

By popular demand, we’ve also re-introduced Mac OS X support, which adds to the already supported Windows and Linux versions.

Get the Mali GPU Offline Shader Compiler

UK Govt Backs Geomerics to Revolutionize the Movie Industry


Earlier this week, Geomerics announced that it has won a £1million award from the UK's Technology Strategy Board (TSB) for it to bring its real-time graphics rendering techniques from the gaming world to the big screen. Geomerics and its partners will help the film and television services industry become more efficient by decreasing the amount of time spent in rendering, particularly for lighting which is one of the most time consuming parts of the editing process. Traditionally all editing was done offline then rendered to bring them to full quality, with the rendering taking 8-12 hours. However, the gaming world has developed techniques that allow full quality graphics graphics sequences to be rendered instantly - and Geomerics is looking to bring them to the film world.


For more information on the technology behind the announcement, check out  the Geomerics website.


Hardkernel Release the Odroid-XU3 Development Board


Based on Samsung's Exynos 5422 SoC with its ARM Mali-T628 MP6 GPU this new development board from Hardkernel offers a heterogeneous multiprocessing solution with great 3D graphics and thanks to its open source support, the board can run various flavours of Linux, including the latest Ubuntu 14.04 and the Android 4.4.

Full details on the board are available on the Hardkernel website and it also got a great article on Linux Gizmos.


Today Was Clearly The Day of the Mali-450 MP GPU


Four devices were announced today featuring the Mali-450, two of which had the Mediatek MT6592 at its heart and two with a HiSilicon SoC. The Mali-450 has been picking up momentum over the past six months and now we are starting to see it in a range of smartphones, such as the HTC Desire 616 and the Wickedleak Wammy Neo Youth - as well as tablets such as the HP Slate 7 VoiceTab Ultra and HP Slate 8 Plus.


Aricent on Architecting Video Software for Multi-core Heterogeneous Platforms

If you haven't caught it already, our GPU Compute partner Aricent posted a great blog on their section of the community, Parallel Computing: Architecting video software for multi-core heterogeneous platforms. It covers conventional techniques used by software designers to parallelize their code and then proposes the novel "hybrid and massive parallelism based multithreading" model as a potential way to overcome the shortcomings of spatial and functional splitting. It's definitely worth a read if you're interested in programming for multi-core platforms.

dennis_keynote.jpgI have just returned from a fortnight spent hopping around Asia in support of a series of ARM hosted events we call the Multimedia Seminars, which took place in Seoul (27th June), Taipei (1st July) and Shenzhen (3rd July). Several hundred attendees joined in each location, a quality-dense cross-section from the local semiconductor and consumer industries, including many silicon vendors, OEMs, ODMs and ISVs. All of them were able to hear the great progress made by the ARM ecosystem partners who are developing the use of GPU Compute on ARM® Mali™ GPUs. In this blog I will try to summarise some of the highlights.

The Benefits of GPU Compute

In my presentation at the three sites I was able to illustrate the benefits of GPU Compute using Mali. This was an easy task as at the event many independent software vendors were demonstrating and promoting a vast selection of middleware ported and optimized for Mali.
But what are the benefits of GPU Compute?
  • Reduced power consumption. The architectural characteristics of the Mali-T600 and Mali-T700 series of GPUs enable computation of many parallel workloads much more efficiently than alternative processor solutions. GPU Compute accelerated applications can therefore benefit by consuming less energy, which translates into longer battery life.
  • Improved performance and user experience. Where raw performance in the target, the computation of heavy parallel workloads can also be significantly accelerated through the use of the GPU. This may translate in increased frame rate, or the ability to carry out more work in the same temporal/power budget, and can result in benefits such as improved UI responsiveness, more robust finger detection for gesture UIs in challenging lighting conditions, more accurate physics simulation, the ability to apply complex pre-/post-processing effects to multimedia on-device and in real-time. In essence: a significantly improved end-user experience.
  • Portability, programmability, flexibility. Heterogeneous compute APIs such as OpenCL™ and RenderScript, are designed for concurrency. They allow the developer to migrate some of the load from the CPU to the GPU or other accelerator, or to distribute it between processors in order to enable better load-balancing across system resources. For example a video codec may offload motion vector calculations to the GPU, enabling the CPU to operate with fewer cores and at lower frequencies, or to be available to compute additional tasks, for example video analytics.
  • Reduction of cost, risk and time to market. System designers may be influenced by various cost, flexibility and portability concerns when considering migrating functionality from dedicated hardware accelerators to software solutions which leverage the CPU/GPU subsystem. This approach is made viable and compelling due to the additional computational power provided by the GPU, now exposed through industry standard heterogeneous compute APIs.

Over the last few years ARM has worked very hard to create and develop a strong GPU Compute ecosystem. Collaborations were established across geographies, use-cases and applications, working with partners at all levels of the value chain. These partners were able to translate the benefits of GPU Compute into reality, to the ultimate avail of the end users, and were proudly showcasing their progress at the Multimedia Seminars.

ittiam_power.jpgDemonstrating Reduced Power Consumption

Software codec vendors such as Ittiam Systems have been demonstrating for some time HEVC and VP9 optimized ports that make use of GPU Compute on Mali-T600 series GPUs. A software solution leveraging the CPU+GPU compute subsystem can be useful for reducing TTM, reducing risk in the adoption of new standards, but most importantly, it can help to save power.

For the first time ever Ittiam Systems publically demonstrated how a software solution leveraging on the CPU+GPU compute subsystem is able to save power compared to a solution that does not make use of the GPU. Using an instrumented development board and power probing tools connected to a National Instruments DAQ unit they were able to demonstrate a typical reduction in power consumption of over 30% for 1080p30 video playback.

Mukund Srinivasan, Director and General Manager of the Consumer and Mobility Business Unit at Ittiam, said: "These paradigm shifts open a unique window of opportunity for focused media-related Intellectual Property providers like Ittiam Systems® to offer highly differentiated solutions that are not only compute efficient but also enhance user experience by way of a longer battery life, thanks to offloading significant compute to the ARM Mali GPU. The main hurdle to cross, in these innovative solutions, comes in the form of how to manage the enhanced compute demands of a complex codec standard on mobile devices, since it bears significantly more complex coding tools, for codecs like H.265 or VP9, as compared to VP8 or H.264. In order to gain the maximum efficiencies offered by the GPU technology, collaborating with a longstanding partner and a technology pioneer like ARM enabled us to generate original solutions to the complex problems posed in the design and implementation of consumer electronic systems. Working closely with ARM, we have been able to showcase not just a prototype or a demo, but a real working product delivering significant power savings, of the order of 30-35% improvement in energy efficiencies, when measured on the whole subsystem, using the ARM Mali GPU on board a silicon chip designed for use in the Mobile market"





Demonstrating reduced CPU Load

Another benefit of GPU Compute was illustrated by our partner Thundersoft, who have implemented a gender-based real-time facebeautifier application and improved its performance using RenderScript on a Mali-T600 GPU. The algorithm first detects the subject’s face and determines its gender, and based on the gender applies a chain of complex image processing filters that enhances the subject’s appearance. This include face whitening, skin tone softening, de-blemishing effects.


The algorithm is very computational intensive and very taxing on the CPU resource, which can at times result in poor responsiveness. Furthermore, a performance of 20+ FPS is required in order to deliver a good user experience and this is not achievable on the CPU alone. Fortunately, the heavy level of data parallelism, and large proportion of floating point and SIMD-friendly operations, make this use case great for GPU acceleration. Using RenderScript on Mali, Thundersoft were able to improve the performance from a poor 10fps to over 20fps, and at the same time reduce the CPU load from fluctuating between 70-100% to a consistent < 40%. Dynamic power reduction techniques are therefore able to disable and scale down operational points of the CPUs in order to save power.

Delivering improved Performance and User Experience

Image processing is proving to be a very fertile area for GPU Compute processing. In their keynote and technical speech, ArcSoft illustrated how they utilised Mali GPUs to improve many of their algorithms including JPEG, photo filters, beautification, Video HDR (NightHawk), and HEVC.
A “nostalgia effect” filter, based on convolution, was optimized using OpenGL® ES. For a 1920x1080 camera preview image, the rendering time was reduced from 80ms down to 20ms using the Mali GPU. This means going from 12.5fps to 50fps.
Another application that benefited is ArcSoft’s implementation of a face beautifier. The camera stream was processed by the CPU, whilst colour-conversion and rendering was moved from the CPU to the GPU. Processing time for a 1920x1080 frame was therefore reduced from 30ms to just 10ms. In practice this meant that the face beautification frame rate was improved from 16fps to 26fps!
Another great example is JPEG processing. OpenCL was used for reconstructing inverse quantization and IDCT modules. Compared with ArcSoft’s Neon based JPEG decoder, performance in decoding 4000x3000 resolution images inproved 25%. Compared with OpenCL based open-source project JPEG-OpenCL, the efficiency of IDCT increased as much as 15 times.

Improved User Experience for Computer Vision Applications

You may have previously seen our partner eyeSight Technologies demonstrate how they have been able to improve the robustness and reliability of their gesture UI engine. Gesture UIs are particularly challenged when lighting conditions are poor, as this adds a lot of noise to the sensor data, and reduce accuracy of detection of gestures. As it happens, poorly lit situations is common when gesture UIs are typically used, such as inside a car, or in a living room. GPU Compute significantly increases the amount of useful computation that the gesture engine can carry out on the image within the same temporal and energy budget, this enables a significant improvement of reliability of gestures when lighting is poor.

eyeSight machine vision algorithms make extensive use of machine learning (using neural networks). The capability to learn from a vast amount of data, at a reasonable amount of time is a key element for success. However, the required
computational resources of neural networks, are beyond the capabilities of standard CPUs. eyeSight’s utilization of deep learning methods can greatly benefit from running on GPU processors.
eyeSight have used their extensive knowledge of machine vision technologies and the ARM Mali GPU Compute architecture to optimized their solutions using OpenCL on Mali.

Alva demonstrate video HDR and real-time video stabilization

In its presentation, Oscar Xiao, CEO of Alva Systems, discussed the value of heterogeneous computing for camera based applications, using two examples: real-time video stabilization and HDR photography. Alva optimized their solutions for Mali-400, Mali-450 and Mali-T628. Implementations of their algorithms are available using OpenGL ES and OpenCL APIs. Through the use of the GPU, image stabilization can be carried out comfortably for 1080p video
streams at 30fps and above. Alva Systems have also implemented an advanced HDR solution that corrects image distortion (common in multi-frame processing algorithms), removes ghosting and carries out intelligent tone mapping (to enable a more realistic result). Of course all of these features increase the computational requirements of the algorithm. GPU Compute enables real-time computation. Alva were able to measure performance improvement of individual blocks of around 13-15x compared to the reference CPU implementation of the same algorithm.

In Conclusion

Modern compute APIs enable efficient and portable heterogeneous computing. This includes enabling the use of the best processor for the task, the ability to balance workloads across system resources and to offload heavy parallel computation to the GPU. GPU Compute with ARM Mali brings tangible advantages for real world applications, including reduced cost and time to market, improved performance and user experience, and improved energy efficiency (measured on consumer devices). These benefits are being enabled by our ecosystem partners who use GPU Compute on Mali for a variety of applications including: advanced imaging, computer vision, computational photography and media codecs.



Industry leaders take advantage of the capabilities of ARM Mali GPUs to innovate and deliver - be one of them!

It's been a busy time here at Mali-central recently and I have been too busy even to blog about it.


I did an Ask The Experts slot with AnandTech recently, and faced some very interesting questions from the public. You might like to take a look.

We also bared our soul and Ryan Smith wrote a very detailed article about the Mali Midgard architecture.


Then, finally, I faced Anand Shimpi from AnandTech and did a live interview as a Google hangout. The whole thing was recorded and put  up on YouTube. You can see it here. I know what I meant  to say but what came out of my mouth of course didn't always match that . Oh well...



After a little hiatus in these blogs while I took a trip to Scotland, these blogs are back on the road! So, what has happened in the ARM Mali world this week?

ARM and Geomerics recognized in the Develop 100 Tech List


Develop have published the ultimate list of tech that will influence the future of gaming. Including a multitude of upcoming platforms and the tools, engines and middleware required to make great games run excellently, the list is well worth a read. The ARM-based Raspberry Pi took pride of place in the top spot, followed shortly afterwards by Geomerics Enlighten technology at #8, which has recently been integrated into Unreal Engine 4 and Unity 5 (#2 and #3 in the Tech List respectively).  It was great to see the Mali Developer Center at #33 being recognized for its efforts to help developers fully utilize the opportunities the mobile market offers through its broad range of developer tools, supporting documentation and broad collaboration across the gaming industry.

ARM release 64-bit hardware development platform, "Juno"


Following the announcement of 64-bit Android at Google IO, this week Linaro announced a port of the Android Open Source Project (AOSP) to the ARMv8-A architecture. At the same time, the Juno development board was released, sporting a dual-core ARM Cortex-A57 CPU and quad-core Cortex-A53 in big.LITTLE configuration, plus a quad core ARM Mali-T624 GPU for 3D graphics acceleration and GPU Compute support. Altogether, this provides the ARM Ecosystem with a strong foundation on which we can accelerate Android availability on 64-bit silicon and drive the next generation of Android mobile experiences.


Ask the experts: Jem Davies answers your questions on AnandTech


The CPU folk did their "Ask the Experts" a while back and now it's the turn of the GPU to have the spotlight! This week ARM's Jem Davies has been answering your questions on AnandTech with topics ranging from mobile versus desktop graphics features to GPU Compute and the future of graphics APIs. Jem is also starring in a Google Hangout on AnandTech next week, Monday 7 July - tune in for what is set to be an informative and detailed debate.


Good luck to all teams competing in the Brains Eden Game Jam this weekend!


Following our "Developing for Mobile" Workshop at Brains Eden, the teams in Cambridge will set to tomorrow to develop the best game of the weekend. With the importance of mobile gaming growing rapidly, all teams have been given access to a Google Nexus 10 and the expertise of ARM experts who are present throughout the event to guide students in how to get the best performance from these devices. Details of how they get along will be reported next week!

Filter Blog

By date:
By tag: