Skip navigation


1 2 3 Previous Next

ARM Mali Graphics

361 posts

Ok I did it. I downloaded Pokemon Go. Yes I was trying to resist, yes it was futile, yes it’s an awesome concept. Whilst a strong believer in Virtual Reality as a driving force in how we will handle much of our lives in the future (see my extensive blog series on the subject), I can see that apps like this have the potential to take Augmented Reality (AR) mainstream much faster than VR. What with the safety (and aesthetic) issues inherent in walking round with a headset on, AR allows you to enter a semi immersive environment but still see the world around you. Although that fact doesn’t negate the need for a warning not to walk blindly into traffic mid-game. By overlaying graphics, user interface and interactive elements over the real world environment we can experience a much more ‘real life’ feel to gaming. The fact that it also gets a generation of console gamers on their feet and out into the big wide world is just an added bonus.


It turns out Pokemon Go isn’t the company’s first attempt at this kind of application. Back in 2012 they sought users to test a beta version of a similar real world game based on spies. The idea was that you followed the map on your phone to relevant locations to solve puzzles, make drops etc. You could argue that the reason this has taken off when that didn’t is that it now has the marketing superpower of Pokemon and Nintendo behind it, but I think it’s a little more than that. All anyone in the tech industry has been talking about in recent months is VR, AR and Computer Vision and this uses two of the three straight away. Not only that but it does so in a form that’s accessible to absolutely everyone with a smartphone (and in its early days, an external battery pack for those who want to use it for more than about ten minutes).

pokemon.JPGross.hookway & alexmercer Catching the Pokemon bug at ARM Cambridge campus


The idea of playing an adventure style game in my home city appeals to me anyway. The fact that Pokemon Go overlays itself onto your actual surroundings, rather than just as a point on an animated map, makes it a whole lot more relatable. This is where Computer Vision comes in, as your phone has to be able to recognise and interpret the locations and landmarks it sees in order to use AR to realistically overlay the Pokemon onto your surroundings. Without computer vision it could prove difficult to avoid bugs like trapping Pokemon in unreachable environments, or enticing people into dangerous situations.


There’s been something of a misconception that you need ‘special’ computer vision chips to be able to do things like computer vision, and that the subsequent additional silicon is unfeasible in the mobile form factor, but this just isn’t the case. Not only can you actually do this level of basic computer vision exclusively on the CPU but some companies also have an engine which can recognize if your device has an ARM Mali GPU and automatically redirect some of the workload to it. Not only does this free up the processing power and bandwidth of the CPU but it also allows us to access the superior graphical capabilities of the existing GPU with no need for additional hardware.


The huge and lightning fast adoption of Pokemon Go, in spite of its quite considerable bugs and glitches, demonstrates just how keen we are jump on board with the next big thing in smartphones. It also shows that a new, and potentially confusing, technology can reach global uptake simply due to clever and compelling packaging. Whilst I fully expect the game to be optimized and bug free in a very short time, it will also no doubt prompt a wave of similar concept applications. I’ll be interested to see how this develops and whether (or maybe when) it will make AR truly the next big thing.

This year’s Siggraph is the 43rd international conference and exhibition on Computer Graphics & Interactive Techniques and takes place from the 24th to 28th July in Anaheim, California. A regular event on the ARM calendar, we’re looking forward to another great turn out with heaps to do and see from all the established faces in the industry as well as some of the hot new tech on the scene.

siggraph.JPGMoving Mobile Graphics

A particularly exciting part of Siggraph this year is the return of the popular Moving Mobile Graphics course. Taking place on Sunday 24th July from 2pm to 5.15pm, this half day course will take you through a technical introduction to the very latest in mobile graphics techniques, with particular focus on mobile VR. Talks and speakers will include:

  • Welcome & Introduction - Sam Martin, ARM
  • Best Practices for Mobile - Andrew Garrard, Samsung R&D UK
  • Advanced Real-time Shadowing - mbjorge, ARM
  • Video Processing with Mobile GPUs - Jay Yun, Qualcomm
  • Multiview Rendering for VR - Cass Everitt, Oculus
  • Efficient use of Vulkan UE4 - Niklas Smedberg, Epic Games
  • Making EVE: Gunjack - Ray Tran, CCP Games Asia

Visit the course page for more information. Slides will be available after the event so sign up to our Graphics & Multimedia Newsletter to be sure to receive all the latest in ARM Mali news.


Tech Talk

We’ll also be giving a great talk on Practical Analytic 2D Signed-Distance Field Generation. Unlike existing methods, instead of first rasterizing a path to a bitmap and then deriving the SDF, we can calculate the minimum distance for each pixel to the nearest segment directly from a path description comprised of line segments and Bezier curves. Our method is novel because none of the existing techniques work in vector space and our distance calculations are done in canonical quadratic space so be sure to come along to Ballroom B on Thursday from 15:45-17:15 to learn about this ground breaking technique.


Poster session

Elsewhere at the event we’ll be talking about Optimized Mobile Rendering Techniques Based on Local Cubemaps. The static nature of the local cubemap allows for faster and higher quality rendering and the fact that we use the same texture every frame guarantees high quality shadows and reflections with none of the pixel instabilities which are present with other runtime rendering techniques. Also, as there are only read operations involved when using static cubemaps, the bandwidth use is halved which is especially important in mobile devices where bandwidth must be carefully balanced at runtime. Our Connected Community members have already produced a number of blogs on this subject and have demonstrated how to work with soft dynamic shadows, reflections and refractions amongst other great techniques. Check these out here and come along at the event to speak to our experts!

Some devices, applications or use cases require the absolute peak of performance capability in order to deliver on their requirements. Some devices, applications or use cases however, need to save every little bit of energy expenditure in order to deliver extended battery power and run within the bounds of a thermally limited form factor. So how do we decide which end of the spectrum to target? Here in Team Mali, we don’t. Mali, the number 1 shipping GPU in the world, has reached such heights partly because it is able to target every single use case across this range. From the most powerful of mobile VR headsets needing lightning-fast refresh rates, to the tiniest of smartwatches required to run for as long as physically possible, there really is a Mali GPU for every occasion.

MALI RGB 2015.jpg

This mini-series of blogs will first introduce the overall scalability and flexibility of the ARM Mali range before taking a deeper dive into two products from either end of the spectrum. We will examine how these products have incorporated Mali in order to target the perfect balance of performance and efficiency their device requires. Not only does this flexibility help our partners reduce their time to market but it also means they can carefully balance resources to target the ideal positioning for their product.


So many choices so little time

There are three product roadmaps in the Mali family; Ultra low power, High area efficiency and High performance and these groupings allow partners to easily select the right set of products for their device’s needs. The Ultra low power range includes the Mali-400 GPU, one of the first in the ARM range of GPUs and still the world’s favourite option with over 25%* market share all by itself. The latest product in this roadmap is Mali-470, featuring advanced energy saving features to bring smartphone quality graphics to low power devices like wearables and Internet of Things applications. It halves the power consumption of the already hyper efficient Mali-400 in order to provide even greater device battery life and extended end use.


The high area efficiency roadmap is focused around providing optimum performance in the smallest possible silicon area to reduce cost of production for mass market smartphones, tablets and DTVs. IP in this roadmap includes Mali-T820 & Mali-T830, a pairing of products which incorporates the cost and energy saving features of their predecessor, Mali-T720, with the superior power of the simultaneously released high performance Mali-T860. The first cost efficient ARM Mali GPUs to feature ARM Frame Buffer Compression, these represented a big step up in terms of the flexibility to balance power and performance.


The high performance roadmap is exactly as you might expect based on the name. It features the latest and greatest in GPU design to optimize performance for high end use cases and premium mobile devices. The Mali-T880 represents the highest performing GPU based upon ARM’s famous Midgard architecture and is powering many of today’s high end devices including the Samsung Galaxy S7, the Huawei P9 smartphone as well as a whole host of awesome standalone VR products. You may have read recently of our brand new high performance GPU on the market, Mali-G71. The change in naming format indicates another step up in Mali GPU architecture with the advent of the Bifrost architecture. The successor to Midgard, Bifrost has been strategically designed to support Vulkan, the new graphics API from Khronos, which is giving developers a lot more control as well as a great new feature set especially for mobile graphics. Not only that but it’s also been designed to exceed the requirements of today’s advanced content, like 360 video and high end gaming, and support the advanced requirements of growing industries like virtual reality, augmented reality and computer vision.


The possibilities are endless…

A large part of the flexibility inherent in the Mali range of products is down to the inbuilt scalability. Mali-400 came into being as the first dual core implementation of the original Mali-200 GPU once it became apparent there was a lot to be gained from this approach. High end Midgard based GPUs like Mali-T860 and Mali-T880 scale from 1 to 16 cores to allow even greater choice for our partners. We’ve seen configurations featuring up to 12 of those available cores at the top end of today’s premium smartphone to support specific use cases like mobile VR, where the requirements push the boundaries of mobile power limits. The new Bifrost GPU, Mali-G71, takes that to another level again with the ability to scale up to a possible 32 cores. The additional options were deemed necessary in order to comfortably support not only today’s premium use cases like mobile VR, but also allow room to adapt to the growing content complexity we’re seeing every day.


After the customer has established their required number of cores there is still a lot of scope for flexibility within the configuration itself. Balances can be reached between power, performance and efficiency in the way the chipset is implemented in order to provide another level of customizable options. The following images show a basic example of the flexibility inherent in the configuration of just one Mali based chipset but this is just the tip of the iceberg.

  config table.png


Example optimization points of one Mali GPU

config graph.png



Practical application

In the next blog we’ll be examining an example of a Mali implementation in a current high performance device and how the accelerated performance and graphical capability supports next-level mobile content. Following on from that we’ll look at a device with requirements to keep power expenditure to a minimum and how Mali’s superior power and bandwidth saving technologies have been implemented to achieve this. The careful balance between power and efficiency is an eternal problem in the industry but one we are primed to address with the flexibility and scalability of the ARM Mali range.


*Unity Mobile (Android) Hardware Stats 2016-06

Recently we released V4.0 of the Mali Graphics Debugger. This is a key release that greatly improves the Vulkan support in the tool. The improvements are as follows:


Frame Capture has now been added to Vulkan: This is a hugely popular feature that has been available for OpenGL ES MGD users for several years. Essentially it is a snapshot of your scene after every draw call as it is rendered on target. This means if there is a rendering defect in your scene you immediately know which draw call is responsible. It is also a great way to see how your scene is composed, which draw calls contribute to your scene, and which draw calls are redundant.


Property Tracking for Vulkan: As MGD tracks all of the API calls that occur during an application it has pretty extensive knowledge of all of the graphics API assets that exist in the application. This spans everything from shaders to textures. Here is a list of Vulkan assets that are now tracked in MGD: pipelines, shader modules, pipeline layouts, descriptor pools, descriptor sets, descriptor set layouts, images, image views, device memories, buffers and buffer views.


Don't forget you can have your say on features we develop in the future by filling out this short survey:


Vulkan & Validation Layers

Posted by solovyev Jul 6, 2016

Why the validation layers?

Unlike OpenGL, Vulkan drivers don't have a global context, don't maintain a global state and don't have to validate inputs from the application side. The goal is to reduce CPU consumption by the drivers and give applications a bit more freedom in engine implementation. This approach is feasible because a reasonably good application or game should not provide an incorrect input to the drivers in release mode and all the internal checks driver usually do are therefore a waste of CPU time.  However, during development/debugging stages, an invalid input detecting mechanism is a useful and powerful tool which can make a developer's life a lot easier. As a new feature in the Vulkan driver all input validations have been moved into a separate standalone module called the validation layers. While debugging or preparing the graphics application to release, running the validation layers is a good self-assurance that there are no obvious mistakes being made by the application. While "clean" validation layers don't necessarily guarantee a bug-free application, they’re a good step towards a happy customer.  The validation layers is an open source project which belongs to Khronos community so everyone is welcome to contribute or raise an issue:


My application runs OK on this device. Am I good to ship it?

No you are not! Vulkan specifications are the result of contribution from multiple vendors and as such there is a list of functionalities that Vulkan API offers that can be used for Vendor A, but may be somewhat irrelevant to Vendor B. This is especially true for Vulkan operations that are not directly observable by applications, for instance layout transitions, execution of memory barriers etc. While applications are required to manage resources correctly, you don't know what exactly happens on a given device when, for example, memory barrier is executed on an image sub-resource. In fact, it depends heavily on the specifics of the memory architectures and GPU. From this perspective, mistakes in areas such as sharing of the resources, layout transitions, selecting visibility scopes and transferring resource ownership may have different consequences on different architectures. This is really a critical point as incorrectly managed resources may not show up on this device due to the implementation options chosen by the vendor, but may prevent the application from running on another device, powered by another vendor.


Frequently observed application issues with the Vulkan driver on Mali.


External resources ownership.

Resources like presentable images are treated as external to the Vulkan driver, meaning that it doesn’t have ownership of them. The driver obtains a lock of such an external resource on a temporary basis to execute a certain rendering operation or a series of rendering operations.  When this is done the resource is released back to the system.  When ownership is changed to be the driver's, the external resource has to be mapped and get valid entries in MMU tables in order to be correctly read/written on GPU. Once graphics operations involving the resource are finished it has to be released back to the system and all the MMU entries invalidated. It is the application's responsibility to tell the driver at which stage the given external resource ownership is supposed to be changed by providing this information as a part of render pass creation structure or as a part of the execution of a pipeline barrier.


Ex. When the presentable resource is expected to be in use by the driver layouts are transitioned from VK_IMAGE_LAYOUT_PRESENT_SRC_KHR to VK_IMAGE_LAYOUT_GENERAL or  VK_IMAGE_LAYOUT_COLOR{DEPTH_STENCIL}_ATTACHMENT_OPTIMAL. When rendering to the attachment is done and it's expected to be presented on display, layouts need to be transitioned back to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR.


Incorrectly used synchronization

Vulkan Objects lifetime is another critical case in Vulkan applications.  The Application must ensure that Vulkan objects, or the pools they were allocated from, are destroyed or reset only when they are no longer in use.  The consequence of incorrectly managing object lifetimes is unpredictable. The most likely problem is MMU faults that will result in rendering issues and losing of a device. Most of these situations can be caught and reported by validation layers, for example, if the application is trying to reset a command pool while the command buffer which was allocated from it is still in flight; the validation layers should intercept it with the following report:


[DS] Code 54: Attempt to reset command pool with command buffer (0xXXXXXXXX)which is in use


Another example. When the application is trying to record commands into the command buffer which is still in flight, the validation layers should intercept it with the following report:


[MEM] Code 9: Calling vkBeginCommandBuffer() on active CB 0xXXXXXXXX before it has completed.

You must check CB fence before this call.


Memory requirements violation.

Vulkan applications are responsible for providing a memory backing image or buffer object via the appropriate calls to vkBindBufferMemory or vkBindImageMemory. The application must not make assumptions about appropriate memory requirements for an object even if it's, for example, a vkImage object created with VK_IMAGE_TILING_LINEAR tiling, as there is no guarantee of contiguous memory. Allocations must be done based on size and alignment return values from vkGetImageMemoryRequirements or vkGetBufferMemoryRequirements. Data upload to the subresource must then be done with respect to sub-resource layout values like offset to the start of sub-resource, size, row/array/depth pitch values.  Violation of memory requirements for a Vulkan object can often result in segmentation faults or MMU faults on GPU and eventually VK_ERROR_DEVICE_LOST.  It’s recommended to run validation layers as a means of protection against these kind of issues. While validation layers can detect situations like memory overflow, cross object memory aliasing, mapping/unmapping issues; insufficient memory being bound isn't currently detected by the validation layers for today.



Developers have used reflections extensively in traditional game development and we can therefore expect the same trend in mobile VR games.  In a previous blog I discussed the importance of rendering stereo reflections in VR to achieve a successful user experience and demonstrated how to do this on Unity. In this blog I demonstrate how to render stereo reflections in Unity specifically for Google Cardboard because, while Unity has built-in support for Samsung Gear VR, for Google Cardboard it uses the Google VR SDK for Unity.


This latest VR SDK supports building VR applications on Android for both Daydream and Cardboard. The use of an external SDK in Unity leads to some specific differences when implementing stereo reflections. This blog addresses those differences and provides a stereo reflection implementation for Google Cardboard.


Combined reflections – an effective way of rendering reflections


In previous blogs 1,2 I discussed the advantages and limitations of reflections based on local cubemaps. Combined reflections have proved an effective way of overcoming the main limitation of this rendering technique derived from the static nature of the cubemap. In the Ice Cave demo, reflections based on local cubemaps are used to render reflections from static geometry while planar reflections rendered at runtime using a mirrored camera are used to render reflections from dynamic objects.



Figure 1 Combining reflections from different types of geometry.


The static nature of the local cubemap does have a positive impact in that it allows for faster and higher quality rendering. For example, reflections based on local cubemaps are up to 2.8 times faster than planar reflections rendered at runtime. The fact that we use the same texture every frame guarantees high quality reflections with no pixel instabilities which are present with other techniques that render reflections to texture every frame.


Finally, as there are only read operations involved when using static local cubemaps, the bandwidth use is halved. This feature is especially important in mobile devices where bandwidth is a limited resource. The conclusion here is that when possible, use local cubemaps to render reflections. When combining with other techniques they allow us to achieve higher quality at very low cost.


In this blog I show how to render stereo reflections for Google Cardboard for reflections based on local cubemaps and runtime planar reflections rendered using the mirrored camera technique. We assume here the shader of the reflective material that combines both reflections from static and dynamic objects to be the same as in the previous blog.


Rendering stereo planar reflections from dynamic objects


In the previous blog I showed how to set up the cameras responsible for rendering planar reflections for left and right eyes. For Google Cardboard we need to follow the same procedure but when creating the cameras we need to correctly set the viewport rectangle as shown below:



Figure 2. Viewport settings for reflection cameras.


The next step is to attach to each reflection camera the below script:


void OnPreRender() {
      SetUpReflectionCamera ();
     // Invert winding
     GL.invertCulling = true;
void OnPostRender() {
     // Restore winding
     GL.invertCulling = false;


The method SetUpReflectionCamera positions and orients the reflection camera. Nevertheless its implementation differs from the implementation provided in the previous blog. The Android VR SDK directly exposes the main left and right cameras that appear in the hierarchy as children of the Main Camera:



Figure 3. Main left and right cameras exposed in the hierarchy.


Note that LeftReflectionCamera and RightReflectionCamera game objects appear disabled because we render those cameras manually.


As we can directly access the main left and right cameras the SetUpReflectionCamera method can build the worldToCameraMatrix of the reflection camera without any additional steps:


void SetUpCamera(){

    // Set up reflection camera

    // Find out the reflection plane: position and normal in world space

    Vector3 pos = chessBoard.transform.position;

    // Reflection plane normal in the direction of Y axis

    Vector3 normal = Vector3.up;

    float d = -Vector3.Dot(normal, pos) - clipPlaneOffset;

    Vector4 reflectionPlane = new Vector4(normal.x, normal.y, normal.z, d);

    Matrix4x4 reflectionMatrix =;

    CalculateReflectionMatrix(ref reflectionMatrix, reflectionPlane);

    // Update left reflection camera considering main left camera position and orientation

    Camera reflCamLeft = gameObject.GetComponent<Camera>();

    // Set view matrix

    Matrix4x4 m = mainLeftCamera.GetComponent<Camera>().worldToCameraMatrix * reflectionMatrix;

    reflCamLeft.worldToCameraMatrix = m;

    // Set projection matrix

    reflCamLeft.projectionMatrix = mainLeftCamera.GetComponent<Camera>().projectionMatrix;



The code snippet shows the implementation of the SetUpCamera method for the left reflection camera. The mainLeftCamera is a public variable that must be populated by dragging and dropping the Main Camera Left game object. For the right reflection camera the implementation will be exactly the same but use instead the Main Camera Right game object.


The implementation of the function CalculateReflectionMatrix is provided in the previous blog.


The rendering of the reflection cameras is handled by the main left and right cameras. We attach the script below to the main right camera:


using UnityEngine;

using System.Collections;


public class ManageRightReflectionCamera : MonoBehaviour {

    public GameObject reflectiveObj;

    public GameObject rightReflectionCamera;

    private Vector3 rightMainCamPos;


    void OnPreRender(){

        rightReflectionCamera.GetComponent<Camera> ().Render ();

        reflectiveObj.GetComponent<Renderer> ().material.SetTexture ("_ReflectionTex",

        rightReflectionCamera.GetComponent<Camera> ().targetTexture);

        rightMainCamPos = gameObject.GetComponent<Camera> ().transform.position;

        reflectiveObj.GetComponent<Renderer> ().material.SetVector ("_StereoCamPosWorld",

            new Vector4(rightMainCamPos.x, rightMainCamPos.y, rightMainCamPos.z, 1));





This script issues the rendering of the right reflection camera and updates the reflection texture _ReflectionTex in the shader of the reflective material. Additionally, the script passes the position of the right main camera to the shader in world coordinates.


A similar script is attached to the left main camera to handle the rendering of the left reflection camera.  Replace the public variable rightReflectionCamera with leftReflectionCamera.


The reflection texture _ReflectionTex is updated in the shader by the left and right reflection cameras alternately. It is worth to check in the shader that the reflection cameras are in sync with the main camera rendering. We can set the reflection cameras to update the reflection texture with different colours. The screenshot below taken from the devices shows a stable picture of the reflective surface (chessboard) for each eye.



Figure 4. Left/Right main camera synchronization with runtime reflection texture.


The OnPreRender method in the script can be further optimized, as it was in the previous blog, to ensure that it only runs when the reflective object needs to be rendered. Refer to the previous blog for how to use the OnWillRenderObject callback to determine when the reflective surface needs to be rendered.


Rendering stereo reflections based on local cubemap from static objects


To render reflections based on static local cubemaps we need to calculate the reflection vector in the fragment shader and apply the local correction to it. The local corrected reflection vector is then used to fetch the texel from the cubemap and render the reflection1. Rendering stereo reflections based on static local cubemaps means that we need to use different reflection vectors for each eye.


The view vector D is built in the vertex shader and is passed as a varying to the fragment shader:


D = vertexWorld - _WorldSpaceCameraPos;


In the fragment shader, D is used to calculate the reflection vector R, according to the expression:


R = reflect(D, N);


where N is the normal to the reflective surface.


To implement stereo reflections we need to provide the vertex shader with the positions of the left and right main cameras to calculate two different view vectors and thus two different reflection vectors.


The last instruction in the scripts attached to the main left and right cameras sends the position of the main left/right cameras to the shader and updates the uniform _StereoCamPosWorld. This uniform is then used in the vertex shader to calculate the view vector:


D = vertexWorld - _StereoCamPosWorld;


Once reflections from both static and dynamic objects have been implemented in “stereo mode” we can feel the depth in the reflections rendered in the chessboard when seen through the Google Cardboard headset.


Figure 5. Stereo reflections on the chessboard.



The local cubemap technique for reflections allows rendering of high quality and efficient reflections from static objects in mobile games. When combined with other techniques it allows us to achieve higher reflection quality at very low cost.


Implementing stereo reflections in VR contributes to the realistic building of our virtual world and achieving the sensation of full immersion we want the VR user to enjoy. In this blog we have shown how to implement stereo reflections in Unity for Google Cardboard with minimum impact on performance.




  1. Reflections Based on Local Cubemaps in Unity
  2. Combined Reflections: Stereo Reflections in VR

The recently released Mali-G71 GPU is our most powerful and efficient graphics processor to date and is all set to take next generation high performance devices by storm. The Mali family of GPUs is well known for providing unbeatable flexibility and scalability in order to meet the broad-ranging needs of our customers but we’ve taken another step forward with this latest product. ARM®’s brand new Bifrost architecture, which forms the basis of the Mali-G71, will enable future generations of Mali GPUs to power all levels of devices from mass market to premium mobile. In a few short blogs I’m going to take a look at some of the key features that make Bifrost unique and the benefits they bring to ARM-powered mobile devices.


The first feature we’re going to look at is the innovative introduction of clauses for shader execution. In a traditional set up, the control flow might change between any two instructions. We therefore need to make sure that the execution state is committed to the architectural registers after each instruction and is retrieved at the start of the next. This means the instructions are executed sequentially after a scheduling decision is made before each one.


Classic Instruction Execution


The revolutionary changes ARM has implemented in the Bifrost architecture means instructions are grouped together and executed in clauses. These clauses provide more flexibility than a Very Long Instruction Word (VLIW) instruction set in that they can be of varying lengths and can contain multiple instructions for the same execution unit. However, the control flow within each clause is much more tightly controlled than a traditional architecture. Once a clause begins, execution runs from start to finish without any interruptions or loss of predictability. This means the control flow logic doesn’t need to be executed after every individual instruction. Branches may only appear at the end of clauses and their effects are therefore isolated in the system. A quad’s program counter can never be changed within a clause, allowing us to eliminate costly edge cases. Also, if you examine how typical shaders are written, you will find that they have large basic blocks which automatically make them a good fit for the clause system. Since instructions within a clause execute back-to-back without interruption, this provides us with the predictability we need to be able to optimize aggressively.


Clause Execution


As is the case in a classic instruction set, the instructions work on values stored in a register file. Each instruction reads values from the registers and then writes the results back to the same register file shortly afterwards. Instructions can then be combined in sequence due to the knowledge that the register retains its written value.

The register file itself is generally something of a power drain due to the large numbers of accesses to the register file. Since wire length contributes to dynamic power (long wires have more capacitance), the larger the register file, or the further away it is, the higher the power requirement to address it. The Bifrost architecture allocates a thread of execution to exactly one execution unit for its entire duration so that its working values can be stored in that Arithmetic Logic Unit (ALU)’s register file close by. Another optimization uses the predictability to eliminate back-to-back accesses to the register file, further reducing the overall power requirements for register access.


In a fine-grained, multi-threaded system we need to allow threads to request variable-latency operations, such as memory accesses, and sleep and wake, very quickly. We implement this using a lightweight dependency system. Dependencies are discovered by the compiler, which removes runtime complexity, and each clause can both request a variable-latency operation and also depend on the results of previous operations. Clauses always execute in order, and may continue to execute even if unrelated operations are pending. While waiting for a previous result, clauses from other quads can be scheduled, and this gives us a lot of run-time flexibility to deal with variable latencies with manageable complexity. Again, by executing this only at clause boundaries we reduce the power cost of the system.


The implementation of clause shaders not only reduces the overhead by spreading it across several instructions but it also guarantees the sequential execution of all instructions contained in a clause and allows us significant scope for optimization due to the predictability and overall power saving. This is just one of the many features of the Bifrost architecture which will allow new Mali based systems to perform more efficiently than ever before, including for high end use cases such as virtual reality and computer vision.


Many thanks to seanellis for his technical wizardry and don't forget to check back soon for the next blog in the Bitesize Bifrost series!

Have your say on the development of the Streamline and Mali Graphics Debugger(MGD) products. Is there a feature in MGD/Streamline that you would love to have and would make your development so much easier? Or is there particular part of MGD/Streamline that frustrates you and you have ideas on how we can improve it? Fill in the short survey below and let us know, we are always looking for feedback to improve our products in a way that matters to you.

Late May sees the annual Pint of Science events take place in pubs (and other somewhat unusual venues) around Cambridge, the UK and indeed the rest of the world. Designed to combine the city’s love of learning with its love of libation, the event has grown more popular every year with new venues and themes popping up every which way you look. Covering every aspect of science, and arguably every nuance of drinking culture, these volunteer run events are a great way to learn and laugh simultaneously.


ARM is always keen to support the furthering of knowledge and share some of the wisdom of our experts with the masses. The recent announcement of our acquisition of Apical, Loughborough based imaging and vision gurus, gave us a whole new thing to talk about this year: computer vision. ARM Fellow and general legend Jem Davies was on hand at the Architect pub to talk about why he thinks this is such an important industry development. He explained the different approaches required to make the most of the information we can receive from computer vision as well as the difficulty of processing and somehow interpreting the overwhelming quantity of mobile data produced every day. With 3.7 exabytes of mobile data traffic every single month, and more than half of that being mobile video, it was eye opening to think about how we can possibly store it all, let alone view it and take anything meaningful from it.



Jem explained that whilst seeing and understanding images comes naturally to us, our fundamental lack of understanding about what actually happens in the brain means it’s very difficult to provide this ability to computers. Deep learning and neural networks can come into play here so we can effectively train the brain to understand what we’re showing it and why it’s important. This is where this mountain of data can come in handy because it acts as a textbook for the computer to learn about our world and begin to recognise it.


For example, Spirit™, part of the Apical product line, is a system which takes object recognition and expands upon it to extract huge amounts of valuable data. It recognises people even in crowded and confusing situations and can be used to help assess large groups and provide early warnings of suspicious activity or potential trouble. Just think how valuable something like this could be in spotting dangerous over-crowding on a subway platform or a collapsed reveller at a concert. Not only does computer vision and deep learning enable these possibilities, but it will also be key to the mainstream adoption of things like autonomous vehicles. It will be the mechanism to allow the vehicle to see what’s happening around it and make smart decisions for safety and efficiency. It soon became clear how many exciting opportunities this kind of technology presents and I for one can’t wait to see where we can take it next.

attendees.jpgMeanwhile, across town at La Raza some of us were getting chemical with our educational evening. Molecular cocktails were the order of the day and we were shown some super cool techniques like Gelification (yes apparently it’s a word) which allows you to make solid cocktails with a range of different techniques, the miniature jelly long island iced tea being a highlight for me. We also had the opportunity to try ‘spherification’ or the technique by which you can make tasty fruity bubbles to add to a cocktail. Using a sodium alginate solution, you can cause a skin to form around a drop of fruit syrup (or similar) holding it together whilst still keeping the centre liquid. It was great to be able to have a go at it ourselves and see just how much fun science can be.


With Pint of Science events taking place all over the world I highly recommend you check them out next time they’re in town. Not only can you learn a lot but you can also have a lot of fun in the process. Initiatives like this are really helping open up the sciences to the masses and get a lot more people interested in the tech that makes our world work.

Mali, the #1 shipping family of GPUs in the world, is celebrating 10 years with ARM this month! In honour of the occasion I’m going to take a look at some of the key milestones along the way and how Mali has developed to become the GPU of choice for today’s devices. Back in early 2006 Mali was just a twinkle in ARM’s eye, it wasn’t until June of that year that ARM announced the acquisition of Norwegian graphics company Falanx and ARM Mali was born.

Mali 10_RGB_Balloon.png

This of course is not the real beginning of Mali’s story. Before Mali became part of the ARM family she was created by the Falanx team in Trondheim, Norway. In 1998 a small group of university students were tinkering with CPUs when someone suggested they try their hand at graphics. By 2001 a team of five had managed to prototype the Malaik 3D GPU with the intention of targeting the desktop PC market. They scouted a series of potential investors and whilst there was plenty of interest, they never quite got the support they were hoping for in order to break into the market.

Capture.JPGOriginal (and shortlived) Falanx branding 2001, and their final logo, edvardsorgard's handwriting codified


Research showed them that the mobile market had the most potential for new entrants and that an IP model was potentially their best option. With that in mind, they set about building the GPU to make it happen. Having revised the architecture to target the smaller, sleeker requirements of the mobile market, the Falanx team felt the Malaik name needed streamlining too.

4 falanx founders.jpgThe four final Falanx founders


Mario Blazevic, one of the founders originally from Croatia, recognized “mali” as the Croatian word for “small” and this was deemed just right for the new mobile architecture. So, armed with the very first incarnation of Mali, they set about selling it. The prototype became Mali-55 and the SoC which featured it reached great success in millions of LG mobile phones. By this time they were six people and one development board and the dream was alive and well.


Meanwhile, ARM was very interested in the GPU market and had an eye on Falanx as a potential provider. Jem Davies, ARM fellow and VP of technology, was convinced the Falanx team’s culture, aspiration and skillset were exactly the right fit and ultimately recommended we moved forward. Over the course of a year, and a few sleepless nights for the Falanx team, the conversations were had, the value was established and the ARM acquisition of Falanx was completed on June 23rd 2006.

6 falanx whole team.jpg  The Falanx team at acquisition


In February 2007 the Mali-200 GPU was the first to be released under the ARM brand and represented the start of a whole new level of graphics performance. It wasn’t long before it became apparent that the Mali-200 had a lot of unexploited potential and so its multi-core version, the Mali-400 entered development. The first major licence proved the catalyst for success when its performance took the world by storm and Mali-400 was well on its way to where it stands today, as the world’s most popular GPU with a market share of over 20% all by itself. Mali-400 is a rockstar of the graphics game and still the go to option for power sensitive devices.


In late 2010 the continued need for innovation saw us announce the start of a ‘New Era In Embedded Graphics With the Next-Generation Mali GPU’. The Mali-T604, the first GPU to be built on the Midgard architecture, prompted a ramping up of development activities and Mali began to expand into the higher performance end of the market whilst still maintaining the incredible power efficiency so vital for the mobile form.


At Computex 2013 the Mali-V500 became the first ARM video processor and complemented the Mali range of GPUs perfectly. Now on the way to the third Mali VPU this is a product gaining more and more importance, particularly in emerging areas like computer vision and content streaming. Just a year on from that we were celebrating the launch of the Mali-DP500 display processor and the very first complete Mali Multimedia Suite became a possibility. Part of the strength of the ARM Mali Multimedia Suite is the cohesive way the products work together and fully exploit bandwidth saving technologies like ARM Frame Buffer Compression.  This allows our partners to utilise an integrated suite of products and reduce their time to market. Another key Mali milestone came in mid-2014 when the Mali-T760 GPU became a record breaker by appearing in its first SoC configuration less than a year after it was launched. By the end of the year ARM partners had shipped 550 million Mali GPUs during 2014.


This year saw the launch of the third generation of Mali GPU architecture, Bifrost. Bifrost is designed to meet the growing needs and complexity of mobile content like Virtual Reality and Augmented Reality, and new generation graphics APIs like Vulkan. The first product built on the Bifrost architecture is the Mali-G71 high performance GPU for premium mobile use cases. Scalable to 32 cores it is flexible enough to allow SoC vendors to customise the perfect balance of performance and efficiency and differentiate their device for their specific target market.


Today Mali is the number 1 shipping GPU in the world, 750 million Mali-based SoCs were shipped in 2015 alone. As the Mali family of GPUs goes from strength to strength I’d like to take this opportunity to wish her and her team a very happy birthday!

6 mali infographic.png

We’ve recently been talking about a brand new video processor about to join the ARM Mali Multimedia Suite (MMS) of GPU, Video & Display IP. Egil, our next generation Mali video processor due for release later this year takes a step forward in functionality and performance to meet the needs of advancing video content. With more than 500 hours of video uploaded to YouTube every single minute it’s no surprise that optimizing video processing in sync with the full suite of products has been a key focus for us.


The MMS comprises software drivers and hardware optimized to work together right out of the box with the aim to maximize efficiency, enable faster time to market and vastly reduce potential support requirements. It has been designed to optimize performance between the various IP blocks through use of bandwidth saving technologies such as ARM Frame Buffer Compression (AFBC). AFBC can be implemented across the entire range of multimedia IP within an SoC and, depending on the type of content we’re talking about, this can produce bandwidth savings up to 50%. An AFBC-capable display controller or an AFBC-capable GPU can directly read the compressed frames produced by an AFBC-capable video decoder, such as Egil, reducing overall pipeline latency.


ARM approaches video processing in a different way from other IP providers.  We believe it is better to provide all the codecs required in a unified Video IP solution, controlled by a single API, making it easier to develop flexible, multi-standard SoCs.  To do this we analyse the codecs to be supported, establish which functions are required and develop hardware blocks to address each function - such as motion estimation & compensation, transforms, bitstreams and so on.Mali_Video-Block-Diagram-Expanded.png


The hardware IP is developed as a core to operate at a set performance level, with multiple cores being used to address higher performance points.  The ‘base core’ in Egil is designed to operate at 1080p60 frames per second, which will provide two Full HD encode and/or decode streams running simultaneously at 30 frames per second – assuming a 28HPM manufacturing process.  To address 4K UHD 2160p at 60 frames per second (as for a 4K UHD TV) would require a four-core implementation.


At the same time as developing the hardware IP we also develop firmware to manage the video IP and interface with the host software. The firmware manages codec implementation, multi-core synchronization and communication requirements as well as additional specialist functions such as error concealment, picture reordering and rate control, saving the hardware and host CPU from getting involved in these steps at all. The result is unified video IP providing an easy to use, multi-standard, scalable solution capable of simultaneous encode and decode of multiple video streams, potentially even using different codecs at different resolutions!Multi_Video_TV.png


Brand new in Egil is VP9 encode and decode capability, making it the first multi-standard video processor IP to support VP9 encode. We’ve also significantly enhanced HEVC encode and deliver an android reference software driver. Whilst currently OpenMaxIL based, this will be uprated to V4L2 as this is introduced to future versions of Android. This driver takes responsibility for setting up a particular video session, allocating memory, gating power dynamically and dramatically reduces the CPU load. The built in core scheduler manages multiple encode/decode streams and maps single or multiple video streams across multiple cores for maximum performance. This makes the new Mali video processor perfect for video conferencing and allows you to seamlessly share your viewed content with others. Not only that, but it means you can view multiple content streams at once, allowing you to keep one eye on the game throughout your meeting!


Another exciting aspect of ARM’s presence in the video space is our involvement with the Alliance for Open Media. As a founding member we’ve been working with leading internet companies in pursuit of an open and royalty-free AOMedia Video codec. We are heavily involved in this Joint Development Foundation project to define and develop media technologies addressing marketplace demand for a high quality open standard for video compression and delivery over the web. Timeline for the new codec, AV1, is to freeze the AV1 bitstream in Q1 2017, with first hardware support expected in the year that follows.


The multi standard nature of our new processor allows both encode and decode as well as supporting new and legacy codecs, all in a single piece of IP. Scalable to allow for every level of use case this next generation processor provides the perfect balance of efficiency and performance in the low power requirements of the mobile device. Egil is due for launch later in 2016.

I just lost a few hours trying to play with the Z index between draw calls, in order to try Z-layering, as advised by peterharris in my question For binary transparency : Clip, discard or blend ?.

However, for reasons I did not understand, the Z layer seemed to be completely ignored. Only the glDraw calls order was taken into account.


I really tried everything :

glDepthRangef(0.1f, 1.0f);


Still... each glDrawArrays drew pixels over previously drawn pixels that had an inferior Z value. Switched the value provided to glDepthFunc, switched the Z values, ... same result.

I really started to think that Z-layering only worked for one draw batch...


Until, I searched the OpenGL wiki for "Depth Buffer" informations and stumbled upon Common Mistakes : Depth Testing Doesn't Work :

Assuming all of that has been set up correctly, your framebuffer may not have a depth buffer at all. This is easy to see for a Framebuffer Object you created. For the Default Framebuffer, this depends entirely on how you created your OpenGL Context.

"... Not again ..."


After a quick search for "EGL Depth buffer" on the web, I found the EGL manual page :  eglChooseConfig - EGL Reference Pages, which stated this :



    Must be followed by a nonnegative integer that indicates the desired depth buffer size, in bits. The smallest depth buffers of at least the specified size is preferred. If the desired size is zero, frame buffer configurations with no depth buffer are preferred. The default value is zero.

    The depth buffer is used only by OpenGL and OpenGL ES client APIs.

The solution

Adding EGL_DEPTH_SIZE, 16 in the configuration array provided to EGL solved the problem.


I should have known.

Here's a quick tip to convert pictures to raw format with FFMPEG, in order to use them as a texture in OpenGL, with no extra conversion :

BMP files

ffmpeg -vcodec bmp -i /path/to/texture-file.bmp -vcodec rawvideo -f rawvideo -pix_fmt rgb565 texture.raw

PNG files

ffmpeg -vcodec png -i /path/to/texture-file.png -vcodec rawvideo -f rawvideo -pix_fmt rgb565 texture.raw


Loading a raw format picture as a texture in OpenGL

int fd = open("texture.raw", O_RDONLY);
read(fd, texture_buffer, raw_picture_file_size_in_bytes);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, picture_width, picture_height, 0, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, texture_buffer);


The biggest advantage is that FFMPEG will implicitly flip the picture upside-down during the conversion, meaning that the upper-left of your original texture is at UV coordinates 0.0,0.0 instead of 1.0,0.0.


Got this quick tip from BMPとRAWデータ(RGB8888/RGB565)の相互変換メモ - Qiita


Eye Heart VR

Posted by freddijeffries Jun 3, 2016

eye heart.png

Welcome to the next installment of my VR blog series. In previous VR blogs we’ve considered the importance of clear focus to a VR experience, as well as the essential requirement to keep ‘motions to photons’ latency below 20ms in order to avoid unnecessary visual discomfort (and vomiting). This time we’re going to look at eye tracking and the impact it could have on the way we use VR in the future. Eye tracking is not new – people have been doing it for nearly twenty years – but head mounted displays for VR could be the catalyst technology needed to unlock its true potential.


Say what you see

One of the aspects of VR that is still presenting a challenge is how to provide a quality user experience when navigating a VR environment. Current systems have a couple of options: The Samsung Gear VR uses a control pad on the side of the display to allow you to press and hold to move or tap to engage with objects. Google recently announced they will release a motion-enabled remote controller for their Daydream VR platform later this year and all tethered VR systems have fully-tracked controllers that mirror your hand movements in the virtual world. Alongside these there’s also growing interest in making more use of your eyes.


Eye tracking is currently getting a lot of hype on the VR circuit. The ability to follow the path of your pupil across the screen has wide ranging uses for all sorts of applications from gaming to social media to shopping. We don’t necessarily have full control over our eye movements as our eyes function as an extension of our brain. This unconscious motion is very different from how we interact with our hands so there is work still to be done to design just the right user interfaces for eye tracking. How, for example, do you glance across something without accidentally selecting it? Just think of the dangerous spending possibilities of selecting items to add to your cart simply by staring longingly at them!


Several eye tracking solutions are emerging from companies such as The Eye Tribe, Eyefluence and SMI, as well as eye tracking headsets such as FOVE. At GDC 2016 MediaTek were able to demonstrate eye tracking with the Helio x20. In all cases the path of your vision is minutely tracked throughout the VR experience. The only calibration typically required is a simple process of looking at basic shapes in turn so the sensors can latch on to your specific eye location and movement. This suggests eye tracking could be easy to adopt and use with mainstream audiences without specialist training. The first use for eye tracking that springs to mind is, as usual, gaming controls and there have indeed been demos released using modified Oculus and Samsung Gear VR headsets which use a built in eye tracking sensor to control direction and select certain objects simply by focussing steadily on them. FOVE have also shown how a depth-of-field effect could be driven from the area of the scene you are looking at, to give the illusion of focal depth. 


An additional potential benefit of eye tracking in VR is the ability to measure the precise location of each eye and use it to calculate the interpupillary distance (IPD) of the user. This measurement is the distance between the centres of your pupils and changes from person to person.  Some VR headsets, such as the HTC Vive, provide a mechanical mechanism for adjusting the distance between the lenses to match your IPD but many more simply space the lenses to match the human average. Having an accurate IPD measurement of the user would allow for more accurate calibration or image correction, resulting in a headset that would always perfectly suit your vision. Your eyes can also move slightly within the confines of the headset. Being able to detect and adjust for this in real time would allow even more precise updates of the imagery to further enhance the immersion of the VR experience.

eye.JPGEye tracking allows the view to update in real time based on exactly where you’re looking in the scene


Beneficial blurriness

Foveated rendering is a power saving rendering technique inspired by the way our eyes and vision work. We have a very wide field of vision with the ability to register objects far to the side of the direction in which we are looking. However, those images and objects in the edges of our field of vision do not appear in perfect clarity to us. This is because our fovea – the small region in the centre of our retina that provides clear central vision – has a very limited field of view. Without eye tracking we can’t tell where the VR user is looking in the scene at any given moment, so we have to render the whole scene to the highest resolution in order to retain the quality of the experience. Foveated rendering uses eye tracking to establish the exact location of your pupil and display only the area of the image that our fovea would see in full resolution. This allows the elements of the scene that are outside of this region to be rendered at a lower resolution, or potentially multiple lower resolutions at increasing distances from the focal point. This adds complexity but saves GPU processing power and system bandwidth and reduces the amount of pressure placed on the power limits of the mobile device, whilst your brain interprets the whole scene as appearing in high resolution. This therefore allows headset manufacturers to utilize this processing power elsewhere, such as in higher quality displays and faster refresh rates.


The High Performance range in the ARM® Mali™ family of GPUs is ideal for the heavy requirements VR places on the mobile device. Achieving ever higher levels of performance and energy efficiency, the flexible and scalable nature of Mali GPUs allows partners to design their SoC to their individual requirements. Partners Deepoon and Nibiru have recently launched awesome Mali-powered standalone VR devices for this very reason and the recently released Mali-G71 GPU takes this another step further. Not only does it double the number of available cores but it also provides 40% bandwidth savings and 20% more energy efficiency to allow SoC manufacturers to strike their ideal balance between power and efficiency.

VR_Foveated_rendering.jpgHow foveated rendering displays only the immediate field of view in high resolution


Verify with vision

Another potentially game-changing use of eye tracking is for security and authentication. Retinal scanning is not an unfamiliar concept in high-end security systems so to extend the uses of eye tracking to this end is a logical step. The ability to read the user’s retinal ID for in-app purchases, secure log in and much more not only reduces boring verification steps but simultaneously makes devices and applications much more secure than they were before! So once you’ve used your unique retinal ID to access your virtual social media platform, it doesn’t stop there right? Of course not, social VR would be a weird place to hang out if your friends’ avatars never looked you in the eye. Eye tracking can take this kind of use case to a whole new level of realism and really start to provide an alternative way to catch up with people you maybe rarely get to see in person. Early examples are already inciting much excitement for the added realism of being able to interpret eye contact and body language.


Seemingly simple innovations like this can actually have a huge impact on an emerging technology like VR and provide incremental improvements to the level of quality we’re able to reach in a VR application. Foveated rendering in particular is a huge step up for bandwidth reduction in the mobile device so with advancements like these we’re getting ever closer to making VR truly mainstream.


Stride argument in OpenGL ES 2.0

Posted by myy May 31, 2016

I'm putting this information here, as it took me way more time than it should to understand how the stride argument works in glVertexAttribPointer.

This argument is extremely important if you want to pack data in the same order as they are accessed by the CPU/GPU.


When reading the manual, I thought that stride was the number of bytes the OpenGL implementation would skip after reading size elements from the provided array.


However, it tends to work like this. glVertexAttribPointer :

  • Start reading data from the provided address,
  • Read size elements from the address,
  • Pass the values to the corresponding GLSL attribute,
  • Jump stride bytes from the address it started reading from,
  • Repeat this procedure count times, where count is the third argument passed to glDrawArrays.


So, for example, let's take a float array stored at memory's address 0x20000, containing the following 15 elements :

GLfloat arr[] = { 
  /* 0x20000 */ -1.0f, 1.0f, 1.0f, 0.0f, 1.0f,
  /* 0x20014 */ -1.0f, 0.0f, 1.0f, 0.0f, 0.0f,
  /* 0x20028 */  0.0f, 1.0f, 1.0f, 1.0f, 1.0f 


If you use glVertexAttribArray like this :

glVertexAttribArray(your_glsl_attrib_index, 3, GL_FLOAT, GL_FALSE, 20, arr);


And then use glDrawArrays, the OpenGL implementation will do something akin to this :

  • Copy the address arr (0x20000).
  • Start reading {-1.0f, 1.0f, 1.0f} from the copied address (referred as copy_arr here) and pass these values to the GLSL attribute identified by your_glsl_attrib_index.
  • Do something like copy_arr += stride. At this point, copy_arr  == 0x20014.

Then, on the second iteration, it will read {-1.0f, 0.0f, 1.0f} from the new copy_arr address, redo copy_arr += stride and continue like this for each iteration.


Here's a concise diagram resuming this.



Filter Blog

By date:
By tag:

More Like This