Skip navigation


1 2 3 Previous Next

ARM Mali Graphics

371 posts


Reducing power consumption and optimizing CPU utilization in a multi-core architecture are key to satisfy the increasing demand of delivering sustained high-quality graphics meanwhile maintaining a lasting battery life. The new Vulkan API facilitates this and this blog covers a real demo recording showing the improvements on power efficiency and CPU usage that Vulkan provides compared to OpenGL ES.


Vulkan unifies graphics and compute across multiple platforms in a single API. Up to now, developers had OpenGL graphics API for desktop environments and OpenGL ES for mobile platforms. The GL APIs were designed for previous generations of GPU hardware and whilst the capabilities of hardware and technology evolved, the API evolution took a little bit longer. With Vulkan, the latest capabilities of modern GPUs can be exploited.


Vulkan gives developers far more control of the hardware resources than OpenGL ES. For instance, memory management in the Vulkan API is much more explicit than in previous APIs. Developers can allocate and deallocate memory in Vulkan, whereas in OpenGL the memory management is hidden from the programmer.


Vulkan API has a much lower CPU overhead compared to OpenGL ES thanks to supporting multithreading. Multithreading is a key feature for mobile as mainstream mobile devices generally have between four to eight cores.


On the left hand side of the video image, you can see the OpenGL ES CPU utilization at the bottom. The OpenGL ES API makes a single core CPU work very hard. On the right hand side, you can see the difference the Vulkan API brings with improved threading. The multithreading capability allows the system to balance the workload across multiple CPUs and to lower the voltage and frequency as well as enabling the code to run on little core CPUs.


OpenGL ES - Vulkan comparison - FINAL.png

Fig.1 Video screen capture, showcasing CPU utilisation


With regards to energy consumption, the video shows an energy dial on top which demonstrates the improved system efficiency that Vulkan brings. If we run the sequence up until the end and this is measured in a real SoC, the multithreading benefits bring a considerable saving in energy consumption. Even at this very early stage of software development on Vulkan, we could see an overall system power saving of around 15%.

OpenGL ES - Vulkan comparison - FINAL 2.png

Fig.2 Video screen capture, showcasing overall system power saving


To get you started using the Vulkan API, there is a wealth of developer resources here, from an SDK with sample code, to tutorials and developer tools to profile and debug your Vulkan application.

Unity-Vulkan-Logo.png On the 29th September, as promised at Google I/O, Unity released the first developer preview for their upcoming Vulkan renderer. Developers have been eagerly awaiting the release since Android Nougat was announced on the 22nd of August with Vulkan support as one of its key features.


Here at ARM we have been supporting graphics developers’ uptake of the Vulkan API since Khronos launched it publicly in February. ARM Mali graphics debugger and driver support were made available on release day and we’ve subsequently provided a set of educational developer blogs on using Vulkan, a Vulkan SDK and sample code. We also gave a series of talks and demonstrations on Vulkan at GDC, the world’s largest game developer conference, just a few weeks after the API was launched. All of our developer resources and content can be found here:

vulkanbanner.pngFig 1. An example of a Vulkan demo developed by ARM


Developer resources and tools are not all we provide at ARM. Not only were we heavily involved in the development of Vulkan as part of Khronos’s Working Group, but we’ve also collaborated closely with Unity, the leading game engine platform downloaded by over 5 million game developers, to support this renderer release.

The results of this collaboration have been great news for mobile game developers as the ARM Mali-based Samsung Galaxy S7 (European version) has been recommended (and tested) as the first Android developer platform to run Unity’s initial Vulkan Renderer Preview. Developers can download the first preview release here: Get the experimental build from Unity’s beta page.


At this early stage of development, the main benefit Vulkan brings to the Unity engine is speed, thanks to the multithreading feature. Current mobile devices have multi-core CPUs and the ability to carefully balance workloads across these cores is key to achieving these improvements. The increase in power efficiency is realized by the balancing workloads across several CPUs to reduce voltage and frequency, while the increase in performance and speed is attributable to the ability to use the full compute resource of the CPU cores.

We in the ARM Mali team are pleased to be able to support such important industry advancement and look forward to seeing what our broad ecosystem of developers can do with the first Vulkan Renderer on Unity!


To know more about Unity's Vulkan Renderer Preview:

In previous blogs we’ve looked at the scalability of the Mali™ family of GPUs which allows partners’ implementations to be tailored to fit all levels of device across multiple price, performance and area points. We’ve also taken a closer look at a high performance Mali implementation in Nibiru’s standalone VR headsets.


This time we’re exploring the other end of the Mali spectrum: Ultra low power. Today, the most shipped GPU in the world is still the Mali-400. Based on our original Utgard architecture, Mali-400 is the GPU of choice for devices where minimizing power consumption is key. Since the Mali-400 GPU was released, further optimizations have been applied in the design and implementation of subsequent Ultra-low power GPUs, Mali-450 and Mali-470.


As you’ll know if you’ve read my previous blogs, VR places a whole lot of pressure on the power and thermal limitations of the mobile form factor. To ensure a great, immersive experience you need a solid framerate, high resolution and super low latency, amongst other things. To achieve this for top end content like AAA gaming can often require the highest performance hardware and a greater power budget than can be supported by a mid-range SoC. That, however, doesn’t necessarily mean you need to queue up and pay out for the next big flagship smartphone just to get on board with mobile VR.


In the tech industry it can often take a long time for high end content, use cases, or applications to become sufficiently well understood and developed to trickle down to the more mainstream device. The beauty of mobile VR is that the flexibility of the medium means you’re not locked out altogether just because you don’t want to spend on a top of the line device. In spite of the comparatively recent take off of VR products, every day use cases are already starting to become available and accessible to all on mainstream hardware. Whilst you wouldn’t want to try high end gaming (you’d almost certainly feel sick, if your system handled it at all) there are other, arguably more useful, ways in which the virtual world can change our lives.


Virtual spaces are where VR can meet mainstream devices to support a vast majority of business, social and communications needs. Whether you want to collaborate with overseas colleagues or just catch up with friends, virtual spaces allow you to interact in a more lifelike manner and can be supported in a much lower power budget than more complex content. The beauty of this concept is that there’s no need to navigate around a fully interactive virtual environment as you need to for VR gaming. Users can be limited to a smaller setting such a virtual boardroom, bar or café, which reduces the rendering complexity. This means you don’t need the highest performance SoC to support devices targeted at this type of content, as one of our innovative partners has recently shown.


Actions Semiconductor (Actions) is a leading Chinese fabless semiconductor company providing dedicated multimedia SoC solutions for mobile devices. Founded in 2001 and publically listed in 2005, Actions now has ~700 employees and one of the most informed and influential engineering teams in the industry.


One of their most recent products, the V700, is an SoC expressly designed for the cost-efficient end of the virtual reality market. Based on a 64-bit Quad-core ARM® Cortex®-A53 processor with TrustZone® Security system, graphics are provided by the powerful but highly efficient Mali-450 MP6 GPU. This provides maximized 3D/2D graphics processing delivering excellent rendering within a very small power and bandwidth budget, making it ideal for mid-range standalone VR devices.


When asked why they chose the ARM Mali family of processors for this device Actions explained that it was very important to them to enable high quality VR content for the mainstream market. Not everyone is interested in spending vast sums of money on emerging technologies, particularly when there’s still some (in my opinion, misplaced) skepticism in the industry about the uptake of VR. Supporting VR content such as virtual spaces for social and business uses allows more people to access and utilize this exciting new technology. The superior power and bandwidth saving features of the products in the Mali Multimedia Suite make them the perfect choice for such a power hungry application as VR. In-built optimizations and synchronized technologies such as ARM Frame Buffer Compression and TrustZone allow our partners to achieve the high quality and security they need without limiting uptake to high-earning consumers.


It’s always great to see partners like Actions take such leaps in supporting exciting new Mali-based products and it will be interesting to watch the emergence of virtual spaces for the mainstream user in the coming months.

I lost a few days wondering why some textures were completely distorted when loaded in OpenGL.

The thing is, they were only distorted when the colours components were packed as GL_UNSIGNED_SHORT_5_5_5_1 or GL_UNSIGNED_SHORT_4_4_4_4. When packing colour components as GL_UNSIGNED_BYTE (RGBA8888), the textures were loaded correctly.


Why ?


Since I'm using a small personal Ruby hack to generate raw textures from BMP with the desired colour packing, I really thought the problem was in the Ruby code. After verifying that the generated 4444 and 5551 textures were the exact counterpart of the working 8888 textures, and tracing the OpenGL glTexImage2D calls to be sure that the data were sent correctly, I wondered if a special parameter was to be passed to glTexImage2D after all.


Ok, maybe I missed something in the glTexImage2D manual...


Sure did...


width × height texels are read from memory, starting at location data. By default, these texels are taken from adjacent memory locations, except that after all width texels are read, the read pointer is advanced to the next four-byte boundary. The four-byte row alignment is specified by glPixelStorei with argument GL_UNPACK_ALIGNMENT, and it can be set to one, two, four, or eight bytes.


The solution


Either :

  • have textures with a width multiple of 4,
  • call glPixelStorei(GL_UNPACK_ALIGNMENT, 2); before calling glTexImage2D.


RTFM, as they always say !

In a previous blog we talked about running Mali Graphics Debugger on a non-rooted device. In this blog we will focus on how you can add support for Mali Graphics Debugger, on a non-rooted device, to your Unreal Engine application. The plan we are going to follow is very simple:

  1. Add the interceptor library to the build system
  2. Edit the activity to load the interceptor library
  3. Install the MGD Daemon application on the target device


For our first step, we will need to download a version of Unreal Engine from the sources available on Github. For more information on this step, please see Epic’s guide.


Once you have a working copy of the engine, we can focus on getting MGD working. You will first need to locate the android-non-root folder in your MGD installation directory, and your Unreal Engine installation folder (where you cloned the repository). Copy the android-non-root folder to Engine\Build\Android\Java\.


Next, we will need to change the Android makefile to ensure that the interceptor is properly packaged inside the engine build. For this, edit the file under “Engine/Build/Android/Java/jni/”  add this line at the end, include $(LOCAL_PATH)/../android-non-root/ It should look like this:

LOCAL_PATH := $(call my-dir)

include $(CLEAR_VARS)

include $(LOCAL_PATH)/../android-non-root/


We will now specify to the main game activity that it needs to load the MGD library, locate inside Engine\Build\Android\Java\src\com\epicgames\ue4\ and edit the onCreate function to look like so:

public void onCreate(Bundle savedInstanceState)
     try {
     catch( UnsatisfiedLinkError e ){
          Log.debug( "libMGD not loaded" );

     // create splashscreen dialog (if launched by SplashActivity)
     Bundle intentBundle = getIntent().getExtras();
     // Unreal Engine code continues there


Engine wise we are all set, we will now prepare our device. Install the MGD daemon on the target phone using the following command whilst being in the android-non-root folder:

adb install -r MGDDaemon.apk


Now before running your app you will need to run this command from the host PC (please ensure that the device is visible by running adb devices first):

adb forward tcp:5002 tcp:5002


Run the MGD daemon application on the target phone and activate the daemon itself:



At that point you can connect it to MGD on the host PC, start your application and begin debugging it. Please refer to the MGD manual for more in-depth information on how to use it.

Following these steps you should be able to use MGD with Unreal applications on any Mali based platform. If you have any issues please raise them on the community and someone will be more than happy to assist you through the process.

In the first Bitesize Bifrost blog we introduced you to our new GPU architecture, Bifrost, and looked specifically at the extensive optimization and power saving benefits provided by clause shaders.  This time around we’re looking at system coherency, which allows the CPU and GPU to more effectively collaborate on workloads, and why this was considered an important focus for our newest GPU architecture.


In earlier systems there was no coherency between the CPU and GPU. If you created data on the CPU but wanted the GPU to be able to work on it then the CPU would need to write the data to the main memory first. This allowed the GPU to see and access the data in order to process it. However, as the CPU operates with a cache, it was difficult to be certain that all data had been written to the main memory as opposed to simply being written to the cache. This meant the cache needed to be ejected to main memory and cleared (flushed) to ensure all the data was available to the GPU.


The issue this raises is that should you forget to flush the cache, you can’t be sure of the consequences. In some instances all the data would have been written out to main memory and you’d have no problem, or the data may be only marginally out of date and still not cause major issues. However, if the data is largely outdated you can experience serious, visible errors which are difficult to diagnose due to the different timings in the debugger affecting what’s in the cache. This makes it hard to reproduce the error and subsequently address it.


Additionally, as CPU cache sizes grow the cost of flushing them grows too. This can mean it’s only efficient to use the GPU for large, data heavy jobs which make the cache clean worthwhile and that the majority of jobs are therefore quicker and easier to keep on the CPU because of this overhead.


Our previous generation of GPU architecture, Midgard, used a concept known as IO coherency, which was originally used for input/output peripherals. This allows the GPU to check the CPU’s cache when it requests data from memory and effectively ask the CPU to confirm if it has the requested data in its cache. If it has, the GPU will copy the data into its own cache directly from the CPU cache, without going via the external memory. This way the memory latency is significantly reduced, as is external read bandwidth. However, this was a one-way system. Whilst the GPU also has caches of its own, in an IO-coherent system, the CPU cannot peek into the GPU’s caches.


As most of the required data in a graphics system flows from CPU to GPU rather than the other way around, this is an efficient tool for graphical tasks. Also, as GPU caches tend to be smaller, cleaning them at the end of a rendering pass is comparatively less costly and occurs at a single, regulated point in time making it less likely to be missed.


However, compute workloads can be vastly varying in size and the data needs to be able to travel between the CPU and GPU in both directions. This is why our new Bifrost architecture introduces full system coherency to products in the High Performance roadmap, allowing both the CPU and GPU to access each other’s caches. This eliminates the need for software to clean the caches and allows the CPU and GPU to collaborate on smaller jobs as well as larger ones. This extends the potential uses of the GPU’s compute capability and removes the risk of producing those difficult to detect errors that occur when a cache clean operation is missed.


As the Bifrost architecture is capable of scaling to 32 cores we’ve redesigned the level two cache to feature a modular design which is accessible by the cores as a single cache. This cache size is configurable to allow partners to balance just the right size and bandwidth for their specific system.


The single logical cache makes it simple for software to work with, both in the driver and on the GPU, so we can make the most of reusing cached data between shader cores. Partial cache line support means that we can effectively use it as a merging write buffer, resulting in fewer partial writes to DRAM and improving overall bandwidth utilization. The GPU also supports TrustZone™ memory protection, working to enforce restrictions on protected memory accesses.


As we look towards our next range of Bifrost based GPUs further advancements are on their way, so stay tuned and we’ll keep you up to date with the very latest in mobile graphics.


As you may have seen, Virtual Reality (VR) is getting increasingly popular. From its modern origins on desktop, it has quickly spread to other platforms, mobile being the most popular. Every time a new mobile VR demo comes out I am stunned by its quality; each time it is a giant leap forward for content quality. As of today, mobile VR is leading the way; based on our everyday phone it makes it the most accessible and because you are not bound to a particular location and wrapped in cables, you can use it wherever you want, whenever you want.


As we all know, smooth framerate is critical in VR, where just a slight swing in framerate can cause nausea. The problem we are therefore facing is simple, yet hard to address. How can we keep reasonable performance while increasing the visual quality as much as possible?


As everybody in the industry is starting to talk about multiview, let us pause and take a bit of time to understand multiview, what kind of improvements one can expect and why you should definitely consider adding it to your pipeline.


Stereoscopic rendering

What is stereoscopic rendering? The scope of this post doesn’t cover the theoretical details behind this question, but the important point is that we need to trick your brain into thinking that the object is real 3D - not screen flat. To do this you need to give the viewer two points of view on the object, or in other words, emulate the way eyes work. In order to do so we generate two cameras with a slight padding, one on the left, the other on the right. If they share the same projection matrix, obviously their view matrices are not the same. That way, we have two different viewpoints on the same scene.

Fig. 1: Stereo camera setup.

Now, let us have a look at an abstract of a regular pipeline for rendering stereo images:

  1. Compute and upload left MVP matrix
  2. Upload Geometry
  3. Emit the left eye draw call
  4. Compute and upload right MVP matrix
  5. Upload Geometry
  6. Emit the right eye draw call
  7. Combine the left and right images onto the backbuffer


We can obviously see a bit of a pattern here as we are emitting two draw calls, and sending the same geometries twice. If Vertex Buffer Objects can mitigate the latter, doubling the draw calls is still a major issue as it is adding an important overhead on your CPU. That is where multiview kicks in, as it allows you in that case, to render the same scene with multiple points of view with one draw call.


Multiview Double Action Extension

Before going into the details of the expected improvements, I would like to have a quick look at the code needed to get multiview up and running. Multiview currently exists in two major flavors: OVR_multiview and OVR_multiview2. If they share the same underlying construction, OVR_multiview restricts the usage of the gl_ViewID_OVR variable to the computation of gl_Position. This means you can only use the view ID inside the vertex shader position computation step, if you want to use it inside your fragment shader or in other parts of your shader you will need to use multiview2.


As antialiasing is one of the key requirements of VR, multiview also comes in a version with multisampling called OVR_multiview_multisampled_render_to_texture. This extension is built against the specification of OVR_multiview2 and EXT_multisampled_render_to_texture.


Some devices might only support some of the multiview extensions, so remember to always query your OpenGL ES driver before using one of them. This is the code snippet you may want to use to test if OVR_multiview is available in your driver:

const GLubyte* extensions = GL_CHECK( glGetString( GL_EXTENSIONS ) );
char * found_extension = strstr( (const char*)extensions, "GL_OVR_multiview" );
if (NULL == found_extension)
     exit( EXIT_FAILURE );


In your code multiview manifests itself on two fronts; during the creation of your frame buffer and inside your shaders, and you will be amazed how simple it is to use it.

glFramebufferTextureMultisampledMultiviewOVR = PFNGLFRAMEBUFFERTEXTUREMULTISAMPLEDMULTIVIEWOVR(eglGetProcAddress("glFramebufferTextureMultisampleMultiviewOVR"));
glFramebufferTextureMultisampledMultiviewOVR (GL_DRAW_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, textureID, 0, 0, 2);


That is more or less all you need to change in your engine code. More or less, because instead of sending a single view matrix uniform to your shader you need to send an array filled with the different view matrices.

Now for the shader part:

#version 300 es

#extension GL_OVR_multiview : enable

layout(num_views = 2) in;

in vec3 vertexPosition;

uniform mat4 MVP[2];

void main(){
     gl_Position = MVP[gl_ViewID_OVR] * vec4(vertexPosition, 1.0f);


Simple isn’t it?


Multiview will automatically run the shader multiple times, and increment gl_ViewID_OVR to make it correspond to the view currently being processed.

For more in depth information on how to implement multiview, see the sample code and article "Using Multiview Rendering".


Why using Multiview?

Now that you know how to implement multiview, I will try to give you some insights as to what kind of performance improvements you can expect.

The Multiview Timeline

Before diving into the numbers, let’s discuss the theory.

Fig. 2: Regular Stereo job scheduling timeline.


In this timeline, we can see how our CPU-GPU system is interacting in order to render a frame using regular stereo. For more in depth information on how GPU scheduling works on Mali, please see Peter Harris’ blogs.


First the CPU is working to get all the information ready, then the vertex jobs are executed and finally the fragment jobs. On this timeline the light blue are all the jobs related to the left eye, the dark blue to the right eye and the orange to the composition (rendering our two eyes side by side on a buffer).


Fig. 3: Multiview job scheduling timeline.

In comparison, this is the same frame rendered using multiview. As expected since our CPU is only sending one draw call, we are only processing once on the CPU. Also, on the GPU the vertex job is smaller since we are not running the non-multiview part of the shader twice. The fragment job, however, remains the same as we still need to evaluate each pixel of the screen one by one.

Relative CPU Time

As we have seen, multiview is mainly working on the CPU by reducing the number of draw calls you need to issue in order to draw your scene. Let us consider an application where our CPU is lagging behind our GPU, or in other words is CPU bound.

Fig. 4: Scene used to measure performances.


In this application the number of cubes is changing over time, starting from one and going up to one thousand. Each of them is drawn using a different draw call - obviously we could use batching, but that’s not the scope here. As expected, the more cubes we add, the longer the frame will take to render. On the graph below, where smaller is better we have measured the relative CPU time between regular stereo (Blue) and multiview (Red). If you remember the timeline, this result was expected as multiview is halving our number of draw calls and therefore our CPU time.

Fig. 5: Relative CPU time between multiview and regular stereo. The smaller the better, with the number of cubes on the x-axis and the relative time on the y-axis.

Multiview in red, and regular stereo in blue.


Relative GPU Time

On the GPU we are running vertex and fragment jobs. As we have seen in the timeline (Fig. 3), they are not equally affected by multiview, in fact only vertex jobs are. On Midgard and Bifrost based Mali GPUs only multiview related parts in the vertex shaders are executed for each view.

In our previous example we looked at relative CPU time, this time we have recorded the relative GPU Vertex jobs time. Again, the smaller the better, regular stereo in blue and multiview in red.

Fig. 6: Relative GPU time between multiview and regular stereo. The smaller the better, with the number of cubes on the x-axis and the relative time on the y-axis.

Multiview in red, and regular stereo in blue.


The savings are immediately visible on this chart as we are no longer computing most of the shader twice.

Wrap it up

From our measurements multiview is the perfect extension for CPU bound applications, in which you can expect between 40% and 50% improvements. If your application is not yet CPU bound multiview should not be overlooked as it can also somewhat improve your vertex processing time at a very limited cost.


It is noteworthy that multiview is rendering to an array of textures inside a framebuffer, thus the result is not directly ready for the front buffer. You will first need to render the two views side by side, this composition step is mandatory, but in most cases the time needed to do so is small compared to the rendering time, and can thus be neglected. Moreover, this step can be integrated directly in the lens deformation or timewarp process.


Multiview Applications

The obvious way, and the one already discussed in this article, is to use multiview in your VR rendering pipeline. Both of your views are then rendered using the same draw calls onto a shared framebuffer. If we try to think outside the box though, it opens up a whole new field in which we can innovate.

Foveated Rendering

Each year sees our device screen getting bigger and bigger, our content becoming increasingly more complicated and our rendering time staying the same. We have already seen what we could save on the CPU side but sometimes fragment shaders are the real bottleneck. Foveated rendering is based on the physical properties of the human eye where only 1% of our eye (called the fovea), is mapped to 50% of our visual cortex.


Foveated rendering uses this property to only render high resolution images in the center of your view, allowing us to render a low resolution version on the edges.

Fig. 7: Example of an application using foveated rendering.


For more information on foveated rendering and eye tracking applications, you can have a look at Freddi Jeffries’ blog Eye Heart VR. Stay tuned for a follow-up of this blog on foveated rendering theory.


We then need to render four versions of the same scene, two per eye, one high, one low resolution. Multiview makes this possible by sending only one draw call for all four views.

Stereo Reflections

Fig. 8: A different reflection for each eye, demonstrated here in Ice Cave VR.


Reflections are a key factor for achieving true immersion in VR, however, as for everything in VR it has to be in stereo. I won’t discuss the details of real time stereo reflections here, please see Roberto Lopez Mendez’s article Combined Reflections: Stereo Reflections in VR for that. In short, this method is based on the use of a secondary camera rendering a mirrored version of the scene. Multiview can help us achieve the stereo reflection at little more than the cost of a regular reflection, thus making real time reflections viable in mobile VR.


As we have seen throughout this article, multiview is a game changer for mobile VR as it allows us to unload our applications and finally consider the two similar views as one. Each draw call we save is a new opportunity for artists and content creators to add more life to the scenes and improve the overall VR experience.


If you are using your custom engine and OpenGL ES 3.0 for your project, you can already start working with multiview on some ARM Mali based devices, like the Samsung S6 and S7. Multiview is also drawing increased attention from industry leaders. Oculus, starting from Mobile SDK 1.0.3, is now directly supporting multiview on Samsung Gear VR and if you are using a commercial engine such as Unreal, plans are in progress to support multiview inside the rendering pipeline.

We have recently announced the first GPU in the Mali Bifrost architecture family, the Mali-G71. While the overall rendering model it implements is similar to previous Mali GPUs the Bifrost family is still a deeply pipelined tile-based renderer (see the first two blogs in this series The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining and The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering for more information) there are sufficient changes in the programmable shader core to require a follow up to the original "Abstract Machine" blog series.


In this blog, I introduce the block-level architecture of a stereotypical Bifrost shader core, and explain what performance expectations application developers should have of the hardware when it comes to content optimization and understanding the hardware performance counters exposed via tools such as DS-5® Streamline. This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.


GPU Architecture


The top-level architecture of a Bifrost GPU is the same as the earlier Midgard GPUs.





The Shader Cores


Like Midgard, Bifrost is a unified shader core architecture, meaning that only a single class of shader core which is capable of executing all types of shader programs and compute kernels exists in the design.


The exact number of shader cores present in a particular silicon chip varies; our partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-G71 GPU can scale from a single core for low-end devices all the way up to 32 cores for the highest performance designs.


Work Dispatch


The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling/compute workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue.


The workload in each queue is broken into smaller pieces and dynamically distributed across all of the shader cores in the GPU, or in the case of tiling workloads to a fixed function tiling unit. Workloads from both queues can be processed by a shader core at the same time; for example, vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology).


Level 2 Cache and Memory Bandwidth


The processing units in the system share a level 2 cache to improve performance and to reduce memory bandwidth caused by repeated data fetches. The size of the L2 cache is configurable by our silicon partners depending on their requirements, but is typically 64KB per shader core in the GPU.


The number of bus ports out of the GPU to main memory, and hence the available memory bandwidth, depends on the number of shader cores implemented. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle. The maximum number of AXI ports has been increased over Midgard allowing larger configurations with more than 12 cores to access a higher peak-bandwidth per clock if the downstream memory system can support it.


Note that the available memory bandwidth depends on both the GPU (frequency, AXI port width) and the downstream memory system (frequency, AXI data width, AXI latency). In many designs the AXI clock will be lower than the GPU clock, so not all of the theoretical bandwidth of the GPU is actually available to applications.


The Bifrost Shader Core


All Mali shader cores are structured as a number of fixed-function hardware blocks wrapped around a programmable core. The programmable core is the largest area of change in the Bifrost GPU family, with a number of significant changes over the Midgard "Tripipe" design discussed in the previous blog in this series:




The Bifrost programmable Execution Core consists of one or more Execution Engines – three in the case of the Mali-G71 – and a number of shared data processing units, all linked by a messaging fabric.


The Execution Engines


The Execution Engines are responsible for actually executing the programmable shader instructions, each including a single composite arithmetic processing pipeline as well as all of the required thread state for the threads that the execution engine is processing.


The Execution Engines: Arithmetic Processing


The arithmetic units in Bifrost implement a quad-vectorization scheme to improve functional unit utilization. Threads are grouped into bundles of four, called a quad, and each quad fills the width of a 128-bit data processing unit.  From the point of view of a single thread this architecture looks like a stream of scalar 32-bit operations, which makes achieving high utilization of the hardware a relative straight forward task for the shader compiler. The example below shows how a vec3 arithmetic operation may map onto a pure SIMD unit (pipeline executes one thread per clock):













... vs a quad-based unit (pipeline executes one lane per thread for four threads per clock):




The advantages in terms of the ability to keep the hardware units full of useful work, irrespective of the vector length in the program, is clearly highlighted by these diagrams. The power efficiency and performance provided by the narrower than 32-bit types is still critically important for mobile devices, so Bifrost maintains native support for int8, int16, and fp16 data types which can be packed to fill the 128-bit data width of the data unit. A single 128-bit maths unit can therefore perform 8x fp16/int16 operations per clock cycle, or 16x int8 operations per clock cycle.


The Execution Engines: Thread State


To improve performance and performance scalability for complex programs, Bifrost implements a substantially larger general-purpose register file for the shader programs to use. The Mali-G71 provides 64x 32-bit registers while still allowing the maximum thread occupancy of the GPU, removing the earlier trade off between thread count and register file usage described in this blog: ARM Mali Compute Architecture Fundamentals.


The size of the fast constant storage, used for storing OpenGL ES uniforms and Vulkan push constants, has also been increased which reduces cache-access pressure for programs using lots of constant storage.


Data Processing Unit: Load/Store Unit


The load/store unit handles all general purpose (non-texture) memory accesses, including vertex attribute fetch, varying fetch, buffer accesses, and thread stack accesses. It includes 16KB L1 data cache per core, which is backed by the shared L2 cache.


The load/store cache can access a single 64-byte cache line per clock cycle, and accesses across a thread quad are optimized to reduce the number of unique cache access requests required. For example, if all four threads in the quad access data inside the same cache line that data can be returned in a single cycle.


Note that this load/store merging functionality can significantly accelerate many data access patterns found in common OpenCL compute kernels, which are commonly memory access limited, so maximizing its utility in algorithm design is a key optimization objective. It is also with noting that even though the Mali arithmetic units are scalar, the data access patterns will still benefit from well written vector loads, so we still recommend writing vectorized shader and kernel code whenever possible.


Data Processing Unit: Varying Unit


The varying unit is a dedicated fixed-function varying interpolator. It implements a similar optimization strategy to the programmable arithmetic units; it vectorizes interpolation across the thread quad to ensure good functional unit utilization, and includes support for faster fp16 optimization.


The unit can interpolate 128-bits per quad per clock; e.g. interpolating a mediump (fp16) vec4 would take two cycles per four thread quad. Optimization to minimize varying value vector length, and aggressive use of fp16 rather than fp32 can therefore improve application performance.


Data Processing Unit: ZS/Blend


The ZS and Blend unit is responsible for handling all accesses to the tile-memory, both for built-in OpenGL ES operations such as depth/stencil testing and color blending, as well as programmatic access to the tile buffer needed for functionality such as:


Unlike the earlier Midgard designs, where the LS Pipe was a monolithic pipeline handling load/store cache access, varying interpolation, and tile-buffer accesses, Bifrost has implemented three smaller and more efficient parallel data units.  This means that tile-buffer access can run in parallel to varying interpolation, for example. Graphics algorithms making use of programmatic tile buffer access, which all tended to be very LS Pipe heavy on Midgard, should see a measurable reduction in contention for processing resources.


Data Processing Unit: Texture Unit


The texture unit implements all texture memory accesses. It includes 16KB L1 data cache per core, which is backed by the shared L2 cache. The architecture performance of this block in Mali-G71 is the same as the earlier Midgard GPUs; it can return one bilinear filtered (GL_LINEAR_MIPMAP_NEAREST) texel per clock. For example interpolating a bilinear texture lookup for each thread in a four thread quad would take four cycles.


Some texture access modes require multiple cycles to generate data:

  • Trilinear filtering (GL_LINEAR_MIPMAP_LINEAR) requires two bilinear samples per texel and so requires two cycles per texel.
  • Volumetric 3D textures require twice the number of cycles than a 2D texture would require; e.g. trilinear filtered 3D textures would take 4 cycles, bilinear filtered 3D textures would take 2 cycles.
  • Wide type texture formats (16-bits or more per color channel) may require multiple cycles per pixel.


One exception to the wide format rule, which is a new optimization in Bifrost, is depth texture sampling. Sampling from DEPTH_COMPONENT16 or DEPTH_COMPONENT24 textures, which is commonly needed for both shadow mapping techniques and deferred lighting algorithms, has been optimized and is now a single cycle lookup, doubling the performance relative to GPUs in the Midgard family.


The Bifrost Geometry Flow


In addition to the shader core change, Bifrost introduces a new Index-Driven Vertex Shading (IDVS) geometry processing pipeline. Earlier Mali GPUs processed all of the vertex shading before tiling, often resulting in wasted computation and bandwidth related to the varyings which only related to culled triangles (e.g. outside of the frustum, or failing a facing test).




The IDVS pipeline splits the vertex shader into two halves; one processing the position, and one processing the remaining varyings.




This flow provides two significant optimizations:

  • The index buffer is read first, and vertex shading is only submitted for small batches of vertices where at least one vertex in each batch is referenced by the index buffer. This allows vertex shading to jump spatial gaps in the index buffer.
  • Varying shading is only submitted for primitives with survive the clip-and-cull phase; this removes a significant amount of redundant computation and bandwidth for vertices contributing only to triangles which are culled.


To get the most benefit from the Bifrost geometry flow is it useful to deinterleave packed vertex buffers partially; place attributes contributing to position in one packed buffer, and attributes contributing to non-position varyings in a second packed buffer. This means that the non-position varyings are not pulled into the cache for vertices which are culled and never contribute to an on-screen primitive. My colleague stacysmith has written a good blog on optimizing buffer packing to exploit this type geometry processing pipeline here: Eats, Shoots and Interleaves.


Performance Counters


Like the earlier Midgard GPUs, Bifrost hardware supports a large number of performance counters to enable application developers to profile and optimize their applications. More detail on the performance counters available to application developers for the Bifrost architecture can be found here:


Mali Bifrost Family Performance Counters


Comments and questions welcomed as always,




Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard with other engineers to determine how to get the best performance out of combined hardware and software compute sub-systems.

As a graphics blogger I’m always interested in the next big thing in tech and particularly, VR, so when I saw a news item about a guy cycling the length of the country from the comfort of his living room I just had to know more.


I, for one, hate spin classes. I go because I know logically that it’s doing me good but my common complaint (to just about anyone who’ll listen) is ‘but what’s the point of pedalling your backside off when you’re going nowhere?!’ Well it’s almost as if innovative developer Aaron Puzey heard my lament and decided to address it with his Cycle VR app, so clearly I had to track him down. A little bit of cyber stalking and a desperate bid for more information led me to the kind of innovation that is at the heart of what we’re trying to do in bringing VR and AR to the mass market.


In deciding which platform to target in developing his app, Aaron realised that whilst console and desktop options are taking off with releases like the HTC Vive, ‘In two years time EVERYONE will have a phone capable of VR. It seems like an obvious market to head for.’ In choosing the Samsung Gear VR as his mobile platform of choice, Aaron was able to utilize the superior visual quality of the Galaxy S6’s AMOLED display as well as the high powered Mali-T760 MP8 GPU as part of the inbuilt Exynos 7420 SoC. The Mali High Performance range of GPUs supports the demanding performance requirements of VR whilst saving maximum power and bandwidth to ensure a super slick experience.

AP2.jpgThe stereoscopic display allows for minor differences between each eye, providing a sense of depth


We know that some of the key challenges to a successful VR experience are latency, framerate and resolution but these tricky areas didn’t actually present a problem for Aaron. As he was working with a single mesh and single texture with very little geometry to add complexity, he was able to achieve the quality he needed without too many issues. The biggest struggle was transforming the data from Google Maps Street View into a working 3D model because of the lack of available information but with a little digging Aaron was able to make use of work already done on this elsewhere. He initially attempted to stream the data live but the lack of multithreading in Unity meant this was causing a stall on every new texture load. He’s discovered the best way around this is to the cache the required data prior to each session and run it offline until a workaround to prevent the stalls can be found. The camera moves smoothly from one panorama to the next, producing some visual distortion but keeping the motion of the bike as realistic as possible.


We’ve seen lots of different ways of navigating a VR environment beginning to crop up, from VR chairs and stools which respond like a Segway when you lean in the direction of travel, to fully encapsulated treadmills. The latter let you move, walk and run freely around your virtual environment without the risk of crashing into people, pets or objects. However, instead of relying on expensive, dedicated hardware like these, Aaron simply customised his own existing exercise bike using a simple cadence monitor to record the RPM. Whilst it doesn’t measure the amount of effort put in, just the distance travelled, with the simple addition of adjusting the bike’s friction setting to emulate real road conditions, Aaron could get a pretty accurate output.


So, whilst still in its early stages, Aaron has high hopes for the project and is looking for the right partner to take it to the broader market. With plans to enhance the user experience, including adding multiplayer capability so you can race your friends cross country, I for one can’t wait to get my hands on the commercial version and ditch dull spin classes for good!


Got a great developer story? Get in touch!


Twitter: @FreddiJeffries

In the first of my Mali™ Power & Efficiency blogs we looked at the inbuilt flexibility and scalability of the Mali range of GPUs. It’s this that allows ARM® partners to target exactly the right performance and efficiency balance to suit their specific product, whether that’s a low power smartwatch or a top of the range premium smartphone. In this blog we’re going to take a deeper dive into the high performance end of this spectrum and look at one key ecosystem partner, Nibiru, who are implementing High Performance Mali GPUs in their range of awesome, standalone VR headsets.


Anyone who’s read my previous blogs on the growing market for Virtual Reality knows I feel strongly that VR and AR are set to change the way we work, live and play. Like most things in life however, it’s not that simple. In order to achieve a truly immersive, high quality VR experience there are some technical challenges we need to overcome. We’ve discussed the need for clear focus to help our brains believe what we’re seeing, we’ve talked about how low latency is key to avoiding nausea and dizziness, and we’ve looked at the future of the field with eye tracking and foveated rendering.


As you can see from the image below, there are a lot of intricate elements to be balanced in a VR headset and ensuring each of them is just right is not an easy task. By featuring a high quality, WQHD (2560x1440p), 5.67” Samsung AMOLED display, Nibiru ensures the user can experience the clearest imagery with the crispest possible colors due to the advanced technology of the screen. Every single pixel in an AMOLED display provides its own light source through the film which sits behind it, whereas a typical LCD screen is continuously backlit by white LEDs. Because colors are achieved by individually updating the colored LEDs behind the screen, it is possible to get brighter and sharper hues with stronger saturation. You can also turn off sections of the panel to achieve a deeper, truer black than is typically possible on a continuously-lit LCD. This is also beneficial for VR due to the latency reduction benefits discussed previously.




So how do we make all this come together into a truly awesome VR product? The answer is power.

A High Performance GPU is essential to achieving a truly great VR experience and Nibiru recognised this when they started designing their VR products. Focusing on mobile VR, Nibiru initially launched their VR OS and VR Launcher to support virtual reality via smartphone and their VR ROM when they began designing standalone devices. With around three million headsets shipped in 2016 so far, this is a company getting ahead of the VR curve. Their latest high end product, the Pro One Plus, is due for release towards the end of 2016 and uses one of the most powerful Mali-based SoCs available, the Samsung Exynos 8890. This SoC features an MP12 configuration of Mali-T880, the highest performing Mali GPU currently appearing in devices. Powering the Samsung Galaxy S7, and therefore the Samsung Gear VR, the Exynos 8890 has already proven its merits in the high performance smartphone space and is a perfect fit for a standalone VR device like Nibiru’s.


The Exynos implementation of MP12 is the highest number of cores we’ve seen in a Mali-T880 based chipset but we’re due for yet another step up with the recently released Mali-G71 which can scale up to 32 cores, double that available in the Mali-T880. Operating on Nibiru’s in-house VR OS this new device has 3GB RAM, 32GB in-built memory, HDMI input and supports customized third party VR apps for gaming, video streaming and more. It’s also optimized for Google Play and YouTube to make sure you never run out of awesome content.

nib2.pngPro One Plus provisional design (powered by Nibiru)


So why did Nibiru choose ARM Mali to power their devices? Nibiru co-founder Tony Chia explained that it was very important to them to choose a GPU that could effectively provide sufficient performance levels to ensure a smooth VR experience with minimal latency. He went on to explain that ‘user experience is very important to us and to make sure we can bring a great mobile VR experience to the mass market we had to have the right hardware in place from the beginning. Our initial focus has been around providing excellent VR video and experience based applications rather than high end gaming due the challenges of interacting with a virtual environment. ARM Mali GPUs allow us to choose an SoC that gives us peak performance whilst still saving power and extending battery life as long as possible.’  Not only are Mali GPUs scalable to allow multiple core implementation options but even the very way the chipset is configured allows vast scope for customization too. ARM Mali’s specialized bandwidth saving technologies like ARM Frame Buffer Compression (AFBC) and Adaptive Scalable Texture Compression (ASTC) contribute to efficiency by reducing bandwidth and freeing up power where it’s needed most.


With its sleek wireless design, Nibiru’s next generation, standalone VR headset represents the future of mobile VR. As we continue to work together on the ARM & Nibiru Joint Innovation Lab we aim to help streamline the game development process and enable fantastic content to complement it. Here in the ARM Mali team we can’t wait to see what they come up with next!

Ok I did it. I downloaded Pokemon Go. Yes I was trying to resist, yes it was futile, yes it’s an awesome concept. Whilst a strong believer in Virtual Reality as a driving force in how we will handle much of our lives in the future (see my extensive blog series on the subject), I can see that apps like this have the potential to take Augmented Reality (AR) mainstream much faster than VR. What with the safety (and aesthetic) issues inherent in walking round with a headset on, AR allows you to enter a semi immersive environment but still see the world around you. Although that fact doesn’t negate the need for a warning not to walk blindly into traffic mid-game. By overlaying graphics, user interface and interactive elements over the real world environment we can experience a much more ‘real life’ feel to gaming. The fact that it also gets a generation of console gamers on their feet and out into the big wide world is just an added bonus.


It turns out Pokemon Go isn’t the company’s first attempt at this kind of application. Back in 2012 they sought users to test a beta version of a similar real world game based on spies. The idea was that you followed the map on your phone to relevant locations to solve puzzles, make drops etc. You could argue that the reason this has taken off when that didn’t is that it now has the marketing superpower of Pokemon and Nintendo behind it, but I think it’s a little more than that. All anyone in the tech industry has been talking about in recent months is VR, AR and Computer Vision and this uses two of the three straight away. Not only that but it does so in a form that’s accessible to absolutely everyone with a smartphone (and in its early days, an external battery pack for those who want to use it for more than about ten minutes).

pokemon.JPGross.hookway & alexmercer Catching the Pokemon bug at ARM Cambridge campus


The idea of playing an adventure style game in my home city appeals to me anyway. The fact that Pokemon Go overlays itself onto your actual surroundings, rather than just as a point on an animated map, makes it a whole lot more relatable. This is where Computer Vision comes in, as your phone has to be able to recognise and interpret the locations and landmarks it sees in order to use AR to realistically overlay the Pokemon onto your surroundings. Without computer vision it could prove difficult to avoid bugs like trapping Pokemon in unreachable environments, or enticing people into dangerous situations.


There’s been something of a misconception that you need ‘special’ computer vision chips to be able to do things like computer vision, and that the subsequent additional silicon is unfeasible in the mobile form factor, but this just isn’t the case. Not only can you actually do this level of basic computer vision exclusively on the CPU but some companies also have an engine which can recognize if your device has an ARM Mali GPU and automatically redirect some of the workload to it. Not only does this free up the processing power and bandwidth of the CPU but it also allows us to access the superior graphical capabilities of the existing GPU with no need for additional hardware.


The huge and lightning fast adoption of Pokemon Go, in spite of its quite considerable bugs and glitches, demonstrates just how keen we are jump on board with the next big thing in smartphones. It also shows that a new, and potentially confusing, technology can reach global uptake simply due to clever and compelling packaging. Whilst I fully expect the game to be optimized and bug free in a very short time, it will also no doubt prompt a wave of similar concept applications. I’ll be interested to see how this develops and whether (or maybe when) it will make AR truly the next big thing.

This year’s Siggraph is the 43rd international conference and exhibition on Computer Graphics & Interactive Techniques and takes place from the 24th to 28th July in Anaheim, California. A regular event on the ARM calendar, we’re looking forward to another great turn out with heaps to do and see from all the established faces in the industry as well as some of the hot new tech on the scene.

siggraph.JPGMoving Mobile Graphics

A particularly exciting part of Siggraph this year is the return of the popular Moving Mobile Graphics course. Taking place on Sunday 24th July from 2pm to 5.15pm, this half day course will take you through a technical introduction to the very latest in mobile graphics techniques, with particular focus on mobile VR. Talks and speakers will include:

  • Welcome & Introduction - Sam Martin, ARM
  • Best Practices for Mobile - Andrew Garrard, Samsung R&D UK
  • Advanced Real-time Shadowing - mbjorge, ARM
  • Video Processing with Mobile GPUs - Jay Yun, Qualcomm
  • Multiview Rendering for VR - Cass Everitt, Oculus
  • Efficient use of Vulkan UE4 - Niklas Smedberg, Epic Games
  • Making EVE: Gunjack - Ray Tran, CCP Games Asia

Visit the course page for more information. Slides will be available after the event so sign up to our Graphics & Multimedia Newsletter to be sure to receive all the latest in ARM Mali news.


Tech Talk

We’ll also be giving a great talk on Practical Analytic 2D Signed-Distance Field Generation. Unlike existing methods, instead of first rasterizing a path to a bitmap and then deriving the SDF, we can calculate the minimum distance for each pixel to the nearest segment directly from a path description comprised of line segments and Bezier curves. Our method is novel because none of the existing techniques work in vector space and our distance calculations are done in canonical quadratic space so be sure to come along to Ballroom B on Thursday from 15:45-17:15 to learn about this ground breaking technique.


Poster session

Elsewhere at the event we’ll be talking about Optimized Mobile Rendering Techniques Based on Local Cubemaps. The static nature of the local cubemap allows for faster and higher quality rendering and the fact that we use the same texture every frame guarantees high quality shadows and reflections with none of the pixel instabilities which are present with other runtime rendering techniques. Also, as there are only read operations involved when using static cubemaps, the bandwidth use is halved which is especially important in mobile devices where bandwidth must be carefully balanced at runtime. Our Connected Community members have already produced a number of blogs on this subject and have demonstrated how to work with soft dynamic shadows, reflections and refractions amongst other great techniques. Check these out here and come along at the event to speak to our experts!

Some devices, applications or use cases require the absolute peak of performance capability in order to deliver on their requirements. Some devices, applications or use cases however, need to save every little bit of energy expenditure in order to deliver extended battery power and run within the bounds of a thermally limited form factor. So how do we decide which end of the spectrum to target? Here in Team Mali, we don’t. Mali, the number 1 shipping GPU in the world, has reached such heights partly because it is able to target every single use case across this range. From the most powerful of mobile VR headsets needing lightning-fast refresh rates, to the tiniest of smartwatches required to run for as long as physically possible, there really is a Mali GPU for every occasion.

MALI RGB 2015.jpg

This mini-series of blogs will first introduce the overall scalability and flexibility of the ARM Mali range before taking a deeper dive into two products from either end of the spectrum. We will examine how these products have incorporated Mali in order to target the perfect balance of performance and efficiency their device requires. Not only does this flexibility help our partners reduce their time to market but it also means they can carefully balance resources to target the ideal positioning for their product.


So many choices so little time

There are three product roadmaps in the Mali family; Ultra low power, High area efficiency and High performance and these groupings allow partners to easily select the right set of products for their device’s needs. The Ultra low power range includes the Mali-400 GPU, one of the first in the ARM range of GPUs and still the world’s favourite option with over 25%* market share all by itself. The latest product in this roadmap is Mali-470, featuring advanced energy saving features to bring smartphone quality graphics to low power devices like wearables and Internet of Things applications. It halves the power consumption of the already hyper efficient Mali-400 in order to provide even greater device battery life and extended end use.


The high area efficiency roadmap is focused around providing optimum performance in the smallest possible silicon area to reduce cost of production for mass market smartphones, tablets and DTVs. IP in this roadmap includes Mali-T820 & Mali-T830, a pairing of products which incorporates the cost and energy saving features of their predecessor, Mali-T720, with the superior power of the simultaneously released high performance Mali-T860. The first cost efficient ARM Mali GPUs to feature ARM Frame Buffer Compression, these represented a big step up in terms of the flexibility to balance power and performance.


The high performance roadmap is exactly as you might expect based on the name. It features the latest and greatest in GPU design to optimize performance for high end use cases and premium mobile devices. The Mali-T880 represents the highest performing GPU based upon ARM’s famous Midgard architecture and is powering many of today’s high end devices including the Samsung Galaxy S7, the Huawei P9 smartphone as well as a whole host of awesome standalone VR products. You may have read recently of our brand new high performance GPU on the market, Mali-G71. The change in naming format indicates another step up in Mali GPU architecture with the advent of the Bifrost architecture. The successor to Midgard, Bifrost has been strategically designed to support Vulkan, the new graphics API from Khronos, which is giving developers a lot more control as well as a great new feature set especially for mobile graphics. Not only that but it’s also been designed to exceed the requirements of today’s advanced content, like 360 video and high end gaming, and support the advanced requirements of growing industries like virtual reality, augmented reality and computer vision.


The possibilities are endless…

A large part of the flexibility inherent in the Mali range of products is down to the inbuilt scalability. Mali-400 came into being as the first dual core implementation of the original Mali-200 GPU once it became apparent there was a lot to be gained from this approach. High end Midgard based GPUs like Mali-T860 and Mali-T880 scale from 1 to 16 cores to allow even greater choice for our partners. We’ve seen configurations featuring up to 12 of those available cores at the top end of today’s premium smartphone to support specific use cases like mobile VR, where the requirements push the boundaries of mobile power limits. The new Bifrost GPU, Mali-G71, takes that to another level again with the ability to scale up to a possible 32 cores. The additional options were deemed necessary in order to comfortably support not only today’s premium use cases like mobile VR, but also allow room to adapt to the growing content complexity we’re seeing every day.


After the customer has established their required number of cores there is still a lot of scope for flexibility within the configuration itself. Balances can be reached between power, performance and efficiency in the way the chipset is implemented in order to provide another level of customizable options. The following images show a basic example of the flexibility inherent in the configuration of just one Mali based chipset but this is just the tip of the iceberg.

  config table.png


Example optimization points of one Mali GPU

config graph.png



Practical application

In the next blog we’ll be examining an example of a Mali implementation in a current high performance device and how the accelerated performance and graphical capability supports next-level mobile content. Following on from that we’ll look at a device with requirements to keep power expenditure to a minimum and how Mali’s superior power and bandwidth saving technologies have been implemented to achieve this. The careful balance between power and efficiency is an eternal problem in the industry but one we are primed to address with the flexibility and scalability of the ARM Mali range.


*Unity Mobile (Android) Hardware Stats 2016-06

Recently we released V4.0 of the Mali Graphics Debugger. This is a key release that greatly improves the Vulkan support in the tool. The improvements are as follows:


Frame Capture has now been added to Vulkan: This is a hugely popular feature that has been available for OpenGL ES MGD users for several years. Essentially it is a snapshot of your scene after every draw call as it is rendered on target. This means if there is a rendering defect in your scene you immediately know which draw call is responsible. It is also a great way to see how your scene is composed, which draw calls contribute to your scene, and which draw calls are redundant.


Property Tracking for Vulkan: As MGD tracks all of the API calls that occur during an application it has pretty extensive knowledge of all of the graphics API assets that exist in the application. This spans everything from shaders to textures. Here is a list of Vulkan assets that are now tracked in MGD: pipelines, shader modules, pipeline layouts, descriptor pools, descriptor sets, descriptor set layouts, images, image views, device memories, buffers and buffer views.


Don't forget you can have your say on features we develop in the future by filling out this short survey:


Vulkan & Validation Layers

Posted by solovyev Jul 6, 2016

Why the validation layers?

Unlike OpenGL, Vulkan drivers don't have a global context, don't maintain a global state and don't have to validate inputs from the application side. The goal is to reduce CPU consumption by the drivers and give applications a bit more freedom in engine implementation. This approach is feasible because a reasonably good application or game should not provide an incorrect input to the drivers in release mode and all the internal checks driver usually do are therefore a waste of CPU time.  However, during development/debugging stages, an invalid input detecting mechanism is a useful and powerful tool which can make a developer's life a lot easier. As a new feature in the Vulkan driver all input validations have been moved into a separate standalone module called the validation layers. While debugging or preparing the graphics application to release, running the validation layers is a good self-assurance that there are no obvious mistakes being made by the application. While "clean" validation layers don't necessarily guarantee a bug-free application, they’re a good step towards a happy customer.  The validation layers is an open source project which belongs to Khronos community so everyone is welcome to contribute or raise an issue:


My application runs OK on this device. Am I good to ship it?

No you are not! Vulkan specifications are the result of contribution from multiple vendors and as such there is a list of functionalities that Vulkan API offers that can be used for Vendor A, but may be somewhat irrelevant to Vendor B. This is especially true for Vulkan operations that are not directly observable by applications, for instance layout transitions, execution of memory barriers etc. While applications are required to manage resources correctly, you don't know what exactly happens on a given device when, for example, memory barrier is executed on an image sub-resource. In fact, it depends heavily on the specifics of the memory architectures and GPU. From this perspective, mistakes in areas such as sharing of the resources, layout transitions, selecting visibility scopes and transferring resource ownership may have different consequences on different architectures. This is really a critical point as incorrectly managed resources may not show up on this device due to the implementation options chosen by the vendor, but may prevent the application from running on another device, powered by another vendor.


Frequently observed application issues with the Vulkan driver on Mali.


External resources ownership.

Resources like presentable images are treated as external to the Vulkan driver, meaning that it doesn’t have ownership of them. The driver obtains a lock of such an external resource on a temporary basis to execute a certain rendering operation or a series of rendering operations.  When this is done the resource is released back to the system.  When ownership is changed to be the driver's, the external resource has to be mapped and get valid entries in MMU tables in order to be correctly read/written on GPU. Once graphics operations involving the resource are finished it has to be released back to the system and all the MMU entries invalidated. It is the application's responsibility to tell the driver at which stage the given external resource ownership is supposed to be changed by providing this information as a part of render pass creation structure or as a part of the execution of a pipeline barrier.


Ex. When the presentable resource is expected to be in use by the driver layouts are transitioned from VK_IMAGE_LAYOUT_PRESENT_SRC_KHR to VK_IMAGE_LAYOUT_GENERAL or  VK_IMAGE_LAYOUT_COLOR{DEPTH_STENCIL}_ATTACHMENT_OPTIMAL. When rendering to the attachment is done and it's expected to be presented on display, layouts need to be transitioned back to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR.


Incorrectly used synchronization

Vulkan Objects lifetime is another critical case in Vulkan applications.  The Application must ensure that Vulkan objects, or the pools they were allocated from, are destroyed or reset only when they are no longer in use.  The consequence of incorrectly managing object lifetimes is unpredictable. The most likely problem is MMU faults that will result in rendering issues and losing of a device. Most of these situations can be caught and reported by validation layers, for example, if the application is trying to reset a command pool while the command buffer which was allocated from it is still in flight; the validation layers should intercept it with the following report:


[DS] Code 54: Attempt to reset command pool with command buffer (0xXXXXXXXX)which is in use


Another example. When the application is trying to record commands into the command buffer which is still in flight, the validation layers should intercept it with the following report:


[MEM] Code 9: Calling vkBeginCommandBuffer() on active CB 0xXXXXXXXX before it has completed.

You must check CB fence before this call.


Memory requirements violation.

Vulkan applications are responsible for providing a memory backing image or buffer object via the appropriate calls to vkBindBufferMemory or vkBindImageMemory. The application must not make assumptions about appropriate memory requirements for an object even if it's, for example, a vkImage object created with VK_IMAGE_TILING_LINEAR tiling, as there is no guarantee of contiguous memory. Allocations must be done based on size and alignment return values from vkGetImageMemoryRequirements or vkGetBufferMemoryRequirements. Data upload to the subresource must then be done with respect to sub-resource layout values like offset to the start of sub-resource, size, row/array/depth pitch values.  Violation of memory requirements for a Vulkan object can often result in segmentation faults or MMU faults on GPU and eventually VK_ERROR_DEVICE_LOST.  It’s recommended to run validation layers as a means of protection against these kind of issues. While validation layers can detect situations like memory overflow, cross object memory aliasing, mapping/unmapping issues; insufficient memory being bound isn't currently detected by the validation layers for today.

Filter Blog

By date:
By tag:

More Like This