1 2 3 Previous Next

ARM Mali Graphics

236 posts

Chinese Version中文版 : 节约带宽才是王道


Building an efficient and high performing System-on-Chip(SoC) is becoming an increasingly complex task. The growing demands for bandwidth heavy applications mean that system components are required to improve efficiency in each generation to address the additional bandwidth consumption that these apps entail. And this is true across all markets: as high-end mobile computing platforms strive for ever greater performance alongside better energy efficiency, SoCs targeting the mainstream still need to deliver a premium style feature set and performance density - and at the same time reduce manufacturing costs and time to market too!


If you look closely at typical user interactions with consumer devices you will realize that they very often centre on a combination of text, audio, stills, images, animation, video, in other words, a multimedia experience. What is significant about this is that media intensive use cases require the transfer of large amounts of data and the more advanced the user experience, the higher the requirement for increased system bandwidth. But the higher the bandwidth consumption, the higher the power consumption.


But which specific use cases and functional requirements are driving this nearly exponential increase for more bandwidth?


  • Screen sizes and resolutions have rapidly increased across a wide range of form factors and performance points. The number of mobile devices with HD screen resolutions is growing fast and tablets now often have a 2.5K screen. There is no sign of this trend slowing down.


  • The frames per second that a media system has to deliver have increased.  At the same time, display refresh rates are expected to provide a more compelling user experience. In reality, 60FPS have been a must for some time now and with some advanced use cases we are moving up to 120FPS.


  • The amount of computing throughput required to calculate each scene has increased. High-end games and use cases require more complex calculations to represent each of the final pixels. Increasing arithmetic throughput on the per pixel basis simply means that there is more data being transferred and processed.


So in summary, there are more pixels which all have to be delivered faster – and at the same time more work is required for each of these pixels every frame. All of this requires not only higher computational capabilities but also more bandwidth. Unless SoC designers think about this in advance when designing a media system, a lot more power will be consumed when delivering a quality user experience.


So what can we do about that? A typical media system consists of a number of IP components, each of which has slightly different functionality and characteristics. Each stage of the media processing pipeline is handled by a separate block, and each of them (as Sean explained in his blog) has inputs, intermediate data and outputs - all of which contribute to the total power budget.




Looking deeper, the typical media pipeline consists of number of interactions between GPU, Video and Display Processors and it requires the passing of a certain amount of data between each of these components. If we take a glass half full view, it means that there are plenty of opportunities to optimize these interactions and provide components that save bandwidth by working together in an efficient way. This is exactly why ARM has developed a range of bandwidth reducing technologies: to allow increasingly more complex media within the power capacity and thermal limit of mobile devices.

Motion Search Elimination - Lowering Latency and Power


Let’s start with looking at how optimizations already applied to Mali GPUs can be applied to other media components. New use cases require an innovative approach. Nowadays we are seeing more and more that require wireless delivery of audio and video from tablets, mobile phones and other consumer devices to large screens such as that on a DTV. Both sending and receiving devices must support compression of the video stream using algorithms such as H.264. In a typical use case, instead of writing the frame buffer to the screen memory, the Display Processor will send it to the Video Decoder and then the compressed frame will be sent over the WiFi network.




Motion Search Elimination extends the concept of Transaction Elimination, introduced last year in the Mali GPUs and described below, to Display and Video Processors. Each of them maintains a signature for each tile and when the Display Processor writes the frame buffer out, the Video Processor can eliminate motion search for tiles where signatures match. Why does this matter? Motion estimation is an expensive part of the video pipeline so skipping the search for selected tiles will lower latency of Wi-Fi transmission, lower bandwidth consumption and as a result, lower the entire SoC power.


Transaction Elimination – Saving External Bandwidth


Transaction Elimination (TE) is a key bandwidth saving feature of the ARM Mali Midgard GPU architecture that allows significant energy savings when writing out frame buffers. In a nutshell, when TE is enabled, the GPU compares the current frame buffer with the previously rendered frame and performs a partial update only to the particular parts of it that have been modified.


With that, the amount of data that need to be transmitted per frame to external memory is significantly reduced. TE can be used by every application for all frame buffer formats supported by the GPU, irrespective of the frame buffer precision requirements. It is highly effective even on first person shooters and video streams. Given that in many other popular graphics applications, such as User Interfaces and casual games, large parts of the frame buffer remain static between two consecutive frames, frame buffer bandwidth savings from TE can reach up to 99%.




Smart Composition - Reducing Bandwidth and Workloads


So is there anything else we could do to minimize the amount of data processed through the GPU? Smart Composition is another technology developed to reduce bandwidth while reading in textures during frame composition and, as outlined by Plout in his blog, it builds on the previously described Transaction Elimination.



By analyzing frames prior to final frame composition, Smart Composition determines if any reason exists to render a given portion of the frame or whether the previously rendered and composited portion can be reused. If that portion of the frame can be reused then it is not read from memory again or re-composited, thereby saving additional computational effort.


AFBC - Bandwidth Reduction in Media System


Now let’s look more closely at interactions between the GPU, Video and Display processors. One of the most bandwidth intensive use cases is video post processing. In many use cases, the GPU is required to read a video and apply effects when using video streams as textures in 2D or 3D scenes. In such cases, ARM Frame Buffer Compression (AFBC), a lossless image compression protocol and format with fine grained random access, reduces the overall system level bandwidth and power by up to 50% by minimizing the amount of data transferred between IP blocks within a SoC.



When AFBC is used in an SoC[TW1] , the Video Processor will simply write out the video streams in the compressed format and the GPU will read them and only uncompress them in the on-chip memory. Exactly the same optimization will be applied to the output buffers intended for the screen. Whether it is the GPU or Video Processor producing the final frame buffers, they will be compressed so that the Display Processor will read these in the AFBC format and only uncompress when moving to the display memory. AFBC is described in more detail in Ola’s blog Mali-V500 video processor: reducing memory bandwidth with AFBC.



ASTC - Flexibility, Reduced Size and Improved Quality


But what about interactions between the GPU and a graphics application such as a high-end game or user interface? This is the perfect opportunity to optimize the amount of memory that texture assets require. Adaptive Scalable Texture Compression (ASTC) technology, developed by ARM and AMD, donated to Khronos and has been adopted as an official extension to both the Open GL® and OpenGL® ES graphics APIs. ASTC is a major step forward in reducing memory bandwidth and thus energy use, all while maintaining image quality.


The ASTC specification includes two profiles: LDR and Full, both of which are already supported on Mali-T62X GPUs and above and are described in more detail by Tom Olson and Stacy Smith.

Mali OpenGL ES Extensions – Efficient Deferred Shading


To finish, let’s explore another great opportunity to optimize system bandwidth. Modern games apply various post processing effects and the textures are often combined with the frame buffer. This means that memory is written out through the external bus and then read back multiple times to achieve advanced graphics effects. This type of deferred rendering requires the transfer of a significant amount of external data and consumes a lot of power. But Mali GPUs are a tile-based rendering architecture, which means that fragment shading is performed tile by tile using on-chip memory and only when all the contents of a tile has been processed is the tile data written back out to external memory.




This is a perfect opportunity to employ deferred shading without the need to write out the data through the external bus. ARM has introduced two advanced OpenGL ES extensions that enable developers to achieve console-like effects within the mobile bandwidth and power budget: Shader Framebuffer Fetch and Shader Pixel Local Storage. For more information on these extensions, read Jan-Harald’s blog.



Anything Else?


So have we exhausted all of the possibilities to minimize system bandwidth with the technologies described in this blog? The good news is … of course not! At ARM we are positively obsessed with finding new areas for optimizations and making mobile media systems even more power efficient. In each generation our Silicon Partners, OEMs and end consumers help us to discover new use cases which are posing different challenges and requirements. With this there is a constant stream of new opportunities to get our innovation engines going and design even more efficient SoCs.


Got any questions on the technologies outlined above? Let us know in the comments section below.


Following on from my previous blog Mali Tutorial Programme for Novice Developers, I am pleased to announce that the first complete semester of 12 tutorials has now been finished and released. As a reminder, these tutorials are meant for people with no prior graphics - or even Android experience. Over the course of the 12 tutorials we will take you from a simple Android application to being able to create an application that loads models from industry standard modelling packages and lights them and normal maps them correctly.  A getting started guide is also included to help setup your computer to be able to build Android applications.


These tutorials are meant to follow on from each other, with each one building on the previous. However, when you get to the simple cube, most of the later tutorials are based off this. To download these tutorials all you need to do is download the Mali Android SDK from Mali Developer Center.


Here is a brief summary of the 12 tutorials and what is included in each:


1) First Android Native Application: An introduction to creating a basic Android application that uses both the Android SDK and the Android NDK.


2) Introduction to Shaders: A brief introduction to shaders and the graphics pipeline. This tutorial is a great companion to the rest of the tutorials and gives you better insight into some of the concepts used later on. It is also great to come back to when you have completed the later tutorials, as it will help to answer some of the questions you may have.


3) Graphics Setup: This tutorial teaches you all the setup required to run OpenGL® ES graphics applications on an Android platform. It briefly talks about EGL, surface and contexts - just enough for you to be able to draw graphics in the next tutorial.


4) Simple Triangle: Finally you get to draw something to the screen! It is only a triangle, but this triangle will be the basis for nearly everything you choose to do with mobile graphics.


5) Simple Cube: In this tutorial we explore how to use 3D objects. Mathematical transformations are also discussed so that you are able to move and rotate the cube at will.


6) Texture Cube: Once we have the cube it is time to start making it more realistic. An easy way to do this is through texturing. You can think about it like wallpapering the cube with an image. This allows you to add a lot of detail really simply.


7) Lighting: Next we add a realistic approximation to lighting to give the scene more atmosphere. We also go through some of the maths that is involved in the lighting approximations.


8) Normal Mapping: This is a way to make our lighting look even more realistic without a heavy cost on calculating at runtime. This is done by doing most of the computation offline and adding it to a texture.


9) Asset Loading: This is the tutorial where we get to move away from using a standard cube. This tutorial teaches you how to import objects generated from third party tools. This means you can add objects into your application like characters, furniture and even whole buildings.


10) Vertex Buffer Objects: Bandwidth is a huge limiting factor when writing a graphics application for mobile. In this tutorial we explore one way to reduce this by sending vertex information only once.


11) Android File Loading: Up until now all of our textures and shaders have been included in the C or Java files that we have been using. This tutorial allows you to separate them out into separate files and then bundle them into your APK. This tutorial also teaches you how to extract your files out of the APK again at runtime.


12) Mipmapping and Compressed Textures: As a follow on from Vertex Buffer Objects, this tutorial explores two other ways of reducing bandwidth. OpenGL ES supports the use of certain compressed texture formats. This tutorial explores those as well as using smaller versions of the same texture to deliver not only better looking results, but also a reduction in bandwidth.


Got any questions or feedback concerning these tutorials? Let me know in the comments section below.

I am interrupting my blog series to share what I think is a rather elegant way to quickly get up and running with OpenCL on the ARM® Mali-T604 GPU powered Chromebook. Please bear in mind that this is not ARM's "official guide" (which can be found here). However, it's a useful alternative to the official guide if, for example, you don't have a Linux PC or just want to use Chrome OS day in and day out.


You will need:


How fast you will complete the installation will depend on how fast you can copy-and-paste instructions from this guide, how fast your Internet connection is and how fast your memory card is (I will give an approximate time for each step measured when using 30 MB/s and 45 MB/s cards). The basic OpenCL installation should take up to half an hour; PyOpenCL and NumPy about an hour; further SciPy libraries about 3-4 hours. Most of the time, however, you will be able to leave the Chromebook unattended, beavering away while compiling packages from source.


Finally, the instructions are provided "as is", you use them at your own risk, and so on, and so forth... (The official guide also contains an important disclaimer.)


Installing OpenCL

Enabling Developer Mode

NB: Enabling Developer Mode erases all user data - do a back up first.


Enter Recovery Mode by holding the ESC and REFRESH (↻ or F3) buttons, and pressing the POWER button. In Recovery Mode, press Ctrl+D and ENTER to confirm and enable Developer Mode.


Entering developer shell (1 min)

Open the Chrome browser and press Ctrl-Alt-T.

Welcome to crosh, the Chrome OS developer shell.

If you got here by mistake, don't panic!  Just close this tab and carry on.

Type 'help' for a list of commands.

Don't panic, keep the tab opened and carry on to enter the shell:

crosh> shell
chronos@localhost / $ uname -a
Linux localhost 3.8.11 #1 SMP Mon Sep 22 22:27:45 PDT 2014 armv7l SAMSUNG EXYNOS5 (Flattened Device Tree) GNU/Linux


Preparing an SD card (5 min)

Insert a blank SD card (denoted as /dev/mmcblk1 in what follows):

chronos@localhost / $ sudo parted -a optimal /dev/mmcblk1
GNU Parted 3.1
Using /dev/mmcblk1
Welcome to GNU Parted! Type 'help' to view a lit of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/mmcblk1 will be destroyed 
and all data on this disk will be lost. Do you want to continue?
Yes/No? Y
(parted) unit mib
(parted) mkpart primary 1 -1
(parted) name 1 root
(parted) print
Model: SD SU08G (sd/mmc)
Disk /dev/mmcblk1: 7580MiB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start    End      Size      File system  Name  Flags
 1      1.00MiB  7579MiB  7578MiB                root

(parted) quit

Make sure the card is not mounted, then format it e.g.:

chronos@localhost / $ sudo mkfs.ext3 /dev/mmcblk1p1

NB: If you use a card that is less than 8 GB, you may need to reserve enough inodes when you format the card e.g.:

chronos@localhost / $ sudo mkfs.ext3 /dev/mmcblk1p1 -j -T small

Mount the card and check that it's ready:

chronos@localhost / $ sudo mkdir -p ~/gentoo
chronos@localhost / $ sudo mount -o rw,exec -t ext3 /dev/mmcblk1p1 ~/gentoo
chronos@localhost / $ df -h ~/gentoo
/dev/mmcblk1p1  7.2G   17M  6.8G   1% /home/chronos/user/gentoo
chronos@localhost / $ df -hi ~/gentoo
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/mmcblk1p1   475K    11  475K    1% /home/chronos/user/gentoo

Installing Gentoo Linux (10-15 min)

chronos@localhost / $ cd ~/gentoo
chronos@localhost ~/gentoo $ ls -la
total 36
drwxr-xr-x  3 root    root            4096 Oct  7 21:37 .
drwx--x--- 33 chronos chronos-access 16384 Oct  7 21:43 ..
drwx------  2 root    root           16384 Oct  7 21:37 lost+found

Download the latest stage 3 archive for armv7a_hardfp:

chronos@localhost ~/gentoo $ sudo wget http://distfiles.gentoo.org/releases/arm/autobuilds/latest-stage3-armv7a_hardfp.txt
chronos@localhost ~/gentoo $ sudo wget http://distfiles.gentoo.org/releases/arm/autobuilds/`cat latest-stage3-armv7a_hardfp.txt | grep stage3-armv7a_hardfp`

Extract the downloaded archive right onto the card e.g.:

chronos@localhost ~/gentoo $ sudo tar xjpf stage3-armv7a_hardfp-20140819.tar.bz2

Clean up:

chronos@localhost ~/gentoo $ sudo rm stage3-armv7a_hardfp-20140819.tar.bz2
chronos@localhost ~/gentoo $ sudo rm latest-stage3-armv7a_hardfp.txt
chronos@localhost ~/gentoo $ ls -la
total 92
drwxr-xr-x  21 root root  4096 Oct  9 19:12 .
drwxr-xr-x  21 root root  4096 Oct  9 19:12 ..
drwxr-xr-x   2 root root  4096 Aug 20 14:44 bin
drwxr-xr-x   2 root root  4096 Aug 20 07:16 boot
drwxr-xr-x  17 root root  3760 Oct  9 18:59 dev
-rwxr--r--   1 root root    85 Oct  7 21:38 enter.sh
drwxr-xr-x  33 root root  4096 Oct  9 19:12 etc
drwxr-xr-x   2 root root  4096 Oct  7 22:14 fbdev
drwxr-xr-x   2 root root  4096 Aug 20 07:16 home
drwxr-xr-x   8 root root  4096 Oct  9 19:08 lib
drwx------   2 root root 16384 Oct  7 20:37 lost+found
drwxr-xr-x   2 root root  4096 Aug 20 07:16 media
drwxr-xr-x   2 root root  4096 Aug 20 07:16 mnt
drwxr-xr-x   2 root root  4096 Aug 20 07:16 opt
dr-xr-xr-x 195 root root     0 Jan  1  1970 proc
drwx------   5 root root  4096 Oct  8 20:46 root
drwxr-xr-x   3 root root  4096 Aug 20 14:43 run
drwxr-xr-x   2 root root  4096 Aug 20 14:54 sbin
-rwxr--r--   1 root root   192 Oct  7 21:38 setup.sh
dr-xr-xr-x  12 root root     0 Oct  9 18:58 sys
drwxrwxrwt   5 root root  4096 Oct  9 19:11 tmp
drwxr-xr-x  12 root root  4096 Oct  7 22:20 usr
drwxr-xr-x   9 root root  4096 Aug 20 07:16 var


Downloading OpenCL drivers (4 min)

Go to the page listing Mali-T6xx Linux drivers and download mali-t604_r4p0-02rel0_linux_1+fbdev.tar.gz. Make sure you carefully read and accept the associated licence terms.

chronos@localhost ~/gentoo $ sudo tar xvzf ~/Downloads/mali-t604_r4p0-02rel0_linux_1+fbdev.tar.gz

This will create ~/gentoo/fbdev which we will use later.


Entering Gentoo Linux (2 min)

Similar to crouton, we will use chroot to enter our Linux environment.


Create two scripts and make them executable:

chronos@localhost ~/gentoo $ sudo vim ~/gentoo/setup.sh
mount -t proc /proc $GENTOO_DIR/proc
mount --rbind /sys  $GENTOO_DIR/sys
mount --rbind /dev  $GENTOO_DIR/dev
cp /etc/resolv.conf $GENTOO_DIR/etc
chronos@localhost ~/gentoo $ sudo vim ~/gentoo/enter.sh
LC_ALL=C chroot $GENTOO_DIR /bin/bash
chronos@localhost ~/gentoo $ sudo chmod u+x ~/gentoo/setup.sh ~/gentoo/enter.sh

Execute the scripts:

chronos@localhost ~/gentoo $ sudo ~/gentoo/setup.sh
chronos@localhost ~/gentoo $ sudo ~/gentoo/enter.sh

Note that the ~/gentoo directory will become the root (/) directory once we enter our new Linux environment. For example, ~/gentoo/fbdev will become /fbdev inside the Linux environment.


Installing OpenCL header files (2 min)

Download OpenCL header files from the Khronos OpenCL registry:

localhost / # mkdir /usr/include/CL && cd /usr/include/CL
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/opencl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_platform.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_gl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_ext.h


Installing OpenCL driver (2 min)

Change properties on the downloaded OpenCL driver files and copy them to /usr/lib:

localhost / # chown root /fbdev/*
localhost / # chgrp root /fbdev/*
localhost / # chmod 755 /fbdev/*
localhost / # mv /fbdev/* /usr/lib
localhost / # rmdir /fbdev



By now you should have a mint Linux installation complete with the OpenCL drivers and headers, so you can start playing with OpenCL!

When you reboot, you just need to mount the card and execute the setup script again:

chronos@localhost / $ sudo mount -o rw,exec -t ext3 /dev/mmcblk1p1 ~/gentoo
chronos@localhost / $ sudo ~/gentoo/setup.sh

Then you can pop in and out of the Linux environment with:

chronos@localhost / $ sudo ~/gentoo/enter.sh
localhost / # exit
chronos@localhost / $

But the fun just begins here! Follow the instructions below to install PyOpenCL and SciPy libraries for scientific computing.


Installing PyOpenCL

Configuring Portage (15 min)

Portage is Gentoo's package management system.

localhost / # echo "MAKEOPTS=\"-j2\"" >> /etc/portage/make.conf
localhost / # echo "ACCEPT_KEYWORDS=\"~arm\"" >> /etc/portage/make.conf
localhost / # mkdir /etc/portage/profile
localhost / # mkdir /etc/portage/package.use
localhost / # mkdir /etc/portage/package.unmask
localhost / # mkdir /etc/portage/package.accept_keywords
localhost / # mkdir /etc/portage/package.keywords
localhost / # touch /etc/portage/package.keywords/dependences

Perform an update:

localhost / # emerge --sync
localhost / # emerge --oneshot portage
localhost / # eselect news read


Selecting Python 2.7 (1 min)

localhost / # eselect python set python2.7


Installing NumPy (30-40 min)

Install NumPy with LAPACK as follows.

localhost / # echo "dev-python/numpy lapack" >> /etc/portage/package.use/numpy
localhost / # echo "dev-python/numpy -lapack" >> /etc/portage/profile/package.use.mask
localhost / # emerge --autounmask-write dev-python/numpy
localhost / # python -c "import numpy; print numpy.__version__"


Installing PyOpenCL (5-10 min)

Install PyOpenCL.

localhost / # cd /tmp
localhost tmp # wget https://pypi.python.org/packages/source/p/pyopencl/pyopencl-2014.1.tar.gz
localhost tmp # tar xvzf pyopencl-2014.1.tar.gz
localhost tmp # cd pyopencl-2014.1
localhost pyopencl-2014.1 # python configure.py
localhost pyopencl-2014.1 # make install
localhost pyopencl-2014.1 # cd examples
localhost examples # python demo.py
(0.0, 241.63054)
localhost examples # python -c "import pyopencl; print pyopencl.VERSION_TEXT"


Installing scientific libraries

If you would like to follow my posts on benchmarking (e.g. see the intro), I recommend you install packages from the SciPy family.


Installing IPython (30-45 min)

localhost / # emerge --autounmask-write dev-python/ipython
localhost / # ipython --version


Installing IPython Notebook (3-7 min)

Install IPython Notebook to enjoy a fun blend of Chrome OS and IPython experience.


localhost / # emerge dev-python/jinja dev-python/pyzmq www-servers/tornado
localhost / # ipython notebook
2014-05-08 06:49:08.424 [NotebookApp] Using existing profile dir: u'/root/.ipython/profile_default'
2014-05-08 06:49:08.440 [NotebookApp] Using MathJax from CDN: http://cdn.mathjax.org/mathjax/latest/MathJax.js
2014-05-08 06:49:08.485 [NotebookApp] Serving notebooks from local directory: /
2014-05-08 06:49:08.485 [NotebookApp] The IPython Notebook is running at:
2014-05-08 06:49:08.486 [NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
2014-05-08 06:49:08.486 [NotebookApp] WARNING | No web browser found: could not locate runnable browser.

Open in a new Chrome tab to start creating your own IPython Notebooks!


Installing Matplotlib (35-50 min)

localhost / # emerge --autounmask-write dev-python/matplotlib
localhost / # python -c "import matplotlib; print matplotlib.__version__"


Installing SciPy (45-60 min)

localhost / # emerge --autounmask-write sci-libs/scipy
localhost / # python -c "import scipy; print scipy.__version__"


Installing Pandas (55-80 min)

localhost / # emerge --autounmask-write dev-python/pandas
localhost / # etc-update
Scanning Configuration files...
The following is the list of files which need updating, each
configuration file is followed by a list of possible replacement files.
1) /etc/portage/package.keywords/dependences (1)
Please select a file to edit by entering the corresponding number.
              (don't use -3, -5, -7 or -9 if you're unsure what to do)
              (-1 to exit) (-3 to auto merge all files)
                           (-5 to auto-merge AND not use 'mv -i')
                           (-7 to discard all updates)
                           (-9 to discard all updates AND not use 'rm -i'): -3
Replacing /etc/portage/package.keywords/dependences with /etc/portage/package.keywords/._cfg0000_dependences
mv: overwrite '/etc/portage/package.keywords/dependences'? y
Exiting: Nothing left to do; exiting.
localhost / # emerge dev-python/pandas
localhost / # python -c "import pandas; print pandas.__version__"
Tim Hartley

Mali at Techcon 2014

Posted by Tim Hartley Oct 2, 2014

Techcon is 10!  Yes, for the tenth time ARM Techcon is up and running from 1 to 3 October at the Santa Clara convention center.  Ahead of three days of presentations, demonstrations, tutorials, keynotes, exhibitions and panels from the great and the good in the embedded community, Peter Hutton, Executive VP & President Products Group kicked it all off demonstrating how widely supported ARMv8 is – from entry level phones through to servers.  Joining him on stage was Dr. Tom Bradicich from HP announcing two enterprise class ARMv8 servers.  And joining him was one his first customers, Dr. Jim Ang from Sandia Labs.  If this went on there was going to be no room left on the stage.


For Mali-philes, graphics, display and video are of course on show here in force.  There’ll be some great, enigmatically named talks … Tom Cooksey’s “Do Androids Have Nightmares of Botched System Integrations?” will showcase the critical parts of the Android media subsystem and how three key Mali products – Graphics, Display and Video can come together to become something greater than the sum of the parts.  Brad Grantham speaking on “Optimizing Graphics Development with Parameterized Batching” and Tom Olson's tackling Bandwidth-efficient rendering using pixel local storage. Anton Lokhmotov’s GPU Compute optimisation guide “Love your Code?  Optimize it the Right Way” will attempt the impossible by mixing live demonstrations from a Chromebook with absolutely no PowerPoint slides at all.


Mali is also in much evidence on the show floor, all part of the buzzing ARM Techcon Expo.  Live demonstrations are showcasing ASTC texture encoding, transaction elimination with Mali-T600 and some of the power saving properties of Ittiam Systems's HEVC decoder running on an energy efficient combination of CPU and Mali GPU.


As well as Anton’s talk there is plenty to keep GPU Compute fans happy.  Roberto Mijat and I presented a talk this morning about how ARM is working with developers to optimise applications using GPU Compute on Mali.  And Roberto is back with a panel discussion in the Expo, “Meet the Revolutionaries who are Making GPU Compute a Reality!”, with representatives from Ittiam, Khronos and ArcSoft discussing developments in this growing field.


Do watch this space... there'll be more detail and blogs about the talks soon.  And if you’re in the area, do come by and check it out!

Chinese Version 中文版:NEON驱动OpenCL强化异构多处理

OpenCL - First Mali and Now NEON


I am currently in Santa Clara for ARM TechCon where the latest technologies from ARM and its partners will be on show from tomorrow. There will be a number of exciting announcements from ARM this week, but the one that I have been most involved in is the launch today of a new product that supports OpenCL™ on CPUs with ARM® NEON™ technology and also on the already supported ARM Mali™ Midgard GPUs. NEON is a 128-bit SIMD (Single Instruction, Multiple Data) architecture extension included in all the latest ARM Cortex®-A class processors, so along with Mali GPUs it’s already widely available in current generation devices and an extremely suitable candidate to benefit from the advantages of OpenCL.


What is OpenCL Anyway?


It’s worth starting with a brief explanation of why support for the OpenCL compute API is important. There are a number of industry trends that create challenges for the software developer. For example, heterogeneous multiprocessing is great for performance and efficiency, but the diversity of instruction sets can often lead to a lack of portability. Another example is that parallel computing gets a task done more quickly, but programming parallel systems is notoriously difficult. This is where OpenCL comes in. It is a computing language (OpenCL C) that enables easier, portable and more efficient programming across heterogeneous platforms, and it is also an API that coordinates parallel computation on those heterogeneous processors. OpenCL load balances tasks across all the available processors in a system; it even simplifies the programming of multi-core NEON by treating it as a single OpenCL device. This is all about efficiently matching the ‘right task to the right processor’.


Figure 1: OpenCL is especially suited to parallel processing of large data sets


Where Can I Use OpenCL?


OpenCL can be used wherever an algorithm lends itself to parallelisation and is being used to process a large data-set. Examples of such algorithms and use-cases can be found in many types of device and include:



  1. The stabilization, editing, correction and enhancement of images; stitching panoramic images
  2. Face, smile and landmark recognition (for tagging with metadata)
  3. Computer vision, augmented reality


Digital TV

  1. Upscaling, downscaling; conversion from 2D to Stereo 3D

  2. Support for emerging codec standards (e.g. HEVC)
  3. Pre- and post-processing (stabilizing, transcoding, colour-conversion)
  4. User interfaces: multi-viewer gesture-based UI and speech control



  1. Advanced Driver Assistance Systems (ADAS)
  2. Lane departure and collision warnings; road sign and pedestrian detection
  3. Dashboard, infotainment, advanced navigation and dynamic cruise control


A Tale of Two Profiles


OpenCL supports two ‘profiles’:


  1. A ‘Full Profile’, which provides the full set of OpenCL features
  2. An ‘Embedded Profile’, which is a strict subset of the Full Profile – and is provided for compatibility with legacy systems


The OpenCL for NEON driver and the OpenCL for Mali Midgard GPU driver both support Full Profile. The heritage of OpenCL from desktop systems means that most existing OpenCL software algorithms have been developed for Full Profile. This makes ARM’s Full Profile support very attractive to programmers who can develop on desktop using mature tools with increased productivity and get products to market faster. Another key benefit is that floating point calculations in OpenCL Full Profile are compliant with the IEEE-754 standard, guaranteeing the precision of results.


OpenCL for NEON and Mali - Better Together


The OpenCL for NEON and the Mali Midgard GPU drivers are designed to operate together within the same OpenCL context. This close-coupling of the drivers enables them to operate with maximum efficiency. For example, memory coherency and inter-queue dependencies are resolved automatically within the drivers. We refer to this version of OpenCL for NEON as the ‘plug-in’ because it ‘plugs into’ the Mali Midgard GPU OpenCL driver.



Figure 2: The benefits of keeping the CPU and GPU in one CL_Context


And Not Forgetting the Utgard GPUs - Mali-400 MP & Mali-450 MP


There is also a ‘standalone’ version of OpenCL for NEON that is available to use alongside Mali Utgard GPUs, such as the Mali-400 MP and Mali-450 MP. These particular GPUs focus on supporting graphics APIs really efficiently, but not compute APIs such as OpenCL. Therefore adding OpenCL support on the CPU with NEON is an excellent way to add compute capability into the system. The ‘standalone’ version is also suitable for use when there is no GPU in the system.


Reaching Out


In addition, as the diagram below shows, the ARM OpenCL framework can be connected to other OpenCL frameworks in order to extend OpenCL beyond NEON and Mali GPUs to proprietary hardware devices, for example those built with FPGA fabric. This is achieved by using the Khronos Installable Client Driver (ICD) which is supported by the ARM OpenCL framework.


Figure 3: Using the Khronos ICD to connect the ARM OpenCL context with other devices


In Summary


We've seen that OpenCL for NEON will enhance compute processing on any platform that uses a Cortex-A class processor with NEON. This is true whether the platform includes a Mali Midgard GPU, an Utgard GPU, or maybe has no graphics processor at all. However, the coupling of NEON with a Midgard GPU delivers the greatest efficiencies.


As algorithms for mobile use cases become more complex, technologies such as OpenCL for NEON are increasingly important for their successful execution. The OpenCL for NEON product is available for licensing immediately; if you would like further information please contact your local ARM sales representative.

Further Reading


For more information on OpenCL, Compute and current use cases that are being developed by the ARM Ecosystem:


Realizing the Benefits of GPU Compute for Real Applications with Mali GPUs

Interested in GPU Compute? You have choices!

GPU Compute, OpenCL and RenderScript Tutorials on the Mali Developer Center

The Mali Ecosystem demonstrate GPU Compute solutions at the 2014 Multimedia Seminars


ARM is an official Khronos Adopter and an active contributor to OpenCL as a Working Group Member

Evaluating compute performance on mobile platforms: an introduction

Using the GPU for compute-intensive processing is all about improving performance compared to using the CPU only. But how do we measure performance in the first place? In this post, I'll touch upon some basics of benchmarking compute workloads on mobile platforms to ensure we are on solid ground when talking about performance improvements.


Benchmarking basics

To measure performance, we select a workload and a metric of its performance. Because workloads are often called benchmarks, the process of evaluating performance is usually called benchmarking.


Selecting a representative workload is a bit of a dark art so we will leave this topic for another day. Selecting a metric is more straightforward.


The most widely used metric is the execution time. To state bluntly, the lower the execution time is, the faster the system is. In other words, lower is better.


Frequently, the chosen metric is inversely proportional to the execution time. So, the higher the metric is, the lower the execution time is. In other words, higher is better. For example, when measuring memory bandwidth, the usual metric is the amount of data copied per unit time. As this metric is inversely proportional to the execution time, higher is better.


Benchmarking pitfalls

Benchmarking on mobile platforms can be tricky. Running experiments back to back can produce unexpected performance variation, and so can dwindling battery charge, hot room temperature or an alignment of stars. Fundamentally, we are talking about battery powered, passively cooled devices which tend to like saving their battery charge and keeping their temperature down. In particular, dynamic voltage and frequency scaling (DVFS) can get in the way. Controlling these factors (or at least accounting for them) is key to meaningful performance evaluation on mobile platforms.


Deciding what to measure and how to measure it deserves special attention. In particular, when focussing on optimising device code (kernels), it's important to measure kernel execution time directly, because host overheads can hide effects of kernel optimisations.


To illustrate some of the pitfalls, I have created an IPython Notebook which I encourage you to view before peeking into our next topic.


Sample from iPython Notebook

What's next?


Using GPU settings that are ill-suited for evaluating performance is common but should not bite you once you've become aware of it. However, even when all known experimental factors are carefully controlled for, experiments on real systems may produce noticeably different results from run to run. To properly evaluate performance, what we really need is a good grasp of basic statistical concepts and techniques...


Are you snoring already? I too used to think that statistics was dull and impenetrable. (A confession: statistics was the only subject I flunked at university, I swear!) Apparently not so, when you apply it to optimising performance! If you are at the ARM TechCon on 1-3 October 2014, come along to my live demo, or just wait a little bit longer and I will tell you all you need to know!

Throughout this year, application developers have continued to release a vast range of apps using both the OpenGL® ES 2.0 and 3.0 APIs. While the more recent API offers a wider range of features and performance can be better on GPUs which support OpenGL ES 3.0 onwards, thanks to the backwards compatibility of OpenGL ES versions the success and longevity of more cost-optimized OpenGL ES 2.0 GPUs looks set to continue. A consequence of this trend is that demand for the ARM® Mali™-450 MP graphics processor, implementing a design that is optimised for OpenGL ES 2.0 acceleration, has never been higher.


The momentum behind ARM’s 64-bit ARMv8-A application processor architecture is growing, enabling more complex applications within strict power budgets. We were able to announce last week the 50th license of the technology across 27 different companies, showing that the demand for greater compute capabilities across a wide range of applications is strong.


This market support gave us the opportunity to further optimize the performance of our Mali-450 drivers to support 64-bit builds of OpenGL ES 2.0 apps. So, that’s exactly what we’ve done, with a brand new set of 64-bit Mali-450 drivers that were released to our partners recently. Examples of where we see a Mali-450 GPU and Cortex-A53 CPU successfully combined is the entry-level smartphone market, where cost efficiency is important but the implementation of a 64-bit CPU can offer the all-important differentiation from the competition. With this release, ARM is making it easier for the mass market to access the latest technology advances while providing silicon partners with a wider choice of which GPU can be paired with which CPU.


So watch out for the new wave of 64-bit devices based on Mali-450 MP and rest assured that the Mali drivers have been optimised for the feature set of the 64-bit CPU.  The only thing you should see is increased app performance, and a few more CPU cycles available – we’re sure you’ll do great things with them.

Chinese Version 中文版:ARM Mali GPU:为新兴市场带来高质量图形


The ARM Ecosystem continues to drive innovation, diversity and opportunity across the entire industry at an astonishing pace, bringing the benefits of semiconductor technology to all potential users across the world. The changes appearing in the cost-efficient segment are especially exciting: there are a huge number of opportunities for silicon venders and OEMs to successfully differentiate their products for this market. Examples include the growing number of customers looking to upgrade from feature phone to smartphone technology as initiatives such as Android One (launched yesterday in India) emerge and gain momentum; the ability to bring high performing technology, showcasing fast frame rates, great displays and long-lasting batteries, into the mainstream; new applications emerging that offer desirable new functionality and capabilities; and new form factors placing mobile silicon into a variety of exciting and affordable new markets. The mass market is discovering a whole host of features which two years ago was only available in premium devices.  With all this change taking place, it is no wonder that the industry is seeing shipments of superphones waning, making way for the era of the mass market.

Gartner graph.png

The mass market (entry level & mid range) is predicted to total 80% of total smartphone shipments by 2017 (Source: Mixture of ARM & Gartner estimates)


But just how big is the global mass market opportunity? In ARM’s results statement, we predicted that the mobile app processor market would be worth $20bn in 2018 of which the total addressable market for the mass market would be $10bn. The main geographical areas driving this ongoing smartphone growth are emerging markets such as China, India, Russia, Brazil, as the graph from Credit Suisse shown below predicts. With 1.75 billion people already owning a smartphone, there are still over 5 billion left who are yet to experience fully mobile connectivity. China and India alone are predicted to bring over 400 million new users to this market in 2014.


Credit Suisse Graph.png

Emerging markets will be the long term driver for smartphone shipment volumes (Source: Credit Suisse, The Wireless View 2014)

ARM® Mali GPUs have rapidly become the de facto GPU for the mass market and for Android devices as a whole. Thanks to the low energy, low silicon area yet feature rich elements of our cost-efficient roadmap, we are now the most commonly deployed GPU in all new smartphone models with the fastest growing market share across all GPU vendors - in 4Q13 73 new Mali-based smartphones were introduced into the market.  In fact, over 75% of all application processors coming out of APAC now have an ARM Mali GPU inside. The first set of Android One devices, whose goal is to bring affordable smartphone technology to emerging markets, is entirely based on Mediatek's MT6582 SoC featuring a Mali-400 MP2 GPU.


Bank of America graph.png

ARM Mali GPUs took the #1 spot in 4Q13 among new models (Source: Bank of America Merrill Lynch Global Research estimates)

The Mali-400 GPU has driven success in this market for all its customers since its announcement in June 2008 and continues to be popular in emerging markets where great hardware and software has to be brought together in an affordable manner. Beyond the Android One smartphones, it can be found in a range of popular devices ranging from smartphones to wearables:


  • Oppo Joy (Mediatek MT6572)
  • Huawei Honor 3C (Mediatek MT6582)
  • Alcatel One Touch Idol X Plus (Mediatek MT6592)
  • Samsung Galaxy S5 Mini (Samsung Exynos 3 Quad)
  • Omate TrueSmart Smartwatch (Mediatek MT6572)


However, as the technology behind these devices is evolving at such a fast pace, tomorrow’s mass market consumers will be demanding more from their devices than their current counterparts. For this reason, ARM has developed a long-term GPU IP roadmap that specifically meets the needs of silicon partners addressing this market, ensuring that as consumer values evolve the ARM Ecosystem has everything it needs to continue its success.


For example, OpenGL® ES 3.0 will become the universal standard for developing mobile games and applications. Already, over 20% of devices support this API, according to stats from Android. Mass market consumers will expect to be able to enjoy the latest titles as soon as they come out and getting the most out of them will require a GPU which supports the most popular standards. As another example, the trend for higher resolutions continues and a mass market GPU will be required that has the computational power to deliver the desired performance at higher pixel densities. The ARM Mali-T720 GPU has been developed to meet these needs of future generations of mass market consumers, offering both higher computation capacity and API support up to and including OpenGL ES 3.1.


The opportunities in the mass market are seemingly endless and ARM IP is historically proven to be the leader in this field, offering functional, energy-efficient graphics within the smallest possible silicon area. Our mid-range GPU roadmap is advancing in line with the market with new GPUs ready to become the Mali-400 of the future, combining the best of ARM’s traditional mass market offering with the new requirements of a future age. For more information about ARM’s mass market offerings, visit www.arm.com.



Chris Doran, COO of Geomerics had a recent conversation with GamingBolt to discuss recent developments with Enlighten, how Geomerics is supporting indie game developers, and two major items on the roadmap.


Geomerics has a come a long way in the last few years. They are now officially backed by the UK government to set new benchmarks in the movie industry. They are also working closely with EA on games like Star Wars Battlefront and Mirrors Edge. It’s safe to assume that Geomerics are aware of where the next generation of lighting and graphics technology are heading.


Geomerics Interview: Realizing The Full Potential of Enlighten Using The New Console Cycle « GamingBolt.com: Video Game…

Tom Olson wrote a fantastic series of blogs about performance metrics and how to interpret them. His blog about triangles per second pretty much changed the industry. Very quickly, companies had to stop talking nonsense about triangles per second in any way being a useful metric. Now, along comes this ground-breaking serious technology research, and the whole comparison basis and industry-standard metric of uselessness becomes challenged. What shall we do? As useful as:

  • an umbrella in the desert?
  • a concrete lifebelt?
  • a glass hammer?

In the second part of Energy Efficiency in GPU Applications, Part 1 I will show some real SoC power consumption numbers and how they correlate with the workload coming from an application.

Study: How Application Workload Affects Power Consumption


We made a brief study to find out how an application workload affects SoC power consumption. The idea of the study was to develop a small micro-benchmark that runs at just above 60fps on the target devices i.e. it is always vsync limited. Here is a screen shot from the micro-benchmark (it is called Torus Test):



To leave some room for optimization we added a few deliberate performance issues to the original version of the micro-benchmark:


  • The vertex count is too high
  • The texture size is too high
  • The fragment shader consumes too many cycles
  • Back-face culling is not enabled


We wanted to see how power consumption is affected when we reduce the workload by fixing each of the above performance issues individually. All of these performance issues and related optimizations are somewhat artificial for being used directly with real applications. The micro-benchmark was written on purpose in a way that none of these crazy optimizations have any major visual impact, but with a real-world application you probably wouldn't be able to decrease the texture resolution from 1920x1920 to 96x96 without a drastic impact on the visual quality of the application. However, the effect of the optimizations described here is the same as the effect of optimizing real applications: you improve the energy efficiency of your application by reducing GPU cycles and bandwidth consumption.


At ARM we have a few development SoCs that can be used for measuring actual SoC power consumption which we were able to use in the study. The micro-benchmark allows the measurement of system FPS in offscreen rendering mode without the vsync limit, as described previously.  In the result graphs we use the frame time instead of the system FPS (frame time = 1s / system FPS), because that corresponds to the number of GPU cycles that consume power on the GPU.  We also used the L2 cache external bandwidth counters for measuring the bandwidth consumed by the GPU. By using these metrics we wanted to see how the workload in the application and GPU correlates with the power consumption in the SoC. Here are the results.


Decreasing Vertex Count

The micro-benchmark allows us to configure how many vertices are drawn in each frame. We tested three different values (4160, 2940 and 1760). The following graph shows how the vertex count correlates with the frame time and SoC Power:



This micro-benchmark is not very vertex heavy but still the correlation between vertex count and SoC power consumption is clear. When decreasing the vertex count, power is not only saved by reduced vertex shader processing, but also because there is less external bandwidth needed to copy vertex data to/from the vertex shading core. Therefore we can also see the correlation between vertex count and external bandwidth in the above graph.


Decreasing Texture Size

The micro-benchmark uses a generated texture for texture mapping, which makes it possible to configure the texture size. We tested the performance with three different texture sizes (1920x1920, 960x960 and 96x96). Each object is textured with a separate texture object instance. As expected, the texture size doesn't affect the frame time much but it affects the external bandwidth. We found the following correlation between texture size, external bandwidth and SoC power:



Notice that the bandwidth doesn't decrease linearly with the number of texels in a texture. This is because with a smaller texture size there is a much better hit rate in the L2 cache, which quickly reduces the external bandwidth.


Decreasing Fragment Shader Cycles

The micro-benchmark implements a Phong shading model with a configurable number of light sources.  We tested the performance with three different values for the number of light sources (5, 3, and 1). The Mali Shader Compiler outputs the following cycle count values for these configurations:


Light SourcesArithmetic CyclesLoad/Store CyclesTexture Pipe CyclesTotal Cycles


We found the following correlation between the number of fragment shader cycles, frame time and SoC power:




Adding Back-Face Culling and Putting All Optimizations Together

Finally, we tested the SoC power consumption impact when enabling back-face culling and when including all the previous optimization at the same time:



With all these optimizations we managed to reduce the SoC power consumption to less than 40% compared to the original version of the micro-benchmark. At the same time the frame time reduced to less than 30% and the bandwidth to less than 10% of the original micro-benchmark. Note that the large relative bandwidth reduction is possible due to the fact that writing the onscreen frame buffer to the external memory consumes very little bandwidth in this micro-benchmark, as Transaction Elimination was enabled in the device which is very effective with this application because there are lots of tiles filled with the constant background color that don't change between frames.



I hope this blog and the case study example has helped you to better understand the factors which impact on energy efficiency and the extent to which SoC power consumption can be reduced through optimizing GPU cycles and bandwidth in an application. As the processing capacity of embedded GPUs keeps growing, an application developer can often change the focus from performance optimization to energy efficiency optimization, which means that the desired visual output is implemented without consuming cycles or bandwidth unnecessarily. You should also consider the trade-off between improved visual quality and increased power consumption; is that last piece of "eye candy" which increases processing requirements by 20% really worth a 20-36% drop in the battery life for the end users of the application?

If you have any further questions, please don’t hesitate to ask them in the comments below.

Back in June I had the pleasure of visiting Barcelona in Spain for the first time to give a presentation at Gamelab.

I was lucky enough to attend some of the other talks given and was impressed by the quality and diversity of the presentations.

Particularly enjoyable were Tim Shafer's presentation on creativity in game development and a panel discussion ("The future of mobile entertainment") about the mobile game development which had some interesting points on the technical difficulties faced by developers.


I gave the attached presentation with our great partner Will Eastcott from PlayCanvas. We ran through:

  • an introduction to WebGL
  • how the guys at PlayCanvas use WebGL in their open source, cloud based game engine.
  • the importance of good tools for performance analysis and debug of mobile games
  • how you can use the Mali Graphics Debugger, ARM DS-5 Streamline, and the Mali Offline Compiler to analysis you  code, identify problems and find solutions

In this blog I will talk about energy efficiency in embedded GPUs and what an application programmer can do to improve the energy efficiency of their application. I have split this blog into two parts; in the first part I will give an introduction to the topic of energy efficiency and in the second part I will show some real SoC power measurements by using an in-house micro-benchmark to demonstrate the extent to which a variety of factors impact frame rendering time, external bandwidth and SoC power consumption.


Energy Efficiency in the GPU/Device


Let's look first at what energy efficiency means from the GPU's perspective.  At a high level the energy is consumed by the GPU and its associated driver in three different ways:


  • GPU is running active cycles in the hardware to perform its computation tasks in one or more of its cores.
  • GPU/driver is issuing memory transactions to read data from, or write data to, the external memory.
  • GPU driver code is executed in the CPU either in the user mode or in the kernel mode.


On most devices Vertical Synchronization (vsync) synchronizes the frame rate of an application with the screen display rate. Using vsync not only removes tearing, but it also reduces power consumption by preventing the application from producing frames faster than the screen can display them. When vsync is enabled on the device the application cannot draw frames faster than the vsync rate (vsync rate is typically 60fps on modern devices so we can keep that as our working assumption in the discussion). On the other hand, in order to give the best possible user experience the application/GPU should not draw frames significantly slower than the vsync rate i.e. 60fps. Therefore the device/GPU tries hard to keep the frame rate always at 60fps, while also trying to use as little power as possible.


A device typically has power management functionality for both GPU and CPU in order to adjust their operating frequencies based on the current workload. This functionality is referred to as DVFS (Dynamic Voltage and Frequency Scaling). DVFS allows the device to handle both normal and peak workload in an energy efficient fashion by adjusting the clock frequency to provide just enough performance for the current workload, which in turn allows us to drop the voltage as we do not need to drive the transistors as hard to meet the more relaxed timing constraints. The energy consumed per clock is proportional to V2, so if we drop frequency to allow a voltage reduction of 20% then energy efficiency would improve by 36%. Using a higher clock frequency than needed means higher voltage and consequently higher power consumption, therefore the power management tries to keep the clock frequency as low as possible while still keeping the frame rate at the vsync rate. When the GPU is under extremely high load some vendors allow the GPU to run at an overdrive frequency - a frequency which requires a voltage higher than the nominal voltage for the silicon process - which can provide a short performance boost, but cannot be sustained for long periods. If high workload from an application keeps the GPU frequency overdriven for a long time, the SoC may become overheated and as a consequence the GPU is forced to use a lower clock frequency to allow the SoC to cool down even if the frame rate goes under 60fps. This behavior is referred to as thermal throttling.


Device vendors often differentiate their devices by making their own customizations to the power management. As a result two devices having the same GPU may have different power management functionality. The ARM® Mali™ GPU driver provides an API to SoC vendors that can be used for implementing power management logic based on the ongoing workload in the GPU.


In addition to DVFS, some systems may also adjust the number of active GPU cores to find the most energy efficient configuration for the given GPU workload. Typically, DVFS provides just a few available operating frequencies and enabling/disabling cores can be used for fine-tuning the processing capacity for the given workload to save power.


In its simplest form the power management is implemented locally for the GPU i.e. the GPU power management is based only on the ongoing GPU workload and the temperature of the chip. This is not optimal as there can be several other sub-systems on the chip which all "compete" with each other to get the maximum performance for its own processing until the thermal limit is exceeded and all sub-systems are forced to operate in a lower capacity. A more intelligent power management scheme maintains a power budget for the entire SoC and allocates power for different sub-systems in a way that thermal throttling can be avoided.


Energy Efficiency in Applications

From an application point of view the power management functionality provided by the GPU/device means that the GPU/device always tries to adjust the processing capacity for the workload coming from the application. This adjustment happens automatically in the background and if the application workload doesn't exceed the maximum capacity of the GPU, the frame rate remains constantly at the vsync rate regardless of the application workload. The only side effect from the high application workload is that the battery runs out faster and you can feel the released energy as a higher temperature of the device.


Most applications don't need to create a higher workload than the GPU's maximum processing capacity i.e. the power management is able to keep the frame rate constantly at the vsync level. The interval between two vsync points is 1/60 seconds and if the GPU completes a frame faster than that, the GPU sits idle until the next frame starts. If the GPU constantly has lots of idle time before the next vsync point, the power management may decrease the GPU clock frequency to a lower level to save power.


VSync App.pngScreenshot from Streamline of a GPU and CPU idling each frame when the DVFS frequency selected is too high


As the maximum processing capacity of modern GPUs keeps growing, it is often not necessary for an application developer to optimize the application for better performance, but instead for better energy efficiency and that is the topic of this blog.


How to Make an Application Energy Efficient

In order to be energy efficient the application should:


  • Render frames with the least number of GPU cycles
  • Consume the least amount of external memory bandwidth
  • Generate the least amount of CPU load either directly in the application code or indirectly by using the OpenGL® ES API in a way that causes unnecessary CPU load in the driver


But hey, aren't these the same things that you used to focus on when optimizing your application for better performance? Yes, pretty much! To explain this further:


  • Every GPU cycle that you save when rendering a frame means more idle time in the GPU before the next vsync point. In the best case the idle time becomes long enough to allow the power management to use a lower GPU frequency or enable a smaller number of cores
  • Reducing bandwidth load doesn't always improve performance as GPUs are designed to tolerate high memory latencies without affecting performance. However, reducing bandwidth can improve energy efficiency significantly
  • The same as for bandwidth, extra CPU load may not impact performance but it definitely can increase the power consumption


So the task of improving energy efficiency becomes very similar to the task of optimizing the performance of an application. For that task you can find lots of useful tips in the Mali GPU Application Optimization Guide.

How Do You Measure Energy Efficiency?


There is one topic that may require some more attention: how can you measure the energy efficiency of your application? Measuring the actual SoC power consumption might not be practical. It might also be problematic to measure the system FPS of your application if vsync is enabled on your device and you cannot turn it off.


ARM provides a tool called DS-5 Streamline for system-wide performance analysis. Using DS-5 Streamline for detecting performance bottlenecks is explained in Peter Harris' s blog Mali Performance 1: Checking the Pipeline and Lorenzo Dal Col's blogs starting with Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel, and also in the Mali GPU Application Optimization Guide. Shortly, DS-5 Streamline allows you to measure the main components of energy efficiency with the following charts / HW counters:


GPU cycles:

  • Mali Job Manager Cycles: GPU cycles
    • This counter increments any clock cycle the GPU is doing something
  • Mali Job Manager Cycles: JS0 cycles
    • This counter increments any clock cycle the GPU is fragment shading
  • Mali Job Manager Cycles: JS1 cycles
    • This counter increments any clock cycle the GPU is vertex shading or tiling


External memory bandwidth:

  • Mali L2 Cache: External read beats
    • Number of external bus read beats
  • Mali L2 Cache: External write beats
    • Number of external bus write beats


CPU load:

  • CPU Activity
    • The percentage of the CPU time spent in system or user code


Another very useful tool for measuring GPU cycles is the Mali Offline Shader Compiler which allows you to see how many GPU cycles are spent in the arithmetic, load/store and texture pipes in the shader core. Each saved cycle in the shader code means thousands/millions of saved cycles in each frame, as the shader is executed for each vertex/fragment.


If you want to measure the performance of an application in a vsync limited device, it is possible to do it by rendering graphics in offscreen mode using FBOs. This is the trick used by some benchmark applications to get rid of vsync and resolution limitations in the performance measurement. The thing is that the vsync limitation applies only for the onscreen frame buffer, but not for the offscreen framebuffers implemented with FBOs. It is possible to measure performance by rendering to an FBO that has the same frame buffer resolution and configuration (color and depth buffer bit depths) as the onscreen frame buffer. After setting up the FBO and binding it with glBindFramebuffer() your rendering functions don't see any difference whether the render target is the onscreen frame buffer or an FBO. However, in order to make the performance measurement work correctly you need to do a few things:


  • You need to consume your FBO rendering results in the onscreen frame buffer. This is necessary because if you render something to an FBO and don't use your rendering results for anything visible, there is no guarantee that the GPU actually renders anything. After rendering to an FBO you can down-sample your output texture into a small area in the onscreen frame buffer. This guarantees that the GPU must render the frame image into an FBO as expected.
  • The offscreen rendering should be implemented with two different FBOs in order to simulate double buffering functionality. After rendering a frame to an FBO, you should down-sample the output texture to the onscreen buffer, and then swap to another FBO that is used for rendering the next frame.
  • You should Use glDiscardFramebufferExt (OpenGL ES 2.0) or glInvalidateFramebuffer (OpenGL ES 3.0) for discarding depth/stencil buffers right after the rendering of a frame to an FBO is complete. This is necessary to avoid writing out the depth/stencil buffer to the main memory in the Mali GPU (the same effect happens for the onscreen frame buffer when you call eglSwapBuffers()). You can find some details of this topic in Mali Performance 2: How to Correctly Handle Framebuffers.
  • After rendering a suitable number of offscreen frames (for example 100) and down-sampling them to a small area in the onscreen frame, you can call eglSwapBuffers() as normal to present the frame in the onscreen buffer. You can measure the offscreen FPS by dividing the total number of rendered offscreen frames by the total rendering time measured when eglSwapBuffers() returns.


There is a small overhead in the performance measurement when using this method because of down-sampling the offscreen frames to the onscreen frame, but nevertheless it should give you quite representative FPS results without the vsync limitation.


Is it really worth it?


You might ask how significant an energy saving you can really get by optimizing your application. We will focus on that in the next part of this blog, Energy Efficiency in GPU Applications, Part 2, where I will present a small micro-benchmark that will show how much you can reduce real SoC power consumption by optimizing your application.

Over the past couple of weeks, ARM and Collabora have been working together closely to showcase all the benefits that can be extract from Wayland for media content playback use cases and beyond.


This week in particular, ARM and Collabora are showing at SIGGRAPH 2014 a face-off between the near 30-year old X11 and the up and coming Wayland.


Leveraging ARM Mali as deployed in Samsung Chromebook 2, Collabora has, with the help of ARM, development an environment that makes it possible to clearly see the advantages of Wayland, particularly with the latest drivers made available by ARM for Mali.


The best way to find out more about this is to watch the video we've produced at SIGGRAPH:


Details can be found on our blog and are also available here:

Wayland on MALI

Over the past several years at Collabora, we have worked on Linux's graphics stack from top to bottom, from kernel-level hardware enablement through to the end applications. A particular focus has always been performance: not only increasing average throughput and performance metrics, but ensuring consistent results every time. One of the core underpinnings of the Linux graphics stack from its very inception has been the X Window System, which recently celebrated its 29th anniversary. Collabora have been one of the most prolific contributors to X.Org for the past several years, supporting its core development, but over the past few years we have also been working on its replacement - Wayland. Replacing something such as X is not to be taken lightly; we view Wayland as the culmination of the last decade of the work by the entire open-source graphics community. Wayland reached 1.0 maturity in 2012, and since then has shipped in millions of smart TVs, set-top boxes, IVI systems, and more.

This week at SIGGRAPH together with ARM, we have been showcasing some of our recent development on Wayland, as well as on the entire graphics stack, to provide best-in-class media playback with GStreamer.

'Every frame is perfect'

wayland-x11@2x.pngWayland's core value proposition for end users is simple: every frame must be perfect. What we mean by that, is that the user will never see any unintended or partially-rendered content, or any graphical glitches such as tearing. In contrast to X11, where the server performs rendering on behalf of its clients, which not only requires expensive parallelisation-destroying synchronisation with the GPU, but is often an unwanted side effect of unrelated requests, Wayland's buffer-oriented model places the client firmly in control of what the user will see.

The user will only ever be shown exactly the content that the client requests, in the exact way that it requests it: painstaking care has been taken to ensure that not only do these intermediate states not exist, but that any unnecessary synchronisation has been removed. The combination of perfect frames and lower latency results in a natural, fluid-feeling user experience.

Power and resource efficient

wayland-x11-2@2x.pngMuch of the impetus for Wayland's development came from ARM-based devices, such as smart TVs and set-top boxes, digital signage, and mobile, where not only is power efficiency key, but increased demands such as 4K media mean in order to ship a functioning product in the first place, the hardware must be pushed right to the margins of its capabilities. In order to achieve these demanding targets, the window system must make full use of all IP blocks provided by the platform, particularly hardware media decoders and any video overlays provided by the display controller. Not only must it use these blocks, but it must eliminate any copies of the content made along the way. X11 has two core problems which preclude it making full use of these features. Firstly, as X11 provides a rendering-command rather than a buffer-driven interface to clients, it is extremely difficult to integrate with hardware media decoders without making a copy of the full decoded media frame, consuming valuable memory bandwidth and time. Secondly, the X11 server is fundamentally unaware of the scene graph produced by the separate compositor, which precludes use of hardware overlays: the only interface it provides for doing this is OpenGL ES rendering, requiring another copy of the content. This increased memory bandwidth and power usage makes it extremely difficult to ship compelling products in a media-led environment. By contrast, Wayland's buffer-driven model is a natural fit for the hardware media engines of today and tomorrow, and the integration of the display server and compositor makes it easy to use the full functionality of the display controller to provide low-power media display, whilst reserving as much memory bandwidth as possible for other applications to run without having to contend with media playback for crucial system resources, or to push systems to their limits, such as 4K content on relatively low-spec systems.

A first-class media experience

To complement our hundreds of man-years of work on the industry-standard GStreamer media framework, which has proven to scale from playback on mobile devices to serving huge live broadcast streams, Collabora has worked to ensure that Wayland provides a first-class experience when used together with GStreamer. Our recent development work on both Wayland itself and GStreamer's Wayland support, ensures that GStreamer can realise its full potential when used together with Wayland. All media playback naturally occurs in a 'zero-copy' fashion, from hardware decoding engines into either the 3D GPU or display controller, thanks to DMA-BUF buffer passing, new in version 3.16 of the Linux kernel. The Wayland subsurface mechanism allows videos to be streamed separately to UI content, rather than combined by the client as they are today in X11. This separation allows the display server to make a frame-by-frame decision as to how to present it: using power-efficient hardware overlays, or using the more flexible and capable 3D GPU. This step allows maximum UI flexibility whilst also making the most of hardware IP blocks. The scaling mechanism also allows the compositor to scale the video at the last minute, potentially using high-quality scaling and filtering engines within the display controller, as well as reducing precious memory bandwidth usage when upscaling videos. Deep buffer queues are also possible for the first time, with both GStreamer and Wayland supporting ahead-of-time buffer queueing, where every buffer has a target time attached. Under this model, it is possible for the client to queue up a large number of frames in advance, offload them all to the compositor, and then go to sleep whilst they are autonomously displayed, saving CPU usage and power. Wayland also provides GStreamer with feedback on when exactly their buffers were shown on screen, allowing it to automatically adjust its internal pipeline and clock for the tightest possible A/V sync.

Easier deployment and support

In contrast to the X11 model of providing a driver specific to the combination of X server version, display controller and 3D GPU, Wayland offers vendors the ability to deploy drivers written according to external, well-tested, vendor-independent APIs. These drivers are required to perform only limited, well-scoped tasks, making validation, performance testing, and support much easier than under X11. This model makes it possible for vendors to deploy a single well-tested solution for Wayland, and for end users to deploy them in the knowledge that they will have reliable performance and functionality.

We are demonstrating all this at SIGGRAPH, on the ARM booth at stand #933 in the Mobility Pavilion on the Exhibition Hall. We are showing a side-by-side comparison of Wayland and X11 on Samsung Chromebook 2 machines (Samsung Exynos 5800 Octa hardware, with an ARM Mali-T628 GPU), demonstrating Collabora's expertise from the very bottom of the stack to the very top. Collabora's in-house Singularity OS runs a Linux 3.16-rc5 kernel, containing changes bound for upstream to improve and stabilise hardware support, and an early preview of atomic modesetting support inside the Exynos kernel modesetting driver for the display controller. The Wayland machine runs Weston with the new DMA-BUF and buffer-queueing extensions on top of atomic modesetting, demonstrating that videos played through GStreamer can be seamlessly switched between display controller hardware overlays and the Mali 3D GPU, using the DMA-BUF import EGL extension. The X11 machine runs the ChromeOS X11 driver, with a client which plays video through OpenGL ES at all times. The power usage, frame 'lateness' (difference between target display time and actual time), and CPU usage are shown, with Wayland providing a dramatic improvement in all these metrics.

Chinese Version中文版:SIGGRAPH、OpenGL ES 3.1 和下一代 OpenGL

It’s that time of year again – SIGGRAPH is here! For computer graphics artists, teachers, freaks and geeks of all descriptions, it’s like having Midsummer, Christmas, and your birthday all in the same week. By the time you read this, I’ll be in beautiful Vancouver BC, happily soaking up the latest in graphics research, technology, animation, and associated general weirdness along with the other 15,000-plus attendees. I can’t wait!


This year, SIGGRAPH has a special personal connection for me: my office-mate Dave Shreiner is this year’s general chair (amazingly, he’s still got all his hair – quite a lot of it actually), and my other office-mate Jesse Barker is chair of SIGGRAPH Mobile. (Jesse’s got no hair at all, but with him it’s a style choice.) My own job at SIGGRAPH is a lot less grand, but it’s something I love doing: In my capacity as OpenGL® ES working group chair, I’ll be co-hosting the Khronos OpenGL / OpenGL ES Birds of a Feather (BOF) session. That’s where the working groups report back to the user community about what’s going on in the ecosystem, what the committee has been doing, and what the future might hold. This year’s OpenGL ES update will mostly focus on the growing market presence of OpenGL ES 3.0, and on OpenGL ES 3.1, which we released earlier this year and which is starting to enter the market in a big way. It’s great stuff – but it’s not the big news.


There’s a change coming


By the standards of, well, standards, the OpenGL APIs have been an amazing success. OpenGL has stood unchallenged for twenty years as the cross-platform 3D API. Its mobile cousin, OpenGL ES, has grown phenomenally over the past ten years; with the mobile industry now shipping a billion and a half OpenGL ES devices per year, it has become the main driver of OpenGL adoption. One-point-five billion is a mind-boggling number, and we’re suitably humbled by the responsibility it implies.  But the APIs are not without problems: the programming model they present is frankly archaic, they have trouble taking advantage of multicore CPUs, they are needlessly complex, and there is far too much variability between implementations. Even highly skilled programmers find it frustrating trying to get predictable performance out of them. To some extent, OpenGL is a victim of its own success – I doubt that there are many APIs that have been evolving for twenty years without accumulating some pretty ugly baggage. But that doesn't change the central fact: OpenGL needs to change.

The Khronos working groups have known this for a long time; top developers (hi Rich!) have been telling us every chance they get.  But now, with OpenGL ES 3.1 finished but still early in its adoption cycle, we finally feel like we have an opportunity to do something about it. So at this year’s SIGGRAPH, Khronos is announcing the Next Generation OpenGL initiative, a project to redesign OpenGL along modern lines. The new API will be leaner and meaner, multicore and multithread-friendly. It will give applications much greater control over CPU and GPU workloads, making it easier to write performance-portable code. The work has already started, and we’re making rapid progress, thanks to strong commitment and active participation from the whole industry, including several of the world's top game engine companies.


Needless to say, ARM is fully behind this new direction, and we’re investing significant engineering resources in making sure it meets its goals and runs well on our Mali GPUs. We are of course also continuing to invest in the ecosystem for ‘traditional’ OpenGL ES, which will remain the dominant  mobile graphics API for quite some time to come.


That’s all I’ve got for now. If you’re going to be at SIGGRAPH, I hope you’ll come by the OpenGL / OpenGL ES BOF and after-party, 5-7pm on Wednesday at the Marriott Pinnacle, and say hi.  If not, drop me a line below…


Tom Olson is Director of Graphics Research at ARM. After a couple of years as a musician (which he doesn't talk about), and a couple more designing digital logic for satellites, he earned a PhD and became a computer vision researcher. Around 2001 he saw the coming tidal wave of demand for graphics on mobile devices, and switched his research area to graphics.  He spends his working days thinking about what ARM GPUs will be used for in 2016 and beyond. In his spare time, he chairs the Khronos OpenGL ES Working Group.

Filter Blog

By date:
By tag: