1 2 3 Previous Next

ARM Mali Graphics

251 posts

ARM has released new major upgrades for the graphics debugger (Mali Graphics Debugger v2.0) and the OpenGL® ES emulator (Mali OpenGL ES Emulator v2.0), and an update for the offline shader compiler (Mali Offline Compiler v4.5), adding support for OpenGL ES 3.1 and many other features. This is the last update for 2014, and concludes a year full of many releases and amazing new features, like overdraw and shader map modes, support for Android KitKat and Lollipop, ASTC textures, frame replay and support for all the ARM® Mali™ GPU cores in the offline compiler. Additionally to those, today we present you three new releases, including key features like live shader editing, support for unrooted Android devices, compute shaders and indirect draw calls.

Content from Major Upgrade for Mali Graphics Development Tools Mali Developer Center

Mali Graphics Debugger v2.0

Mali Graphics Debugger allows developers to trace OpenGL ES and OpenCL™ API calls in their application and understand frame-by-frame the effect on the application to help identify possible issues. We support Android and Linux devices with ARM Cortex® CPUs and Mali GPUs & Linux, Windows and Mac OS X hosts.

Key New Features

  • OpenGL ES 3.1 support
    This means that all OpenGL ES 3.1 function calls will be present in a trace.
  • Live editing
    • Support added for changing both the fragment and vertex shader of a program and then replaying the frame to view the results.
    • Support added to override textures in an application and replace them with a new texture that will aid in diagnosing any issues with incorrect texture coordinates.
    • Support added for overriding the precision of all elements in a shader and then replaying the frame to view the results (force highp/mediump/lowp modes)
    • New mid-level hierarchy in the outline view added so now draw calls are separated per framebuffer as well as per frame allowing the user to better visualize render passes.
  • New Android application provided to support unrooted devices
  • New view for compute shaders.
  • User can now filter by frame feature to highlight interesting frames in larger traces.
  • Support for making notes alongside function calls has been added. This allows important functions in a trace to be located easily.
  • Support for most OpenGL ES extensions.

To learn more about Mali Graphics Debugger and performance analysis of graphics applications, you can watch the tutorials and presentations on YouTube.

Get Mali Graphics Debugger

Mali Graphics Debugger Daemon Application for Android

With the objective of making the installation of the graphics debugger on Android targets easier, we have developed an Android application that runs the required daemon. This eliminates the need to manually install executables on the Android device. The application (APK) works on rooted and unrooted devices.




Mali OpenGL ES Emulator v2.0

The OpenGL ES Emulator is a library that maps OpenGL ES 3.1 API calls to the OpenGL API. By running on a standard PC, the emulator helps software development and testing of next generation OpenGL ES 3.1 applications since no embedded platform is required. We support Linux and Windows PCs.Key new features

  • OpenGL ES 3.1 support

We have put a lot of effort to make the emulator as close as possible to the Khronos standard OpenGL ES 3.1 API, executing the conformance suite and getting good conformance score on different platforms. Some tests cannot easily pass on desktop PCs, due to the nature of the underlying OpenGL driver; nevertheless we passed more than 94% of the conformance tests on our test platform.

Get OpenGL ES Emulator



Mali Offline Compiler v4.5

Mali Offline Compiler is a command line tool that translates vertex shaders and fragment shaders written in the OpenGL ES Shading Language (ESSL) into binary vertex shaders and binary fragment shaders for execution on the Mali GPUs. It generates statistics that are useful to optimize shader code.Key new features

  • OpenGL ES 3.1 shader language support, including compute shaders
  • Support for Mali ‘Midgard‘ driver version r5p0 (for Mali-T600 and Mali-T700 Series)
  • Support for Mali ‘Utgard‘ driver version r5p0 (for Mali-400 MP and Mali-450 MP)


Get Mali Offline Compiler




Support and Other Activities

As always, tools provided by ARM are supported in the ARM Connected Community. You can ask a question in the Mali Developer Forums, follow us on Twitter, Sina Weibo, or watch our YouTube, YouKu channels.


Lorenzo Dal Col is the Product Manager of DS-5 Streamline and Mali GPU Tools. He first used ARM technology when, in 2007, he created a voice-controlled robot at university. He has experience in machine learning, image processing and computer vision. He joined ARM in 2011 to work on 3D graphics, developing performance analysis and debug tools.

I have previously shared how to install OpenCL on the Samsung XE303C12 Chromebook powered by the ARM® Mali-T604 GPU. I have found that things are slightly different on the newer Samsung XE503C12 Chromebook ("Chromebook 2") powered by the ARM® Mali-T628 GPU, so decided to provide an update. As before, please bear in mind that this is not ARM's "official guide" (which can be found here). However, it's a useful alternative to the official guide if, for example, you don't have a Linux PC or just want to use Chrome OS day in and day out.


You will need:


How fast you will complete the installation will depend on how fast you can copy-and-paste instructions from this guide (Ctrl-C) into the shell (Shift-Ctrl-C), how fast your Internet connection is and how fast your memory card is. (I will give an approximate time for each step measured when using a rather slow 30 MB/s card). The basic OpenCL installation should take up to half an hour; PyOpenCL and NumPy about an hour; further SciPy libraries about 4 hours. Most of the time, however, you will be able to leave the Chromebook unattended, beavering away while compiling packages from source.


Finally, the instructions are provided "as is", you use them at your own risk, and so on, and so forth... (The official guide also contains an important disclaimer.)


Installing OpenCL

Enabling Developer Mode

NB: Enabling Developer Mode erases all user data - do a back up first.


Enter Recovery Mode by holding the ESC and REFRESH (↻ or F3) buttons, and pressing the POWER button. In Recovery Mode, press Ctrl+D and ENTER to confirm and enable Developer Mode.


Entering developer shell (1 min)

Open the Chrome browser and press Ctrl-Alt-T.

Welcome to crosh, the Chrome OS developer shell.

If you got here by mistake, don't panic!  Just close this tab and carry on.

Type 'help' for a list of commands.

Don't panic, keep the tab opened and carry on to enter the shell:

crosh> shell
chronos@localhost / $ uname -a
Linux localhost 3.8.11 #1 SMP Wed Dec 10 14:41:54 PST 2014 armv7l SAMSUNG EXYNOS5 (Flattened Device Tree) GNU/Linux


Preparing a Micro SD card (5 min)

Insert a blank Micro SD card (denoted as /dev/mmcblk1 in what follows):

chronos@localhost / $ mount | grep "SD Card"
/dev/mmcblk1p1 on /media/removable/SD Card type vfat (rw,nosuid,nodev,noexec,relatime,dirsync,uid=1000,gid=1000,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,flush,errors=remount-ro)

Unmount the card and run fdisk:

chronos@localhost / $ sudo umount /dev/mmcblk1p1
chronos@localhost / $ sudo /sbin/fdisk /dev/mmcblk1

Welcome to fdisk (util-linux 2.24).
Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Command (m for help):

Enter 't' to change a partition type, then '83' to change the partition type to 'Linux', and finally 'w' to apply the change:

Command (m for help): t
Selected partition 1
Hex code (type L to list all codes): 83
If you have created or modified any DOS 6.x partitions, please see the fdisk documentation for additional information.
Changed type of partition 'W95 FAT32 (LBA)' to 'Linux'.

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
chronos@localhost / $ 

Format the card e.g. using ext3:

chronos@localhost / $ sudo /sbin/mkfs.ext3 /dev/mmcblk1p1

NB: If you use a card that is less than 8 GB, you may need to reserve enough inodes when you format the card e.g.:

chronos@localhost / $ sudo /sbin/mkfs.ext3 /dev/mmcblk1p1 -j -T small

Mount the card and check that it's ready:

chronos@localhost / $ sudo mkdir -p ~/gentoo
chronos@localhost / $ sudo mount -o rw,exec -t ext3 /dev/mmcblk1p1 ~/gentoo
chronos@localhost / $ df -h ~/gentoo
Filesystem      Size  Used Avail Use% Mounted on
/dev/mmcblk1p1   15G   38M   14G   1% /home/chronos/user/gentoo
chronos@localhost / $ df -hi ~/gentoo
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/mmcblk1p1   951K    11  951K    1% /home/chronos/user/gentoo

Installing Gentoo Linux (15 min)

chronos@localhost / $ cd ~/gentoo
chronos@localhost ~/gentoo $ ls -la
total 32
drwxr-xr-x  3 root    root            4096 Dec  9 21:31 .
drwx--x--- 30 chronos chronos-access 12288 Dec  9 21:38 ..
drwx------  2 root    root           16384 Dec  9 21:31 lost+found

Download the latest stage 3 archive for armv7a_hardfp:

chronos@localhost ~/gentoo $ sudo wget http://distfiles.gentoo.org/releases/arm/autobuilds/latest-stage3-armv7a_hardfp.txt
chronos@localhost ~/gentoo $ sudo wget http://distfiles.gentoo.org/releases/arm/autobuilds/`cat latest-stage3-armv7a_hardfp.txt | grep stage3-armv7a_hardfp`

Extract the downloaded archive right onto the card e.g.:

chronos@localhost ~/gentoo $ sudo tar xjpf stage3-armv7a_hardfp-20141023.tar.bz2

Clean up:

chronos@localhost ~/gentoo $ sudo rm latest-stage3-armv7a_hardfp.txt
chronos@localhost ~/gentoo $ sudo rm stage3-armv7a_hardfp-20141023.tar.bz2


Downloading OpenCL drivers (4 min)

Go to the page listing Mali-T6xx Linux drivers and download release r4p0-02rel0 for Mali-T62x fbdev (mali-t62x_r4p0-02rel0_linux_1+fbdev.tar.gz). Make sure you carefully read and accept the associated licence terms.

chronos@localhost ~/gentoo $ sudo tar xvzf ~/Downloads/mali-t62x_r4p0-02rel0_linux_1+fbdev.tar.gz 

This will create ~/gentoo/fbdev which we will use later.


Entering Gentoo Linux (2 min)

Similar to crouton, we will use chroot to enter our Linux environment.


Create two scripts and make them executable:

chronos@localhost ~/gentoo $ sudo vim ~/gentoo/setup.sh
mount -t proc /proc $GENTOO_DIR/proc
mount --rbind /sys  $GENTOO_DIR/sys
mount --rbind /dev  $GENTOO_DIR/dev
cp /etc/resolv.conf $GENTOO_DIR/etc
chronos@localhost ~/gentoo $ sudo vim ~/gentoo/enter.sh
LC_ALL=C chroot $GENTOO_DIR /bin/bash
chronos@localhost ~/gentoo $ sudo chmod u+x ~/gentoo/setup.sh ~/gentoo/enter.sh

Execute the scripts:

chronos@localhost ~/gentoo $ sudo ~/gentoo/setup.sh
chronos@localhost ~/gentoo $ sudo ~/gentoo/enter.sh

Note that the ~/gentoo directory will become the root (/) directory once we enter our new Linux environment. For example, ~/gentoo/fbdev will become /fbdev inside the Linux environment.


Installing OpenCL header files (2 min)

Download OpenCL header files from the Khronos OpenCL registry:

localhost / # mkdir /usr/include/CL && cd /usr/include/CL
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/opencl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_platform.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_gl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_ext.h


Installing OpenCL driver (2 min)

Change properties on the downloaded OpenCL driver files and copy them to /usr/lib:

localhost / # chown root /fbdev/*
localhost / # chgrp root /fbdev/*
localhost / # chmod 755 /fbdev/*
localhost / # mv /fbdev/* /usr/lib
localhost / # rmdir /fbdev



By now you should have a mint Linux installation complete with the OpenCL drivers and headers, so you can start playing with OpenCL!

When you reboot, you just need to mount the card and execute the setup script again:

chronos@localhost / $ sudo mount -o rw,exec -t ext3 /dev/mmcblk1p1 ~/gentoo
chronos@localhost / $ sudo ~/gentoo/setup.sh

Then you can pop in and out of the Linux environment with:

chronos@localhost / $ sudo ~/gentoo/enter.sh
localhost / # exit
chronos@localhost / $

But the fun just begins here! Follow the instructions below to install PyOpenCL and SciPy libraries for scientific computing.


Installing PyOpenCL

Configuring Portage (15 min)

Portage is Gentoo's package management system.

localhost / # echo "MAKEOPTS=\"-j4\"" >> /etc/portage/make.conf
localhost / # echo "ACCEPT_KEYWORDS=\"~arm\"" >> /etc/portage/make.conf
localhost / # mkdir /etc/portage/profile
localhost / # mkdir /etc/portage/package.use
localhost / # mkdir /etc/portage/package.unmask
localhost / # mkdir /etc/portage/package.accept_keywords
localhost / # mkdir /etc/portage/package.keywords
localhost / # touch /etc/portage/package.keywords/dependences

Perform an update:

localhost / # emerge --sync --quiet
localhost / # emerge --oneshot portage
localhost / # eselect news read

NB: If any emerge command below fails, rerun it with the --autounmask-write flag; then run etc-update and answer '-3' followed by 'y'. Running emerge again should now get the build started e.g.:

localhost / # emerge --autounmask-write dev-python/pandas
localhost / # etc-update
Scanning Configuration files...
The following is the list of files which need updating, each
configuration file is followed by a list of possible replacement files.
1) /etc/portage/package.keywords/dependences (1)
Please select a file to edit by entering the corresponding number.
              (don't use -3, -5, -7 or -9 if you're unsure what to do)
              (-1 to exit) (-3 to auto merge all files)
                           (-5 to auto-merge AND not use 'mv -i')
                           (-7 to discard all updates)
                           (-9 to discard all updates AND not use 'rm -i'): -3
Replacing /etc/portage/package.keywords/dependences with /etc/portage/package.keywords/._cfg0000_dependences
mv: overwrite '/etc/portage/package.keywords/dependences'? y
Exiting: Nothing left to do; exiting.
localhost / # emerge dev-python/pandas


Selecting Python 2.7 (1 min)

localhost / # eselect python set python2.7


Installing NumPy (40 min)

Install NumPy with LAPACK as follows.

localhost / # echo "dev-python/numpy lapack" >> /etc/portage/package.use/numpy
localhost / # echo "dev-python/numpy -lapack" >> /etc/portage/profile/package.use.mask
localhost / # emerge dev-python/numpy
localhost / # python -c "import numpy; print numpy.__version__"


Installing PyOpenCL (7 min)

Install PyOpenCL.

localhost / # cd /tmp
localhost tmp # wget https://pypi.python.org/packages/source/p/pyopencl/pyopencl-2014.1.tar.gz
localhost tmp # tar xzf pyopencl-2014.1.tar.gz
localhost tmp # cd pyopencl-2014.1
localhost pyopencl-2014.1 # python configure.py
localhost pyopencl-2014.1 # make install
localhost pyopencl-2014.1 # cd examples
localhost examples # python demo.py
Choose device(s):
[0] <pyopencl.Device 'Mali-T628' on 'ARM Platform' at 0x-49b96370>
[1] <pyopencl.Device 'Mali-T628' on 'ARM Platform' at 0x-49b96270>
Choice, comma-separated [0]:0
Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
(0.0, 241.52145)
localhost examples # python -c "import pyopencl; print pyopencl.VERSION_TEXT"

(That's right! The Exynos 5420 chip effectively has two Mali-T62x GPUs: GPU 0 has 4 cores; GPU 1 has 2 cores. How cool is that?)


Installing scientific libraries

If you would like to follow my posts on benchmarking (e.g. see the intro), I recommend you install packages from the SciPy family.


Installing IPython (45 min)

localhost / # emerge dev-python/ipython
localhost / # ipython --version


Installing IPython Notebook (5 min)

Install IPython Notebook to enjoy a fun blend of Chrome OS and IPython experience.


localhost / # emerge dev-python/jinja dev-python/pyzmq www-servers/tornado
localhost / # ipython notebook
2014-05-08 06:49:08.424 [NotebookApp] Using existing profile dir: u'/root/.ipython/profile_default'
2014-05-08 06:49:08.440 [NotebookApp] Using MathJax from CDN: http://cdn.mathjax.org/mathjax/latest/MathJax.js
2014-05-08 06:49:08.485 [NotebookApp] Serving notebooks from local directory: /
2014-05-08 06:49:08.485 [NotebookApp] The IPython Notebook is running at:
2014-05-08 06:49:08.486 [NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
2014-05-08 06:49:08.486 [NotebookApp] WARNING | No web browser found: could not locate runnable browser.

Open in a new Chrome tab to start creating your own IPython Notebooks!


Installing Matplotlib (50 min)

localhost / # emerge dev-python/matplotlib
localhost / # python -c "import matplotlib; print matplotlib.__version__"


Installing SciPy (60 min)

localhost / # emerge sci-libs/scipy
localhost / # python -c "import scipy; print scipy.__version__"


Installing Pandas (80 min)

localhost / # emerge dev-python/pandas
localhost / # python -c "import pandas; print pandas.__version__"

TyGL is now open source!

Posted by mattspencer Dec 18, 2014

You might remember that back in August we released details of a program ARM had been working on in conjunction with Szeged University and Samsung Research UK.


TyGL is a new backend for WebKit which demonstrates a huge acceleration in mobile web rendering. While it was developed and tested on an ARM Mali-T628 GPU based Chromebook, it will work on any GPU conforming to OpenGL ES 2.0 and higher and has been shown to achieve a performance uplift of up to eleven times. Full details about the process through which TyGL manages this huge boost are available in one of our previous blogs, TyGL: Hardware Accelerated Web Rendering.


tygl svg tiger.png


Key Features of TyGL include:

  • GPU involvement in web rendering pipeline

Clever batching of draw calls delivers better results on GPUs (see previous blogs on batching for more information).  While the Graphics Context API can result in frequent state changes if implemented in the wrong way, TyGL is designed to catch this problem and group commands together to reduce draw calls – and in this way improve performance.


  • Automatic shader generation

TyGL enables efficient batching by generating complex shaders from multiple shader fragments – and ensures the batches fit into the shader cache of the GPU.


  • Trapezoid based path rendering

This section of the engine is under continuous improvement and is planning to take advantage of new GPU capabilities such as the Pixel Local Storage extension for OpenGL ES in order to get that extra bit of performance.


  • No software fallback

This is a complete GPU-based hardware accelerated solution with no dependency on legacy software and no need to sacrifice optimizations for compatibility.

This month we were delighted to open source the TyGL port and it is now available for the world to see on GitHub and we will truly value community involvement in implementing additional features and improving the code. Take a look, let us know what you think and feel free to contact us if you want more information about the project!


Further Reading

TyGL: Hardware Accelerated Web Rendering

TyGL on GitHub

Blog by our partners, Szeged University

How to build TyGL

This year at GDC I gave a presentation on our exhibition booth about using ASTC with different types of textures to get the best visual results. It’s interesting that in the past whenever I spoke about ASTC it was always about how it works, rather than how to use it, which is bizarre because that’s not really what developer education is about.


It would be like a driving instructor turning up and lecturing you for the full hour on the science behind the internal combustion engine.


I did go on to write a fairly long guide to understanding the various settings and options you get when compressing in ASTC for GPU Pro, and the release of that roughly coincided with my booth talk at GDC. Those present on the day may have noticed the presentation wasn’t up to my usual standard. I can only apologise, I was very ill and dragged myself out of the hotel to give the talk before immediately returning at the end.


I’d like to use this as an opportunity to reiterate some of that content in the form of a blog, to clarify some parts I missed or stumbled over on the day. For those who weren’t there or just want to relive the presentation, I’ve attached a recording of it here.



The first topic I covered is a really basic introduction to texture compression in general, including a few notes on why textures should be compressed in the first place. With the ubiquity of gif, jpg and png image compression formats, surprisingly few people stop to think about the size of raw pixel data. Whether you have an alpha channel or not, cache alignment means you’re essentially always packing one in raw image data, making a 32 bits per pixel (bpp) cost. With even modest textures weighing in around a million pixels each, you can see how this might get quite heavy.


It’s not the size of the texture that causes the real problem, it’s the fact that you have to constantly look that data up, as the GPU taps into the main memory to pull that data into its cache whilst shading fragments, all of which compounds the bandwidth usage of the application. The solution to this is not compressed image files unpacked into GPU memory, it’s compressed textures in the GPU memory that the GPU can unpack as needed. This places interesting constraints on the compression formats used.


Firstly the pixels need to be accessed randomly. PNG is all well and good for compressing a whole image but to unpack a single pixel you have to unpack the entire line it’s on. Maybe it’d be worthwhile if you were reading in order along that axis, but if you’re sampling across the lines, you end up unpacking far more data than you need. Compression relies on grouping data to compressed bundles, so optimally these bundles need to be blocks of pixels, not lines, allowing the block to be decompressed into the cache and sampled randomly in any direction.


As this implies, the blocks have to be completely standalone. Other than the compression format, there must be no external information such as a dictionary or symbol table to decode the block. Finally, blocks have to line up in memory in a regular formation, or your decompressor won’t know where to look in the data to find a specific block. Scanning through to find it is not an option.


This is why texture compression has its own specialist formats for the task. In the older ARM® Mali™ GPUs, we only supported the ETC and later ETC2 formats because those are Khronos standards. There’s a pretty good reason for sticking to standards because  the capability and availability of different compression formats is rather sparsely populated. Your choice of format might not just lock your application into a certain bitrate or channel specification; a proprietary format could also lock it to specific hardware.


ASTC is a texture compression format designed to solve this problem from the ground up, allowing different bit rates, different pixel formats, even different combinations of spatial dimensions to be picked for any given texture. So maybe you want a 2D high bit rate normal map with just X and Y data, or maybe you want a low bit rate 3D HDR RGBA texture? ASTC can do both, and more.


If you want to know how that even works, I already wrote about that at length here. If you want to know how to get the best results from it, you’re in the right place.


The quality of a compressed texture is controlled with three main factors: the bit rate, the limits and the error factors. I’ll tackle these from the easiest to understand to the hardest.


Bit Rates and Block Size


ASTC, as you may know, can encode in different block modes. The dimensions of a single block are called its footprint. Whereas other texture formats have fixed footprints, ASTC has various block footprints from 4x4 to 12x12 (and from 3x3 to 6x6x6). What stays the same in ASTC is the data size used to encode it, at exactly 128 bits. So if those 128 bits encode a 4x4 block (16 pixels), that’s 8bpp, whereas the 12x12 block (144 pixels) is a staggering 0.98bpp. If you think that’s impressive, a 6x6x6 block is 216 pixels, making it 0.59 bits per pixel. Ordinarily at this point there would just be a reminder that higher bit rate leads to higher quality and move on, but you’ve spent the time to read this far so I’ll explode that myth for you.  A 128 bit block can represent 2128 different binary combinations, each of which will map to a specific layout of pixels. The smallest block size, 4x4, contains 16 pixels, which at 32bpp (RGBA) can represent 2512 different combinations of pixel data. For those not used to thinking in binary, that means you have less than one in a googol’s chance of getting an exact match (a googol is one with a hundred zeros). That may seem very small, but the whole point is that you don’t need an exact match for every outcome, and the best texture compression formats are geared towards the 2128 pixel layouts most likely to make sense as part of a larger image.


The point is, if you’re using 12x12 blocks, there are 24608 combinations. The probability of getting an exact match on a block that size is less than one in one with a thousand zeroes, which we don’t even have a proper name for; it also means there’s a much lower chance of even getting a passable match for it. The compressor will have to pick the best configuration it can, and hope you don’t notice.


Limits and Leeway


Which leads us neatly onto limits, or how hard should the compressor try to find a good match? The whole point of texture compression algorithms is that they have a fast deterministic decompression function, but after a few intelligent choices, the best the compressor can do is try out different combinations and see how close they are. This means the more it checks, the more likely it is to find a good one. You don’t necessarily want to check them all; that would take a very long time. This is why you have to set limits. The limits can be things like “only try so many combinations, then give up and pick the best we found” or “if you find one that’s suitably close to the original, use that and stop looking” or even “if you try a few patterns with two partitions and it’s no better than those using a single partition, don’t bother trying three or four partitions” (the concept of partitions is explained in this blog post)


It’s fair to say most people wouldn’t know where to begin setting signal to noise decibel ratings for these kinds of decisions so, handily, the compressor has a few in-built presets from very fast to exhaustive. There’s a chance that it will find the best combination in the very fast presets, but it’s a very low chance. The probability is much higher if you’re willing to wait. The best advice therefore is to iterate your assets with fast or very fast compression, then ship with thorough or exhaustive compression. Curiously there’s very little difference between the result from thorough and exhaustive but exhaustive will take a lot longer, this again is down to the relative probabilities involved.

The one question remaining, therefore, is if it’s trying all these different blocks of pixels to see how close they are to the same block in the raw image, how is it comparing them?


Priority and Perception


In order to tell which one out of a hundred or a thousand or even a hundred thousand proposed blocks is the best, you need to be able to compare any pair and say that one is objectively better than the other, then repeat with the best and the next attempt.  The standard way to compare two images is called PSNR or percentage signal to noise ratio, so you take your original image, subtract all of the colour values from your resulting image, convert all the negative numbers to positive (the absolute difference) and then sum them. The ratio part comes from a sort of imaginary maximum error, which would be if an all white image came out all black or vice versa.

But there are different things you might want to preserve.


When the numbers are added together they can have weightings applied to them. Little known fact, the human optic system is more sensitive to high frequency detail in green light than red or blue. Using this knowledge you can add a pre-multiplier to different channels. If you gave a weight of two to the green error, and there were two tiles which differed by roughly the same amount, one mostly in the red channel, one mostly in the green channel, the error in the green channel would be doubled, meaning the one with the red error would be considered a better match.  Alternatively, you could be more concerned about angular error. This is particularly relevant in normal maps where the pixels represent not a colour to be displayed on screen but a field of vectors. In this scenario the ratio of the channels is far more important than simple per channel or overall magnitude differences, and this can be reflected by giving a weight to the angular component.


One interesting thing that arises as the result of block based comparisons is that errors near the edge of a tile may have positive error within the defined limits, and the errors on the adjacent tile may be negative error within the defined limits, making the step change between two blocks, which should match up of course, larger than the desired error bounds. Block weighting reduces that error by applying additional error weight to boundary mismatches.  If you really want to get under the hood, there are a few settings that tinker with the way individual pixel errors are combined into a full block error. These work by applying weights and pre-multipliers to the mean average error and the standard deviation of the error in a certain radius. I could talk at length about how this may be weighted to favour a tile with a few big errors over a tile with lots of little errors, or certain settings can favour a noisy looking tile over one which smoothes minor details out, and I haven’t even researched all the possibilities yet. Either way it’s a huge topic and one that although I touch upon in the presentation, I’m going to leave alone for now and go into much greater detail at a later date.


Getting Started with ASTC


If you want to try out ASTC you’ve got quite a few options. There are commercial devices available right now with the appropriate hardware to decode ASTC on the GPU even though it’s still a very new technology.


If you’d like to see how it looks without the hardware advantages of memory bandwidth reduction, the OpenGL ES 3.0 emulator can handle ASTC textures (although its underlying technique is to decode them to raw images, the compression artefacts are left intact) so you can try them out in your shaders. To generate ASTC images you have two options: the command line evaluation codec or the texture compression toolBoth of these tools have a lot of preset modes and switches for different use cases.


Things already mentioned like block or channel weighting can be set easily in either tool to clean up specific error cases. Also there are preset modes for normal maps, which map the angular weighted X and Y of the normal to Luminance and Alpha for better compression, and data masking, which tells the encoder to treat errors in different channels separately so that they can encode unrelated non colour data.


Both tools are also capable of encoding volumetric 3D textures. Either of them will accept an uncompressed 3D image file, and the command line tool has commands for accepting an array of 2D slices to build the 3D volume.


In my main auditorium talk at GDC I gave a few more tips on working with compressed textures, and I’ll share those in another blog real soon. For now, download a compressor and have fun playing around with the future of texture compression.



Epic Giveaway.png

Another year has almost passed for the Mali team and, crikey, has it been a busy one. With the mobile market booming - particularly in the mainstream segment - demand for ARM® Mali GPUs has never been stronger and continues to mount. Our engineers are taking it in their stride, releasing new products across the board to drive the spread of stunning media experiences across the world. Our product roadmap is at its most flexible ever, offering a wide range of GPU, video and display solutions that scale from a single core Mali-400 for smartwatches to the recently announced Mali-T860 to power the next generation of premium smartphones. Meanwhile, in the current generation of premium smartphones, the Mali-T600 Series is shipping in volume and delivering the stunning graphics capabilities of the performance efficient Mali GPU roadmap to consumers who demand only the best media experiences.

If you’re looking this Holiday Season for a mobile device that delivers high quality graphics experiences without draining the battery, why not check out some of the following – all of which are new to the market this year and sport an ARM Mali GPU. You will also have the chance to win some of these fantastic devices as part of the Epic Giveaway, continue reading to find out more.

1.     Samsung Galaxy Note 4

Featuring an Exynos 7 Octa processor with a Mali-T760 MP6 GPU, the Samsung Galaxy Note 4 was released in September 2014 and will be part of the Epic Giveaway in January. With its pixel dense display, vibrant colours and powerful 16MP back facing camera the Samsung Galaxy Note 4 is a feature-rich phone that also provides astonishing performance for compute intensive applications such as high resolution, high end games.

2.     Huawei Honor 6

The Honor 6’s HiSilicon Kirin 920 processor features four ARM® Cortex®-A15 cores and four Cortex-A7 cores in big.LITTLE configuration alongside a Mali-T628 MP4 GPU to offer a good HD display and high performance within an affordable price bracket.

3.     HTC Desire 820s

With the 64-bit, Mali-T760 GPU-based Mediatek MT6752 SoC at its heart the HTC Desire 820s was announced on Sina Weibo, the Chinese social network site, in November.

4.     Pipo Pad-P1

Rockchip were the first to market with a Mali-T760 GPU-based SoC, our most advanced GPU shipping to date, releasing the first devices half a year after the cores were first announced.epic giveaway 2.jpg  Their fast time to market is setting the pace for the industry and the implication of the RK3288 is that, in the future, the latest graphics processing will be arriving to the hands of consumers quicker than ever before.

5.     Omate TrueSmart


With a single core Mali-400 GPU at the heart of the Mediatek MT6572, this little smartwatch delivers a smooth UI along with the ability to make calls, navigate and use Android apps independently of your smartphone.

6.     Samsung Galaxy S5

A final device of mention is the Samsung Galaxy S5 which has just entered ARM’s 2014 Epic Giveaway! Released at the start of the year in Barcelona, it sports a Samsung Exynos 5 Octa chipset with octa-core big.LITTLE technology, splitting CPU work across four Cortex-A7 and four Cortex-A15 processors, with an implementation of the Mali-T628 GPU IP delivering the multimedia experience. Enter the Epic Giveaway competition for your chance to get your hands on a Samsung Galaxy S5 this Holiday Season.



The 2014 Epic Giveaway gets underway today. In partnership with HEXUS, ARM is giving you the chance to win amazing new prizes this holiday season! Every day for the next few weeks, we'll be giving away a brand-new ARM-based device. We'll have an array of prizes from ARM and our partners, including Atmel, Nvidia and Samsung, plus many, many more! Each prize draw will be open for seven days, so visit the dedicated competition page to keep tabs on what's up for grabs and what's coming soon.

Click here to find out more and to enter the EPIC Giveaway for your chance to win.

The global smartphone market has witnessed extraordinary growth in recent years with shipments rising by 23% in 2014 to exceed the 1.2bn unit threshold. Market research firm CCS Insight forecasts smartphone shipments of 1.89bn units by 2018. This growth predominantly comes from the lower spectrum of the market with consumers upgrading from feature phones to the mainstream smartphones. ARM® Cortex® CPUs are in the heart of over 95% of the mobile and smartphones in the world ARM® Mali™ is the #1 licensable GPU with a mature and growing ecosystem of partners.


ARM and its partners have been at the forefront of innovation, delivering energy-efficient and affordable devices to millions of consumers who would like to be able to enjoy features of modern technology like mobile email, web browsing, video streaming and mobile gaming without paying a high price.


The ARM® Mali™-450 MP GPU has been enjoying popularity since its launch and is found in millions of entry to mid-range level smartphones, tablets and set-up-box. The Mali-450 MP has been designed for the volume market and optimized with a focus on energy and bandwidth savings. It is a perfect energy-efficient, cost and area optimized solution for a market that requires OpenGL ES 2.0 implementation.


Recently we announced a 64-bit driver for Mali-450MP enabling developers to take full advantage of the latest technology available on the market.

Embarrassingly Parallel

In the world of parallel computing when an algorithm can be easily split into multiple parallel jobs, where the output of each of the jobs doesn’t depend on the output of any other job, it is referred to as “Embarrassingly Parallel” or “Pleasingly Parallel”, whichever you prefer.  The reason for this uncharacteristically prosaic terminology is perhaps inspired by the huge relief such algorithms must bring to the weary parallel compute developer who otherwise has to craft delicate inter-thread communication so that parallel jobs can share their results in whatever their algorithm defines as the correct order.


Let me give you a simple example of such a parallel-friendly algorithm.  Convolution filters are certainly members of the embarrassingly parallel club.  Imagine we have a large array of values:



An example of a convolution filter.  Each transformed pixel value is created by multiplying its current value and the values of the pixels around it against a matrix of coefficients

Each pixel in the image is processed by summing a proportion of its original value with a proportion of the original values of the surrounding pixels.  The proportion of each pixel usually depends on its proximity to the central pixel being processed.  Crucially – and apparently embarrassingly – none of the calculations require knowledge of the result of any of the other calculations.  This makes parallelizing much easier, because each of the pixel calculations can be performed in any order.  Using a parallel compute API like OpenCL, it is then easy to assign each pixel to a work item – or thread – and watch your convolution job split itself across the processing cores you have at your disposal.


This sort of example is a nice way to showcase parallel programming.  It gets across the message of splitting your job into the smallest processing elements without getting bogged down with too many thorny issues like thread synchronization.  But what of these problem – non-embarrassing – algorithms? How should we tackle those?


Well of course, there’s not one answer.  Life is not that simple.  So we need to resort to an example to showcase the sort of methods at your disposal.


A good one I came across the other day was the Floyd-Steinberg algorithm.  This is the name given to an image dithering algorithm invented by Robert W Floyd and Louis Steinberg in 1976.  It is typically used when you need to reduce the number of colours in an image and still retain a reasonable perception of the relative colour and brightness levels.  This is achieved through pixel dithering.  In other words, an approximation of the required colour in each area of the image is achieved with a pattern of pixels. The result becomes a trade-off: what you lose is the overall level of detail, but what you gain is a reasonable colour representation of the original image.


Here's an example:

colour fs.png

Original image on the left.  On the right the 256-colour dithered version.

In our particular example, we’re going to be looking at converting a grey-scale image – where each pixel can be represented by 256 different levels of grey – to an image only using black and white pixels.


bw fs.png

Grey scale version on the left.  2-colour Floyd-Steinberg version on the right


What you can see in this example – and what Floyd and Steinberg discovered – is this concept of error diffusion, where an algorithm could determine the distribution of pixels from a limited palette to achieve an approximation of the original image.


The algorithm itself is actually quite simple,  and indeed rather elegant.  What you have are three buffers:

  • The original image
  • An error diffusion buffer
  • The destination image

The algorithm defines a method of traversing over an image and for each pixel determining a quantization error – effectively the difference between the pixel’s value and what would be the nearest suitable colour from the available palette.  This determination is made by reference to both the pixel’s current colour and a value read from the error buffer – as written out by previous pixel calculations.  And indeed a proportion of the error calculated for this pixel will then be propagated to neighbouring ones. Here’s how this works: step1.png Step 1: a pixel from the source and a value from the error diffusion buffer are added.  Depending on the result, a white or black pixel is written to the destination and an error value is determined.


Step 2: the value of err is split up and distributed back into the error distribution buffer into four neighbouring pixels.

The code for doing all this is actually quite simple:

for (int y = 0; y < height - 1; y++)
    for (int x = 1; x < width - 1; x++)
        int sum = src[width * y + x] + err_buf[width * y + x];

        if (sum > THRESHOLD)
            err                = sum - THRESHOLD;
            dst[width * y + x] = 0xff;              // Write a white pixel
            err                = sum;
            dst[width * y + x] = 0x00;              // Write a black pixel

        err_buf[width * y      + x + 1] += (err * 7) / 16;
        err_buf[width * (y + 1) + x - 1] += (err * 3) / 16;
        err_buf[width * (y + 1) + x    ] += (err * 5) / 16;
        err_buf[width * (y + 1) + x + 1] += (err * 1) / 16;


This uses these three buffers:

  • src (the source grey-scale image, 1 byte per pixel)
  • dst (the destination black or white image, 1 byte per pixel)
  • err_buff (the buffer used to hold the distributed error values along the way - 1 byte per pixel and initialised to all-zeros before starting).

width and height refer to the size of the image.  The value of THRESHOLD would typically be set to 128.


So I hope you can see the problem here.  We can’t simply assign each pixel’s calculation to an independent work item because we cannot guarantee the order that work items will run.  In OpenCL the order of execution of work items – even the order of execution of work groups – is entirely undefined.  As we progress left to right across the image, and then line by line down the image, each pixel is dependent on the output of 4 previous pixel calculations.


Embarrassingly Serial?


So is there any hope for parallelization here?  On its own perhaps this algorithm is better tackled by the CPU.  But imagine the Floyd-Steinberg filter was part of a filter chain, where there was significant benefit from running the other filters before and after this one on a GPU like the ARM® Mali-T604.



Any move from GPU to CPU will require cache synchronisation, introducing a level of overhead


Here we would need two CPU/GPU synchronization points either side of the Floyd-Steinberg filter.  These are potentially quite expensive.  Not only do the caches need to be flushed back to main memory, but the CPU needs to be ready to take on this job, which could complicate other jobs the CPU might be doing.  So if it was possible to get a GPU version running somehow, even if its processing time was a little slower than the CPU, there might still be some net benefit to the GPU implementation.


Let’s look at the algorithm again and see what we might be able to do.  We can see that the only thing stopping an individual pixel being processed is whether its related error buffer value has been written to by all four pixels: the one to the left, and the three above as follows.


C2 depends on the results of four previous pixel calculations, B1, C1, D1 and B2

From the diagrams we can see that if we want to process pixel C2 we have to wait until B1, C1, D1 and B2 have finished as these all write vaues into C2’s error buffer location.


If we have a work-item per pixel, each work item would be having to check for this moment and ultimately you could have work items representing pixels quite low down in the image or far to the right that are simply waiting a long time. And if you fill up all the available threads you can run at the same time, and they’re all waiting, you reach deadlock.  Nothing will proceed.  Not good.


What we need to do is to impose some order on all this… some good old-fashioned sequential programming alongside the parallelism.  By serializing parts of the algorithm we can reduce the number of checks a work item would need to do before it can determine it is safe to continue.  One way to do this is to assign an entire row to a single work item.  That way we can ensure we process the pixels in a line from left to right.  The work item processing the row of pixels below then only needs to check the progress of this work item: as long as it is two pixels ahead then it is safe to proceed with the next pixel.  So we would have threads progressing across their row of pixels in staggered form:



Each thread processes a horizontal line of pixels and needs to be staggered as shown here

Of course there are a few wrinkles here.  First of all we need to consider workgroups.  Each workgroup – a collection of work items, with each work item processing a line – needs to be run in order.  So the work items in the first workgroup need to process the top set of lines.  The next needs to process the set of lines below this and so on.  But there’s no guarantee that workgroups are submitted to the GPU in sequential order, so simply using the OpenCL function get_group_id – which returns the numerical offset of the current workgroup – won’t do as a way of determining which set of lines is processed.  Instead we can use OpenCL atomics: if the first work item atomically incremented a global value – and then this is used to determine the actual group of lines a workgroup processes – then we can guarantee the lines will be processed in order as they progress down the image.


Here’s a diagram showing how workgroups would share the load within an image:



Each workgroup processes a horizontal band of pixels.  In this case the workgroup size is 128, so the band height is 128 pixels, with each work item (thread) processing a single row of pixels.


So for each line we need a progress marker so that the line below knows which pixel it is safe to calculate up to.  A work item can then sit and wait for the line above if it needs to, ensuring no pixel proceeds until the error distribution values it needs have been written out.


Here’s the rough pseudo code for what the kernel needs to do…


is this the first work item in the workgroup?
    atomic increment the global workgroup rider

    initialize to zero the local progress markers

barrier        // All work items in this workgroup wait until this point is reached

from the global workgroup rider and the local work item id,
determine the line in the image we’re processing

loop through the pixels in the line we’re processing
    wait for the work item representing the line above to
    have completed enough pixels so we can proceed

    do the Floyd-Steinberg calculation for this pixel

    update the progress marker for this line

You may have spotted the next wrinkle in this solution.  The local progress markers are fine for ensuring that individual lines don’t get ahead of themselves – with the exception of the first work item (representing the top line in the group of lines represented by this workgroup).  This first line needs to only progress once the last line of the previous workgroup has got far enough along.  So we need a way of holding markers for the last line of each workgroup as well.  The wait for the first work item then becomes a special case, as does the update of the marker for the last line.


Here’s the initialisation part of the kernel code:


__kernel void fs2( __global uchar        *src,                // The source greyscale image buffer
                  __global uchar        *dst,                // The destination buffer
                  __global uchar        *err_buf,            // The distribution of errors buffer
                  __global uint          *workgroup_rider,    // A single rider used to create a unique workgroup index
                  __global volatile uint *workgroup_progress,  // A buffer of progress markers for each workgroup
                  __local volatile uint  *progress,            // The local buffer for each workgroup
                  uint                  width)                // The width of the image
    __local volatile uint        workgroup_number;

    /* We need to put the workgroups in some order. This is done by
        the first work item in the workgroup atomically incrementing
        the global workgroup rider. The local progress buffer - used
        by the work items in this workgroup also needs initialising..
      if (get_local_id(0) == 0)            // A job for the first work item...
            // Get the global order for this workgroup...
            workgroup_number        = atomic_inc(workgroup_rider);

            // Initialise the local progress markers...
            for (int i = 0; i < get_local_size(0); i++)
                progress[i]        = 0;

      barrier(CLK_LOCAL_MEM_FENCE);        // Wait here so we know progress buffer and
                                          // workgroup_number have been initialised


Note the use of the 'volatile' keyword when defining some of the variables here.  This hints to the compiler that these values can be changed by other threads, thereby avoiding certain optimisations that might otherwise be made.


The barrier in the code is also something to highlight.  There are often better ways than using barriers, typically using some kind of custom semaphore system.  The barrier here however is only used as part of the initialization of the kernel, and is not used subsequently within the loop.  Even so, I implemented a version that used a flag for each workgroup, setting the flag once the initialization has been done during the first work item’s setup phase, and then sitting and checking for the flag to be set for each of the other work items.  It was a useful exercise, but didn’t show any noticeable difference in performance.


With initialization done, it’s time to set up the loop that will traverse across the line of pixels:

      /* The area of the image we work on depends on the workgroup_number determined earlier.
        We multiply this by the workgroup size and add the local id index. This gives us the
        y value for the row this work item needs to calculate. Normally we would expect to
        use get_global_id to determine this, but can't here.
      int                y = (workgroup_number * get_local_size(0)) + get_local_id(0);
      int                err;
      int                sum;

      for (int x = 1; x < (width - 1); x++)  // Each work item processes a line (ignoring 1st and last pixels)...
          /* Need to ensure that the data in err_buf required by this
              workitem is ready. To do that we need to check the progress
              marker for the line just above us. For the first work item in this
              workgroup, we get this from the global workgroup_progress buffer.
              For other work items we can peek into the progress buffer local
              to this workgroup.

              In each case we need to know that the previous line has reached
              2 pixels on from our own current position...
          if (get_local_id(0) > 0)          // For all work items other than the first in this workgroup...
              while (progress[get_local_id(0) - 1] < (x + 2));
          else                              // For the first work item in this workgroup...
              if (workgroup_number > 0)
                  while (workgroup_progress[workgroup_number - 1] < (x + 2));

At the top of each loop we need to ensure the line above has got far enough ahead of where this line is.  So the first item in the work group checks on the progress of the last line in the previous workgroup, whilst the other items check on the progress of the line above.


After that, we’re finally ready to do the Floyd-Steinberg calculation for the current pixel:


          sum = src[(width * y) + x] + err_buf[(width * y) + x];

          if (sum > THRESHOLD)
              err                  = sum - THRESHOLD;
              dst[(width * y) + x] = 0xff;
              err                  = sum;
              dst[(width * y) + x] = 0x00;

          // Distribute the error values...
          err_buf[(width * y)      + x + 1] += (err * 7) / 16;
          err_buf[(width * (y + 1)) + x - 1] += (err * 3) / 16;
          err_buf[(width * (y + 1)) + x    ] += (err * 5) / 16;
          err_buf[(width * (y + 1)) + x + 1] += (err * 1) / 16;

The final thing to do within the main loop is to set the progress markers to reflect that this pixel is done:


          /* Set the progress marker for this line...

              If this work item is the last in the workgroup we set
              the global marker so the first item in the next
              workgroup will pick this up.

              For all other workitems we set the local progress marker.
          if (get_local_id(0) == (get_local_size(0) - 1))      // Last work item in this workgroup?
              workgroup_progress[workgroup_number]  = x;
              progress[get_local_id(0)]              = x;


There’s one more thing to do.  We need to set the progress markers to just beyond the width of the image so subsequent lines can complete:

      /* Although this work item has now finished, subsequent lines
          need to be able to continue to their ends. So the relevant
          progress markers need bumping up...
      if (get_local_id(0) == (get_local_size(0) - 1)) // Last work item in this workgroup?
          workgroup_progress[workgroup_number]        = width + 2;
          progress[get_local_id(0)]                  = width + 2;




A Word about Warp

Before I talk about performance – and risk getting too carried away – it’s worth considering again the following line:


while (progress[get_local_id(0) - 1] < (x + 2));


This loop keeps a work item waiting until a progress marker is updated, ensuring the processing for this particular line doesn’t proceed until it’s safe to do so.  The progress marker is updated by the thread processing the line above.  Other than the use of barriers, inter-thread communication is not specifically ruled out in the specification for OpenCL 1.1 or 1.2.  But neither is it specifically advocated.  In other words, it is a bit of a grey area.  As such, there is a risk that behaviour might vary across different platforms.


Take wavefront (or “warp”)-based GPUs for example.  With wavefront architectures threads (work items) are clustered together into small groups, each sharing a program counter.  This means threads within such a group cannot diverge.  They can go dormant whilst other threads follow a different conditional path, but ultimately they are in lock-step with each other.  This has some advantages when it comes to scalability, but the above line in this case will stall because if a thread was waiting for another in the same warp, the two can never progress.


The Mali-T600, -T700 and -T800 series of GPUs are not wavefront based.  With each thread having its own program counter, threads are entirely independent of each other so the above technique runs fine.  But it should be easy enough to accommodate wavefront by replacing the above 'while' loop with a conditional to determine whether the thread can continue:


Current method for the main loopAlternative method supporting wavefront-based architectures

for (x = 1; x < (width - 1); x++)


    Wait for line above to be >= 2 pixels


    process pixel x

    update progress for this line


for (x = 1; x < (width - 1); )


    if line above is >= 2 pixels ahead


          process pixel x

          update progress for this line



The right-hand version allows the loop to iterate regardless of whether the previous line is ready or not.  Note that in this version, x now only increments if the pixel is processed.


It’s also worth mentioning that as all the work items in the same wavefront are in lock-step by design, once the work items have been started further checking between the threads would be unnecessary.  It might be feasible to optimise a kernel for a wavefront-based GPU to take advantage of this.


How did it do?

Technically, the above worked, producing an identical result to the CPU reference implementation.  But what about performance?  The OpenCL implementation ran between 3 and 5 times faster than the CPU implementation.  So there is a really useful uplift from the GPU version. It would also be possible to create a multithreaded version on the CPU, and this would potentially provide some improvement.  But remember that if this filter stage was part of a chain running on the GPU, with the above solution we can now slot this right in amongst the others, further freeing the CPU and removing those pesky sync points.


kernels 2.png

Moving the CPU kernel to the GPU will remove the need for cache synchronization in the above example

And what about further room for improvement?  There are all sorts of optimisation techniques we would normally advocate, and those steps have not been explored in detail here.  But just for example, the byte read and writes could be optimised to load and store several bytes in one operation.  There are links at the bottom of this post to some articles and other resources which go into these in more detail.  With a cursory glance however it doesn’t appear that many of the usual optimisation suspects would apply easily here… but nevertheless I would be fascinated if any of you out there can see some interesting ways to speed this up further.  In the meantime it is certainly encouraging to see the improvement in speed which the Mali GPU brings to the algorithm.


Platform used for case study


CPU: ARM Cortex®-A15 running at 1.7GHz

GPU: ARM Mali-T604 MP4 running at 544MHz

Further resources


For more information about optimising compute on Mali GPUs, please see the various tutorials and documents listed here:

GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center



This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

Previous blog in the series: Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

5 Principles.png


In my previous blogs in this series I have looked at the bare essentials for using a tile-based rendering architecture as efficiently as possible, in particular showing how to structure an application's use of framebuffers most efficiently to minimize unnecessary memory accesses. With those basics out of the way I can now start looking in more detail about how to drive the OpenGL ES API most efficiently to get the best out of a platform using Mali, but before we do that, I would like to introduce my five principles of performance optimization.


Principle 1: Know Your Goals


When starting an optimization activity have a clear set of goals for where you want to end up. There are many possible objectives to optimization: faster performance, lower power consumption, lower memory bandwidth, or lower CPU overhead to name the most common ones. The kinds of problems you look for when reviewing an application will vary depending on what type of improvement you are trying to make, so getting this right at the start is of critical importance.


It is also very easy to spend increasingly large amounts of time for smaller and smaller gains, and many optimizations will increase complexity of your application and make longer term maintenance problematic. Review your progress regularly during the work to determine when to say "we've done enough", and stop when you reach this point.


Principle 2: Don't Just Try to Make Things Fast


I am often asked by developers working with Mali how they can make a specific piece of content run faster. This type of question is then often quickly followed up by more detailed questions on how to squeeze a little more performance out of a specific piece of shader code, or how to tune a specific geometry mesh to best fit the Mali architecture. These are all valid questions, but in my opinion often unduly narrow the scope of the optimization activity far too early in the process, and leave many of the most promising avenues of attack unexplored.


Both of the questions above try to optimize a fixed workload, and both make the implicit assumption that the workload is necessary at all. In reality graphics scenes often contain a huge amount of redundancy - objects which are off screen, objects which are overdrawn by other objects, objects where half the triangles are facing away from the user, etc - which contribute nothing to the final render. Optimization activities therefore need to attempt to answer two fundamental questions:


  1. How do I remove as much redundant work from the scene as possible, as efficiently as possible?
  2. How do I fine tune the performance of what is left?


In short - don't just try to make something faster, try to avoid doing it at all whenever possible! Some of this "work avoidance" must be handled entirely in the application, but in many cases OpenGL ES and Mali provides tools which can help provided you use them correctly. More on this in a future blog.


Principle 3: Graphics is Art not Science


If you are optimizing a traditional algorithm on a CPU there is normally a right answer, and failure to produce that answer will result in a system which does not work. For graphics workloads we are simply trying to create a nice looking picture as fast as possible; if an optimized version is not bit-exact it is unlikely anyone will actually notice, so don't be afraid to play with the algorithms a little if it helps streamline performance.


Optimization activities for graphics should look at the algorithms used, and if their expense does not justify the visual benefits they bring then do not be afraid to remove them and replace them with something totally different. Real-time rendering is an art form, and optimization and performance is part of that art. In many cases smooth framerate and fast performance is more important than a little more detail packed into a single frame.


Principle 4: Data Matters


GPUs are data-plane processors, and graphics rendering performance is often dominated by data-plane problems. Many developers spend a long time looking at OpenGL ES API function call sequences to determine problems, without really looking at the data they are passing into those functions. This is nearly always a serious oversight.


OpenGL ES API call sequences are of course important, and many logic issues can be spotted by looking at these during optimization work, but remember that the format, size, and packing of data assets is of critical importance and must not be forgotten when looking for opportunities to make things faster.


Principle 5: Measure Early, Measure Often


The impact of a single draw call on scene rendering performance is often impossible to tell from the API level, and in many cases seemingly innocuous draw calls often have some of the largest performance overheads. I have seen many performance teams sink days or even weeks of time into optimization something, only to belatedly realise that the shader they have been tuning only contributes 1% of the overall cost of the scene, so while they have done a fantastic job and made it 2x faster that only improves overall performance by 0.5%.


I always recommend measuring early and often, using tools such as DS-5 Streamline to get an accurate view of the GPU hardware performance via the integrated hardware performance counters, and Mali Graphics Debugger to work out which draw calls are contributing to that rendering workload. Use the performance counters not only to identify hot spots to optimize, but also to sanity check what your application is doing against what you expect it to be doing. For example, manually estimate the number of pixels, texels, or memory accesses, per frame and compare this estimate against the counters from the hardware. If you see twice as many pixels as expected being rendered then there are possibly some structural issues to investigate first which could give much larger wins than simple shader tuning.


Next Time


The best optimizations in graphics are best tackled when made a structural part of how an application presents data to the OpenGL ES API, so in my next blog I will be looking at some of the things an application might want to consider when trying very hard to not do any work at all.




Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.



This last blog in the Movie Vision App series, following on from The Movie Vision App: Part 1 and The Movie Vision App: Part 2, will discuss two final movie effect filters.



Movie Vision Filters: “Follow The White Rabbit…”


This is the most intriguing and complex filter in the Movie Vision demonstration. The camera preview image is replaced by a grid of small characters (primary Japanese Kana). The characters are coloured varying shades of green reminiscent of old computer displays. Additionally, the brightness is also manipulated to create the appearance of some characters ‘falling’ down the image. The overall impression is that the image is entirely composed of green, computer-code like characters.


      //Run the WhiteRabbitScript with the RGB camera input allocation.
      mWhiteRabbitScript.forEach_root(mWhiteRabbitInAlloc, mWhiteRabbitOutAlloc);
      //Make the heads move, dependant on the speed.
      for(int hp = 0; hp < mScreenWidth / mCharacterSize; hp++) {
          //If the character string has reached the bottom of the screen, wrap it back around.
          if(mHeadPos[hp] > mScreenHeight + 150) {
              mHeadPos[hp] = 0;
              mStrChar[hp] = mGen.nextInt(8)+1;
              mStrLen[hp] = mGen.nextInt(100)+50;
              mUpdate = true;
      //If a character string has reached the bottom, update the allocations with new random values.
      if(mUpdate) {
          mUpdate = false;

“Follow the White Rabbit” excerpt from processing of each camera frame


The Java component of this image filter does the standard RenderScript set up, but also populates several arrays to use in mapping the image to characters. The number of columns and rows of characters is calculated and a random index set for each column. A set of header positions and string lengths is also randomly generated for each column. These correspond to areas that will be drawn brighter than the rest of the image, to give the impression of falling strings of characters. On the reception of each camera preview frame, the standard YUV to RGB conversion is performed. Then, the RenderScript image effect script’s references to the character, position and length arrays are updated. The script kernel is executed. Afterwards, the header positions are adjusted so that the vertical brighter strings appear to fall down the image (and wrap back to the top).


static const int character1[mWhiteRabbitArraySize] = {0, 0, 1, 0, 0, 0,
                                                  0, 0, 1, 0, 0, 0,
                                                  1, 1, 1, 1, 1, 1,
                                                  0, 0, 1, 0, 0, 1,
                                                  0, 1, 0, 0, 0, 1,
                                                  1, 0, 0, 0, 1, 1};
static const int character2[mWhiteRabbitArraySize] = {0, 0, 1, 1, 1, 0,
                                                  1, 1, 1, 1, 1, 1,
                                                  0, 0, 0, 1, 0, 0,
                                                  0, 0, 0, 1, 0, 0,
                                                  0, 0, 1, 0, 0, 0,
                                                  0, 1, 0, 0, 0, 0};

“Follow the White Rabbit” RenderScript character setup


This is by far the most complicated RenderScript kernel in the Movie Vision app. The script file starts with eight statically defined characters from the Japanese Kana alphabet. These are defined as 6x6 arrays. The first line of the script execution is a conditional statement – the script only executes on every eighth pixel in both the x and y direction. So, the script executes ‘per character’ rather than ‘per pixel’. As we use 6x6 characters, this gives a one pixel border to each character. The output colour for the current position is set to a default green value, based on the input colour. The character index, header position and length values are retrieved from the arrays managed by the Java class. Next, we determine if the character corresponding to the current pixel is in our bright ‘falling’ string, and adjust the green value appropriately: brightest at the head, gradually fading behind and capped at a lower maximum value elsewhere. If the current character position isn’t at the front of the falling string, we also pseudo randomly change the character to add a dynamic effect to the image. Next, some basic skin tone detection is used to further brighten the output if skin is indeed detected. Finally, the output values for all pixels in the current character position are set.


      //Sets the initial green colour, which is later modified depending on the in pixel.
      refCol.r = 0;
      refCol.g = in->g;
      refCol.b = in->g & 30;
//If the Y position of this pixel is the same as the head position in this column.
        if(y == currHeadPos)
            refCol.g = 0xff; //Set it to solid green.
        //If the character is within the bounds of the falling character string for that column, make it darker the further away
        //from the head it is.
        else if((y < currHeadPos && y >= (currHeadPos - currStringLength)) || (y < currHeadPos && (currHeadPos - currStringLength) < 0))
            refCol.g = 230 - ((currHeadPos - y));
        else if(refCol.g > 150) //Cap the green at 150.
            refCol.g -= 100;
            refCol.g += refCol.g | 200; //For every other character, make it brighter.
      //If the current character isn't the head, randomly change it.
      if(y != currHeadPos)
            theChar += *(int*)rsGetElementAt(stringChars, (y/mWhiteRabbitCharSize));
      //Basic skin detection to highlight people.
      if(in->r > in->g && in->r > in->b) {
            if(  in->r > 100 && in->g > 40
              && in->b > 20 && (in->r - in->g) > 15)
                refCol.g += refCol.g & 255;
      //Loop through the binary array of the current character.
      for(int py = 0; py < mWhiteRabbitCharSize; py++){
          for(int px = 0; px < mWhiteRabbitCharSize; px++){
                out[(py*mWidth)+px].r = 0;
                out[(py*mWidth)+px].g = 0;
                out[(py*mWidth)+px].b = 0;
                if(theChar == 1) {
                    if(character1[(py*(mWhiteRabbitCharSize))+px] == 1)
                      out[(py*mWidth)+px] = refCol;
                }else if(theChar == 2) {
                    if(character2[(py*(mWhiteRabbitCharSize))+px] == 1)
                      out[(py*mWidth)+px] = refCol;

Excerpts of “Follow the White Rabbit” Renderscript Kernel root function



Movie Vision Filters: “Why So Serious?”



This filter mimics a sonar vision effect. Part of this is a simple colour mapping to a blue toned image. In addition, areas of the image are brightened relative to the amplitude of sound samples from the microphone.


    mRecorder = new MediaRecorder();
    try {
        mRecording = true;
    } catch (IOException ioe){
        mRecording = false;

“Why so serious?” setting up the microphone


The Java side of this filter does the standard configuration for a RenderScript kernel. It also sets up the Android MediaRecorder to constantly record sound, but dumps the output to /dev/null. A set of look-up tables, similar to the ‘Get to the chopper’ filter, are used to do the colour mapping. References to these are passed to the script. For each camera preview frame, the maximum sampled amplitude since the last frame and a random x and y position are passed to the RenderScript kernel. The image is converted to RGB and then the image effect kernel is executed.


    //If the current pixel is within the radius of the circle, apply for 'pulse' effect colour.
    if (((x1*x1)+(y1*y1)) < (scaledRadius*scaledRadius)){
        dist = sqrt((x1*x1)+(y1*y1));
        if (dist < scaledRadius){
            effectFactor = (dist/scaledRadius) * 2;
            lightLevel *= effectFactor;
            blue -= lightLevel;
    //Lookup the RGB values based on the external lookup tables.
    uchar R = *(uchar*)rsGetElementAt(redLUT, blue);
    uchar G = *(uchar*)rsGetElementAt(greenLUT, blue);
    uchar B = *(uchar*)rsGetElementAt(blueLUT, blue);
    //Clamp the values between 0-255
    R > 255? R = 255 : R < 0? R = 0 : R;
    G > 255? G = 255 : G < 0? G = 0 : G;
    B > 255? B = 255 : B < 0? B = 32 : B;
    //Set the final output RGB values.
    out->r = R;
    out->g = G;
    out->b = B;
    out->a = 0xff;

“Why So Serious?” RenderScript Kernel root function


The RenderScript kernel calculates a brightness, radius and offset for a ‘pulse’ effect based on the amplitude and position passed to it. If the current pixel is within the pulse circle, it is brightened considerably. The output colour channels for the pixel are then set based on the lookup tables defined in the Java file.



Movie Vision: Conclusions


Can you guess which movies inspired “Follow the White Rabbit” and “Why So Serious?” ?


At the beginning of this blog series we stated that the Movie Vision app was conceived as a demonstration to highlight heterogeneous computing capabilities in mobile devices. Specifically, we used RenderScript on Android to show the GPU Compute capabilities of ARM® Mali™ GPU technology. As a proof of concept and a way to explore one of the emerging GPU computing programming frameworks, Movie Vision has been very successful: RenderScript has proven to be an easy to use API. It is worth noting that it is highly portable, leveraging both ARM CPU and GPU technology. The Movie Vision App explored a fun and entertaining use-case, but it is only one example of the potential of heterogeneous approaches like GPU Compute.


We hope you have enjoyed this blog series, and that this inspires you to create your own applications that explore the capabilities of ARM technology.




Creative Commons License

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

We are pleased to release a new version of the Mali OpenGL ES Emulator, v2.0-BETA*, which adds support for OpenGL ES 3.1.

From Khronos website:

“OpenGL ES 3.1 provides the most desired features of desktop OpenGL 4.4 in a form suitable for mobile devices,” said Tom Olson, chair of the OpenGL ES working group and Director of Graphics Research at ARM. “It provides developers with the ability to use cutting-edge graphics techniques on devices that are shipping today.”

The OpenGL® ES Emulator is a library that maps OpenGL ES 3.1 API calls to the OpenGL API. By running on a standard PC, the emulator helps software development and testing of next generation OpenGL ES 3.1 applications since no embedded platform is required.

Get the Mali OpenGL ES Emulator




Following on from The Movie Vision App: Part 1, in Part 2, we’ll immediately move on and discuss two more image filters implemented for the project.



Movie Visions Filters: “An Old Man Dies…”



This filter is only slightly more complex than the “I’ll be back” effect described in the previous blog. The camera preview image is filtered to a black and white, grainy ‘comic book’ style, but any objects detected as red retain their colour.

The Java portion of the filter does the standard RenderScript initialisation. A Script Intrinsic is used to convert the YUV camera preview data to an RGB array, and a second custom script applies the actual visual effect.


Function: root
param uchar4 *in        The current RGB pixel of the inout allocation.
param uchar4 *out      The current RGB pixel of the output allocation.
param uint32_t x        The X position of the current pixel.
param uint32_t          The Y position of the current pixel.
void root(const uchar4 *in, uchar4 *out, uint32_t x, uint32_t y){
    //The black and white output char.
    uchar4 bw;
    //Range between -120 and 120, 120 being the highest contrast.
    //We're applying this to make a high-contrast image.
    int contrast = 120;
    float factor = (255 * (contrast + 255)) / (255 * (255 - contrast));
    int c = trunc(factor * (in->r - 128) + 128)-50;
    if(c >= 0 && c <= 255)
        bw.r = c;
        bw.r = in->r;
    //Now determine if we apply a 'grain' effect to this pixel - every 4th pixel
    //If the current pixel is divisible by 4, apple a 'grain' effect.
    if(x % 4 == 0 && y % 4 == 0)
        bw.r &= in->g;
    //Finally determine if this pixel is 'red' enough to be left as red...
    //Red colour threshhold.
    if (in->r > in->g+55 && in->r > in->b+60) {
        //Only show the red channel.
        bw.g = 0;
        bw.b = 0;
    } else {
        //Make all colour channels the same (Black & White).
        bw.g = bw.r;
        bw.b = bw.r;
    //Set the output pixel to the new black and white one.
    *out = bw;

“An Old Man Dies” RenderScript Kernel root function


First, we apply a formula to calculate a colour value for the pixel that will result in a high contrast black & white image. The value for every fourth pixel is further modified to stand out, resulting in a slight grain effect. Finally, if the pixel’s red colour value exceeds a certain threshold, only the red channel for that pixel is shown. Otherwise, the blue and green channels are set to the same value as the red to achieve the black & white look.

Once again, can you guess the movie that inspired this filter?



Movie Vision Filters: “Get To The Chopper…”


This filter creates a ‘thermal camera’ effect, and also applies a Heads Up Display (HUD) type decoration. The colour filtering utilises RenderScript, whilst the HUD leverages Android’s built in face detection. A set of look-up tables map specific colour ranges to output colours. Thermal cameras generally map infrared to a narrow set of output colours. This image filter mimics this by mapping input image colours to a similar set of output colours.


* Creates the lookup table use for the 'heat map' splitting the image int
* 16 different colours.
private void createLUT() {
    final int SPLIT = 8;
    for (int ct = 0; ct < mMaxColour/SPLIT; ct++){
        for (int i = 0; i < SPLIT; i++){
            switch (ct) {
                * The following cases define a set of colours.
                case (7):
                    mRed[(ct*SPLIT) +i] = 0;
                    mGreen[(ct*SPLIT) +i] = 255;
                    mBlue[(ct*SPLIT) +i] = 0;
                case (6):
                    mRed[(ct*SPLIT) +i] = 128;
                    mGreen[(ct*SPLIT) +i] = 255;
                    mBlue[(ct*SPLIT) +i] = 0;

“Get to the Chopper” creating look-up tables


On the Java side, along with setting up the typical set up of Allocation objects to pass input and receive output from RenderScript, three look-up tables are defined: one each for the red, green and blue colour channels. Each look-up table is essentially an array of 255 values, giving the output value for each of the possible input values of the colour channel. Each frame is again first converted to RGB before being passed to the image effect RenderScript kernel. After the RenderScript filtering, the decoration drawing callback is used to draw a red, triangular ‘targeting’ reticule on any faces that were detected by the Android face detection API.


    //Basic skin detection.
    //These values specifically filter out skin colours.
    if(in->r > in->g+10 && in->r > in->b+5 && in->g < 120) {
        //If skin has been detected, apply the 'hotter' colours.
        out->r = in->r & 40;
        out->g = in->g & 40;
        out->b = 24;
        out->a = 0xff;
    //Use the external lookup allocations to dertermine the colour.
    out->r = *(uchar*)rsGetElementAt(redLUT, in->r);
    out->g = *(uchar*)rsGetElementAt(greenLUT, in->g);
    out->b = *(uchar*)rsGetElementAt(blueLUT, in->b);

“Get to the Chopper” RenderScript Kernel root function


The RenderScript script for this effect is very simple. For each pixel, it first checks if the RGB values fall within a range considered a ‘skin tone’. If so, the output is forced to the ‘hot’ output colours. Otherwise, the output values for the pixel are set directly from the pre-configured look-up tables for each channel.


Which movie inspired “Get to the Chopper”? As a hint, it features the same actor as “I’ll be back”.


That concludes this second Movie Vision App blog. Read on for the most complex image effects of the Movie Vision App and some concluding comments in The Movie Vision App: Part 3!





Creative Commons License

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).



The Applied Systems Engineering (ASE) team at ARM is responsible for both implementing and presenting technical demonstrations that show ARM® technology in context. These demonstrations make their way to trade shows and other events and are also often discussed in online features such as this series of blogs. Here we introduce our Movie Vision App which explores the use of GPU Compute.



The Movie Vision Concept


The Movie Vision app was conceived as a demonstration to highlight emerging heterogeneous computing capabilities in mobile devices. Specifically, it makes use of Android’s RenderScript computation framework. There is a great deal of discussion about GPU Compute’s capabilities and a variety of exciting new use-cases have been highlighted. On the ARM Connected Community, you can find discussions on the advantages and benefits of GPU Compute, the methods available and details on the compute architecture provided by ARM® Mali™ GPUs. The objective of the Movie Vision application is to provide a visually compelling demonstration that leverages the capabilities of ARM technology and GPU Compute, to explore the RenderScript API and to provide an example application for the Mali Developer community.


Movie Vision takes the preview feed from an Android device’s camera, applies a visual effect to that feed and displays it on the device screen. A number of visual effect filters have been implemented, each modeled on various special effects seen in popular movies over the years. Frame-by-frame, this breaks down to applying one or more mathematical operations to a large array of data – a task well suited to the kind of massive parallelism that GPU Compute provides.


In this series of blogs, we’ll go through each of the visual effect filters we implemented and use these to explore and understand the capabilities of the RenderScript API.





RenderScript is well described on the Android Developers website, an excellent place to start for details and instructions on its use. To summarize, from a developer’s standpoint, in using RenderScript you will be writing your high-performance ‘kernel’ in C99 syntax and then utilizing a Java API to manage the use of this by your application, and to manage the data going into and out of it. These kernels are parallel functions executed on every element of the data you pass into the script. Under the hood, RenderScript code is first compiled to intermediate byte-code, and then further compiled at runtime to machine code by a sequence of Low Level Virtual Machine (LLVM) compilers. This is not conceptually dissimilar to the standard Android or Java VM model, but obviously more specialized. The final machine code is generated by an LLVM on the device, and optimized for that device. On the Google Nexus 10 used for development of the Movie Vision application, RenderScript would thus make use of either the dual core ARM Cortex®-A15 CPU or the GPU Compute capabilities of the Mali-T604 GPU.



Movie Vision Application Structure


The Movie Vision app has the following structure:




- Main Android activity class

- UI layout/functionality

- Setup of camera preview

- Setup of face detection

ImageFilterOverlayView- Allows rendering of icons & text decorations on top of filtered camera preview image
Image Filters (Java)

- Set-up for sharing data with RenderScript kernels

- Callback for each camera preview frame

- Callback for rendering decorations

Image Filters (RenderScript)- Application of image filter operations to image data



The functionality of the app is fairly simple. The Android Camera API provides a hook to receive a camera preview callback, each such call delivering a single frame from the camera. The main Movie Vision Activity receives this and passes the frame data to the currently selected image filter. After the frame has been processed, the resulting image is rendered to the screen. A further call back to the selected filter allows decorations such as text or icons to be rendered on top of the image. The Android AsyncTask class is used to decouple image processing from the main UI thread.


Each Image Filter shares some relatively common functionality. All of them perform a conversion of the camera preview data from YUV to RGB. The data from the camera is in YUV, but the filter algorithms and output for Movie Vision require RGB values. The Android 4.2 releases included updates to RenderScript which added an “Intrinsic” script to perform this operation. A RenderScript Intrinsic is an efficient implementation of a common operation. These include Blends, Blurs, Convolutions and other operations – including this YUV to RGB conversion. More information can be found on the Android Developer Website. Each Image Filter Java class also configures its ‘effect’ script. Configuration generally consists of allocating some shared data arrays (using the Allocation object) for input and output and allocating or setting any other values required by the script.


* RenderScript Setup
mTermScript = new ScriptC_IllBeBack(mRS, res, R.raw.IllBeBack);

Type.Builder tb = new Type.Builder(mRS, Element.RGBA_8888(mRS));

mScriptOutAlloc = Allocation.createTyped(rs, tb.create());
mScriptInAlloc = Allocation.createSized(rs, Element.U8(mRS), (mHeight * mWidth) + ((mHeight / 2) * (mWidth / 2) * 2));

Initial set up of a RenderScript kernel



Movie Vision Filters: “I’ll Be Back”



This filter applies a famous movie effect with a red tint and an active Heads-Up Display that highlights faces. It is probably the simplest Movie Vision effect in terms of the RenderScript component. However, a desired additional feature of this filter highlights some challenges.


* Processes the current frame. The filter first converts the YUV data
* to RGB via the conversion script. Then the Renderscript kernel is run, which applies a
* red hue to the image. Finally the filter looks for faces and objects, and on finding one
* draws a bounding box around it.
* @param data The raw YUV data.
* @param bmp Where the result of the ImageFilter is stored.
* @param lastMovedTimestamp Last time the device moved.
public void processFrame(byte[] data, Bitmap bmp, long lastMovedTimestamp){
    mConv.forEach_root(mScriptInAlloc, mScriptOutAlloc);
    mTermScript.forEach_root(mTermOutAlloc, mTermOutAlloc);

“I’ll Be Back” processing each camera frame


The Java component of this filter is relatively straight forward. The YUV/RGB conversion and image effect RenderScript kernels are configured. For each camera preview frame, we convert to RGB and pass the image to the effect kernel. After that, in a second pass on the frame, we render our HUD images and text if any faces have been detected. This draws a box around the face and prints out some text to give the impression that the face is being analyzed.


Function: root
param uchar4 *in        The current RGB pixel of the inout allocation.
param uchar4 *out      The current RGB pixel of the output allocation.
void root(const uchar4 *in, uchar4 *out){
  uchar4 p = *in;
  //Extracting the red channel, ignoring the green and blue channels. Creates the red 'HUE' effect.
  out->r = p.r & 0xff;
  out->b = p.g & 0x00;
  out->g = p.b & 0x00;
  out->a = 0xff;

“I’ll Be Back” RenderScript Kernel root function


The RenderScript component is very simple. The output green and blue channels are zeroed, so that just the red channel is visible in the final image.


Initially, an attempt was made to add pseudo object detection to this effect, such that ‘objects’ as well as faces would be highlighted by the HUD. A prototype using the OpenCV library was implemented, using an included library implementation of an algorithm for Contour Detection. It is worth noting that this approach would not utilise GPU Compute and run only on the CPU. Contour Detection is a relatively complex multi-stage computer vision algorithm. First, a Sobel Edge Detection filter is applied to bring out the edges of the image. Then, a set of steps to identify joined edges (contours) is applied. The prototype object detection then interpreted this to find contours in the image that were likely to be objects. Generally, large, rectangular regions were chosen. One of these would be selected and highlighted with the HUD decorations as an ‘object of interest’.


The issue with this object detection prototype was that it required several passes of algorithmic steps, with intermediate data sets. Porting this to RenderScript to take advantage of GPU Compute would have resulted in several chained together RenderScript scripts.  At the time of initial development this resulted in some inherent inefficiencies, although the addition of ‘Script Groups’ in Android 4.2 will have helped to address this. For now, porting the Contour Detection algorithm to RenderScript remains an outstanding project.


That concludes the first blog in this series on the Movie Vision App. Carry on reading with The Movie Vision App: Part 2 for examples of increasingly complex visual effect filters. Can you guess the movie that inspired the “I’ll be back” filter?




Creative Commons License

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

Mali Graphics Debugger v1.3.2 now supports the most recent version of Android, 5.0 “Lollipop”.

Mali Graphics Debugger is an advanced API tracer tool for OpenGL ES, EGL and OpenCL. It allows developers trace their graphics and compute applications to debug issues and analyze the performance.

It also supports Linux targets and runs on Windows, Linux and Mac.

Get Mali Graphics Debugger

Mali Graphics Debugger v1.3.2

Gaming has always been one of the top applications for GPU, not to mention it accounts for majority of app revenues on the smart mobile devices. That is why ARM is running the Game Developer Days in major cities this year. Being in the top 3 mobile game markets, hosting the ARM Game Developer Day in China is a no brainier. Partnering with Chengdu Mobile Internet Society, we hosted our very first ARM China Game Developer Day in Chengdu, the provincial capital of Sichuan province and the biggest city in Western China. The location of the event was also strategic. It was held in a coffee shop inside the Tianfu Software Park, one of the top high-tech industrial parks in China -- making it convenient for the local developers to join.



In addition to in-depth technical topics covering the latest Mali technology and ARMv8 architecture, we had also invited speakers from the 3 top game engines - Cocos, Unity3d and Unreal to share how their game engines are best optimized for the ARM platform. Here was the agenda to our ARM Developer Day at Chengdu.











Win On ARM  - The challenges and trends of the next generation Mobile Games

Leon Zhang/章立

亚太区生态系统合作经理, ARM
APAC Ecosystem Marketing Manager, ARM


ARM Mali architecture overview and ARM OpenGL ES extensions
架构概述和 ARM OpenGL® ES API 扩展

Joel Liang/梁宇洲

资深生态系统工程师, ARM
Senior Ecosystem Engineer


ARM CPU & Mali GPU Synergy Development, Deeply Optimization by using ARM tools, with live Cocos2d-x, Unity, Unreal demos on best practises
运用ARM工具进行ARM CPUMali GPU的协调开发与深度优化 - Cocos2d-x, Unity3D Unreal引擎优化实战手把手讲解

Nathan Li/李陈鲁

资深工程师, 技术团队负责人, ARM
Staff Engineer, Tech Leader, ARM


Unity5 and Enlighten Realtime Global Illumination Technology

Zhenpin Guo/郭振平

亚太区技术总监, Unity
Technology Director APAC, Unity


Lunch and Q&A




Benefits of multithread and big.LITTLE, NEON and overview of ARMv8 benefits for game developers
运用多线程,大小核,Neon,ARM V8 64位指令集技术来加速你的移动游戏 !

Alan Chuang/莊智鑫

生态系统市场经理, ARM
Ecosystem Marketing Manager, ARM


Unreal Engine4 and Optimization experiences on Mali GPU

Jack Porter

引擎开发与支持技术Lead, Epic Games
Engine Development and Support Lead, Epic Games


Tea Break/Demo




Cocos, not only an engine - enabling your games on ARM 64bit and accelerating the game development

Kenan Liu/刘克男

技术推广经理, 触控科技
Cocos Technical Marketing Manager,  Chukong Technology


What's new in OpenGL ES 3.1 - ASTC Full Profile(HDR & 3D Textures) and Computer Shaders
OpenGL ES3.1
新特性介绍 - ASTC纹理压缩与计算着色器介绍

Frank Lei/雷宇

资深开发者关系工程师, ARM
Senior DevRel Engineer, ARM


面对面问答 & 抽奖/Q&A and Lucky Draw




   Leon Zhang  - ARM (Win On ARM  - The challenges and trends of the next generation Mobile Games)


   Joel Liang  - ARM (ARM Mali Architecture Overview and ARM OpenGL ES extensions)


   Nathan Li  - ARM (Performance Analysis with Mali Tools)


  Zhenpin Guo - Unity  (Unity5 and Enlighten Realtime Global Illumination Technology)


  Alan Chuang - ARM (Benefits of Multithread and big.LITTLE, NEON and Overview of ARMv8 architecture)


  Jack Porter - Epic Games  (Unreal Engine4 and Optimization experiences on Mali GPU)



   Kenan Liu - Chukong (Cocos -- not only an engine)


  Frank Lei - ARM (What's New in OpenGL ES 3.1 -- ASTC Full Profile and Computer Shaders)


The turn-out was surprising well. We had 120 people registered for the event, but a total of 131 people showed up. Many of them are from well-know game companies here in China. Several partners were even flying in from other cities to join the event. There were good Q&As during the presentation and even more interactions at the breaks. Not only our participants enjoyed the technical presentations, but our guest speakers also enjoyed the chatting with the developers.





For a recap of the event, check out the wrap up summary from our local partner here.

More future events like this and other technical information, check out our Mali Developer Center.

Today, the Media Processing team at ARM is delighted to announce the launch of five new products, the ARM® Mali-T860, Mali-T830 and Mali-T820 graphics processors, the Mali-V550 video processor and the Mali-DP550 display processors.


The changing market


We’ve discussed the opportunities emerging in the growing mainstream market previously in these blogs. With well over 1Bn consumers already, each of whom has different requirements in terms of price, performance and feature-set, our partners need a choice of semiconductor IP which enables them to address the diversity of demands within this high-volume segment. ARM has long understood the fact that one size and one feature set does not address the needs of every market segment or best serve the needs of partners who are all looking to quickly differentiate their products to gain a competitive advantage. With this in mind, we aim to deliver a scalable roadmap of core IP, bringing our partners choice as well as enabling them to accelerate their time to market and freeing their engineers to bring more innovation and diversity to this accelerating market.


At the same time as this diversification in device type is taking place, ARM and its partners are also seeing important trends in mobile content consumption that need to be taken into account when designing the next generation of semiconductor IP. Jakub Lamik, our Director of Product Marketing, discusses some of the important trends such as increased pixel density, increased screen resolutions and increasingly complex content in his blog last week and explains the inter-core technology which ARM offers that helps our partners deal with the increasing strain this content applies to mobile devices.


When you take both of these aspects into consideration, there are a range of challenges which our partners face in producing successful end devices. Central to all is the need to offer a range of price and performance points in an energy efficient fashion in order to enable the latest content across the entire breadth of the market.


ARM’s new suite of integrated Mali IP


Since the development of the first mobile phone, ARM has worked with our partners to develop technology that continually extends the capabilities of the mobile within its fixed power budget. Today we are launching five new products which address the diverse media needs of the mainstream market. The suite offers options for cost efficiency, performance efficiency and the ability to get to market faster, all combined with innate energy efficiencies provided by the ability to allocate tasks across the system to the most appropriate processor, be that CPU, GPU, video or display.

Media Suite Launch.png


Introducing the Mali-T860 GPU


The Mali-T860 scales across sixteen cores to offer the best performance for the lowest energy consumption of any Midgard GPU. Building on the technical advances of our previous generations of GPUs, it offers a 45% improvement in energy efficiency compared to the Mali-T628 in the same configuration and process node. With micro-architectural enhancements such as quad prioritization and improved early Z test throughput, performance is improved across both casual and advanced gaming content. It is the perfect GPU for an end device targeted at the most demanding consumers who want a great visual experience at an affordable price point.




Because the key focus of the Mali-T860 is on performance efficiency, it delivers this extra performance within an impressively small energy budget by incorporating support for a range of bandwidth reducing technology including ARM Frame Buffer Compression, Smart Composition and Transaction Elimination.  Native hardware support for 10-bit YUV has also been added to make this GPU an ideal accompaniment to the Mali-V550 video processor and Mali-DP550 display processor, so that users can experience the best visual quality when watching content in an increasingly 4K DTV and STB market. 10-bit YUV is available across the entire media suite released today, whether as native hardware support such as in the Mali-T860 or as a configuration option as in the Mali-T820.


For more information on the Mali-T860 GPU, visit its product page.


Introducing the Mali-T830 and Mali-T820 GPUs


Entering the cost efficient roadmap are the Mali-T820 and Mali-T830. These two GPUs are an evolution of the Mali-T720, recently announced as the GPU in the MediaTek MT6735, and, having been developed alongside the Mali-T860, they have also inherited some important features from this performance efficient GPU which enable them to offer not only area and energy efficiencies compared to previous generations, but performance advancements as well, such as quad prioritization.

The Mali-T820 is optimized for entry-level products and achieves up to 40% higher performance density compared to the Mali-T622. Comparatively, the Mali-T830 balances area, performance and energy efficiency to deliver maximal performance from a minimal silicon area. It has an additional arithmetic pipeline compared to the Mali-T820 and offers up to 55% more performance than the Mali-T622 GPU in the same configuration and process node. It is ideal for bringing more advanced 3D gaming and arithmetically complex use cases to consumers of mainstream smartphones, tablets and DTVs.




Together, the Mali-T820 and Mali-T830 introduce ARM Frame Buffer Compression to the cost efficient roadmap for the first time. This will ensure that the system-wide bandwidth savings made possible by AFBC – up to 50% - will appear in the next couple of years in more affordable devices, enabling these to deliver high quality multimedia experiences to consumers for longer.


For more information on the Mali-T820 and Mali-T830 GPUs, visit their product pages.


Introducing the Mali-V550 video processor


The Mali-V550 is ARM’s next generation, low bandwidth, multi core, multi codec encode & decode video IP. It is the IP industry’s first single-core video encode and decode solution for HEVC; the combination of encode and decode functionality on a single core and its ability to maximize re-use across multiple codecs ensure that the Mali-V550 maintains its strong area efficiency leadership.


The Mali-V550 is a multi-core solution out of the box, scalable to 4K resolutions at 120fps or 1080p at 480fps with an 8-core configuration.  The architecture supports multiple video streams across multiple cores as well as simultaneous encode and decode. For example you can parallel decode eighteen 720p30 decode streams with a Mali-V550 MP4, or any combination of encode or decode. These streams may use different coding standards and are time multiplexed on a frame basis.

Motion search elimination, introduced in Jakub’s blog last week, enables the video processor to avoid a large amount of processing related mainly to the motion search engine, but also sometimes entire reconstruction.  The best motion search elimination benefits apply to WiFi scenarios, when encoding and sending static content (such as user interface or 2D games) to an external display. In such a situation, it is able to lower memory bandwidth by up to 35% as well as lower latency.

While system power, performance and silicon area are all critical for our SoC partners, this can not come at the expense of visual quality. The Mali-V550 is robust against external memory latency: video processing can continue for over 5000 cycles without external memory access and the Mali-V550 can hide more than 300 clock cycles of static latency from a slow memory system without dropping a frame. This means that consumers will benefit from smooth playback with no dropped frames when experiencing multimedia on a device with the Mali-V550 video processor. The Mali-V550 also maintains support for AFBC.


For more information on the Mali-V550 video processor, visit its product page.


Introducing the Mali-DP550 display processor


The Mali-DP550 completes the suite of IP launched today and offers efficient media processing right to the glass.


One way of delivering system-wide energy efficiency is to enable each task to be executed on the most appropriate processor.  We have talked about this a lot before in these blogs in the case of GPU Compute enabling applications such as computational photography.  When a Mali-DP550 is deployed in a mobile media system, it too can offload basic tasks from the GPU or CPU such as user interface composition or scaling as well as rotation, post-processing and display control – and it does this all in a single pass so there is no need to go out to memory,  extra bandwidth and power savings.


The principal additional feature of the Mali-DP550 is its co-processor interface which enables partners to easily integrate third party or proprietary display IP with the display processor. As the mainstream market diversifies and grows, delivering the right choice of application processors so that consumers can buy a device without compromise requires the ability to differentiate and deliver products quickly and simply. Display is regularly an important differentiating factor for our partners, and with this co-processor interface our partners can continue to use their proprietary display algorithms while benefiting from the advantages that licensing a highly functional core IP block can bring.


For more information on the Mali-DP550 display processor, visit its product page.


Why choose a media system from ARM?


ARM offers each of the IP blocks above as separate licensable products, but the advantages come when you employ an entirely ARM-based system. ARM partners discover system-wide bandwidth efficiencies, reduced time to market and the ability to focus engineering on critical differentiation. Thanks to our bandwidth saving technologies, the availability of an integrated software stack and system-wide performance analysis tools such as DS-5 Streamline, employing an integrated ARM-based media system is simple and very effective.  And importantly, partners can be reassured of the quality of the new products they license because of the proven verification and validation processes that ARM implements consistently across our entire IP range, from CPUs to display IP.


Mali System.png


The ARM Mali media IP products are available for immediate licensing and initial consumer devices are expected to appear in late 2015 and early 2016.

Filter Blog

By date:
By tag: