1 2 3 Previous Next

ARM Mali Graphics

287 posts

Chinese version中文版:百视通/ARM HTML5技术论坛顺利举行(附资料下载)

Dear All, good day. HTML5 is really one of the hottest word now in Internet technology area. I believe most of you heard it before. These years HTML5 commercial promotion have a number of ups and downs as a result that it is said that every year is the Year One of HTML5. But with the HTML5 specification fixed, 2015 would be the hottest year of HTML5. According to statistics, there are over 50 HTML5 related meetings in 2015 in China. News keep coming about that HTML5 companies launched IPO and HTML5 startups  won billion level valuations.



ARM as the Global lead IP technology company, we will never be absent from the HTML5 boom. Yesterday, BesTV/ARM HTML5 technical forum was held in BesTV New Media Institute (Shanghai) besTV This forum is run by besTV and ARM, inviting upstream and downstream guests from SoC ,ODM/OEM, ISP, solution vendors, channel and developers to discuss the HTML5 technology status and future.


Dr. Wen Li , senior VP of besTV and vice dean of BesTV New Media Institute gave opening speech and technical speech HTML5 will be main technical choice of smart TV and home entertainment area. He analyzed the advantage of HTML5 and introduced their excellent work in China government-sponsered core technology project and China SARFT leading TVOS project.




Matt Spencer the UI and browser marketing manager from ARM UK HQ, gave the speech HTML5 newest trends and the latest progress of global HTML5 technologies and specification information, as well as the positive role of ARM and our strategy.



The Top 2 HTML5 game engines in China came on the stage one after the other, Jianyun Bao , senior engineer of developer service department of Cocos2d-x Community , the giant mobile game company in China gave the speech Cocos HTML5 solution and introduced the total solution support to HTML5 games in open source engine, tool chain, runtime, channel access, test/analysis and payment system. And next to him, the technical evangelist Xinlei Zhang , who is from the fast rising HTML5 game engine – Egret, presented the audience their total solution support to HTML5 game in open source engine , tool chain, running platform , channel access, test/analysis and payment system too. He also introduced the new launched mobile application framework Lark.




Oversea HTML5 game engine also came to our forum. Joel Liang, senior ecosystem engineer from ARM gave the speech PlayCanvas: 3D game engine base on WebGL. He showed the advanced GPU algorithms based on WebGL in PlayCanvas engine. It can give users great visual experience of HTML5 game like a Hollywood movie. The most of the technology development are done under the cooperation between PlayCanvas and ARM.



Finally, Xiao Shen the development director from Thunder Software Technology Co., Ltd the leading solution vendor in China, gave the speech Thundersoft experience in HTML5 development . They have many maturing products, reference designs and technical storage of HTML5 solution for SoC, ODM/OEM and IPS to do quick prototype/production development.



In this forum  the attendees raised many questions andthe discussion was very hot. We got fully success in this forum. Since many other people were not able to appear in this event but keen to the contents, please let us share the slides in the enclosed files. Welcome to share your comments.

Thanks a lot.

Machine Vision is perhaps one of the few remaining areas in technology that can still lead you to say “I didn’t know computers could do that”.  The recent pace of development has been relentless.  On the one hand you have the industry giants competing to out-do each other on the ImageNet challenge, surpassing human vision recognition capabilities along the way, and on the other there is significant, relentless progress in bringing this technology to smart, mobile devices.  May 11 and 12 saw the annual Santa Clara California gathering of industry experts and leaders to discuss latest developments at the Embedded Vision Alliance Summit.  Additionally, this year ARM was proud to host a special seminar linked to the main event to discuss developments in computer vision on ARM processor technologies.  In this blog I’m going to provide my perspective of some of the highlights from both events.



The Santa Clara Convention Centre, California.  Host to both the ARM workshop and the EVA Summit


Computer Vision on ARM Seminar, 11 May, Santa Clara, CA


workshops.jpgIt was my great pleasure to host this event and for those of you who were there I hope you enjoyed the afternoon’s presentations and panel discussion.  The proceedings from the seven partner presentations can all be downloaded from here.  The idea of this event – the first of its kind ARM has held on computer vision – was intended to bring together leaders and experts in computer vision from across the ARM ecosystem.  The brief was to explore the subjects of processor selection, optimisation, balancing workloads across processors, debugging and more, all in the context of developing computer vision applications.  This covered both CPU and NEON™ optimisations, as well as working with Mali™ GPUs.


With a certain degree of cross-over, the seminar program was divided into three broad themes:

Optimising Computer Vision for NEON

  • Dr. Masaki Satoh, a research engineer from Morpho, talked about the benefits and main technical aspects of NEON acceleration, including a focus on specific algorithmic optimisations using NEON SIMD (Single Instruction, Multiple Data) instructions.
  • Wikitude are a leader in Augmented Reality applications on mobile devices and in his talk CTO Martin Lechner highlighted the use of NEON to accelerate this particular computer vision use case.

Real Computer Vision Use Cases on ARM

  • Ken Lee, founder and CEO of Van Gogh Imaging, showcased their work developing real-time 3D object recognition applications using 3D stereoscopic sensors, including optimisation via NEON and their early exploration of further acceleration via Mali GPUs.
  • Gian Marco Iodice, Compute Engineer at ARM, discussed his work on accelerating a real-time dense passive stereo vision algorithm using OpenCL™ on ARM Mali GPUs.
  • Real-time image stabilization running entirely in software was the subject of the presentation by Dr. Piotr Stec, Project Manager at FotoNation.  His analysis covered the complete processing pipeline for this challenging use case and discussed where optimisations were most effective.

Processor selection, Benchmarking and Optimising

  • Jeff Bier, president of BDTI and founder of the Embedded Vision Alliance discussed the important area of processor selection and making intelligent choices when selecting benchmarking metrics for computer vision applications.
  • Tim Hartley (that’s me!) discussed the importance of whole-system measurements when developing computer vision applications and demonstrated profiling techniques that can be applied across heterogeneous CPU and GPU processor combinations.



Jeff Bier from BDTI gets things going with his presentation about processor selection and benchmark criteria

Panel Discussion

In addition to the above presentations Roberto Mijat hosted a panel discussion looking at current and future trends in computer vision on mobile and embedded platforms.  The panel included the following industry experts:


Laszlo Kishonti CEO of Kishonti and new venture AdasWorks, a company creating software for the heavily computer vision dependent Advanced Driver Assistance Systems market.  Working with ARM, AdasWorks has explored accelerating some of their ADAS-related computer vision algorithms using a combination of ARM CPUs and a Mali GPU.  Tim Hartley from ARM talks in this video recorded at a previous event about some of the previous optimisation work around AdasWorks with the use of the DS5 Streamline profiler.
Michael Tusch, CEO of Apical, a developer of computer vision and image processing IP and covering the future algorithm development for imaging along with display control systems and video analytics.  Apical are a long-time collaborator on computational photography and have much experience using the GPU as well as the CPU for image processing acceleration.  In the previously recorded video here Michael talks about Apical's work and their experience using GPU Compute to enable hardware-based graphics acceleration.
Tim Droz, GM of SoftKinetic, a developer of 3D sensor and camera modules as well as 3D middleware, and covering issues around 3D recognition, time of flight systems, camera reference designs for gesture sensing and shared software stacks.  This video recorded at GDC 2013 shows an example of some of SoftKinetic’s work with GPGPU on Mali for their gesture-based systems.


It was a fascinating and wide-ranging discussion with some great audience questions. Roberto asked the panelists what had stood out for them with computer vision developments to date.  Laszlo talked about the increasing importance of intelligence embedded in small chips within cameras themselves.  Michael Tusch echoed this, highlighting the problem of high quality video in IP cameras causing saturation over networks.  Having analysis embedded within the cameras and then only uploading selective portions, or even metadata describing the scene, would mitigate this significantly.  Tim Droz stressed the importance of the industry moving away from the pixel count race and concentrating instead on sensor quality.


Roberto then asked about the panelist’s view on the most compelling future trends in the industry.  Michael Tusch discussed the importance in the smart home and businesses of the future of being able to distinguish and identify multiple people within a scene, in different poses and sizes, and being able to determine trajectories of objects.  This will need flexible vision processing abstractions with the the aim of understanding the target you are trying to identify: you cannot assume one size or algorithm will fit all cases.  Michael forsees, just as GPUs do for graphics, the advent of engines cable of enabling this flexible level of abstraction for computer vision applications.


Laszlo Kishonti talked about future health care automation including sensor integration in hospitals and the home, how artificial intelligence in computer vision for security is going to become more important and how vision is going to enable the future of autonomous driving.  Laszlo also described the need for what he sees as the 3rd generation of computer vision algorithms.  These will require levels of sophistication that will reach, for example, the ability of differentiating between a small child walking safely hand-in-hand with an adult with one where there might be a risk of running out into the road.  This kind of complex mix of recognition and semantic scene analysis was, said Laszlo, vital before fully autonomous vehicles can be realized.  It brought home to me both the importance of ongoing research in this area and perhaps how much further computer vision technology has to develop as a technology.


Tim Droz talked about the development of new vector processors flexible enough for a variety of inputs, HDR – high dynamic range, combining multiple images from different exposures – becoming ubiquitous, along with low-level OpenCL implementations in RTL.  He also talked about Plenoptic, light-field cameras that allow re-focusing after an image is taken, becoming much smaller and more efficient in the future.


The panel ended with a lively set of questions from the audience, wrapping up a fascinating discussion.



Gian Marco Iodice talks about accelerating a real-time dense passive stereo vision algorithm

Overall it was a real pleasure to see so many attendees so engaged with the afternoon and we are grateful to all of you who joined us on the day.  Thanks also to all our partners and panellists whose efforts led to a fascinating set of presentations and discussions.

The presentations from the seminar can be downloaded here: EVA Summit 2015 and ARM’s Computer Vision Seminar - Mali Developer Center


Embedded Vision Summit, 12 May, Santa Clara, CA

eva.pngThe annual Embedded Vision Summit is the industry event hosted by the Embedded Vision Alliance, a collection of around 50 companies working in the computer vision field.  Compared to the 2014 event, this year saw the Summit grow by over 70%, a real reflection of the growing momentum and importance in embedded vision across all industries.  Over 700 attendees had access to 26 presentations on a wide range of computer vision subjects arranged into 6 conference tracks.  The exhibition area show-cased the latest work from 34 companies.


See below for links to more information about the proceedings and for downloading the presentations.


Dr. Ren Wu, Distinguished Scientist from Baidu delivered the first of two keynotes, exploring what is probably the hottest topic of the hour: visual intelligence through deep learning.  Dr. Wu has pioneered work in this area, from training supercomputers through to deployment on mobile and Internet of Things devices.  And for robot vacuum cleaner fans – and that’s all of you surely – the afternoon keynote was from Dr. Mike Aldred from Dyson who talked about the development of their 360° vision (and ARM!) enabled device which had earlier entertained everyone as it trundled around the exhibition area, clearing crumbs thrown at it by grown men and women during lunch.



ARM showcased two new partner demos at the Summit exhibition: SLAMBench acceleration on Mali GPU by the PAMELA consortium and video image stabilization in software with Mali acceleration by FotoNation

The six conference tracks covered a wide range of subject areas.  Following on from Ren Wu’s keynote, Deep Learning and CNNs (Convolutional Neural Networks) made a notable mark with its own track this year.  And there were tracks covering vision libraries, vision algorithm development, 3D vision, business and markets, and processor selection.  In this final track, Roberto Mijat followed on from ARM’s previous day’s seminar with an examination of the role of GPUs in accelerating vision applications.



Roberto Mijat discusses the role of the integrated GPU in mobile computer vision applications

A list of all the speakers at this year's Summit can be found here: 2015 Embedded Vision Summit Speakers

All the papers from the event can be downloaded here (registration required): 2015 Embedded Vision Summit Replay

Hey everybody,


ARM have arrived and the talks have started! We're eagerly awaiting the Game Jam kicking off at 8pm CEST. See all the details on the Mali Developer Center



ARM at Shayla Games

We have with us a load of Samsung GearVR kits for developers to work on and optimise for mobile as well as our help and advice available all weekend! The GearVR kits work with the Samsung Galaxy Note 4 which has the latest Mali-T760 GPU inside. Full spec below:


Samsung Galaxy Note 4

Samsung Exynos 7 Octa

ARM Mali-T760 GPU (MP6)

ARM big.LITTLE™ processing technology

ARM Cortex®-A53 CPU (MP4)

  ARM Cortex-A57 CPU (MP4)


We're excited to see all the new innovations in the VR space and what the developers can put together in just a weekend.


You can see the full agenda for the event at the Shayla Games website



Have you ever heard about the Taoyuan effect in reflections? Probably not, so I invite you to read this blog to learn about this effect and the application of local cubemaps in graphics.


Several blogs published in this community cover different uses of local cubemaps. The first was about reflections  followed by two more blogs that describe novel uses of local cubemaps to render shadows  and refractions  with great quality and performance.


These new techniques, completely developed by the ecosystem demo team in ARM’s Media Processing Group, are especially relevant to developing for mobile devices where runtime resources must be carefully balanced. They not offer only great performance but also high-quality rendering, which makes these techniques appealing to desktop computers and consoles’ developers as well. Fetching a texture from a static cubemap guarantees very precise and stable shadows, reflections and refractions, compared to the pixel instabilities we would get when rendering these effects at runtime with a moving camera using conventional techniques.


Thanks to these different uses of the local cubemap techniques, we can achieve very high quality graphics with existing hardware.


Unity Unite events in Asia


Recently Sylwester Bala and I attended Unity Unite events in Seoul, Beijing and Taipei where we presented the talk “Enhancing Your Unity Mobile Games”. I delivered the talk at Seoul, Nathan Li at Beijing and Sylwester at Taipei. In Seoul and Beijing there were joint talks with Carl Callewaert from Unity (see here).


In the talk we introduced the concept of local cubemap and how it can be used to render reflections, and we expanded this technique for rendering shadows in an innovative way. During the talk we used several short videos to illustrate the concept and the advantages of these techniques, which helped to deliver a clear and understandable message to the audience. A couple of videos were particularly useful to show how our new shadows technique can render dynamic and soft shadows.


In the final part of the talk Carl gave a live demo about some of the most important improvements in Unity 5: Global Illumination, Physically-Based Shading and Reflection Probes. Unity’s implementation of Reflection Probes is based on the local cubemap rendering technique. Now Unity developers have access to a simpler but powerful technique to add reflections in their games in a further optimized way, which is particularly relevant for mobile platforms. Carl also mentioned that Unity might consider implementing the new shadows technique in the engine as it is tightly close to the reflection probe feature.

In all cities, the talks were received well with great expectation and had a high number of attendees, reflecting the interest in these techniques. Carl showed a live Unity demo with a great example on how reflections based on local cubemaps work. His example clearly demonstrated the advantages of this technique and how easy it can be used in the Unity engine.



Top left: Roberto Lopez Mendez at Seoul. Top right: Nathan Li at Beijing.

Bottom Left: Carl Callewaert at Beijing. Bottom right: Sylwester Bala at Taipei.



Reflections and shadows in games


During a long night walk in Seoul, Sylwester Bala and I talked about how reflections and shadows can contribute to improving the visual experience of games. We walked along Yeongdong Avenue and there were high buildings full of neon and glass. Reflections and shadows were practically in every surface. Modern cities are full of polished, high-reflective glass and metal surfaces, virtually everywhere.


Therefore, when rendering this kind of environment in games, we need to consider the fact that the local cubemap technique might offer a very efficient way of rendering reflections and shadows in combination with other rendering techniques.


Figure 1. Reflections on the facade of the Coex building at Seoul.



It is well known that reflections and shadows are key topics in any game. Without reflections and shadows any virtual world would look plain and unrealistic. Let’s take for example reflections. Reflections change every time the camera updates its position and orientation. We could prebake in a texture the reflections from the static environment but the final effect will be disappointing. An effective way of rendering high quality and optimized reflections is using local cubemaps. Using this technique we prebake the environment into static cubemaps and later at runtime we render the reflections by retrieving the texture from the cubemaps, using the local corrected view vector and then combining the contribution of local cubemaps for a certain position.



And what do you do with dynamic geometry? We obviously can’t prebake the reflections from dynamic geometry. In this case, we can render at runtime reflections/shadows from dynamic geometry using traditional techniques and combine them with reflections/shadows rendered using the local cubemap technique.



Unfortunately, the use of local cubemaps in reflections is not yet widely implemented, despite the fact that it has been available for 15 years. Now with the implementation of the Reflection Probe in Unity 5 the technique of reflections based on local cubemap is becoming available to more than the half of developers all over the world, which is pretty good. To provide support for our shadows technique based on local cubemaps in the Unity engine, it would be as simple as rendering the transparency of the environment into the alpha channel of the same cubemap used for reflections.



The Taoyuan effect



At the very end of our journey when heading to the flight gate at the Taiwan Taoyuan International Airport in Taipei, Sylwester and I were walking through a long corridor with a very polished and reflective floor. On one side the wall was projecting some message with Chinese symbols, which were perfectly reflected on the floor. The picture drew my attention and I pointed out to Sylwester that was a clear use case of reflections based on local cubemaps.



Nevertheless, Sylwester, who always pays extra attention to details, pointed out that further away from the wall reflections were more blurred. We looked at each other because it was clear that this effect can be implemented in reflections in the same way we implemented soft shadows.


Figure 2. Reflections on the corridor at the Taiwan Taoyuan International Airport.

In the shadows technique, we implemented the blurred effect by fetching the texture from an interpolated cubemap mipmap level. The magnitude passed to the fetching function is proportional to the distance from the fragment to the intersection point of the fragment-to-light vector with the scene-bounding box (a more detailed explanation of the shadows technique is in this blog).


As it happens with reflections based on local cubemap, our technique offers very clear advantages in terms of quality and resource saving when rendering shadows. Regardless of its limitations -mainly derived from its static nature- I would like to encourage developers to try the technique and explore it beyond the use case we presented in our talk. You never know what a technique can do for you until you start to use it and push it to new extremes. If you are skeptic about that then please continue reading.


The effect was caused by the fact that the Chinese symbols were carved in to the wall and the light source was behind in such a way that the light scattered in the borders of the symbols provoking a soft pattern of light beams.



I decided to implement this effect in the same demo we used to show reflections and shadows: the chess room demo. The results are below. Fig. 3 shows the reflections on the chessboard using the standard technique of local cubemap. Fig. 4 shows a blurred reflection based on the distance from the fragment to the intersection point of the reflection vector with the scene-bounding box. The further the reflection is from the real object, the more blurred it is rendered.



  Figure 3. Standard reflections on the chess board based on local cubemap.



Figure 4. Reflections on the chessboard based on local cubemap using an interpolated mipmap level based on

the distance from the fragment to the intersection point of the fragment-to-light vector with the scene-bounding box.




This example shows how powerful and flexible the technique of local cubemaps is. It allows rendering of not only soft shadows but also the reflections coming from soft light beams.





What other new applications of local cubemaps will come? As local cubemaps are becoming more popular, developers will find new applications for this technique - it is just a matter of time. But one thing is certain; this simple technique has proved to be a very effective and powerful way to render reflections, shadows and refractions.

My last blog looked at some of the critical areas which an application has to implement efficiently to get best performance out of 3D content, such as broad-brush culling of large sections of a scene which are guaranteed not to be visible so they are not sent to the GPU at all. In one of the follow on comments to this blog seanlumly01 asked "Is there a performance penalty for an application modifying textures between draw calls?". It is a really good question, but the answer is non-trivial, so I deferred to this blog post to answer it fully.


Pipelined Rendering


The most important thing to remember when it comes to resource management is the fact that OpenGL ES implementations are nearly all heavily pipelined.  This is discussed in more detail in this earlier blog, but in summary ...


When you call glDraw...() to draw something the draw does not happen instantly, instead the command which tells the GPU how to perform that draw is added to a queue of operations to be performed at some point in future. Similarly, eglSwapBuffers() does not actually swap the front and back buffer of the screen, but really just tells the graphics stack that the application has finished composing a frame of rendering and queues that frame for rendering. In both cases the logical specification of the behaviour  - the API calls - and the actual processing of the work on the GPU are decoupled by a buffering process which can be tens of milliseconds in length.


Resource Dependencies


For the most part, OpenGL ES defines a synchronous programming model. Apart from a few explicit exceptions, when you make a draw call rendering must appear to have happened at the point that the draw call was made, with pixels on screen correctly reflecting the state of any command flags, textures, or buffers at that point in time (based either on API function calls or previously specified GPU commands). This appearance of synchronous rendering is an elaborate illusion maintained by the driver stack underneath the API, which works well but does place some constraints on the application behavior if you want to achieve the best performance and lowest CPU overheads.


Due to the pipelining process outlined earlier, enforcing this illusion of synchronicity means that a pending draw call which reads a texture or buffer effectively places a modification lock on that resource until that draw operation has actually completed rendering on the GPU.



For example, if we had a code sequence:


glBindTexture(1)       // Bind texture 1, version 1
glDrawElements(...)    // Draw reading texture 1, version 1
glTexSubImage2D(...)   // Modify texture 1, so it becomes version 2
glDrawElements(...)    // Draw reading the texture 1, version 2


... then we cannot allow the glTexSubImage2D() to modify the texture memory until the first draw call has actually been processed by the GPU, otherwise the rendering of the first draw call will not correctly reflect the state of the GL at the point the API call was made (we need it to render the draw using the contents of the physical memory which reflect texture version 1, not version 2). A lot of what OpenGL ES drivers spend their time doing is tracking resource dependencies such as this one to make sure that the synchronous programming "illusion" is maintained, ensuring that operations do not happen too early (before the resources are available) or too late (after a later resource modification has been made).


Breaking Resource Dependencies


In scenarios where a resource dependency conflict occurs - for example a buffer write is requested when that buffer still has a pending read lock - the Mali drivers cannot apply the resource modification immediately without some special handling; here are multiple possible routes open to the drivers to resolve the conflict automatically.


Pipeline Finish


We could drain the rendering pipeline to the point where all pending reads and writes from the GPU for the conflicted resource are resolved. After the finish has completed we can process the modification of the resource as normal. If this happens part way through the drawing of a framebuffer you will incur incremental rendering costs where we are forced to flush the intermediate render state to main memory; see this blog for more details.


Draining the pipeline completely means that the GPU will then go idle waiting for the CPU to build the next workload, which is a poor use of hardware cycles, so this tends to be a poor solution in practice.


Resource Ghosting


We can maintain both the illusion of the synchronous programming model and process the application update immediately, if we are willing to spend a bit more memory. Rather than modifying the physical contents of the current resource memory, we can simply create a new version of the logical texture resource, assembling the new version from both the application update and any of the data from the original buffer (if the modification is only a partial buffer or texture replacement). The latest version of the resource is used for any operations at the API level, older versions are only needed until their pending rendering operations are resolved, at which point their memory can be freed. This approach is known as resource ghosting, or copy-on-write.


This is the most common approach taken by drivers as it leaves the pipeline intact and ensures that the GPU hardware stays busy. The downsides of this approach are additional memory footprint while the ghost resources are alive, and some additional processing load to allocate and assemble the new resource versions in memory.


It should also be noted that resource ghosting isn't always possible; in particular when resources are imported from external sources using a memory sharing API such as UMP, Gralloc, dma_buf, etc. In these cases other drivers, such as cameras, video decoders, and image processors may be writing into these buffers and the Mali drivers have no way to know whether this is happening or not. In these cases we generally cannot apply copy-on-write mechanisms, so the driver tends to block and wait for pending dependencies to resolve. For most applications you don't have to worry about this, but if you are working with buffers sourced from other media accelerators this is one to watch out for.


Application Overrides


Given that resource dependencies are a problem on all hardware rendering systems due to pipeline depth, it should come as no surprise that more recent versions of OpenGL ES come with some features which allow application developers to override the purely synchronous rendering illusion to get more fine control if it is needed.


The function glMapBufferRange() function in OpenGL ES 3.0  allows application developers to map a buffer into the application's CPU address space. Mapping buffers allows the application to specify an access flag of GL_MAP_UNSYNCHRONIZED_BIT, which loosely translates as the "don't worry about resource dependencies, I know what I am doing" bit. When a buffer mapping is unsynchronized the driver does not attempt to enforce the synchronous rendering illusion, and the application can modify areas of the buffer which are still referenced by pending rendering operations and therefore cause incorrect rendering for those operations if the buffer updates are made erroneously.


Working With Resource Dependencies


In addition to the direct use of features such as GL_MAP_UNSYCHRONIZED_BIT, many applications work with the knowledge that the resource usage is pipelined to create flexible rendering without causing excessive ghosting overheads.


Separate Out Volatile Resources


Ghosting can be made less expensive by ensuring that volatile resources are separated out from the static resources, making the memory regions which need to be allocated and copied as small as possible. For example, ensuring that animated glyphs which are updated using glTexSubImage2D() are not sharing a texture atlas with static images which are never changed, or ensuring that models which are animated in software on the CPU (either via attribute or index update) are not in the same buffer as static models.


Batch Updates


The overheads related to buffer updates can be reduced, and the number of ghosted copies minimized, by performing most of the resource updates in a single block (either one large update or multiple  sequential sub-buffer/texture updates), ideally before any rendering to a FBO has occurred. Avoid interleaving resource updates with draw calls like this ...




... unless you are able to use GL_MAP_UNSYNCHORNIZED_BIT. It is usually much more efficient to make the same set of updates like this:




Application Pipelined Resources


If the application wants to make performance more predictable and avoid the overheads of ghosting reallocating memory in the driver, one technique it can apply is to explicitly create multiple copies of each volatile resource in the application, one for each frame of latency present in the rendering pipeline (typically 3 for a system such as Android). The resources are used in a round-robin sequence, so when the next modification of a resource occurs the pending rendering using that resource should have completed. This means that the application modifications can be committed directly to physical memory without needing special handling in the driver.


There is no easy way to determine the pipeline length of an application, but it can be empirically tested on a device by inserting a fence object by calling glFenceSync() after a draw call using a texture, and then polling that fence object by calling glClientWaitSync() with a timeout of zero just before making the modifications N frames later. If this wait returns GL_TIMEOUT_EXPIRED then the rendering is still pending and you need to add an additional resource version to the resource pool you are using.


Thanks to Sean for the good question, and I hope this answers it!




Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali GPUs even better.

Originally posted on ASTC Evaluation Codec now on GitHub - Mali Developer Center

The evaluation codec for Adaptive Scalable Texture Compression (ASTC) technology is now available on GitHub, making it easier for you to access, browse and contribute feedback. Previously the ASTC codec was released as a ZIP archive on malideveloper.arm.com, including the source code and example binaries, however we wanted to use GitHub to bring ASTC to the heart of the developer open-source community.


What is ASTC technology?

ASTC technology developed by ARM and AMD is an official extension to both the OpenGL® and OpenGL® ES graphics APIs. ASTC is a major step forward in terms of image quality, reducing memory bandwidth and thus energy use. It is the outcome of many years of research, development and engineering, and it is now implemented in a number of GPUs across the whole industry.

ASTC is widely supported by all major hardware vendors and it is free to use. Google’s Android Extension Pack (GL_ANDROID_extension_pack_es31a) also requires support for ASTC. If you are a game developer, Unreal Engine 4 and Unity 4.3 already support ASTC and for those of you building your own game engine, you can clone the GitHub repository to start using ASTC yourself.

A number of cutting-edge developer tools also support ASTC such as:

  • ARM® Mali GPU Texture Compression Tool – can compress and decompress textures to ASTC, show the preview and let you experiment with different parameters and block sizes.
  • OpenGL ES Emulator – an OpenGL ES library for desktop PCs lets you experiment with ASTC textures even where they are not supported in hardware.
  • Mali Graphics Debugger – can trace, capture and visualize ASTC textures used by your application.

For the technical details on ASTC, please refer to the blog posts by Tom Olson and Sean Ellis.


Example of ASTC used to encode normal maps

Why is this now on GitHub?

With the source code available on GitHub, it is even easier for every developer to access, clone, browse and contribute feedback and improvements. Since ASTC is a standard, the whole community benefits from it, and there is now a straightforward way to share fixes and new features.


Availability and support

As always, tools provided by ARM are supported in the ARM Connected Community. You can ask a question in the Mali Developer Forum, follow us on Twitter, Sina Weibo, or watch our ARM YouTube and ARM YouKu channels.

For a long time, I've not really been interested in 3D programming.

(Well, I've done some minor OpenGL programming many years ago, but I must admit that I'm more of a dev-tools programmer)


After watching the video Alban linked to in ARM Processor - Sowing the Seeds of Success - Computerphile, I decided to watch some more from Computerphile.


The videos by John Chapman are spoken in a very clear and very easy to understand english, and they explain advanced technology in a way so it's easy to follow.

If you're new to 3D programming, this might be a good starting point.


A Universe of Triangles



True Power of the Matrix



Triangles to Pixels



Visibility Problem



Lights and Shadows in Graphics


The Embedded Vision Summit 2015 is nearly upon us.  This annual gathering of experts interested in this highly dynamic area is a day of fascinating presentations and demonstrates of leading-edge developments in vision-enabled products.  The Summit is on 12 May.


This year ARM® is hosting a special half-day seminar connected with the Summit on the day before.  Titled “Enabling Computer Vision on ARM” the event will see a number of industry-leading developers presenting their experiences in computer vision working over a variety of ARM platforms and use cases.


Growth in Computer Vision

Computer vision is seeing phenomenal growth in adoption and deployment.  The increased power, efficiency and variety of processors is enabling many new use cases while revolutionizing existing vision applications previously confined to the desktop.  These are now increasingly possible on energy efficient mobile devices and across market segments from automotive, retail, medical and industry.


In this seminar, a selection of computer vision experts and leaders in their fields will present their experiences working with ARM-based systems across a variety of real use cases.  Attendees will learn how to:


  • Resolve common issues encountered when implementing complex vision algorithms on embedded processors in areas such as object recognition and augmented reality
  • Balance workloads across processors and processor types
  • Debug heterogeneous vision applications on ARM-based systems to remove design bottlenecks and improve performance and efficiency


Seminar program:


  • Jeff Bier, President, BDTI
    Title: Benchmarking Metrics and Processor Selection for Computer Vision
    This presentation looks at the long term trends in computer vision applications and processors and the challenges these pose to benchmarking vision applications.  As the complexity of applications, processors and the heterogeneous design of systems increases, so do the challenges in measuring their performance in meaningful ways.  Being able to assess the relative performance of processors and processor types under various combinations and configurations is a vital factor in matching systems to particular use cases.  For example, mobile use cases are becoming the a key focus for software development and these systems increasingly rely on heterogeneous configurations to increase processing efficiency.  To use these processors efficiently, developers must determine the optical mapping of their applications onto the SoC’s heterogeneous processing cores.

  • Dr. Masaki Satoh, Morpho Inc
    Title: Development of Image and Vision Processing Software and Optimizations for ARM
    This presentation will give technical insight on the benefit of NEON Acceleration, including detailing actual performance improvement.  From the developer perspective, the presentation will examine specific algorithms and how they are optimized with NEON™.  The future of imaging will also be examined, looking at the potential of GPU compute and other heterogeneous combinations, deep learning image recognition engines accelerated through NEON and OpenCL, and research and development into automotive products.

  • Dr. Piotr Stec, Project Manager in the Imaging Field, FotoNation
    Title: Video Image Stabilization for Mobile Devices
    The presentation will show the processing steps needed to perform video stabilization on mobile devices. We will show the algorithm flow indicating the steps that need to be taken to perform the algorithm and the data flow between various components of the algorithm. Some parts of the algorithm proven to be particularly challenging in terms of achieving suitable performance. Not always the most complex parts turned out to be the bottleneck. We will show how those difficulties were overcome on the device using ARM chipset and what gains were achieved in terms of processing time.  The last part of the presentation will be a live demonstration of the working algorithm.

  • Gian Marco Iodice, Compute Engineer, ARM
    Title: Real-time Dense Passive Stereo Vision: A Case Study in Optimizing Computer Vision Applications Using OpenCL on ARM
    Abstract: Passive stereo vision is a powerful visual sensing technique aimed at inferring depth without using any structured light. Nowadays, as it offers low cost and reliability solutions, it finds application in many real use cases, such as natural user interfaces, industrial automation, autonomous vehicles, and many more. Since stereo vision algorithms are extremely computationally expensive, resulting in very high CPU load, the aim of this presentation is to demonstrate the feasibility of this task on a low power mobile ARM Mali GPU. In particular, the presentation will focus on a local stereo vision method based on a novel extension of census transform, which exploits the highlyparallel execution feature of mobile Graphic Processing Units with OpenCL.  The presentation will show also the approaches and the strategies used to optimize the OpenCL code in order to reach significant performance benefits on the GPU.

  • Martin Lechner, CTO, Wikitude
    Title: Utilizing NEON for Accelerated Computer Vision Processing in Augmented Reality Scenarios
    In the core of the Wikitude SDK runs an engine that heavily relies on different computer vision algorithms to get information about the current environment of the user. As those algorithms can be very computationally intensive, a major part in our work is to optimize and specifically design the algorithms for the architecture on mobile devices.  As most of the current mobile phones have either armv7 or armv8 architectures the ARM NEON SIMD-instruction set provides a huge possibility for improving the performance of computer vision algorithms. This presentation will focus on how the NEON instruction set can be used to improve the performance of general image processing functions. It will also include a discussion of our experience with the NEON instruction set, specifically the process on how to find the hotspots in the code and how those functions can be tested and debugged as well as our experience with porting the NEON functionality from armv7 to armv8.

  • Tim Hartley, Technical Marketing Manager, ARM
    Title: Measuring the Whole System: Holistic Profiling of CPU and GPU for Optimal Vision Applications on ARM Platforms
    Developers of sophisticated vision applications need all the processing power they can lay their hands on, and using OpenCL on a GPU can be a vital additional compute resource.  But spreading the workload amongst processors and processor types brings its own problems and difficulties, and traditional application optimization techniques are not always effective in this brave new heterogeneous world.  The key to achieving performance is twofold: getting access to hardware counters for all the processors in your system, and then understanding what those numbers are telling you.  In this talk, I will examine the tools and techniques available to profile these sorts of applications and will use real case studies from vision applications. Using tools like DS5 Streamline I will show how to extract meaningful performance numbers and how to interpret them.

  • Ken Lee, Founder and CEO, Van Gogh Imaging
    Title: Using ARM Processors to Implement Real-Time 3D Object Recognition on Mobile Devices
    Abstract: Diverse applications such as 3D printing, augmented reality, medical, parts inspections, and ecommerce can benefit significantly from the ability of 3D computer vision to separate a scene into discrete objects and then recognize and analyze them reliably. The 3D approach is much more robust and accurate than the traditional 2D approach and is now possible with embedded 3D sensors and powerful processors in mobile devices.  This discussion will focus on how real-time 3D computer vision can now be implemented in the ARM CPU.  Further, we will discuss how these algorithms can be further accelerated using ARM Mali GPU with OPENCL implementation.



The seminar will include an industry panel discussion. Experts from the worlds of vision IP, ADAS (Automatic Driver Assistance Systems), and image sensors and recognition software will discuss future trends in technology for ARM-based systems.


To register for free for the ARM Seminar:


There’s more about the Embedded Vision Summit here:

At GDC 2015, ARM and PlayCanvas unveiled the Seemore WebGL demo. If you haven’t seen it yet, it takes WebGL graphics to a whole new level.




So why did we build this demo? We had two key goals:

Put amazing demo content in the hands of you, the developer

Seemore WebGL is the first conference demo that has been developed to run specifically in the web browser. This is great, because you can run it for yourself and do so on any device. Nothing to download and install - hit a link, and you’re immediately dropped into a stunning 3D experience. And better yet, you can learn from the demo and use that knowledge in your own creations.

Demonstrate console quality graphics on mobile

ARM Mali GPUs pack a serious graphical punch and Seemore is designed to fully demonstrate this power. We have taken advanced graphical features seen in the latest generation of console titles and optimized them to be completely mobile friendly. And best of all, all of this technology is open sourced on GitHub.

It's not practical to examine all of the engine updates we made to bring Seemore to life. So instead, let’s examine three of the more interesting engine features that were developed for the project.


Prefiltered Cubemaps

This is the generation and usage of prefiltered cubemaps. Each mipmap level stores environment reflection at different level of surface roughness - from mirror-like to diffuse.




How did we do it?
First, we added a cubemap filtering utility to the engine (GPU-based importance sampling). The next step was to expose this functionality in the PlayCanvas Editor. This technique uses Phong lobes of different sizes to pre-blur each mip level. Runtime shaders use either the EXT_shader_texture_lod extension (where supported) or reference mip levels stored as individual textures that are interpolated manually.

Show me the code!


Further reading:



Box-projected cubemaps

This feature makes cubemaps work as if projected onto the insides of a box, instead of being infinitely far away (as with a regular skybox cubemap). This technique is widely used in games for interior reflection and refraction.



How did we do it?

This effect is implemented using a world-space AABB projection. Refraction uses the same code as reflection but with a different ray direction, so the projection automatically applies to it as well.

Show me the code!


Further reading:



Custom shader chunks

Standard material shaders in PlayCanvas are assembled from multiple code 'chunks'. Often, you don't want to replace the whole shader, but you'd like to only change some parts of it, like adding some procedural ambient occlusion or changing the way a surface reflects light.


This feature was required in Seemore to achieve the following:


  • Dual baked ambient occlusion. The main plant uses 2 AO maps for open and closed mouth states which are interpolated dynamically.


  • Fake foliage translucency. This attenuates emission to make it appear as though light is scattered on the back-faces of leaves in a hemispherically lit room. The plant’s head uses a more complex version of the effect, calculating per-vertex procedural light occlusion.


  • Plant/tentacle animation. Procedural code that drives vertex positions/normals/tangents.

How did we do it?

Shader chunks are stored in the engine sourcebase as .vert and .frag files that contain snippets of GLSL. You can find all of these files here. Here’s an example chunk that applies exponential squared fog to a fragment:

uniform vec3 fog_color;
uniform float fog_density;

vec3 addFog(inout psInternalData data, vec3 color)
    float depth = gl_FragCoord.z / gl_FragCoord.w;
    float fogFactor = exp(-depth * depth * fog_density * fog_density);
    fogFactor = clamp(fogFactor, 0.0, 1.0);
    return mix(fog_color, color, fogFactor);

Each chunk file’s name becomes its name at runtime, with PS or VS appended, depending on whether the chunk forms part of a vertex or pixel shader. In the case above, the filename is fogExp2.frag. It’s a simple matter to replace this fragment on a material. Simply do:

  material.chunks.fogExp2PS = myCustomShaderString;

Show me the code!



So there you have it. A brief insight into some of the latest technology in the PlayCanvas Engine. Want to find out more? Head over to GitHub, watch, star and fork the codebase - get involved today!



Since our successful demonstration at SIGGRAPH, ARM and Collabora have continued to work together on providing the best possible platform for media playback and presentation. Numerous applications such as digital signage, IVI, tablets, remote monitoring, and more, all require accurate, high-quality and low-power video presentation, with as low a thermal envelope as possible.


Collabora has made significant contributions to, and maintenance of, both the standard open-source GStreamer media framework, and the next-generation Wayland window system. Combining the two has allowed us to bring out the full extent of the capabilities of GStreamer, for years used in broadcast television with its exacting standards, and Wayland's lightweight and flexible design, which above all else emphasises accuracy and perfect end results.


The result is a system providing perfectly synchronised network video presentation. Three displays, all powered by separate ODROID-XU3 systems using the Samsung EXYNOS 5422 SoC with an ARM MALI-T620 GPU. Each system displays one segment of the video, with one acting as the co-ordinator to keep timing consistent across all three segments. From the user's point of view, the video appears as one consistent whole.


Network synchronisation with GStreamer

GStreamer is the reference open-source media framework, used in everything from audio playback on embedded systems, to huge farms powering broadcast TV. GStreamer's pipeline concept provides a flexible and lightweight transport to suit almost any usecase. In this particular instance, we are using GStreamer to load H.264 content from disk, pass it to a hardware H.264 decoder, feed the resulting frames to Wayland, and then feed the timing information from Wayland back to the master device.

Core to GStreamer's flexibility and applicability has been its excellent support for clock control, being able to synchronise multiple disparate sources. Its clock control supports both hardware and software sources and sinks and allows the most precise matching possible between input audio and video clocks, and the output device's actual capabilties.

GStreamer's measurement was then supplemented by an open-source distributed media control system called Aurena, which uses these measurement reports and targets from GStreamer to synchronise playback across all three devices.

The work we did with GStreamer to enhance its Wayland and H.264 hardware decoding support is both already merged to the upstream open-source project, and fully hardware-independent.



Accurate display with Wayland

The next-generation Wayland window system allows us to make the most efficient possible use of the hardware IP blocks, not only maximising throughput (thus increasing the highest achievable resolution, or number of streams, without sacrificing quality), but also providing predictable presentation.

Wayland's design goal of 'every frame is perfect' means that the content shown to the user must always be complete, coherent, and well-timed. The frame-based model employed is a significant stride over legacy X11 and DirectFB systems, and the consideration given to timing concerns allows us to make sure that the media is always delivered as close to on time as possible, without unsightly visual artifacts such as tearing.

Building on this solid and well-tested core, Collabora developed multiple extensions to Wayland. The first ensures that no copies of the video data are made in the compositing process, preserving precious memory bandwidth, latency, and overall system responsiveness. This extension uses the latest Khronos Group EGL extensions, as supported by ARM's MALI GPU.

However, even without this copy stage, as video usage continues to push at the margins of hardware performance – one recent customer project involved 4K output of nine 1080p H.264 streams on an embedded system – we realise that it might not be physically possible to obtain full frame-rate at all times. To compensate for this, Collabora developed an additional Wayland extension, allowing not only real-time feedback of actual hardware presentation time, but ahead-of-time frame queueing.

This feedback mechanism allows GStreamer to dynamically adjust its clock to obtain perfect synchronisation both across devices and between audio/video, whilst the ahead-of-time queueing gives the hardware the best possible chance to make frame deadlines, as well as preserving power by allowing the hardware to enter sleep states for longer.

This work is all either included with current Wayland releases, or actively being discussed and developed as part of the upstream open-source community.



Hardware enablement


Neither GStreamer nor Wayland required any hardware-specific development or tweaking. However, in order to make this work possible, Collabora has worked extensively on the kernel drivers for the Exynos 5422 SoC found inside the ODROID-XU3. Bringing the Exynos hardware support up to speed with the latest developments in the Kernel Modesetting and Video4Linux 2 subsystems, as well as fixing bugs found in our automated stress-testing laboratory, allowed this work to proceed without a hitch.


Far from being throwaway, this work is being merged into the upstream Linux kernel and U-Boot projects, as part of our ongoing commitment to working closely with the open source community to raise the bar for quality and functionality. As this and other platforms rapidly adopt these improvements and become able to run this work, device manufacturers are able to select from the greatest possible choice of vendors. Our work with our partners, including ARM, the wider open source community, and membership of the Khronos Group, continues to deliver benefits for the entire ecosystem, not just one platform or device.


This open standards-based approach allows platform selection to be driven by the true capabilities of the hardware and cost/logistics concerns, rather than having to fret about software capability and vendor lock-in.



Further development


Of course, the power of GStreamer, Wayland, and a standards-based Linux system does not just stop there: its extensibility includes support for OpenGL ES and EGL, as well as arbitrary client-defined content. Not only can applications like web browsers take advantage of this synchronisation in order to embed seamless media content in HTML5 displays, but the source data can be anything from a single OpenGL ES application, a web browser, or anything else imaginable. Whether providing for immersive gaming experiences or large-scale digital signage, the underlying technology is flexible and capable enough to deal with any needs.



ARM is a registered trademark of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved.

Marius Bjørge

Pixel Local Storage

Posted by Marius Bjørge Apr 17, 2015

It's now been a year since we announced the Shader Pixel Local Storage extension. Here I'll recap what we've done since the time of the release.


What is Pixel Local Storage?

I recommend reading Jan-Harald Fredriksen's blog post Pixel Local Storage on ARM® Mali™ GPUs for background information about what Pixel Local Storage is and the advantages of exposing it.


Order Indepdendent Transparency

At SIGGRAPH 2014 we presented "Efficient Rendering with Tile Local Storage", with detailed use-cases mixing advanced techniques such as deferred shading and order independent transparency. The problem with transparency is that blending operations tend to be non-commutative, meaning that the end result is highly sensitive to the shading order of the blended fragments. Using Pixel Local Storage we implemented both a full depth-peeling approach and also compared that against approximate approaches such as Multi-Layer Alpha Blending and Adaptive Range blending. Not only that; we implemented all of this very efficiently on-top of a fully deferred shading renderer. Please see Efficient Rendering with Tile Local Storage for more details.



Collaboration with Epic

We integrated Pixel Local Storage into Epic's Unreal Engine 4. This enabled more efficient HDR rendering as well as features such as bloom and soft particles.



Sample code

We've also release a couple of samples showing how to use Pixel Local Storage in your own code.


Shader Pixel Local Storage


This sample implements deferred shading using pixel local storage.





This sample uses Pixel Local Storage to render translucent geometry.





  1. Pixel Local Storage on ARM® Mali™ GPUs
  2. Supporting the development of mobile games at GDC 2015
  3. Efficient Rendering with Tile Local Storage

Most people use Mali Graphics Debugger (MGD) to help debug OpenGL® ES applications on Linux or Android. However graphics is not the only API that is supported by MGD, in-fact MGD supports applications that use the OpenCL™ API as well. This means that if you run MGD with an application that uses OpenCL you will get the same function level tracing that you would get with OpenGL ES. You will also get access to the kernel source code in much the same way you would get access to your OpenGL ES shader source code.


With the release 2.1 of Mali Graphics Debugger the OpenCL feature set has been improved by the inclusion of GPUVerify support. GPUVerify is a tool for formal analysis of kernels written in OpenCL and was partly funded by the EU FP7 CARP project http://carpproject.eu/. The tool can prove that kernels don’t suffer from the following three issues:

  • Intra-group data races: This is when there is a data race between work items in the same work group.
  • Inter-group data races: This is when there is a data race between work items in different work groups.
  • Barrier divergence: This is when a kernel breaks the rules for barrier synchronization in conditional code defined in the OpenCL documentation.


The tool was created by Imperial College London and more information about the tool and the issues it can diagnose can be found by visiting http://multicore.doc.ic.ac.uk/GPUVerify.


In essence if you provide GPUVerify with the source code of a OpenCL kernel along with the local and global work group sizes it can check your kernel for the issues highlighted above. One of the main reasons for including support for this tool in MGD is that the information required to run GPUVerify is already known from tracing your application. The steps needed to use this tool with MGD are outlined below:


Step 1: Download Mali Graphics Debugger v2.1 and GPUVerify

Mali Graphics Debugger can be downloaded from http://malideveloper.arm.com/develop-for-mali/tools/software-tools/mali-graphics-debugger/. If you are using an older version of MGD you will need to upgrade. Version 2.1 has much more included than just GPUVerify support, a few items are listed below:

  • Support for ARMv8 (Android 64-bit)
  • Support for tracing Android Extension Pack
  • Many improvements to the Frame Overrides feature.


GPUVerify can be downloaded in binary form for both Linux and Windows from the following location: http://multicore.doc.ic.ac.uk/tools/GPUVerify/download.php. GPUVerify does support Mac OS X as well but you will have to build it from source.


Step 2: Make sure GPUVerify is working stand-alone

Using GPUVerify successfully inside MGD is often easier once it is established that GPUVerify works in a stand-alone capacity. Once it has been downloaded the process of making it work is easy thanks to the documentation provided on the GPUVerify's website. Along with a section that provides common troubleshooting advice. A good way to check that it is working correctly is to try GPUVerify with some of the examples that are provided with the Mali OpenCL SDK.


The following image is of the output from GPUVerify running the HelloWorld example from the Mali OpenCL SDK:



Step 3: Making MGD Aware of the Location of GPUVerify

As GPUVerify is not shipped with MGD you must specify the location of where you installed it to MGD. To do this in MGD do the following:


  • Click Edit -> Preferences


  • Click on the Text box next to "Path to GPUVerify" and provide the location of the GPUVerify binary.



Step 4: Run a normal OpenCL trace in MGD

As mentioned previously MGD will try to fill in the prerequisite information for GPUVerify from the trace. It does this by looking for several functions; the most important are clCreateProgramWithSource, clCreateKernel and clEnqueueMapBuffer.


If you don't use clCreateKernel to create your kernels, MGD can also obtain the information from clCreateKernelsInProgram. As shown in the image above MGD also captures the build options used in clBuildProgram to pass to GPUVerify. The more details that can be passed to GPUVerify the more accurate it can analyze your kernels.


Step 5: Running GPUVerify from MGD

To run the tool you need to select Debug -> Launch GPU Verify. MGD will then present a new dialog box which summarizes all of the information MGD managed to pull out of the trace. You are free to fill this information in or even change the information. One of the reasons you may want to do this is to try new work group sizes or global work sizes, to see if there are any unforeseen issues with different kernel parameters.


Stage 6: Analyzing the Results

The results should be placed in the console view as part of MGD. Here are the results of the HelloWorld example from above running through MGD:




Mali Graphics Debugger can be used to do much more than debug and trace graphics applications. It can be used to debug and trace OpenCL applications as well. With the inclusion of GPUVerify support it is now possible to debug possible data race conditions in your kernels as well as barrier divergence issues. MGD will send as much data as it can to GPUVerify by analyzing the trace of your OpenCL application.

OpenGL ES is the standard API for 2D and 3D graphics on embedded systems, which includes mobile phones, tablets, smart TVs, consoles, and other appliances and vehicles. The API is a well-defined subset of desktop OpenGL.


In our ARM Mali DeveloperCenter, we have two OpenGL ES Software Development Kits, one for Android OS and one for Linux environments. These SDKs are aimed mainly from beginners to intermediate users, with a guide on how to get your system properly configured and set up, as well as how to build the OpenGL ES 2.0 and the latest OpenGL ES 3.x sample code and run them. 


Mali OpenGL ES SDK.png




The SDKs contain tutorials and sample code. The tutorials start from the basic concepts, such as introduction to shaders, and setting up the graphics pipeline to render to the display, followed by examples to render basic geometry and add textures and light onto them. The more advanced tutorials talk about our latest ARM Mali features like implementing deferred shading using the tile buffer available through the Shader Pixel Local Storage OpenGL ES extension, as well as samples illustrating the use of AdaptiveScalable Texture Compression (ASTC), and showing the use of the very latest OpenGL ES 3.1 APIs, released in the latest Android Lollipop.


The most important OpenGL ES 3.1 API feature is Compute Shaders, which allow the GPU to be used for general-purpose computing and therefore, the SDK includes a dedicated tutorial to it called introduction to compute shaders.  The tutorials refer to sample code, and a detailed summary of the Compute Shader sample code available in the SDK is given in Hans-Kristian’s blog.


ARM Mali-T6xx and ARM Mali-T7xx series support OpenGL ES 3.1, and the latest Android L OS is capable of running OpenGL ES 3.1 applications. At MWC, Samsung launched their Galaxy S6, based on Android L and supporting OpenGL ES 3.1. Other existing devices in the market with support for the latest API are the Nexus10 tablet (It might need upgrading the OS to Android L from here) and the Galaxy Note 4 smartphone, upgrading the OS to Android L, too.



About me

Hi, I am Hans-Kristian Arntzen! This is my first post here. I work in the Mali use cases team where we explore the latest mobile APIs to find efficient ways of implementing modern graphics techniques on the ARM Mali architecture.

Sometimes, we create small tech demos which result in Mali SDK samples, smaller code examples which you can take inspiration from when developing your own applications.

Since August, I've been writing quite a lot of code for OpenGL ES 3.1 and I will summarize what we have done with OpenGL ES 3.1 the last months.


About OpenGL ES 3.1

OpenGL ES 3.1 is an update to OpenGL ES 3.0 which recognizes the fact that OpenGL ES 3.0 capable hardware is already capable of much more, for example compute. OpenGL ES 3.1 now brings GPU compute support directly to OpenGL ES, so there is no longer any need to interface with external APIs to expose the compute capabilities of the hardware. The interface for compute is very clean, powerful and easy to use.


Compute support in graphics APIs means there are many more opportunities now for applications to offload parallel work to the GPU than before and being able to do this on mobile hardware is very exciting.

See Here comes OpenGL® ES 3.1! for more details.


Mali driver support for OpenGL ES 3.1

We released the r5p0 driver in December with support for OpenGL ES 3.1. The driver for Linux and Android platforms can be found here: Drivers - Mali Developer Center Mali Developer Center.


Update to the Mali OpenGL ES SDK

The latest Linux and Android OpenGL ES SDK has new sample code for compute shaders.

Mali OpenGL ES SDK for Linux - Mali Developer Center Mali Developer Center

Mali OpenGL ES SDK for Android - Mali Developer Center Mali Developer Center

The samples can be built for Linux development platforms with fbdev.


There is also OpenGL ES emulator support included (OpenGL ES Emulator - Mali Developer Center Mali Developer Center) so you can run the Linux fbdev samples on your desktop on Linux and Windows.

If your desktop implementation supports X11/EGL in Linux, you should be able to run the samples without emulator by leveraging the GL_ARB_ES3_1_compatibility extension which went  into core in OpenGL 4.5.


Introduction to Compute Shaders

Introduction to compute shaders - Mali Developer Center Mali Developer Center

Compute is a new subject for many graphics programmers. This document tries to explain the different mind set you need to effectively use GPU compute and the new APIs found in OpenGL ES 3.1.

It goes through the major features of compute, and in-depth into some more difficult subjects like synchronization, memory ordering and execution barriers.

It is recommended that you read this before studying the examples below unless you're already familiar with compute shaders.


Particle Flow Simulation with Compute Shaders

Particle Flow Simulation with Compute Shaders - Mali Developer Center Mali Developer Center


This sample implements a modern particle system. It uses compute shaders to sort particles back-to-front which is critical to obtain correct alpha blending.

Since we can sort now on the GPU, we can offload the entire particle system to the GPU.


It also implements cool things like 4-layer opacity shadow map for some sweet volumetric shadow effects and simplex noise to add turbulence to the particles.

Combining all these techniques together allow you to create a very nice particle system.


Occlusion Culling with Hierarchical-Z

Occlusion Culling with Hierarchical-Z - Mali Developer Center Mali Developer Center


Culling is important in complex scenes to keep vertex work down as mentioned in this blog post: Mali Performance 5: An Application's Performance Responsibilities

For game objects, there are many sophisticated CPU-based solutions which often rely on baking data structures based on how the scene is put together.

For example, in indoor scenes with separate rooms, it makes sense to only consider rendering the room you're in and objects from rooms with are visible from the room you're standing in. Doing this computation on-the-fly could get expensive, but once the information is baked, it is fairly simple.


However, when we add a large amount of "chaotic" elements to a more dynamic scene, it becomes more difficult to bake anything and we need to compute this on the fly. We have to look for some more general solutions for these scenarios.

The sample shows how you can use a low-resolution depth map and bounding spheres to efficiently cull entire instances in parallel before they are even rendered. It can also be combined with level-of-detail sorting to reduce geometry load even further.

Finally, the result is drawn with indirect draws, a new feature of OpenGL ES 3.1.


Using these kinds of techniques allow you to offload big "particle-like" systems to the GPU efficiently.


Game Developers Conference 2015


At GDC2015 we presented updated best practices for GLES 3.1 on Mali along with a newly developed tech demo. I manned our tech booth at the expo floor most of the time where I got to show my demo to other people, which was quite exciting.


Best Practices for OpenGL ES 3.1 on Mali

There are certain things you should think about when developing for Mali. During our work with OpenGL ES 3.1, we have found some general performance tips you should take into account.

Compute exposes more low level details about the architecture, and to get optimal performance for a particular architecture, you might need some specific optimizations.

If you are experienced with compute on desktop, you might find that many general truths about performance on desktop don't necessarily apply to mobile! Sometimes, performance tips are opposite of what you'd want to do on desktop.

If you have used OpenCL on Mali before, best practices for OpenCL also apply for compute shaders.



I presented at the GDC2015 along with Tom Olson (Chair of Khronos OpenGL ES and Vulkan working groups, Director of Graphics Research at ARM) and Dan Galpin (Developer Advocate, Google).

The presentation goes through OpenGL ES 3.1 (focus on compute), some of the techniques I mentioned in this post, best practices for OpenGL ES 3.1 and AEP on Mali and a small sneak peak on early Vulkan experiments on Mali.


Unleash the Benefits of OpenGL ES 3.1 and Android Extension Pack:



At the ARM Lecture Theater at GDC2015, I also did a short presentation with focus exclusively on compute. It goes a bit more in detail on compute shader basics compared to the full length GDC talk:



Caveats with r5p0 release

Unfortunately, there are some performance bugs with some features in r5p0 release. You might stumble into them when developing for OpenGL ES 3.1.

  • Indirect draws can slow down a lot compared to regular draws.
  • Compute shaders with smaller work groups (e.g. 4 or 8 threads) are much slower (3-4x) than compute shaders using 64 or 128 threads.


If you run into these issues, they have been addressed and should be fixed in future driver releases.



Occlusion Culling with Compute Shaders demo

I am very excited about compute shaders and culling, so much that I wanted to create a demo for it at GDC. We do have the Occlusion Culling sample code in the SDK, but it is far too bare-bones to show at an event.

I attended GDC 2015, where I manned our tech booth most of the time and I got to show this demo to many people passing by.


Instead of dull green spheres I went for some procedurally generated asteroids. All the asteroids look slightly different even if they are instanced due to the use of a 4-component RGBA8 heightmap. All asteroids have different random weighting factors which make them look a bit different. They have independent radii, rotation axes and rotation speeds as well which makes the scene look fairly complex. Diffuse textures and normal textures are shared for all asteroids. They are also generated procedurally with perlin noise and compressed with ASTC LDR.




There are over 27000 asteroids in the scene here spread out across a big sphere around the camera.

At highest quality, each individual asteroid has over 2500 triangles. If we were to just naively draw this without any kind of optimization, we would get a triangle count in the ballpark of 50+ million which is extremely overkill.


We need some culling. The first and obvious optimization is frustum culling, which can remove most of the asteroids outright. We can do this on the GPU very efficiently and parallel since it's just a couple of dot products per instance after all.

All the asteroids in the scene are represented as a flat linear array of per-instance data such as position, base radius, rotation axis, rotation speed and heightmap weighting factors. We combine frustum culling with the physics update (rotating the asteroids and creating a final rotation quaternion per asteroid). Since we need to update every asteroids anyways, might as well do frustum culling while they are in cache!




Now we're looking at ~2000 asteroids being rendered, but just frustum culling is not enough! We also need LOD sorting to get the vertex count low enough.

The idea behind LOD sorting is that objects far away don't need high detail. We can add this technique to plain frustum culling and reduce the vertex count a lot. After these optimizations, we're looking at 500-600k triangles per frame, a 100x reduction from before. We can also use cheaper vertex shaders for objects far away, which reduces the vertex load even more. We can also do this efficiently in compute shaders, it's just a question of pushing per-instance data to one of many instance buffers if it passes the frustum test.



Here we see close objects in white and it gets darker as the LOD factor increases.


We can also use different shading for close objects. Here, close asteroids have full bling with normal mapping and specular highlights from the skydome, objects farther away are only diffuse with spherical harmonics for diffuse lighting.

This kept fragment shading load down quite a bit. Screenshot shows the debugged normals. The normals without normal mapping look a bit funky, but that's because the normals are computed directly in the vertex shader by sampling the heightmap multiple times. With shading applied it looks fine however


But we can do even better. You might have noticed the transparent "glasslike" wall in front of the asteroids? It is supposed to be opaque. We wanted this to be a space station interior or something cool, but unfortunately we didn't have time for GDC




The main point here is that there is a lot of stuff going on behind the occluder in the scene. There is no reason why we should waste precious bandwidth and cycles on vertex shading asteroids which are never seen.

Enter Occlusion Culling!


We can go from this:



to this:



After this optimization we cull over half the asteroids in the scene on average, and we are looking at a very manageable 200-300k triangles.

My hope for the future is that we'll be able to easily do all kinds of scene management directly on the GPU. It's not feasible to do everything on the GPU quite yet, the CPU is still very capable of doing things like these, but we can definitely accelerate massively instanced cases like these




The skydome is procedurally generated with FBM noise. It is HDR and is used for all the lighting in the scene. I compressed it with ASTC HDR instead of RGB9E5, a 32-bit shared exponent format which is pretty much the only reasonable alternative if I didn't have ASTC HDR.



I squeezed out 60 fps at 1080p/4xMSAA on a Samsung Galaxy Note 10.1 (Mali-T628 MP6) and Samsung Galaxy Note 4 (Korean version, Mali-T760 MP6) when all culling was applied, which I'm quite happy with .

I used DS-5 (ARM DS-5 Streamline - Mali Developer Center Mali Developer Center) to find bottlenecks when tuning along with (Mali Offline Compiler - Mali Developer Center Mali Developer Center) to fine-tune the shaders (mediump varyings can make a lot of difference!).

Ellie Stone

Lighting in Games

Posted by Ellie Stone Apr 13, 2015


Light is sight


When we start talking about the importance of lighting in Geomerics, we often refer directly to light’s importance in setting the mood and atmosphere of a scene - but we jump way ahead of ourselves here. Step one of light in physics – without light, there is no sight. Everything we see in the real world is the result of light reflecting off surfaces and into our eyes.  If we turn off the lights in a room, close the curtains, stuff the doorframe with fabric to stop light leaking in, the objects within it are still there, our eyes are still there, but the objects remain unseen. Defining and highlighting form is the first step in lighting; it lets us see details in objects and determine how they are shaped. Smooth surfaces gradient softly between light and shadow, but sharp edges deliver distinct changes.

Going from this basic first step to effectively using lighting to set the mood, intensity and atmosphere of a scene is a long jump. There is no magic formula for getting the perfect combination of light, shadow and color to achieve the desired artistic vision for the environment, in part because it is, like any art, subjective.The combination depends so much on the ambience being created - mystery horrors, for example, will tend to use low lighting and lots of shadows, punctuated by pools of light to grab your attention and draw you in; sometimes even plunging the player into darkness with the only light source being a torch controlled by the player.

The variety of lighting


When designing a game, there are many different light sources available: for example directional, ambient, spotlight and point light. The source of a directional light is infinitely far away such that by the time they reach the viewer all light rays are parallel – a good example is sunlight; they can be stationary or movable. Ambient lighting casts soft rays equally to every part of a scene without any specific direction, and so provides light but not shadow; it has no real source.  Spotlights emit from a single source in a cone shape with an inner cone angle that provides full brightness and an outer cone angle that allow softening at the edges of where the light is falling; these are often used for torches. Point lights are much like real-world lightbulbs or candles, they emit from a single point in all directions.


Each of these light sources will provide a different type of direct lighting and their effect is computed by the rendering engine. However, to simulate the physics of lighting in the real world it is important to also calculate the indirect lighting, or global illumination, of a scene. Global illumination takes into account the way in which light travels from its source, hits an object and is then absorbed, reflected, scattered or refracted across every subsequent surface it encounters. It is perfect for producing the kind of style required by architectural visualization, interior renders, scenes with direct sunlight and photorealistic renders thanks to its calculation of indirect lighting, soft shadows, colour bleeding and more. For example, light reflecting off a red leather seat cushion will “bleed” colour onto the wall next to it – depending on the colour of the wall this could produce a reddish glow (if wall is white), or purple (if wall is blue). Alternatively, a great effect is where light can leak from one room to its neighbour, gently illuminating the new room through just an open crack in a door.


Bear in mind that an exact simulation of how light works is not necessarily required. All that’s needed is something that is good enough to fool the (admittedly, clever) human eye.


Solving the global illumination challenge


One option for achieving global illumination in a scene is an offline lightmap bake. This gives the illusion that light is being cast onto an object, but what you’re actually seeing is just the effect of the light baked onto the texture. This technique delivers high quality results, but the iteration time is slow and it has limited runtime possibilities – for example the baked light won’t have any effect on moving objects, not can it be turned on or off during play. Another technique is “bounce lighting”, where artists add light sources into the game at strategic positions in order to simulate global illumination – for example at the point where a light would be reflected, a new light source is added with the desired properties. In comparison, this has a fast iteration time, but it can take a very high number of iterations to achieve physical correctness, it is hard to achieve dynamism and the number of light sources may be limited by the engine in use.


Enlighten is a third option for achieving accurate, lightweight and dynamic global illumination. Enlighten uses real time radiosity to compute the interaction between a scene’s geometry and light. It contains a unique and highly optimised runtime library that generates lightmaps of bounce lighting in real time. The lightmap generation occurs on the CPU and is simply added to the rest of the direct lighting on the GPU. This approach can be further combined with lightmaps generated offline, so only the lights and materials that need to be updated at run time incur any cost. In this way, Enlighten offers a highly scalable solution suitable for all gaming platforms, from PC and console right the way down to mobile, and all lighting requirements, from fully baked to totally dynamic. Because the scene’s lighting and materials are also able to be updated dynamically at runtime in the editor (as well as in the game), rapid iteration is possible. By taking into account the indirect light, surface properties, and specularity in a scene it generates an extremely high quality and realistic output. For example, by enabling the bounce lighting to pick up the colour properties of the surfaces in the scene, Enlighten naturally ties together the geometry and lighting in an environment. In addition its ability to update materials’ properties at runtime create a host of new gameplay opportunities, as demonstrated with the Subway demo where destruction was achieved by making walls transparent.

More information on Enlighten is available at www.geomerics.com.

Filter Blog

By date:
By tag: