1 2 3 Previous Next

ARM Mali Graphics

266 posts

HSA_logo.png

The Heterogeneous System Architecture 1.0 specifications have now been released.

 

Jem Davies has talked about earlier releases of the HSA specification which gave an overview of the Programmer's model, and it's great to see the System Architecture, Programmer's Reference Manual and Runtime specifications finalised and available for download. The HSA System Architecture specifies hardware which fits into modern SoCs and a clean Programming Model for software and compiler design which fits well into modern operating systems. It's well suited to CPUs, GPUs, DSPs and any other devices which support offloading of computation tasks, programmable or otherwise.

For developers, HSA provides a simple and consistent interface to the hardware and then gets out of the way. Moreover, HSA exposes hardware in a standardised way that has support from a large number of companies in both the Desktop and Mobile space. One of the key things this allows for is opening up acceleration API design to more developers; HSA is meant to be built on.

What do you do with it?

You build APIs on top of it.

There's a lot of active debate on what the right API for accelerating general purpose programs in heterogenous systems is, it's still a young area, as evidenced by a large number of APIs available today. We're still trying to find the right API (or APIs) to make it easier to speed up programs and make them more power efficient. HSA makes it easier for any person or company to prototype or develop better tools and APIs for heterogeneous systems.

For domain specific issues it also means that the API design on top of HSA can be made to suit the problem at hand, and those who best understand their problem can design the solution.

What doesn't it do?

HSA is a low level interface with a focus on direct hardware support, so it's not typically something you would program to directly if you're writing applications. It's also got a relatively high bar to entry, so if writing your own compiler front-end isn't something you had in mind, you should probably look at APIs built on top of HSA.

The good thing is, there are already API implementations written on top of HSA to get you started. Take a look at the open source OpenCL C compiler and C++ AMP implementations that are already available.

What's next?

HSA simplifies the programming model by providing mechanisms consistent with those seen for CPU development.

  • Sharing of page tables means system allocations are available on HSA agents simply by using standard OS allocation mechanisms; the process address space is now simply "extended" to all capable HSA agents.
  • Full coherency support provides an intuitive parallel programming model, and better performance when compared to software methods when sharing resources
  • User mode queues with a defined hardware packet format simplifies generating work for offload. Be it asynchronous batching of commands or immediate blocking work, the low latency of user mode queues is well suited.
  • HSAIL provides a standard compiler target for frontends. It can also be built to at the same time as host CPU code, simplifying toolchain integration when using accelerators.

HSA_arch_simple.png

The hope is that HSA can become the basis for innovation for heterogeneous compute APIs. As implementations continue to advance and overheads to use of accelerators continue to reduce, it will become easier to offload small pieces of work to accelerators.

Benefits for mobile development

HSA allows for communicating agents at the hardware level, without CPU involvement. This means fewer cycles where your CPU needs to be active and the CPU isn't unnecessarily used as a communication path between two independent units. It also means (with the help of support libraries and OS software) that common formats can be used on multiple devices in the system and no copies are requred for reformatting data shared between devices. With the flat address space of the HSAIL model, programmers can target normal CPU algorithms more easily to other devices and not have to specialise algorithms as heavily for features like local scratch memory or segmented addressing.

Mobile SoCs are also more constrained by the power and thermal budget available in the small form factors of phones and tablets. As a result, mobile development is more sensitive to overheads from copying or driver validation; HSA avoids this by adding hardware support for coherency that can be easily extended to the non-CPU devices in the sysem

Full coherency and a shared address space mean algorithms can be working on different areas of an input buffer without having to carefully partition work statically to cache line boundaries. This allows programs to run across more devices in the system, often at a lower frequency and voltage, resulting in lower total energy for equivalent computations. This can be used to either speed up the computation, or to conserve battery.

A promising future

Having HSA as a low level hardware interface means it is naturally applicable to more problems and opens up the development of higher level APIs which can now accelerate general purpose compute on GPUs, DSPs and other devices. The task based queue interface implemented in user process mapped memory allows for low overhead task dispatch and minimal CPU involvement.

HSA hardware complements existing standards such as OpenCL, SYCL, C++ AMP and OpenMP 4, with features useful for many of these APIs. HSA isn't a revolution but an evolution and standardisation of key hardware features that open up development to a much larger community.

RealtimeUK are a Creative CG Studio that has spent the last 18 years forging a strong reputation within the games industry for the quality of our Cinematics, Animation and Marketing Imagery. Whilst our team often work with developers in creating in-engine assets, it has been the quality of our trailer work for which we have become best known.  Trailers for such games as 'War Thunder', 'World of Tanks', 'Smite' and 'Total War' have become the benchmark for what can be achieved, in terms of visual fidelity, by using a pre-rendered solution.  Our studio has always been very ambitious in this regard and has always tried to maintain its position in being a leading creative production studio that delivers exceptional quality by pioneering cutting edge solutions for the video games industry.

 

With this in mind, when ARM first approached us with the ‘Ice Cave Demo’ project, our team were very keen to explore the ways in which we could collaborate with one of the world’s leading developers of Semiconductor IP. With their technical know-how in real-time mobile graphics technology and our strength in producing compelling visuals, it seemed like a match made in heaven.   It was a great opportunity to explore the ways in which we could collaboratively push the quality of the next generation of mobile graphics processors.

 

We initially treated this production the same as any other – what will be the most effective way of realizing this production and making it as visually compelling as possible? What struck us first about the piece was its ambition – even to produce the ice cave using our existing pre-rendered pipeline approach would prove challenging enough. The brief specified the need for real-time global illumination, reflection, refraction and soft shadows. Add in the budgets of the ARM® Mali™-T760 GPU-based platform and it was clear that we would have to plan the project carefully. 

 

Fortunately, we already had a head start having recently completed 'SMITE: Battleground of the Gods' in which we had invested in exploring cutting edge techniques for creating ice and snow.


Smite_T2_image_23_eshot.jpg

Given the ambitions we had in our own mind for the quality we wanted to achieve, the first steps we took were to highlight any technical concerns that stood in the way of us effectively realizing this production.  Having done this, we then made some creative decisions that would best enable us to create something extremely 'high end' that would allow us to make the most of this collaboration and enable us both to get the best out of the platform. This process helped us realise the challenge of working on mobile devices, where budgets are very small in comparison to other platforms. Armed with this information, we created a brief for our pre-production department to produce a piece of concept that would allow us to get the maximum visual quality, respect the restricted performance budget of mobile devices and also help ARM showcase the features they wanted to demo. This was further refined in line with discussions with ARM.


Slide_03.jpg

Using this as a starting point, the team at RealtimeUK further refined the vision following useful technical meetings with the team at ARM.  The budgets for this production were dictated by the platform upon which the demo would run.  In this instance, the target was to show the demo running on one of ARM’s latest designs in its ARM Cortex®-A Series - the powerful processor that is used in most of the world’s mobiles and tablets.

 

Whilst its core specs were incredibly powerful, they were far from the kind of processing power that we were used to where pre-rendered trailers could take up to two hours per frame to render.  Although initially daunting, the only real deviation away from our usual pipeline was that all the assets had to be built more economically. Whereas all assets are ordinarily built with great integrity and technical precision as standard, it is usually not so much of an issue if a piece of geometry is built with excessive triangle counts for a pre-rendered production.  Of course, this is not the case where the geometry is intended for an in-engine production where every tri and UV Map counts.  With this in mind, our artists carefully followed the guidelines set by the team at ARM and modified the authoring pipeline so that the assets could easily be integrated into Unity.  Using Unity on this project helped speed up the project flow more easily and was a new process for ARM who had, up until this point, used their own proprietary engine to realise their technical demos.  In particular, Unity’s prefab feature was a big help for our collaboration, with Unity helping to refine our workflow and techniques. Having a large community of Unity users was helpful to both parties and made for a less complicated production as it removed the need to learn a bespoke engine.

 

Once these factors had been taken into consideration, the asset creation itself was relatively straight forward with our team generating the normal, diffuse, specular and any other  maps to the usual RealtimeUK standards.  A separate dedicated team was used to create and rig the characters that would feature in the demo as were the effects which were created with a more artistic lead approach.

 

Once all the assets were built, the team at RealtimeUK were able to render them to match its intended vision.  Using our own well tested pipeline in this process, which replicated the final lighting rig, we were able to share a pre-rendered version of the assets as a movie with the team at ARM. The intention with this was that the ARM team would then be able to realise our vision in Unity by replicating it as closely as possible, using the shaders but basing them on the properties of our visual target sequence that they had created specifically for the project.

 

The end result is something we’re all proud of at RealtimeUK  and has been regarded as a huge success by ARM. Both RealtimeUK and ARM have learnt from the creation of this demo and it is something that ARM’s ecosystem will benefit from, with the results feeding in to the 'ARM Guide to Unity: Enhancing your Mobile Games' which is available to view on the Mali Developer CenterAs ever, our team has constantly pushed what is achievable in terms of graphical fidelity, regardless of the final platform. Whilst no one will deny that there still remains a clear discrepancy between what can be achieved in-engine and from a pre-rendered pipeline, it’s nice to know that collaborative efforts between studios like RealtimeUK and technology pioneers like ARM are closing the gap.

Previous blog in the series: Mali Performance 4: Principles of High Performance Rendering

 

Welcome to the next instalment of my blog series looking at graphics rendering performance on Mali GPUs using OpenGL ES. This time around I'll be looking at some of the important application-side optimizations which should be considered when developing a 3D application, before making any OpenGL ES calls at all. Future blogs will look in more detail at specific areas of usage for the OpenGL ES API. Note that the techniques outlined in this blog are not Mali specific, and should work well on any GPU.

 

With Great Power Comes Great Responsibility

The OpenGL ES API specifies a serial stream of drawing operations which are turned into hardware commands for the GPU to perform, with explicit state control over how those drawing operations are to be processed. This low level mode of operation gives the application a huge amount of control over how the GPU performs its rendering tasks, but also means that the device driver has very little knowledge about the whole scene that the application is trying to render. This lack of global knowledge means that the device driver cannot significantly restructure the command stream that it sends to the GPU, so there is a burden of responsibility on the application to send sensible rendering commands through the API in order to achieve maximum efficiency and high performance. The first rule of high performance rendering and optimization is "Do Less Work", and that needs to start in the application before any OpenGL ES API calls have happened.

 

Geometry Culling

All GPUs can perform culling, discarding primitives which are outside of the viewing frustum or which are facing away from the camera. This is a very low level cull which is applied primitive-by-primitive, and which can only be applied after the vertex shader has computed the clip-space coordinate for each vertex. If an entire object is outside of the frustum, this can be a tremendous waste of GPU processing time, memory bandwidth, and energy.

 

The most important optimization which a 3D application should therefore perform is early culling of objects which are not visible in the final render, skipping the OpenGL ES API calls completely for these objects. There are a number of methods which can be used here, with varying degrees of complexity, a few examples of which are outlined below.

 

Object Bounding Box

The simplest scheme is for the application to maintain a bounding box for each object, which has vertices at the min and max coordinate in each axis. The object-space to clip-space computation for 8 vertices is sufficiently light-weight that it can be computed in software on the CPU for each draw operation, and the box can be tested for intersection with the clip-space volume. Objects which fail this test can be dropped from the frame rendering.

 

For very geometrically complex objects that cover a large amount of screen space it can be useful to break the object up into smaller pieces, each with its own bounding box, allowing some sections of the object to be rejected if the current camera position would benefit from it.

 

BoundingBox.png

 

The images above show one of our old Mali tech demos, an interactive a fly through of an indoor science fiction space station environment. The final 3D render is shown on the left and a content debug view, which shows the bounding boxes of the various objects in the scene are highlighted in blue, is shown on the right.

 

Scene Hierarchy Bounding Boxes

This type of bounding box scheme can be taken further, and turned into a more complete scene data structure for the world being rendered. Bounding boxes could be constructed for each building in a world, and for each room in each building, for example. If a building is off-screen then it can be rejected quickly, based on a single bounding box check, instead of needing hundreds of such checks for all of the individual objects which that building contains. In this hierarchy the rooms are only tested if their parent building is visible, and renderable objects are only tested if their parent room is visible. This type of hierarchy scheme doesn't really change the workload which gets sent to the GPU for rendering, but can really help make the CPU performance overhead of all of the checks much more manageable.

 

Portal Visibility

In many game worlds simple bounding box checks against the viewing frustum will remove a lot of redundant work, but still leave a significant amount present. This is especially common in worlds consisting of interconnected rooms, as from many camera angles the view of the spatially adjacent rooms will be entirely blocked by a wall, floor, or ceiling, but be close enough to be inside the viewing frustum (and so pass a simple bounding box cull).

 

The bounding box scheme can therefore be supplemented with pre-calculated visibility knowledge, allowing for more aggressive culling of objects in the scene. For example in the scene consisting of three rooms shown below, there is no way that any object inside Room C can be seen by the player standing in Room A, so the application can simply skip issuing OpenGL ES calls for all objects inside Room C until the player moves into Room B.

 

blog5-image1.png

This type of visibility culling is often factored into game designs by the level designers; games can achieve higher visual quality and frame rates if the level design keeps a consistently small number of rooms visible at any point in time. For this reason many games using indoor settings make heavy use of S and U shaped rooms and corridors as they guarantee no line of sight through that room if the doors are placed appropriately.

 

This scheme can be taken further, allowing us to cull even Room B in our test floor plan in some cases, by testing the coordinates of the portals - doors, windows, etc. - linking the current room and the adjacent rooms against the frustum. If no portal linking Room A and Room B is visible from the current camera angle, then we can also discard the rendering of Room B.

 

blog5-image2.png

These types of broad-brush culling checks are very effective at reducing GPU workload, and are impossible for the GPU or GPU drivers to perform automatically - we simply don't have this level of knowledge of the scene being rendered - so it is critical that the application performs this type of early culling.

 

Face Culling

It should go without saying that in addition to not sending off-screen geometry to the GPU, the application should ensure that the render state for the objects which are visible is set efficiently. For culling purposes this means enabling back-face culling for opaque objects, allowing the GPU to discard the triangles facing away from the camera as early as possible.

 

Render Order

OpenGL ES provides a depth buffer which allows the application to send in geometry in any order, and the depth-test ensures that the correct objects end up in the final render. While throwing geometry at the GPU in any order is functional, it is measurably more efficient if the application draws objects using a front-to-back order, as this maximizes the effectiveness of the early depth and stencil test unit (see The Mali GPU: An Abstract Machine, Part 3 - The Shader Core for more information on early depth and stencil testing). If you render objects using a back-to-front order then there is a good chance that the GPU will have spent some cycles rendering some fragments, only to later overdraw them with a fragment which is closer to the camera, which is a waste of precious GPU cycles!

 

It is not a requirement that triangles are sorted perfectly, which would be very expensive in terms of CPU cycles; we are just aiming to get it mostly right.  Performing an object-based sort using the bounding boxes or even just using the object origin coordinate in world space is often sufficient here; anywhere where we get triangles slightly out of order will be tidied up by the full depth test in the GPU.


Remember that blended triangles need to be rendered back-to-front in order to get the correct blend results, so it is recommended that all opaque geometry is rendered first in a front-to-back order, and then blended triangles are drawn last.

 

Using Server-Side Resources

OpenGL ES uses a client-server memory model; client-side memory resembles resources owned by the application and driver, server-side resembles resources owned by the GPU hardware. Transfer of resources from application to server-side is generally expensive:

 

  • The driver must allocate memory buffers to contain the data.
  • Data must be copied from the application buffer into the driver-owned memory buffer.
  • Memory must be made coherent with the GPU view of memory. On unified memory architectures this may imply cache maintenance, on card-based graphics architectures this may mean an entire DMA transfer from system memory into the dedicated graphics RAM.

 

blog5-4.png

For reasons dating back to the early OpenGL implementations - namely that geometry processing was performed on the CPU, and did not using the GPU hardware at all - OpenGL and OpenGL ES have a number of APIs which allow client-side buffers for geometry to be passed into the API for every draw operation.

 

  • glVertexAttribPointer allows the user to specific per-vertex attribute data.
  • glDrawElements allows the user to specify per-draw index data.

 

Using client-side buffers specified this way is very inefficient. In most cases the models used each frame do not change, so this simply forces the drivers to perform a huge amount of work allocating memory and transferring the data to the graphics server for no benefit. As a much more efficient alternative OpenGL ES allows the application to upload data for both vertex attribute and index information to server-side buffer objects, which can typically be done at level load time. The per-frame data traffic for each draw operation when using buffer objects is just a set of handles telling the GPU which of these buffer objects to use, which for obvious reasons is much more efficient.

 

The one exception to this rule is the use of Uniform Buffer Objects (UBOs), which are server-side storage for per-draw-call constants for used by the shader programs. As uniform values are shared by every vertex and fragment thread in a draw-call it is important that they can be accessed by the shader core as efficiently as possible, so the device drivers will generally aggressively optimize how they are packaged in memory to maximize hardware access efficiency. It is preferable that small volumes of uniform data per draw call should be set directly via the glUniform<x>() family of functions, instead of using server-side UBOs, as this gives the driver far more control over how the uniform data is passed to the GPU. Uniform Buffer Objects should still be used for large uniform arrays, such as long arrays of matrices used for skeletal animation in a vertex shader.

 

State Batching

OpenGL ES is a state-based API with a large number of state settings which can be configured for each drawing operation. In most non-trivial scenes there will be multiple render states in use, so the application must typically perform a number of state change operations to set up the configuration for each draw operation before the draw itself can be issued.

 

There are two useful goals to bear in mind when trying to get the best performance out of the GPU and minimizing CPU overhead of the drivers:

  • Most state changes are low cost, but not free, as the driver will have to perform error checking and set some state in an internal data structure.
  • GPU hardware is designed to handle relatively large batches of work, so each draw operation should be relatively large.

 

One of the most common forms of application optimization to improve both of these areas is draw call batching, where multiple objects using the same render state are pre-packaged into the same data buffers and as such can be rendered using a single draw operation. This reduces CPU load as we have fewer state changes and draw operations to package for the GPU, and gives the GPU bigger batches of work to process. My colleague Stacy Smith has an entire blog dedicated to effective batching here: Game Set and Batch. There is no hard-and-fast rule on how many draw calls a single frame should contain, as a lot depends on the system hardware capability and the desired frame rate, but in general we would recommend no more than a few hundred draw calls per frame.

 

It should also be noted that there is sometimes a conflict between getting the best batching and removing the most redundant work via depth sorting and culling. Provided that the draw call count remains sensible and your system is not CPU limited, it is generally better to remove GPU workload via improved culling and front-to-back object render order.

 

Maintain The Pipeline

As discussed in one of my previous blogs, modern graphics APIs maintain an illusion of synchronous execution but are really deeply pipelined to maximize performance.  The application must avoid using API calls which break this rendering pipeline or performance will rapidly drop as the CPU blocks waiting for the GPU to finish, and the GPU goes idle waiting for the CPU to give it more work to do. See Mali Performance 1: Checking the Pipeline for more information on this topic - it is was important enough to warrant and entire blog in its own right!

Summary

This blog has looked as some of the critical application-sde optimizations and behaviours which must be considered in order to achieve a successful high performance 3D render using Mali. In summary the key things to remember are:

 

  • Only send objects to the GPU which have some chance of being visible.
  • Render opaque objects in a front-to-back render order.
  • User server-side data resources stored in buffer objects, not client-side resources.
  • Batch render state to avoid unnecessary driver overhead.
  • Maintain the deep rendering pipeline without application triggered pipeline drains.

 

Tune in next time and I'll start looking at some of the more technical aspects of using the OpenGL ES API itself, and hopefully even manage to include some code samples!

 

TTFN,

Pete

 


Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

You have probably all seen the announcements at GDC 2015 (going on right now). The big engine guys are battling it out for the attentions of developers large and small. Epic Games are giving away Unreal Engine 4 for free and Unity Technologies have released Unity 5 (of which the Personal Edition contains all engine features). But it doesn’t stop there; Chukong Technologies continues to improve the widely adopted open source engine Cocos2d-x and PlayCanvas are showcasing at GDC right now with their stunning WebGL engine and ARM has been excited to sponsor their PLAYHACK March competition with a great prize of a Samsung Chromebook 2 (Exynos 5 Octa and ARM® Mali™-T628 MP6 GPU) for the winning entry.

 

As all the major players continue to offer amazing engines with incredible features and continue to reduce the cost to developers, the barrier to entry reduces too. This is great news for all developers but especially for new aspiring indies keen to break into the industry.

 

As the ecosystem continues to expand and the audience of games becomes ever wider, ARM is incredibly excited to take on the challenge of ARMing (pardon the pun) developers with the tools and education they need to optimise their games for both performance and energy consumption. At GDC 2015 we’re proud to be, once again, working with key games industry players.

 

MoonTemple_640x359.jpg

Moon Temple, a collaboration between ARM and Epic Games

 

We have been collaborating with Epic Games to port Unreal Engine 4 to 64-bit and add ASTC texture compression. This can be seen in our joint demo called Moon Temple based on an existing demo from Epic. As well as 64-bit and ASTC, you’ll also see Pixel Local Storage in action; this allows us to perform energy efficient rendering by keeping data on-chip to reduce bandwidth consumption. Full details of this demo were explained in our joint talk with Epic – “Unreal Engine 4 Mobile Graphics on ARM CPU and GPU Architecture (Presented by ARM)”. Keep an eye out on the GDC Video Vault if you missed it.


IceCave_640.jpg

Ice Cave demo from ARM; created with Unity 5

 

Last year, ARM released the “ARM Guide to Unity: Enhancing Your Mobile Games”, a popular guide teaching beginner and intermediate developers how to optimise for Mali and implement cool effects. We’ll be updating the guide again this year but as a special sneak peek into what is coming, check out the Ice Cave demo from our internal demo team. Created with the latest Unity 5 (now available), this demo makes use of real-time global illumination (employing Enlighten, the new lighting solution inside Unity 5) and adds some cool reflection, refraction and soft shadows. All to be detailed in the ARM Guide to Unity later this year! ARM continues to work ever more closely with Unity and we can’t wait for what the future holds. Be sure to check out our GDC talk “Enhancing Your Unity Mobile Games (Presented by ARM)“,  also to be available in the GDC Video Vault.

 

Cocos2d-x is a very popular open source engine and we’re very pleased to have been working closely with the creators Chukong Technologies for a long time now. ARM is especially proud to have been able to help optimise the engine for Mali and help with the transition as the mobile world moves to ARMv8 and 64-bit. To make great games, you need a great engine backed by useful, easy to use and powerful tools. Recently you may have seen the announcement for the integration of ARM DS-5 into the Cocos IDE. This will allow developers the ability to efficiently debug their C++ code and continue to optimise for ARM and push the boundaries!

 

SeemoreWebGL_640.jpg

Seemore WebGL created with PlayCanvas

 

The barriers to entry for game developers are reducing but the possibilities continue to widen. ARM enables partners to create high performing energy efficient devices and we’re seeing more and more content running across multiple platforms. Console quality engines are already running on smartphones and tablets and that trend continues with browser technology. We’ve been collaborating with PlayCanvas to showcase what is possible with HTML5 and WebGL on current generation mobile devices. PlayCanvas have created a stunning rendition of our popular demo SeeMore. With stunning lighting and physically based rendering, all running on last year’s Samsung Galaxy Note 10.1 2014 Edition (Mali-T628 MP6), developers can now easily deploy high quality visuals across browser-based apps and games.

 

At time of writing, there are only two days remaining of GDC 2015, if you’re lucky enough to be there, be sure to head over to the ARM Booth 1624 and check out all the demos and technology on show!

Enlighten 3 with Forge.png

“What made Leonardo’s paintings so revolutionary was his use of light and shadow, rather than lines, to define three-dimensional objects.” – The National Gallery (nationalgallery.org.uk)

 

Great artists use lighting to convey emotions and tell stories. This is true whatever the medium, be it paint, film, or the latest video game. For computer-generated imagery, an accurate simulation of how light interacts with materials is essential and this is where global illumination (GI) - how light bounces around a scene - can be used to deliver incredible visual realism.

 

The big challenge in computer graphics is performing dynamic global illumination in real time as it has traditionally been computationally intense. And this is exactly why dynamic GI is interesting to ARM in our mission to deploy efficient technology wherever computing happens.

 

In 2013, ARM acquired Geomerics and their Enlighten technology, the game industry’s most advanced dynamic lighting solution. Enlighten is incredibly scalable – from fully baked to totally dynamic lighting, from PC and console to mobile and from small rooms to large environments.

 

While Enlighten is and always will be optimized to scale and run on any hardware platform, ARM’s design teams benefit from understanding the type of processing required to deliver cutting edge games; this in turn influences and informs our processor roadmaps.

 

This week at GDC we launched Enlighten 3 with Forge. The innovation in Enlighten 3 ensures it remains at the cutting edge of lighting technology; it also includes a new lighting editor and workflow tool called Forge which makes it easier for artists and developers to take advantage of the incredible visual quality on offer in Enlighten.

 

You can find more details on Enlighten 3 and Forge on the Geomerics website.

 

Since taking over the running of Geomerics in ARM I have been staggered by the popularity of the technology. Whether it is 40,000 YouTube hits in a week for a demo video or standing room only in a series of customer meetings in Japan a couple of weeks ago, the developer mindshare we have with Enlighten is significant. When we released our Realistic Rendering demo in 2014 Epic

 

Games founder and CEO Tim Sweeney said:

“This is gorgeous!  I remember having dreams about this kind of dynamic indirect lighting back when I was building the Unreal Engine 1 renderer!”

 

 

2015 looks set to be even more exciting than 2014 as we see Enlighten reach tens of thousands of developers via Unity 5.

 

Steven Spielberg once said,

You shouldn't dream your film, you should make it!”

 

 


...maybe with Enlighten that should apply to your game as well.


Over 25,000 game developers go yearly to San Francisco for the Game Developer Conference (GDC) in order to see and hear the latest features and capabilities of the game engines, games middleware, developer tools and hardware platforms.


At today’s Google Developers day it was announced that a whooping $7 billion revenue has been given to their app developers. And this is set to continue growing thanks to lower cost smartphones and tablets exposing millions of people of all ages and from all walks of life, to video games for the first time.


From a technical perspective, each year we are seeing, on average, a 30 to 50 per cent increase in the performance of mobile devices. The computational power of mobile GPUs is already largely on par with that of the Xbox 360 and PlayStation 3. There are still challenges around, like the availability of memory bandwidth, but ARM is developing techniques to overcome these which developers can reach via our sample code, tutorials, tools and developer guides available at our developer portal, and our latest demos of these techniques will be shown and explained during our GDC talks.


The GDC developer audience have extremely divers educational needs, from the game artist creating the game assets, the visual environment and characters, to the game developers using a specific game engine or middleware, and to the developers designing their own game engine or not using any. Therefore, we have shaped our developer tutorials and resources to fit the audience diversity.


At GDC 2015, we start our talk sessions with “Unreal Engine 4: Mobile Graphics on ARM CPU and GPU architecture”, showing first of all, how Epic Games’ game engine has been ported into the latest ARMv8 architecture, showcasing the results with a bespoke game demo from Epic Games called Moon Temple.

For game developers, the ARMv8 architecture mainly translates to porting their game to a 64-bit OS, and the latest Android “L” already includes 64-bit support.  Apple has also mandated the support of 64-bit for all new iOS 8 apps. The session continues with the tile-based ARM® Mali™ GPU architecture, showing how to reduce external memory bandwidth by keeping memory transactions localized to fast on-chip memory. The light bloom effect of the Moon Temple demo is developed using that technique, implemented via the Khronos OpenGL® ES extension “Shader Pixel Local Storage”. Other sample code using this extension are also available here. Another highlight of the talk is the ASTC integration into Unreal Engine 4. ASTC is a texture compression standard developed by ARM and adopted by Khronos. ASTC allows free choice of multiple bit rates across all supported input texture formats, from LDR to HDR formats, as well as the ability to compress 3D textures. At our developer portal we have sample code and further tutorials on it.


Furthermore, the Enlighten middleware by Geomerics, an ARM company enables dynamic global illumination and is also available pre-integrated into Unreal Engine. A full session is dedicated to it which reveals the latest features and advances in Enlighten, and the collaboration with Unreal Engine and Unity.


Another hot topic for developers is to learn how best to use the latest API features and for mobile and embedded devices, OpenGL ES is the 3D graphics API of choice. Our talk Unleash the benefits of OpenGL ES 3.1 and Android Extension Pack (AEP), focuses on the main new highlight which is compute shaders, allowing the GPU to be used for general-purpose computing. Previously, developers had to learn a different API (such as OpenCL™) if they wanted to use GPU Compute. The session covers compute shader techniques and the best coding practises on Mali Midgard GPUs. It showcases a few of the sample codes which are already available at our developer portal. The other highlight of the talk is the Android Extension Pack (AEP) and its best coding practices. AEP requires OpenGL ES 3.1 and it is an optional feature in the latest Android “L” OS release. AEP enables around 20 other extensions, including tessellation, geometry shaders and ASTC.


Tools are key for developers so that they can debug and profile their code, finding out where the performance bottlenecks are so they can optimize their application. At the talk How to Optimize your Mobile Game with ARM Tools and Practical Examples  the Mali Graphics Debugger (MGD) and DS-5 Streamline are shown, with further live sessions at our ARM booth lecture theatre. The MGD traces all the API calls that the graphics application makes; in particular it supports OpenGL ES 2.0, 3.x and EGL. The tool is complementary to DS-5 Streamline, which gives a system wide view of the performance of the application. MGD v2.1 has just been launched to be showcased at GDC 2015, and the key features include the support for Android 64-bit targets and its capability of tracing the Android Extension Pack functions.


Last but not least, there is a talk session aimed at Unity developers: Enhancing your Unity Mobile Games. Unity is the most widely used game engine, and from our developer surveys from developer events and Mali Developer Center, we understand that up to 50% of game developers use Unity. The session is given jointly with Unity and RealtimeUK, the company who created the 3D assets for the brand new Ice Cave demo, premiering this week on the ARM booth. Developers will learn the differences when developing for mobile, as well as the bottlenecks they might encounter and how to overcome them, referring to all the work done in our ARM Guide to Unity. It goes on to cover the use of the local cubemap technique for reflections, and then, inspired by the technique, we show a new way of rendering dynamic soft shadows in real-time, which is one of the key additions on our ARM Guide to Unity refresh to be released later this year.

Banner for GDC 2015.jpg

Just as September marks the turn of the year for schoolchildren; April marks the turn of the year for taxes; so too does the Game Developers Conference mark the climax of the year for anyone in the gaming industry. ARM is no different. The work of our ecosystem team begins and ends in March, with demos being finalized, developer guides written off and tools being released all for this time. With GDC coinciding with MWC in Barcelona this year, mobile game developers can definitely expect a week full of exciting announcements.

 

In the field of mobile game development, ARM recognizes the challenges. While mobile devices have the biggest reach of any gaming platform, the thermal and battery constraints have not traditionally made them a straightforward target for visually stunning games. However, increasingly advanced processors and energy efficient technologies are hitting the market each year and with IP such as the ARM® Mali™-T880 GPU and ARM Cortex®-A72 processors in the pipeline, designed specifically to deliver high-end gaming, tomorrow's premium mobile experiences are being redefined.

 

This year we have a stunning lineup of new demos that form a one-stop-shop for cutting-edge mobile development techniques, all based on the latest hardware. If you’re starting to work with APIs such as OpenGL® ES 3.1 or WebGL, come and find out how to use compute shaders for occlusion culling, or how WebGL games can rival the visual quality of those built in OpenGL ES. For those working with popular game engines such as Unity or Unreal, we have brand new demos featuring battery-saving techniques such as Pixel Local Storage and ASTC as well as tips for driving up visual quality in mobile games using Enlighten’s global illumination solution, reflections, refractions and shadows.  64-bit mobile gaming is now present in leading engines and we will be showcasing the performance improvements available both on the booth and in our sponsored sessions.

 

TessellationResult.pngThis week we announced updates to three of our most popular Mali graphics tools including a plug-in to Unreal Engine 4 for the Offline Shader Compiler. The Offline Shader Compiler allows you to analyze your materials and get advanced mobile statistics while previewing the number of arithmetic, load & store and texture instructions in your code. The OpenGL ES Emulator receives support for geometry and tessellation shaders and enables users to start developing for the Android Extension Pack (AEP) as well as OpenGL ES 2.0, 3.0 and 3.1. The Mali Graphics Debugger has gained support for 64-bit Android, improved live shader editing and now enables the Android Extension Pack (AEP) to be traced. The upgrades to the Emulator and the Debugger are available for download now; the Offline Shader Compiler plug-in is being previewed at GDC.

 

Joining us on the ARM booth will be partners who share our ambition to make the production of high-quality mobile games as easy as possible. Cocos2d-x, who recently announced the integration of ARM’s DS-5 Streamline into the Cocos Code IDE to enable developers to simply optimize their games, will be sharing their extremely popular engine with attendees. Tencent, the world-leading, free-to-play publisher and #1 brand in China will join the ARM booth with their innovative titles for mobile. Simplygon’s automatic 3D asset optimization middleware is ideal for increasing the performance of your mobile game. For those facing the challenge of smartphone market diversity, Testin’s quality assurance testing suite is a blessing for confirming the performance of your application across a variety of devices. PlayCanvas’ ever-popular WebGL game engine that’s free, open source and backed by amazing developer tools will be showing a new demo featuring some well known ARM characters!

 

All of these demonstrations will be accompanied by live sessions and in-depth explanations by the engineers who developed them on the in-booth ARM lecture theatre. The full schedule and more information about ARM at GDC is available at Mali Developer Center. We look forward to seeing you on the ARM booth #1624 next week!

If you have followed my instructions on installing OpenCL on the Samsung Chromebook or on the Samsung Chromebook 2, you may be wondering what's next. Well, optimising your code for the ARM® Mali GPUs, of course! If you are serious about using your Chromebook as a development board, you may want to know how to connect to it remotely via ssh, and use it with the lid closed. In this blog post, I'll explain how. All the previous disclaimers still apply.

 

Enabling remote access to Chromebook

I assume your Chromebook is already in the developer mode (and on the dev-channel if you are really brave).

Making the root file system writable

Open the Chrome browser, press Ctrl-Alt-T and carry on to enter the shell:

Welcome to crosh, the Chrome OS developer shell.

If you got here by mistake, don't panic!  Just close this tab and carry on.

Type 'help' for a list of commands.

crosh> shell
chronos@localhost / $ 

Using sudo, run the make_dev_ssd.sh with the --remove_rootfs_verification flag:

chronos@localhost / $ sudo /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification   

  ERROR: YOU ARE TRYING TO MODIFY THE LIVE SYSTEM IMAGE /dev/mmcblk0.

  The system may become unusable after that change, especially when you have
  some auto updates in progress. To make it safer, we suggest you to only
  change the partition you have booted with. To do that, re-execute this command
  as:

    sudo ./make_dev_ssd.sh --remove_rootfs_verification --partitions 4

  If you are sure to modify other partition, please invoke the command again and
  explicitly assign only one target partition for each time  (--partitions N )
  
ERROR: IMAGE /dev/mmcblk0 IS NOT MODIFIED.

Note the number after the --partitions flag and rerun the previous command with this number e.g.:

chronos@localhost / $ sudo /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions 4
Kernel B: Disabled rootfs verification.
Backup of Kernel B is stored in: /mnt/stateful_partition/backups/kernel_B_20150221_224038.bin
Kernel B: Re-signed with developer keys successfully.
Successfully re-signed 1 of 1 kernel(s)  on device /dev/mmcblk0.

Finally, reboot:

chronos@localhost / $ sudo reboot

 

Creating host keys

Create keys for sshd to use:

chronos@localhost / $ sudo ssh-keygen -t dsa -f /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key
Generating public/private dsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key.
Your public key has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key.pub.
chronos@localhost / $ sudo ssh-keygen -t rsa -f /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key.
Your public key has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key.pub.

You can leave the passphrase empty (hit the Enter key twice).

 

Enabling password authentication

Change the PasswordAuthentication setting in /etc/ssh/sshd_config to 'yes':

chronos@localhost / $ sudo vim /etc/ssh/sshd_config
# Force protocol v2 only
Protocol 2

# /etc is read-only.  Fetch keys from stateful partition
# Not using v1, so no v1 key
HostKey /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key
HostKey /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key

PasswordAuthentication yes
UsePAM yes
PrintMotd no
PrintLastLog no
UseDns no
Subsystem sftp internal-sftp

 

Starting sshd

Allow inbound ssh traffic via port 22 and start sshd:

chronos@localhost / $ sudo /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT 
chronos@localhost / $ sudo /usr/sbin/sshd

Change the root password (no, I'm not showing you mine):

chronos@localhost / $ sudo passwd
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully

 

Connecting from another computer

Check the IP address of your Chromebook:

chronos@localhost / $ ifconfig
lo: flags=73  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10        loop  txqueuelen 0  (Local Loopback)
        RX packets 72  bytes 5212 (5.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 72  bytes 5212 (5.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

mlan0: flags=4163  mtu 1500
        inet 192.168.1.70  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::26f5:aaff:fe26:ee0a  prefixlen 64  scopeid 0x20
        ether 24:f5:aa:26:ee:0a  txqueuelen 1000  (Ethernet)
        RX packets 10522  bytes 3356427 (3.2 MiB)
        RX errors 0  dropped 8  overruns 0  frame 0
        TX packets 6516  bytes 1956509 (1.8 MiB)
        TX errors 3  dropped 0 overruns 0  carrier 0  collisions 0

(In this case, the IP address is 192.168.1.70.)

You should now be able to connect from another computer e.g.:

[lucy@theskyofdiamonds] ssh root@192.168.1.70
localhost ~ # whoami
root

 

Making sshd start on system startup

To make sshd start on system startup, add a script to /etc/init e.g.

chronos@localhost / $ sudo vim /etc/init/sshd.conf
start on started system-services
script
  /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT 
  /usr/sbin/sshd
end script

(A two-space indent is sufficient for the script block.)

 

Enabling passwordless connection from another computer

Generate a public/private key pair from another computer e.g.:

[lucy@theskyofdiamonds] ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/lucy/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/lucy/.ssh/id_rsa.
Your public key has been saved in /home/lucy/.ssh/id_rsa.pub.

Copy the public key to your Chromebook e.g.:

[lucy@theskyofdiamonds] ssh-copy-id root@192.168.1.70
The authenticity of host '192.168.1.70 (192.168.1.70)' can't be established.
RSA key fingerprint is 58:2d:89:e7:52:5c:b4:85:1e:79:e0:23:e8:36:f0:c2.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
Password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@192.168.1.70'"

and check to make sure that only the key(s) you wanted were added.

[lucy@theskyofdiamonds] ssh root@192.168.1.70
Last login: Sun Feb 22 00:08:45 GMT 2015 from 192.168.1.74 on ssh
localhost ~ #

 

Keeping your Chromebook awake with the lid closed

With the lid open, your Chromebook's GPU may be rendering graphics and running compute tasks concurrently. This may create undesired noise when obtaining performance counters. To keep the Chromebook awake when you close the lid, connect to the Chromebook and disable power management features:

[lucy@theskyofdiamonds] ssh root@192.168.1.70
Last login: Sun Feb 22 00:48:41 GMT 2015 from 192.168.1.74 on ssh
localhost ~ # stop powerd

Check that when you close the lid, you can still "talk" to the Chromebook e.g. launch tasks.

To enable power management again, enter:

localhost ~ # start powerd

Chinese Version 中文版: 联发科技借助 Mali™-T720 扩展移动市场份额

We’ve all come to expect our portable gadgets to wow us with their smooth, 3D, feature-rich displays and their ability to support fast-action, console-quality games. But I wonder how many people stop to think about the innovation and technology that underpins those stunning user experiences.

Phone 3.png

Innovation

Here at ARM we’re all about innovation – it’s what we do! The resulting technologies are then licensed to the world’s leading semiconductor companies who make the chips for all those exciting products that we have come to love. One of ARM’s leading technology brands is Mali – and ARM® Mali™ GPUs are at the heart of a wide range of successful consumer devices – providing that ‘wow factor’ we talked about at the beginning.

 

But let’s just step back for a moment and consider how the ‘innovate, develop, implement’ cycle can be sustained at such a fast pace that each product generation has features even more compelling than the last. The key principle here is to blend the introduction of new technologies and features with the smart reuse of existing hardware and software platforms - in order to leverage the maximum return from every investment and to ensure rapid time to market (TTM).

 

Scalability

Mali-T720.png

The scalability of Mali GPUs perfectly aligns with the reuse paradigm; the performance of a design can be tuned by simply varying the number of cores within the GPU. This, combined with reuse of the same driver and software framework, means a wide range of products - from entry-level, cost-sensitive designs through to those that are high-end and feature-rich - can quickly be brought to market.


MediaTek

One of ARM’s SoC partners is MediaTek, a company based in Taiwan. MediaTek always impresses me with the speed at which it innovates and brings product to market. I’m pleased to say that MediaTek is a partner for Mali GPU technology across its product range. A pair of recent announcements highlight how MediaTek has used the scalability of Mali GPUs and ARM Cortex® processors to good effect. In October last year, MediaTek announced details of their MT6735, an SoC for the mainstream that uses a four-core Cortex-A53 processor and a Mali-T720 MP2 GPU. In the past few days MediaTek has followed this with the announcement of the MT6753. This latest SoC is aimed at high-end applications and uses an eight-core Cortex-A53 processor combined with a Mali-T720 MP3 GPU.

 

The use of common processor and GPU types across the product range allows the TTM benefits of scalability to be realized; MediaTek comments, ‘MT6753 is compatible with the previously released MT6735, which can significantly shorten the product development cycle’. According to Mr Hsieh, president of MediaTek, there will also be a variant of the MT6735 to address particular low-end market requirements – the MT6735M.

 

These news pieces from MediaTek are yet another indication that 2015 is going to be an exciting year for Mali GPUs and Cortex processors.

Though they may be reluctant to acknowledge it openly, I think my three kids are quite fortunate. They are growing up in an environment where gaming is ubiquitous

Premium Launch 4.png

- we have two consoles, each child has access to their own tablet, and there are a number of PCs dotted around as well. There are even board games for when they finally tire of looking at a screen. This offers an amazing range of opportunities, as well as an interesting case study of the preferences of the next generation.

 

It is clear that there is a natural hierarchy of preferences among my children, with mobile at the top. Console gaming goes in cycles depending on whether a new game is out, and the PC is still the dominant platform for Minecraft, but the kids have an attachment to their tablets that goes beyond the console or PC experience. I think this has a lot to do with a sense of ownership, and the feeling of holding a premium device in your hands. There is a unique tactile experience when playing on these devices that remains exciting.

 

(I should point out here that my youngest daughter is only 4 so has to share iPad time with her mum. However, she did manage to customise the home screen by neatly putting all of her games into a sub-folder, which was both impressive for a 4 year old, and disconcerting for her mum who thought she has deleted Candy Crush).

 

For me this underlines the first part of the premium mobile experience – the raw quality of the hardware. Even if the game you are playing is quite simple, or you are just consuming content, the feel of a premium device in your hand that is light, sleek and looks cool, remains a genuine experience.

 

The second aspect of the premium experience is the content itself. This has been slower in arriving as many developers are reluctant to make games that only run on the latest high-end devices. But this is starting to change as more of the console developers target mobile. These days a high-end console game is an enormously complex beast to put together. This is why many developers chose  to license in technology to get their game made – few can afford to invest in their own technology across the board as well as employ the talented art teams needed to produce a top-end game. This level of complexity makes it very difficult to strip back a game so it will run on an older mobile device. The consequence of this is that few developers have simultaneously targeted console and mobile. Most have opted to make a console game and then hand over the task of a mobile port to a different studio, with some mixed results.

 

We are now approaching a point where this situation is about to change, and the driver for this change is the rapid progress being made in mobile hardware. The latest mobile devices promise multi-core performance, fast shared memory and a powerful GPU. You can see this in the stunning specifications for ARM’s latest Cortex-A72 CPU and Mali-T880 GPU designs. Together these represent another significant quality step in mobile performance and architecturally they closely resemble the latest consoles – the PlayStation 4 and Xbox One. The arrival of desktop style graphics APIs on mobile is also making life easier for developers to target both platforms simultaneously.

 

There is also convergence in the other direction. Mobile devices are designed with connectivity in mind from the ground up. Without connectivity my tablet defaults back to being just a MP3 player. This has not always been the case with consoles, but the new generation are also built around connectivity and content sharing, often with second-screen functionality built in.

 

These trends are all helping to create an environment where developers can simultaneously target console and premium mobile devices with the same game, which is a very attractive proposition. For developers and publishers it offers a route out of the current mindset of free-to-play with in-game purchases, pay-to-win and irritating adverts. For me these do not define a premium experience. But if your game is targeted at those who are attracted to the latest premium devices, I think you have an audience ready to pay a sensible amount for a quality game that does not constantly ask the player for more money.

 

The key final step in making this vision a reality is having the technology to simultaneously deploy the same content on console and premium mobile. The main engine vendors, Epic and Unity, are already at this point, as are some of the main technology providers. Geomerics, now an ARM company, originally developed the real-time global illumination technology Enlighten for the PC and console space. Enlighten has been mobile ready for over a year now, and on the latest mobile devices is able to run with the same quality settings used on the new generation of consoles.

 

The possibilities for the next five years are spectacular, which brings me back to my first point. The natural adopters of premium mobile content are the current generation of kids growing up with these devices, and their minds are already set – they will continue to want the latest, fastest mobile platform which can play the best games out there.

Pavel Krajcevski and Dinesh Manocha over at the University of North Carolina at Chapel Hill, have produced several papers recently that have discussed ASTC. They had a paper at High Performance Graphics 2014: SegTC: Fast Texture Compression using Image Segmentation in which they start with a good introduction and review of the state of the art of compressed textures and in particular methods of actually compressing those textures to be used later on GPUs. Needless to say, they mention ASTC as a significant advance over existing methods, which made us rather proud. They then go on to discuss new methods of compression of textures, by first computing a segmentation of the image into superpixels to identify homogeneous areas based on a given metric and using that to define partitionings for partition-based compression formats, including ASTC.

 

Their recent paper has been accepted for the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games. In this paper with the catchy title of Compressed Coverage Masks for Path Rendering on Mobile GPUs, they look at methods of accelerating resolution-independent curve-rendering on mobile GPUs, preferably in real time. They find using ASTC to be a significant advance in this area, seeing good speed-ups overall from using the compressed coverage masks but also much bigger memory footprint (and bandwidth, and thus power) savings compared to older methods such as ETC2. For those interested, the list of other papers is here.

 

Both papers are highly readable, and I encourage you to have a look. It's clear there is more work to be done in this area, particularly research into more efficient ways of compressing images (and other data) into ASTC, and we look forward to seeing it.

Chinese Version 中文版: 没有免费的性能

I spent much of my teenage years sat in front of a monitor with a keyboard and mouse blasting away friends on the other side of the town in a bit of first person shooter action. The rest of the time I would be thinking about what graphics card I would need to run the same game at a slightly higher resolution or with more effects enabled. If I had two cards, would that double the frame rate giving me some kind of edge on my friends? The arms race in discreet graphics cards was always about delivering the ultimate performance no matter the cost – and ultimately the only cost was that on the consumer’s wallet.

 

Coming back to today’s world, a significant amount of gaming is played on mobile devices. In the mobile GPU space, it’s very easy to be drawn into a similar battle for ultimate performance when comparing GPUs. But here there is one cost that is critical: power or, more specifically, thermal limits. The GPU will always keep giving performance, but if you cannot sustain that performance due to the thermal constraints of a mobile platform, there is little point in having that performance available. Not only that, but you would also want to use your phone to make a call, chat with friends, check e-mails etc, after a few hours of heavy gaming without having to worry about your battery running low.


When ARM talks about graphics performance, we specifically use the term energy efficiency, or delivering the maximum performance within this constrained thermal budget. It’s worth pointing out at this point that the “constrained” thermal budget never increases (~2.5W for total SoC power in a high-end smartphone that also needs to include other components such as CPU, memory etc) so the only way we can keep up with the curve in terms of performance requirements for the latest content is to keep the Mali GPU architecture constantly evolving with new innovative technologies and optimizations.

 

pic1.png

 

Looking at the latest high-end GPU from ARM, the ARM® Mali™-T860, we improved energy efficiency by 45% compared to the Mali-T628 across a wide range of content. That means it is able to deliver 45% more performance within the same thermal budget. The comparison is core for core in the same process node. In reality, as the industry moves forward with process nodes, we see even greater improvements in energy efficiency in end devices.

pic2.png

 

From generation to generation the Mali Midgard GPU family has made step improvements in energy efficiency. These have come both from innovative bandwidth reduction technologies such as AFBC (ARM Frame Buffer Compression) or Transaction Elimination (Jakub Lamik's recent blog titled Should Have Gone to Bandwidth Savers covers these technologies and more in extensive detail) and micro architectural optimizations designed around the content we run every day.

 

pic3.png

Looking at the Mali-T860 GPU we recently launched, ARM focused on real life use cases such as high-end gaming, casual gaming and the user interface for its hardware improvements. Optimizations like quad-prioritization result in significant efficiency improvements for casual gaming and user interface. Given that users spend a large proportion of time playing these types of games or navigating between applications on a device, we feel it is extremely important to focus on such use cases and ensure we are able to handle them in an energy efficient way. Ultimately, the user gets a smoother experience for longer.

 

Another optimization introduced in the Mali-T760 and enhanced in the Mali-T860 is Forward Pixel Kill. This feature reduces the amount of redundant processing the Mali GPU has to do when pixels are occluded. This is especially effective in applications that use inefficient draw ordering.

 

In summary, when comparing GPUs in our industry, performance alone is not a useful metric when energy efficiency is not included in the mix. Mali GPUs have been designed from the ground up to be extremely energy efficient not only within the GPU itself but also from a system wide perspective. We will continue to innovate in this area for each new generation of Mali GPU products.

Chinese Version中文版:在 Mali GPU 上利用 DS-5 Streamline 优化复杂的 OpenCL™ 应用程序

Heterogeneous applications – those running code on multiple processors like a CPU and a GPU at the same time – are inherently difficult to optimize.  Not only do you need to consider how optimally the different parts of code that run on the different processors are performing, but you also need to take into account how well they are interacting with each other.  Is either processor waiting around unnecessarily for the other?  Are you copying large amounts of memory unnecessarily?  What level of utilisation are you making of the GPU?  Where are the bottlenecks?  The complexities of understanding all these are not for the squeamish.

 

Performance analysis tools are, of course, the answer, at least in part.  DS-5 Streamline performance analyzer is one of these tools and recently saw the addition of some interesting new features targeting OpenCL.  Streamline is one of the components of ARM DS-5 Development Studio, the end-to-end suite of tools for software development on any ARM processor.

 

So, armed with DS-5 Streamline and a complex, heterogeneous application how should you go about optimization?  In this blog I aim to give you a starting point, introducing the DS-5 tool and a few concepts about optimization along the way.

 

DS-5 Streamline Overview


pic1.jpg


DS-5 Streamline allows you to attach to a live device and retrieve hardware counters in real time.  The counters you choose are displayed in a timeline, and this can include values from both the CPU and GPU in the same trace.  The image above, for example, shows a timeline with a number of traces.  From the top there’s the dual-core CPU activity in green, the GPU’s graphics activity in light blue and the GPU’s compute activity in red.  Following that are various hardware counter and other system traces.

 

As well as the timeline, on the CPU side you can drill down to the process you want to analyse and then profile performance within the various parts of the application, right down to system calls.  With Mali GPUs you can specify performance counters and graph them right alongside the CPU.  This allows you to profile both graphics and OpenCL compute jobs, allowing for highly detailed analysis of the processing being done in the cores and their components. A recently added feature, the OpenCL timeline, takes this a step further making it possible to analyse individual kernels amongst a chain of kernels.

 

Optimization Workflow


So with the basics described, what is the typical optimization process for complex heterogeneous applications?

 

When the intention is to create a combined CPU and GPU solution for a piece of software you might typically start with a CPU-only implementation.  This gets the gremlins out of the algorithms you need to implement and then acts both as a golden reference for the accuracy of computations being performed, and as a performance reference so you know the level of benefit the move to multiple processor types is giving you.

 

Often the next step is then to create a “naïve” port.  This is where the transition of code from CPU to GPU is functional but relatively crude. You wouldn’t necessarily expect a big – or indeed any – uplift in performance at this stage, but it’s important to establish a working heterogeneous model if nothing else.

 

At this point you would typically start thinking about optimization.  Profiling the naïve port is probably a good next step as this can often highlight the level of utilisation within your application and from there you can deduce where to concentrate most of your efforts.  Often what you’re looking for at this stage is a hint as to the best way to implement the parallel components of your algorithm.

 

Of course to get the very best out of the hardware you’re using it is vital to have a basic understanding at least of the architecture you are targeting.  So let’s start with a bit of architectural background for the Mali GPU.

 

The OpenCL Execution Model on Mali GPUs


Firstly, here’s how the OpenCL execution model maps onto Mali GPUs.

 

pic2.jpg

 

Work items are simply threads on the shader pipeline, each one with its own registers, program counter, stack pointer and private stack. Up to 256 of these can run on a core at a time, each capable of natively processing vector data.

 

OpenCL work groups – collections of work items – also work on an individual core. Workgroups can have barriers, local atomics and cached local memory.

 

The ND Range, the entire work load for an OpenCL job, splits the workgroups up and assigns them around the available Mali GPU cores. Global atomics are supported, and we have cached global memory.

 

As we’ll see, relative to some other GPU architectures, Mali GPU cores are relatively sophisticated devices capable of handling hundreds of threads in flight at any one time.

 

The Mali GPU Core

 

Let’s take a closer look inside one of these cores:

 

pic3.jpg

 

Here we see the dual ALU, the load/store and the texture pipelines. Threads come in at the top and enter one of these pipes, circle round back up to the top for the next instruction until the thread completes, at which point it exits at the bottom.  We would typically have a great many threads running this way spinning around the pipelines instruction by instruction.

 

Load/Store

 

So let’s imagine the first instruction is a load.  It enters and is executed in the load/store pipe.  If the data is available, the thread can loop round on the next cycle for the next instruction.  If the data hasn’t yet arrived from main memory, the instruction will have to wait in the pipe until it’s available.

 

ALUs

 

Imagine then the next instruction is arithmetic.  The thread now enters one of the arithmetic pipes.  ALU instructions support SIMD – single instruction, multiple data – allowing operations on several components at a time.  The instruction format itself is VLIW – very long instruction word – supporting several operations per instruction.  This could include, for example, a vector add, a vector multiply and various scalar operations all in one instruction.  This can give the effect of certain operations appearing “as free” because the arithmetic units within the ALU can perform many of these in parallel within a single cycle.  Finally there is a built in function library – the “BIFL” – which has hardware acceleration for many mathematical and other operational functions.

 

So this is a complex and capable core, designed to keep many threads in flight at a time, and thereby hide latency.  Latency hiding is what this is ultimately all about. We don’t care if an individual thread has to wait around for some data to arrive as long as the pipelines can get on with processing other threads.

 

Each of these pipelines is independent from the other and likewise the threads are entirely independent from other threads.  The total time for a program to be executed is then defined by the pipeline that needs the most cycles to let every thread execute all the instructions in its program. If we have predominantly load/store operations for example, the load/store pipe will become the limiting factor.  So in order to optimize a program we need to find which pipeline this is allowing us to target optimization efforts effectively.

 

Hardware Counters

 

To help determine this we need to access the GPU’s hardware counters. These will identify which parts of the cores are being exercised by a particular job.  In turn this helps target our efforts towards tackling bottlenecks in performance.

 

There are a large number of these hardware counters available. For example there are counters for each core and counters for individual components within a core, allowing you to peek inside and understand what is going on with the pipelines themselves.  And we have counters for the GPU as a whole, including things like the number of active cycles.

 

Accessing these counters is where we come back to DS-5 Streamline.  Let’s look at some screenshots of Streamline at work.

 

pic4.jpg

The first thing to stress is that what we see here is a whole-system view.  The vertical green bars in the top line shows the CPU, the blue bars below that show the graphics part of the application running on the GPU, and the red bars show the compute-specific parts of the application on the GPU.

 

crop.jpg

 

There are all sorts of ways to customise this – I’m not going to go into huge amounts of detail here, but you can select from a wide variety of counter information for your system depending on what it is you need to measure. Streamline allows you to isolate counters against specific applications for both CPU and GPU, allowing you to focus in on what you need to see.

crop2.jpg

 

Looking down the screen you can see an L2 cache measurement - the blue wavy trace in the middle there -  and further down we’ve got a counter showing activity in the Mali GPU’s arithmetic pipelines.  We could scroll down to find more and indeed zoom in to get a more detailed view at any point.

 

DS-5 Streamline can often show you very quickly where the problem lies in a particular application.  The next image was taken from a computer vision application running on the CPU and using OpenCL on the GPU.  It would run fine for a number of seconds, and then seemingly randomly would suddenly slow down significantly, with the processing framerate dropping in half.

crop3.jpg

 

You can see the trace has captured the moment this slowdown happened. To the left of the timeline marker we can see the CPU and GPU working reasonably efficiently.  Then this suddenly lengthens out, we see a much bigger gap between the pockets of GPU work, and the CPU activity has grown significantly.  The red bars in amongst the green bars at the top represent increased system activity on the platform.  This trace and others like it were invaluable in showing that the initial problem with this application lay with how it was streaming and processing video.

 

One of the benefits of having the whole system on view is that we get a holistic picture of the performance of the application across multiple processors and processor types, and this was particularly useful in this example.

 

crop4.jpg

Here we’ve scrolled down the available counters in the timeline to show some others – in particular the various activities within the Mali GPU’s cores.  You can see counter lines for a number of things, but in particular the arithmetic, load-store and texture pipes – along with cache hits, misses etc.  Hovering over any of these graphs at any point in the timeline will show actual counter numbers.

 

crop5.jpg

 

Here for example we can see the load/store pipe instruction issues at the top, and actual instructions on the bottom.  The difference in this case is a measure of the load/store re-issues necessary at this point in the timeline – in itself a measure of efficiency of memory accesses.  What we are seeing at this point represents a reasonably healthy position in this regard.

 

The next trace is from the same application we were looking at a little earlier, but this time with a more complex OpenCL filter chain enabled.

 

crop6.jpg

 

If we look a little closer we can see how efficiently the application is running.  We’ve expanded the CPU trace – the green bars at the top – to show both the cores we had on this platform.  Remember the graphics elements are the blue bars, with the image processing filters represented by the red.

 

mag.jpg

 

Looking at the cycle the application is going through for each frame:

 

  1. Firstly there is CPU activity leading up to the compute job.
  2. Whilst the compute job then runs, the CPU is more or less idle.
  3. With the completion of the compute filters, the CPU does a small amount of processing, setting up the graphics render.
  4. The graphics job then runs, rendering the frame before the sequence starts again.

 

So in a snapshot we have this holistic and heterogeneous overview of the application and how it is running.  Clearly we could aim for much better performance here by pipelining the workload to avoid the idle gaps we see.  There is no reason why the CPU and GPU couldn’t be made to run more efficiently in parallel, and this trace shows that clearly.

 

OpenCL Timeline

 

There are many features of DS-5 Streamline, and I’m not going to attempt to go into them all.  But there’s one in particular I’d like to show you that links the latest Mali GPU driver release to the latest version of DS-5 (v5.20), and that’s the OpenCL Timeline.

 

pic1.jpg

 

In this image we’ve just enabled the feature – it’s the horizontal area at the bottom.  This shows the running of individual OpenCL kernels, the time they take to run, any overhead of sync-points between CPU and GPU etc.

 

crop7.jpg

 

Here we have the name of each kernel being run along with the supporting host-side setup processes   If we hover over any part of this timeline…

 

crop8.jpg

… we can see details about the individual time taken for that kernel or operation.  In terms of knowing how then to target optimizations, this is invaluable.

 

Here’s another view of the same feature.

 

pic15.jpg

 

We can click the “Show all dependencies” button and Streamline will show us visually how the kernels are interrelated.  Again, this is all within the timeline, fitting right in with this holistic view of the system.  Being able to do this – particularly for complex, multi-kernel OpenCL applications is becoming a highly valuable tool for developers in helping to understand and improve the performance of ever-more demanding applications.

 

Optimizing Memory Accesses

 

So once you have these hardware counters, what sort of use should you make of them?

 

Generally speaking, the first thing to focus on is the use of memories. The SoC only has one programmer controlled memory in the system – in other words, there is no local memory, it’s all just global.  The CPU and GPU have the same visibility of this memory and often they’ll have a shared memory bus.  Any overlap with memory accesses therefore might cause problems.

 

If we want to shift back and forth between CPU and GPU, we don’t need to copy memory (as you might do on a desktop architecture).  Instead, we only need to do cache flushes.  These can also take time and needs minimising. So we can take an overview with Streamline of the program allowing us to see when the CPU was running and when the GPU was running, in a similar way to some of the timelines we saw earlier.  We may want to optimize our synchronisation points so that the GPU or CPU are not waiting any longer than they need to. Streamline is very good at visualising this.

 

Optimizing GPU ALU Load

 

With memory accesses optimized, the next stage is to look more closely at the execution of your kernels.  As we’ve seen, using Streamline we can zoom into the execution of a kernel and determine what the individual pipelines are doing, and in particular determine which pipeline is the limiting factor.  The Holy Grail here – a measure of peak optimization – is for the limiting pipe to be issuing instructions every cycle.

 

I mentioned earlier that we have a latency-tolerant architecture because we expect to have a great many threads in the system at any one time. Pressure on register usage, however, will limit the number of threads that can be active at a time.  And this can introduce latency issues once the number of threads falls sufficiently.  This is because if there are too many registers per thread, there are not enough registers for as many threads in total.  This manifests itself in there being too few instructions being issued in the limiting pipe.  And if we’re using too many registers there will be spilling of values back to main memory, so we’ll see additional load/store operations as a result.  The compiler manages all this, but there can be performance implications of doing so.

 

An excessive register usage will also result in a reduction in the maximum local workgroup size we can use.

 

The solution is to use fewer registers.  We can use smaller types – if possible.  So switching from 32 bit to 16 bit if that is feasible.  Or we can split the kernel into multiple kernels, each with a reduced number of registers.  We have seen very large kernels which have performed poorly, but when split into 2 or more have then overall performed much better because each individual kernel needs a smaller number of registers.  This allows more threads at the same time, and consequently more tolerance to latency.

 

Optimizing Cache Usage

 

Finally, we look at cache usage.  If this is working badly we would see many L/S instructions spinning around the L/S pipe waiting for the data they have requested to be returned. This involves re-issuing instructions until the data is available.  There are GPU hardware counters that show just what we need, and DS-5 can expose them for us.

 

This has only been a brief look at the world of compute optimization with Mali GPUs.  There’s a lot more out there.  To get you going I’ve included some links below to malideveloper.arm.com for all sorts of useful guides, developer videos, papers and more.

 

Download DS-5 Streamline: ARM DS-5 Streamline - Mali Developer Center Mali Developer Center

Mali-T600 Series GPU OpenCL Developer Guide: Mali-T600 Series GPU OpenCL Developer Guide - Mali Developer Center Mali Developer Center

GPU Compute, OpenCL and RenderScript Tutorials: http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/

Epic Giveaway.png


samsung note 4.jpgThe Samsung Galaxy Note 4 has quickly made itself popular among the ARM Mali graphics team. And it’s not its 515PPI Quad HD Super AMOLED display,

Exynos 7 Octa.png

vivid colors and intuitive UI that has earned it a place in our hearts – it is, as you would expect from ARM engineers, its stunning processor that has caught our eyes.


The Samsung Exynos 7 Octa is the latest mobile application processor to come out of the Samsung LSI team and it boasts considerable improvements over the previous generation – including up to 74% enhanced graphics performance with the ARM Mali-T760 MP6 GPU.  With this boost the Samsung Galaxy Note 4 can deliver superior, more life-like 3D gaming experiences on super HD screens, as well as a smoother, more responsive user interface and performance intensive, up-and-coming applications such as instant image stabilization, video editing or facial recognition. The Mali-T760 incorporates many of ARM’s popular energy efficient technologies, such as ARM Frame Buffer Compression, Adaptive Scalable Texture Compression, Smart Composition and Transaction Elimination – these together with the micro-architectural improvements to the Mali-T760, in particular to the L2 cache interconnect, result in an Exynos SoC that delivers a fantastic graphics experience without overexerting its intrinsic thermal and

samsung exynos 7.png

power budget.


When combined with a 1.9GHz ARM Cortex-A57 MP4, a 1.3GHz Cortex-A53 MP4 processor in big.LITTLE™ configuration with Samsung's HMP (Heterogeneous Multi-Processing) solution, every process can intelligently use the high processing power in such a way that no matter what the multitasking needs are, or what application is being run, there will be no lags and ultimately no unnerving power consumption. In all, the HMP technology, when used with the Cortex-A57 cores and Cortex-A53 cores, provides a 57% CPU performance increase from the previous generation Exynos 5 Octa.

 

The sum of all this is a device that has not only impressed across a range of benchmarks but also delighted critics and the public at large. The Samsung Galaxy Note 4 is an extremely desirable device that delivers the very latest advances in mobile technology – and it can be yours! ARM is giving away a Samsung Galaxy Note 4 as part of the 2014 Epic Giveaway in partnership with HEXUS. To find out more and to enter the EPIC Giveaway for your chance to win, click here and go to HEXUS’ Facebook page.


 

The 2014 Epic Giveaway is underway today. In partnership with HEXUS, ARM is giving you the chance to win amazing new prizes this holiday season! Every day for the next few weeks, we'll be giving away a brand-new ARM-based device. We'll have an array of prizes from ARM and our partners, including Atmel, Nvidia and Samsung, plus many, many more! Each prize draw will be open for seven days, so visit the dedicated competition page to keep tabs on what's up for grabs and what's coming soon.

For GDC 2014 I wrote a presentation intended to capture people’s interest and get them to the ARM lecture theatre using a lot of the buzz surrounding the current uptake in 3D tech. Alas, we never got to see the talk in the lecture theatre as so many partners offered to speak on our stand that we didn’t have time for it, so instead we recorded it in the office to present in ARMflix format.

 

When I first started at ARM on the demo team, the first demo I saw was True//Force, running in stereoscopic 3D on a 36 inch screen. Connected to that screen was a single core ARM® Mali™-400 MP GPU based development board.

 

That demo was adapted to many other platforms through the years, and later we reinvented the 3D camera system for a gesture input UI, again designed for large stereoscopic scenes. Throughout this development we made several interesting observations about pitfalls faced in generating stereoscopic content, which this year we were able to share over the ARMflix channel.

 

 

Admittedly the video above may be a little overlong in its description of 3D screen technology, most of this is already very common knowledge, although the advice  about splitting the signal top/bottom instead of left/right still throws some developers. Since screens with alternating polarised lines already have a reduction in vertical resolution, the reduction in vertical resolution from a top/bottom split screen isn’t seen. A left/right split screen will reduce the horizontal resolution, forcing the screen to interpolate between pixels, and the vertical detail gets lost in the interlacing of lines anyway.

 

The basics of stereoscopy can be understood by thinking about lines of sight coming out from your eyes and intersecting at an object. If the object gets closer those lines converge nearer, if it moves into the distance those lines converge further away. When an object is drawn on a stereoscopic display, each eye sees the object in different places, and where those eye lines cross is where it appears to be in space, which could be in front of or even behind (or inside) the screen.

 

Stereoscopy1.png

 

This projection into different views in order to emulate this difference in what each eye sees is the basis of stereoscopy, but stereoscopy is only one factor used by your brain to discern the shape of the world. If you’re going to implement stereoscopy, it has to remain congruent with other mental models of how the world works.

 

One such model is occlusion. For example, you would never expect to see something through a window which was further away than the object you’re seeing through it. Anything which comes out of the screen in a stereoscopic setup is basically this exact scenario. It’s okay so long as the object is in the middle of the screen, because then it could conceivably be in front of it, but if it clips the edge of the screen your brain becomes aware that it is seeing through the screen like a window, and yet the object being occluded by it is closer. This creates a visual dissonance which breaks the spell, so to speak.

 

Stereoscopy2.png

As it approaches the edge what you will observe is a kind of impossible monocular defect, where the eyes disagree to a degree that it becomes hard to discern spatial information. Monocular defects occur in real life too, imagine peering round the corner of a wall such that only one eye sees an object in the distance, or looking at an object which has a highly angle sensitive iridescence or noisy high gloss texture such that it seems a different colour or pattern in each eye. The difference in these cases is that they arise from genuine circumstance and we can rely on other factors to provide additional information, from small lateral movements of the head to using optical focus as a secondary source of depth information.

 

In a simulation, the focus of the eyes is always on the screen, as there is currently no way to defocus the image on screen to converge at a different focal point for your eyes, and without sophisticated head tracking hardware, your TV can’t correct perspective for you leaning to the left or coming closer.

 

More importantly, as the generation of stereoscopic viewpoints is being faked on a 2D screen, it’s possible to do odd things like occlude something with an object that appears to be behind it. This is a mistake frequently made by overlaying the menu on the screen for each eye but not thinking about how far in or outside the screen it will appear.

 

You could also occlude protrusions from the screen by the screen itself. The imagined reality of watching something stereoscopic on a screen is that the screen is a window you’re looking through into a fully 3D world. Occasionally things may bulge out of this window, and that’s fine, but if they then slide sideways out of view, they’re being occluded by an object behind them: the frame of this virtual window.

 

Best to try to keep the space in front of the screen a little bit special, by keeping the collision plane of the camera and the convergence plane of the projections in the same place. With that in mind, let’s talk about convergence planes.

This is a good point to remind you that this is advice for stereoscopy on screens, as virtual headsets use a very different method of projection.

 

So you’ve got two cameras pointing forwards, some distance apart. This will give you a kind of 3D, but unfortunately you’re not quite done yet. As mentioned before, the distance at which your eyes converge on the screen is the place in 3D space where objects should appear in the same place on the two eye images. But if you draw two visibility frustums pointing parallel, you’ll see that actually there is no point where an object appears in the same place in both. It will tend towards convergence at infinity, when the distance between the frustums is negligible compared to the width of the view plane. That means that the screen is essentially the furthest away anything can be in this setup, whereas what you probably want is a sense of depth into the screen. To achieve this you need to cross the streams, make the eyes look inward a little.

 

It’s not as simple as just rotating the frustums however. Unlike the spherical optics of an actual eye (admittedly rather exaggerated in the video), the flat projection of a frustum will intersect but never truly converge. For convergence of the view plane you must skew the frustum.

 

The equation for this is far simpler than it looks, a multiplication by a matrix that looks like this:

skew matrix.png

 

g is the gradient of the skew, defined as s/d (where s is half the separation between the two cameras, and d is the perpendicular distance between the desired convergence plane and the cameras).

 

So now you need to decide what the eye distance and the convergence plane are. The best advice I can give is to attempt to replicate as best you can the relationship between the physical world and the virtual. Basically, you would expect someone playing a game to sit at least a meter from the screen and have eyes which are between 6 and 7cm apart. However, more important than an exact emulation is the ability to parameterise it for the far distance. The draw frustum has an angular factor in the field of view.

 

Typically the field of view is around 90 degrees, which means the screen size at the convergence plane is sin(45)d or roughly 0.7d so in the scenario above where we assume the screen is about 1 meter away from the cameras, the screen width is 70cm. I’m currently sat in front of a 24 inch monitor and I measure the width as approximately 50cm. If the eyes were placed 6cm apart in this scenario, the far plane would tend towards the virtual 7cm eye separation required for the eye vectors to be parallel. Scaled down onto my 50cm screen however this would be a 5cm eye separation. My eye lines would not be parallel and so the far plane would actually have a tangible optical distance*.

 

Ideally you want the distance of the far plane separation on screen to be the same as your user’s actual physical eye separation. This of course requires parameterisation for user eye separation and screen size. Sticking with 7cm eye separation and a 50cm screen for a moment, let’s look at our options.

 

1. You can move the cameras further apart (by a ratio of 7/5 to a virtual 9.8cm in this case) and adjust the skew gradient to suit.

2. You can make the field of vision (FOV) narrower (to sin-1(0.5) or 30 degrees in this case) so that the convergence plane is the same size as the screen.

 

In case it isn’t obvious which of these is the better option (a 30 degree FOV, really you think that’s okay?), consider the math if the screen is 10cm across. Obviously you wouldn’t see much through a 5.7 degree FOV, so that option is right out.

 

Option 1 has one major drawback however, which is that the whole world’s scale will scale with the eye distance.  In this case the whole world would be 5/7ths its original size. Again, do the math with a 10cm screen and you’ll see that the world would be 1/7th the size.

 

There is a temptation to move the cameras closer to the convergence plane, but to keep the screen the same size and place that plane closer, the FOV would have to widen, creating a mismatch between the perspective in the real and virtual space. What this mismatch does is move things towards the far plane more rapidly, as the sense of depth is an asymptotic power curve linked to perspective. The wider the FOV goes, the higher the power and the tighter the curve. If this curve is different to what you’re used to, the depth will feel very odd and less real.  However, an incorrect convergence or even a lack of convergence is far preferable to the worst sin of stereoscopy: divergence.

Stereoscopy3.png

 

Since the entire basis of stereoscopy is your eyes pointing to two different places on the screen and the imaginary sight lines crossing at the depth being simulated, we know that these lines being parallel means the object is infinitely far away. But what if the images for each eye move even further apart? If the gap is wider than the gap between your eyes the imaginary lines of sight are no longer parallel but actually diverge and this is one of the biggest head ache inducers in stereoscopy.

 

It is a scenario which you can literally never replicate in real life without optical equipment. If the images diverge by a reasonable amount it’s not too bad because you just can’t focus on them, which ruins the picture but not your eyes. If they diverge by a tiny amount your eyes will have a go at achieving binocular focus and you’ll probably end up with the mother of all migraines in very little time. This scenario is the most likely to arise if you don’t allow users to fine tune their own stereoscopy settings.

 

There are other simple mistakes and pitfalls covered in the video at the top of this blog, such as forgetting to make menus stereoscopic and being careful with the perceived depth of overlays. It makes sense for a pop-up menu to have its text at screen depth on the convergence plane so that if the stereoscopy needs adjusting, the menu is still easy to read… but the menu has to be drawn over everything else, so how to avoid objects behind the menu overlapping the convergence plane at these times? Again, this comes down to keeping space in front of the convergence plane special.

 

Which just leaves me time to add one side ramble which isn’t in the video. By far the most popular stereoscopic display currently in use is that on the 3DS hand held game console from Nintendo, and when it first came out people talked about the weirdness of the little slider on the side which seemed to scale the 3D effect on the screen. If it was at the bottom it went 2D, and sliding it up slowly took it to full depth simulation. At the time people wondered why you would want a game to be in half 3D, where it was still stereoscopic, but appeared semi-flattened.

 

The answer is simple: The slider was actually adjusting the eye-separation, so that full depth stereoscopy could be tuned for use by children of all ages. If the anatomical eye gap of the user is wider than the highest setting of the device, it doesn’t matter too much, it just means that the far plane won’t look as far. But if the eye separation was fixed width and someone with a smaller eye gap tried to use it, it would cause line of sight divergence, and all the associated headaches.

 

So make sure you’ve got that slider in your software.

 

* I know you’ll be knocking yourselves out over this so I’ll do the math for you. The far plane would appear to be 2.5 meters behind the screen.

Filter Blog

By date:
By tag: