1 2 3 Previous Next

ARM Mali Graphics

214 posts

Le Tour de France is arguably the world’s greatest cycling road race and has been extremely successful on the streets of…well…France. That was until a few weeks ago when “Le Tour” came to the UK.


ARM® Mali™ Multimedia IP is widely regarded as the world’s best too and it has been incredibly successful in mobile, where we are the #1 shipping GPU for Android devices. However, the scalability of our multimedia IP means our partners can also target other markets beyond mobile with the same design.


The interesting thing for me about the Atmel license announced today is not just that they now have access to ARM Cortex®-A7, ARM's most energy efficient processor ever, and ARM’s video and display processors – it is the new and different types of markets Atmel will go on and address with the same ARM Mali IP which has done so well in the mobile market.


Atmel has traditionally not focused on mobile application processors. Instead, they have been very successful in a broad range of other low power markets.


Each ARM Mali GPU, Video and Display processor delivers best-in-class performance in the smallest area. It is this small area advantage that will allow Atmel to target a vast array of markets with a rich multimedia experience.


Atmel plans to use ARM's technology in wearables, toys and even industrial markets. Here, user expectation is that any screen-based device will deliver a similar experience to the latest tablet or smartphone with a smooth 3D user interface, video capture and playback functionality, all at HD resolution and incorporating secure features for protection of data and content. And critically, all of this in a low power budget.


The Mali IP Atmel has licensed includes system-wide power saving technology such as ARM Frame Buffer Compression (see Jem Davies blog ARM Mali Display - the joined-up story continues) out of the box support for ARM TrustZone® processor security technology and an optimized software solution delivering maximum efficiency all the way to the glass [see Dave Brown blog Do Androids have nightmares of botched system integrations).


A few weeks ago, the organisers of the Tour de France brought the race to the UK, speeding right through my tiny little village on the outskirts of Cambridge. The event was a massive success in the UK with crowds surpassing anything seen outside Paris.


I look forward to seeing where Atmel takes ARM’s range of multimedia IP and the success they will see in some very exciting markets outside mobile.

Mali Graphics Debugger v1.3

We are pleased to announce that version 1.3 of Mali Graphics Debugger is finally available to everyone. This is the biggest update since we first released MGD, around a year ago. In this version we have introduced:

  • A new frame replay feature (see User Guide Section 5.2.10 for details).
  • Faster tracing and lower overheads with a new binary format.
  • Memory performance improvements (allows for longer traces and improved performance of the tool in general).
  • Enhancements to the GUI.

While we were developing these features we have also fixed a large number of bugs, and made the user experience better (see User Guide Section 9.1 for details).

Get the Mali Graphics Debugger

About Frame Replay

Different draw modes in Mali Graphics DebuggerMali Graphics Debugger now has the ability to replay certain frames, depending on what calls were in that frame. To see if a frame can be replayed you must pause your application in MGD. Once paused, if a frame can be replayed, the replay button will be enabled. Clicking this button will cause MGD to reset the OpenGL ES state of your application back to how it was before the previous frame had been drawn. It will then play back the function calls in that frame. This feature can be combined with the Overdraw, Shadermap, and Fragment Count features. For example, you can pause you application in an interesting position, activate the Overdraw mode and then replay the frame. (See Section 5.2.10 in the User Guide for Frame Replay Limitations).


About the New Binary Format

Originally Mali Graphics Debugger was using a text based format to transfer all the data from the target device to the host. Now we’ve decided to switch to a binary format, using Google Protocol Buffers. This has made the interceptor lighter allowing faster transfers and lower overhead while capturing the application. Old text-based traces can still be read and they will be converted to the new binary format when saved again on disk.


Mali Graphics Debugger v1.3

Mali Offline Shader Compiler v4.3

We are also happy to announce the release of a new version of the Mali GPU Offline Shader Compiler.

This version is capable of compiling OpenGL ES 3.0 shaders and supports the Mali-T600 series r4p0-00rel0 driver, together with all previous driver versions.

By popular demand, we’ve also re-introduced Mac OS X support, which adds to the already supported Windows and Linux versions.

Get the Mali GPU Offline Shader Compiler

UK Govt Backs Geomerics to Revolutionize the Movie Industry


Earlier this week, Geomerics announced that it has won a £1million award from the UK's Technology Strategy Board (TSB) for it to bring its real-time graphics rendering techniques from the gaming world to the big screen. Geomerics and its partners will help the film and television services industry become more efficient by decreasing the amount of time spent in rendering, particularly for lighting which is one of the most time consuming parts of the editing process. Traditionally all editing was done offline then rendered to bring them to full quality, with the rendering taking 8-12 hours. However, the gaming world has developed techniques that allow full quality graphics graphics sequences to be rendered instantly - and Geomerics is looking to bring them to the film world.


For more information on the technology behind the announcement, check out  the Geomerics website.


Hardkernel Release the Odroid-XU3 Development Board


Based on Samsung's Exynos 5422 SoC with its ARM Mali-T628 MP6 GPU this new development board from Hardkernel offers a heterogeneous multiprocessing solution with great 3D graphics and thanks to its open source support, the board can run various flavours of Linux, including the latest Ubuntu 14.04 and the Android 4.4.

Full details on the board are available on the Hardkernel website and it also got a great article on Linux Gizmos.


Today Was Clearly The Day of the Mali-450 MP GPU


Four devices were announced today featuring the Mali-450, two of which had the Mediatek MT6592 at its heart and two with a HiSilicon SoC. The Mali-450 has been picking up momentum over the past six months and now we are starting to see it in a range of smartphones, such as the HTC Desire 616 and the Wickedleak Wammy Neo Youth - as well as tablets such as the HP Slate 7 VoiceTab Ultra and HP Slate 8 Plus.


Aricent on Architecting Video Software for Multi-core Heterogeneous Platforms

If you haven't caught it already, our GPU Compute partner Aricent posted a great blog on their section of the community, Parallel Computing: Architecting video software for multi-core heterogeneous platforms. It covers conventional techniques used by software designers to parallelize their code and then proposes the novel "hybrid and massive parallelism based multithreading" model as a potential way to overcome the shortcomings of spatial and functional splitting. It's definitely worth a read if you're interested in programming for multi-core platforms.

dennis_keynote.jpgI have just returned from a fortnight spent hopping around Asia in support of a series of ARM hosted events we call the Multimedia Seminars, which took place in Seoul (27th June), Taipei (1st July) and Shenzhen (3rd July). Several hundred attendees joined in each location, a quality-dense cross-section from the local semiconductor and consumer industries, including many silicon vendors, OEMs, ODMs and ISVs. All of them were able to hear the great progress made by the ARM ecosystem partners who are developing the use of GPU Compute on ARM® Mali™ GPUs. In this blog I will try to summarise some of the highlights.

The Benefits of GPU Compute

In my presentation at the three sites I was able to illustrate the benefits of GPU Compute using Mali. This was an easy task as at the event many independent software vendors were demonstrating and promoting a vast selection of middleware ported and optimized for Mali.
But what are the benefits of GPU Compute?
  • Reduced power consumption. The architectural characteristics of the Mali-T600 and Mali-T700 series of GPUs enable computation of many parallel workloads much more efficiently than alternative processor solutions. GPU Compute accelerated applications can therefore benefit by consuming less energy, which translates into longer battery life.
  • Improved performance and user experience. Where raw performance in the target, the computation of heavy parallel workloads can also be significantly accelerated through the use of the GPU. This may translate in increased frame rate, or the ability to carry out more work in the same temporal/power budget, and can result in benefits such as improved UI responsiveness, more robust finger detection for gesture UIs in challenging lighting conditions, more accurate physics simulation, the ability to apply complex pre-/post-processing effects to multimedia on-device and in real-time. In essence: a significantly improved end-user experience.
  • Portability, programmability, flexibility. Heterogeneous compute APIs such as OpenCL™ and RenderScript, are designed for concurrency. They allow the developer to migrate some of the load from the CPU to the GPU or other accelerator, or to distribute it between processors in order to enable better load-balancing across system resources. For example a video codec may offload motion vector calculations to the GPU, enabling the CPU to operate with fewer cores and at lower frequencies, or to be available to compute additional tasks, for example video analytics.
  • Reduction of cost, risk and time to market. System designers may be influenced by various cost, flexibility and portability concerns when considering migrating functionality from dedicated hardware accelerators to software solutions which leverage the CPU/GPU subsystem. This approach is made viable and compelling due to the additional computational power provided by the GPU, now exposed through industry standard heterogeneous compute APIs.

Over the last few years ARM has worked very hard to create and develop a strong GPU Compute ecosystem. Collaborations were established across geographies, use-cases and applications, working with partners at all levels of the value chain. These partners were able to translate the benefits of GPU Compute into reality, to the ultimate avail of the end users, and were proudly showcasing their progress at the Multimedia Seminars.

ittiam_power.jpgDemonstrating Reduced Power Consumption

Software codec vendors such as Ittiam Systems have been demonstrating for some time HEVC and VP9 optimized ports that make use of GPU Compute on Mali-T600 series GPUs. A software solution leveraging the CPU+GPU compute subsystem can be useful for reducing TTM, reducing risk in the adoption of new standards, but most importantly, it can help to save power.

For the first time ever Ittiam Systems publically demonstrated how a software solution leveraging on the CPU+GPU compute subsystem is able to save power compared to a solution that does not make use of the GPU. Using an instrumented development board and power probing tools connected to a National Instruments DAQ unit they were able to demonstrate a typical reduction in power consumption of over 30% for 1080p30 video playback.

Mukund Srinivasan, Director and General Manager of the Consumer and Mobility Business Unit at Ittiam, said: "These paradigm shifts open a unique window of opportunity for focused media-related Intellectual Property providers like Ittiam Systems® to offer highly differentiated solutions that are not only compute efficient but also enhance user experience by way of a longer battery life, thanks to offloading significant compute to the ARM Mali GPU. The main hurdle to cross, in these innovative solutions, comes in the form of how to manage the enhanced compute demands of a complex codec standard on mobile devices, since it bears significantly more complex coding tools, for codecs like H.265 or VP9, as compared to VP8 or H.264. In order to gain the maximum efficiencies offered by the GPU technology, collaborating with a longstanding partner and a technology pioneer like ARM enabled us to generate original solutions to the complex problems posed in the design and implementation of consumer electronic systems. Working closely with ARM, we have been able to showcase not just a prototype or a demo, but a real working product delivering significant power savings, of the order of 30-35% improvement in energy efficiencies, when measured on the whole subsystem, using the ARM Mali GPU on board a silicon chip designed for use in the Mobile market"





Demonstrating reduced CPU Load

Another benefit of GPU Compute was illustrated by our partner Thundersoft, who have implemented a gender-based real-time facebeautifier application and improved its performance using RenderScript on a Mali-T600 GPU. The algorithm first detects the subject’s face and determines its gender, and based on the gender applies a chain of complex image processing filters that enhances the subject’s appearance. This include face whitening, skin tone softening, de-blemishing effects.


The algorithm is very computational intensive and very taxing on the CPU resource, which can at times result in poor responsiveness. Furthermore, a performance of 20+ FPS is required in order to deliver a good user experience and this is not achievable on the CPU alone. Fortunately, the heavy level of data parallelism, and large proportion of floating point and SIMD-friendly operations, make this use case great for GPU acceleration. Using RenderScript on Mali, Thundersoft were able to improve the performance from a poor 10fps to over 20fps, and at the same time reduce the CPU load from fluctuating between 70-100% to a consistent < 40%. Dynamic power reduction techniques are therefore able to disable and scale down operational points of the CPUs in order to save power.

Delivering improved Performance and User Experience

Image processing is proving to be a very fertile area for GPU Compute processing. In their keynote and technical speech, ArcSoft illustrated how they utilised Mali GPUs to improve many of their algorithms including JPEG, photo filters, beautification, Video HDR (NightHawk), and HEVC.
A “nostalgia effect” filter, based on convolution, was optimized using OpenGL® ES. For a 1920x1080 camera preview image, the rendering time was reduced from 80ms down to 20ms using the Mali GPU. This means going from 12.5fps to 50fps.
Another application that benefited is ArcSoft’s implementation of a face beautifier. The camera stream was processed by the CPU, whilst colour-conversion and rendering was moved from the CPU to the GPU. Processing time for a 1920x1080 frame was therefore reduced from 30ms to just 10ms. In practice this meant that the face beautification frame rate was improved from 16fps to 26fps!
Another great example is JPEG processing. OpenCL was used for reconstructing inverse quantization and IDCT modules. Compared with ArcSoft’s Neon based JPEG decoder, performance in decoding 4000x3000 resolution images inproved 25%. Compared with OpenCL based open-source project JPEG-OpenCL, the efficiency of IDCT increased as much as 15 times.

Improved User Experience for Computer Vision Applications

You may have previously seen our partner eyeSight Technologies demonstrate how they have been able to improve the robustness and reliability of their gesture UI engine. Gesture UIs are particularly challenged when lighting conditions are poor, as this adds a lot of noise to the sensor data, and reduce accuracy of detection of gestures. As it happens, poorly lit situations is common when gesture UIs are typically used, such as inside a car, or in a living room. GPU Compute significantly increases the amount of useful computation that the gesture engine can carry out on the image within the same temporal and energy budget, this enables a significant improvement of reliability of gestures when lighting is poor.

eyeSight machine vision algorithms make extensive use of machine learning (using neural networks). The capability to learn from a vast amount of data, at a reasonable amount of time is a key element for success. However, the required
computational resources of neural networks, are beyond the capabilities of standard CPUs. eyeSight’s utilization of deep learning methods can greatly benefit from running on GPU processors.
eyeSight have used their extensive knowledge of machine vision technologies and the ARM Mali GPU Compute architecture to optimized their solutions using OpenCL on Mali.

Alva demonstrate video HDR and real-time video stabilization

In its presentation, Oscar Xiao, CEO of Alva Systems, discussed the value of heterogeneous computing for camera based applications, using two examples: real-time video stabilization and HDR photography. Alva optimized their solutions for Mali-400, Mali-450 and Mali-T628. Implementations of their algorithms are available using OpenGL ES and OpenCL APIs. Through the use of the GPU, image stabilization can be carried out comfortably for 1080p video
streams at 30fps and above. Alva Systems have also implemented an advanced HDR solution that corrects image distortion (common in multi-frame processing algorithms), removes ghosting and carries out intelligent tone mapping (to enable a more realistic result). Of course all of these features increase the computational requirements of the algorithm. GPU Compute enables real-time computation. Alva were able to measure performance improvement of individual blocks of around 13-15x compared to the reference CPU implementation of the same algorithm.

In Conclusion

Modern compute APIs enable efficient and portable heterogeneous computing. This includes enabling the use of the best processor for the task, the ability to balance workloads across system resources and to offload heavy parallel computation to the GPU. GPU Compute with ARM Mali brings tangible advantages for real world applications, including reduced cost and time to market, improved performance and user experience, and improved energy efficiency (measured on consumer devices). These benefits are being enabled by our ecosystem partners who use GPU Compute on Mali for a variety of applications including: advanced imaging, computer vision, computational photography and media codecs.



Industry leaders take advantage of the capabilities of ARM Mali GPUs to innovate and deliver - be one of them!

It's been a busy time here at Mali-central recently and I have been too busy even to blog about it.


I did an Ask The Experts slot with AnandTech recently, and faced some very interesting questions from the public. You might like to take a look.

We also bared our soul and Ryan Smith wrote a very detailed article about the Mali Midgard architecture.


Then, finally, I faced Anand Shimpi from AnandTech and did a live interview as a Google hangout. The whole thing was recorded and put  up on YouTube. You can see it here. I know what I meant  to say but what came out of my mouth of course didn't always match that . Oh well...



After a little hiatus in these blogs while I took a trip to Scotland, these blogs are back on the road! So, what has happened in the ARM Mali world this week?

ARM and Geomerics recognized in the Develop 100 Tech List


Develop have published the ultimate list of tech that will influence the future of gaming. Including a multitude of upcoming platforms and the tools, engines and middleware required to make great games run excellently, the list is well worth a read. The ARM-based Raspberry Pi took pride of place in the top spot, followed shortly afterwards by Geomerics Enlighten technology at #8, which has recently been integrated into Unreal Engine 4 and Unity 5 (#2 and #3 in the Tech List respectively).  It was great to see the Mali Developer Center at #33 being recognized for its efforts to help developers fully utilize the opportunities the mobile market offers through its broad range of developer tools, supporting documentation and broad collaboration across the gaming industry.

ARM release 64-bit hardware development platform, "Juno"


Following the announcement of 64-bit Android at Google IO, this week Linaro announced a port of the Android Open Source Project (AOSP) to the ARMv8-A architecture. At the same time, the Juno development board was released, sporting a dual-core ARM Cortex-A57 CPU and quad-core Cortex-A53 in big.LITTLE configuration, plus a quad core ARM Mali-T624 GPU for 3D graphics acceleration and GPU Compute support. Altogether, this provides the ARM Ecosystem with a strong foundation on which we can accelerate Android availability on 64-bit silicon and drive the next generation of Android mobile experiences.


Ask the experts: Jem Davies answers your questions on AnandTech


The CPU folk did their "Ask the Experts" a while back and now it's the turn of the GPU to have the spotlight! This week ARM's Jem Davies has been answering your questions on AnandTech with topics ranging from mobile versus desktop graphics features to GPU Compute and the future of graphics APIs. Jem is also starring in a Google Hangout on AnandTech next week, Monday 7 July - tune in for what is set to be an informative and detailed debate.


Good luck to all teams competing in the Brains Eden Game Jam this weekend!


Following our "Developing for Mobile" Workshop at Brains Eden, the teams in Cambridge will set to tomorrow to develop the best game of the weekend. With the importance of mobile gaming growing rapidly, all teams have been given access to a Google Nexus 10 and the expertise of ARM experts who are present throughout the event to guide students in how to get the best performance from these devices. Details of how they get along will be reported next week!

Hi, all


There is a paper ARM’s Mali Midgard Architecture Explored posted in AnandTech. You can read this paper through the following link.


AnandTech | ARM’s Mali Midgard Architecture Explored

This week saw the exciting launch of the latest ARM® Mali-T628 MP4 GPU based product, Huawei Technologies Honor 6. With its Hisilicon Kirin 920 processor based on ARM big.LITTLE processing technology and promising graphics benchmark scores, this latest offering from Huawei is making huge ripples in the Chinese smartphone market.

K3V3 Mali.jpg

An impressive array of chips has been coming out of China in recent months. From the Rockchip RK3288 featuring the Mali-T760 through to the MediaTek MT6732, Chinese semiconductor companies are meeting the growing domestic demands for high performing yet cost efficient smartphones head on. Over the past two years smartphone shipments in China have nearly quadrupled and the market is starting to mature. Like the majority of smartphone markets, it is now one of two ends. While 1 billion consumers in China have a phone, only about 40% of these are smartphone owners, largely due to budget constraints. This leaves a potential market of 600 million customers for the OEMs who can deliver a desirable yet cost-effective device. At the other end of the scale, and where the latest release from Huawei falls, we have those who desire a premium superphone whose user experience sets the standard for the industry. While competition is strong from international products such as Apple’s iPhone 5S and Samsung’s Galaxy Note 3, shipments from local suppliers are increasing rapidly as their offerings become increasingly competitive with the combined benefit of lower prices.


HiSilicon’s latest chip, the Kirin 920, is one that can comfortable rival these competitors. With a quad-core ARM Cortex®-A15 and quad-core Cortex-A7 in a big.LITTLE processor configuration it offers both high performance for more intensive workloads and energy efficiency for day to day tasks – a truly heterogeneous approach. Combined with the Mali-T628 MP4, the processor is capable of not only driving a stunning 2560x1600 resolution display, 4K video capture and playback and a 13MP camera with HDR support, but also achieving some impressive early results in the AnTuTu benchmark as highlighted in the launch presentation.


K3V3 AnTuTu2.jpg


On top of it all, it is the first LTE device to support LTE Cat6 in WW and support the maximum, super-fast download speed of 300MB/s.


Devices such as the Huawei Honor 6 will accelerate the expansions of the Android Gaming and GPU Compute ecosystems in China. The gaming experience on these devices with such high levels of processing technology promises to be incredible and as they succeed in reaching the hands of more consumers it will encourage developers to create even more graphically challenging, visually impressive applications to enrich the user experience further. The Huawei Honor 6 is yet another proof point that the Chinese marketplace is the one to be watching.

Modern games and applications really push the boundaries of real-time graphics and user interfaces on mobile and to do this they need all the components of the system to work together to provide the performance those apps need.


But it’s not all about performance; not only do mobile users demand desktop equivalent features, they want it at desktop equivalent quality too! It’s not just enough to push lots of pixels around, they need to be high quality pixels! Don't get me wrong, better performance allows developers to make use of advanced shader techniques to add high quality visual special effects, more detailed geometry in their 3D scenes and more animated objects, such as particles for simulating explosions and weather. However, there are things other than performance that can influence the visual quality in your latest apps and games.


Last week Samsung announced their new Galaxy Tab S with AMOLED display. This is great for users; its vibrant colours and thin design really help improve the user experience. But great displays need great images to start with! This is where the ARM® Mali™ GPU comes in; it accelerates the rendering of apps and games on your mobile. And, the new Galaxy Tab S just happens to have our current flagship GPU, the Mali-T628 MP6.


Mali-T628 MP6 GPU contains lots of features that help improve the quality of the images;especially real-time 3D graphics used in high-end games. For the techies out there, take ETC2 supported in OpenGL ES 3.0 for example. ETC2 allows the compression of images that contain an opacity component, allowing for higher quality foliage in games. And, how about Adaptive Scalable Texture Compression (ASTC) - the texture compression format designed by ARM and adopted across the industry. It enables high quality compression of images with a much wider range of supported formats. Textures in games can now be much higher quality while retaining small file sizes (which you’ll know about if you’ve ever had to wait while your favourite game downloads to your phone!)


Texture compression allows us to save memory bandwidth and further improve visual quality by employing higher resolution textures. Keeping with the theme of bandwidth savings, there’s also ARM’s proprietary Transaction Elimination technology. With no extra effort from the developer (it all happens automatically in the background), bandwidth savings can be made by only updating areas of the screen that have actually changed; again, bandwidth resource that the developer can employ elsewhere to make further improvements to visual quality.


Anti-aliasing is another technology that ARM has always employed to improve the visual quality of the images you see in your games and apps. Even at high resolutions, aliasing can be an issue but the Mali range of GPUs can perform anti-aliasing with minimal impact; all developers need to do is turn it on!


We touched on OpenGL ES 3.0 earlier and for my last point, I’d like to mention it again. With OpenGL ES 3.0 that is in devices now, developers can make use of higher dynamic range formats, both for textures and render targets. Meaning source textures and rendered scenes can make use of the extra colour information that HDR techniques provide.


For a long time now, ARM has driven innovation and visual quality in the graphics industry and the future is no exception. Coming soon we will have ARM Framebuffer Compression (AFBC)and Smart Composition; both technologies help reduce memory bandwidth, allowing developers the freedom to improve those pixels even more!


To create compelling picture-perfect visual experiences, developers don’t just throw pixels on the screen, lots of hard work goes into every tiny detail, every leaf on a tree, every curve on a super car and every scar on an action hero’s chin. They all look great on a high quality display but when you combine that with the performance and quality of Mali GPUs that’s when the images really begin to pop!

The most notable addition to OpenGL® ES when version 3.1 was announced at GDC earlier this year was Compute Shaders. Whilst similar to vertex and fragment shaders, Compute Shaders allow much more general-purpose data access and computation. These have been available on desktop OpenGL® since version 4.3 in mid-2012, but it’s the first time they’ve been available in the mobile API. This brings another player to the compute-on-mobile-GPU game, joining the ranks of OpenCL, RenderScript and others. So what do these APIs do and when should you use them? I’ll attempt to answer these questions in this blog.


When it comes to programming the GPU for non-graphics related jobs, the various tools at our disposal share a common goal: to provide an interface between the GPU and CPU so that packets of work to be executed in parallel can be applied to the GPU’s compute resources. Designing tools that are flexible enough to do this and that allow the individual strengths of the GPU’s architecture to be exploited is a complex process. The strength of the GPU is to run small tasks on a wide range of data as far as possible in parallel, often many millions of times. This is after all what a GPU does when processing pixels. Compute on the GPU is just generalizing this capability. So inevitably there are some similarities in how these tools do what they do.


Let’s take a look at the main options…




Initially developed by Apple and subsequently managed by the Khronos Group, the OpenCL specification was released in late 2008. OpenCL is a flexible framework that can target many types of processor, from CPUs and GPUs to DSPs. To do so you need a conformant OpenCL driver for the processor you’re targeting. Once you have that, a properly written OpenCL application will be compatible with other suitably conformant platforms.


When I say OpenCL is flexible, I was perhaps understating it. Based on a variant of C99, it is very flexible, allowing complex algorithms to be shaped across a wide variety of parallel computing architectures.  And it has become very widespread – there are drivers available for hundreds of platforms. See this list for the products that have passed the Khronos conformance tests. ARM supports OpenCL with its family of ARM® Mali™ GPUs. For example Mali-T604 passed conformance in 2012.


So is there a price for all this flexibility?  Well, it can be reasonably complex to set up an OpenCL job… and there can be quite an overhead in doing so. The API breaks down access to OpenCL-compatible devices into a hierarchy of sub units.


OpenCL Execution Model_s.png


So the host computer can in theory have any number of OpenCL devices. Each of these can have any number of compute units and in turn, each of these compute units can have any number of processing elements. OpenCL workgroups – collections of individual threads called work items – run on these processing elements. How all of this is implemented is platform dependent as long as the end result is compliant with the OpenCL standard. As a result, the boilerplate code to set up access to OpenCL devices has to be very flexible to allow for so many potential variations, and this can seem significant, even for a minimal OpenCL application.


There are some great samples and a tutorial available in the ARM Mali OpenCL SDK, with a mix of basic through to more complex examples.


From the earliest days of OpenCL targeting mobile GPUs the API has showed shown great promise, both in terms of accelerating performance and in reduced energy consumption.  Many of these have concentrated on image and video processing. For an example, see this great write-up of the latest software VP9 decoder from Ittiam.



For more examples of some of the developments using OpenCL on mobile, check out Mucho GPU Compute, amigo! from Roberto Mijat.


One of the real benefits of OpenCL, as well as its flexibility, is the huge range of research and developer activity surrounding the API. There are a large number of other languages – more than 70 at the last count – that compile down to OpenCL, easing its use and allowing its benefits to be harnessed in a more familiar environment. And there are several CL libraries and numerous frameworks exposing the OpenCL API from a wide range of languages. PyOpenCL, for example, provides access to OpenCL via Python. See Anton Lokhmotov's blog on this subject Introducing PyOpenCL.

Because of the required setup and overhead, building an OpenCL job into a pipeline is usually only worth doing when the job is big enough, at the point where this overhead becomes insignificant against the work being done. A great example of this was Ittiam System’s recent optimisation of their HEVC and VP9 software video decoder. As not all of the algorithm was suitable for the GPU, Ittiam had to choose how to split the workload between the CPU and GPU. They identified the motion estimation part of the algorithm as being the most likely to present enough parallel computational work to benefit from running on the GPU. The algorithm as a whole is then implemented as a heterogeneous split between the CPU and GPU, with the resulting benefits of reduced CPU workload and reduced overall power usage. See this link for more about Ittiam Systems.  Like most APIs targeting a wide range of architectures, optimisations you make for one platform might need to be tweaked on another, but having the flexibility to address the low level features of a platform to take full advantage of it is one of OpenCL’s real strengths.


Recent Developments

It’s been a busy year so far for Khronos and OpenCL – there have been a number of developments.  Of particular note perhaps is the announcement of version 1.0 of WebCL™, an API that does for OpenCL what WebGL™ does for OpenGL ES by exposing the compute API to JavaScript and bringing compute access into the world of the browser. Of course, support within browsers may take some time – as it did for WebGL – but it’s a sign of OpenCL broadening its appeal.

OpenCL Summary

OpenCL provides an industry standard API that allows the developer to optimise for a supporting platform’s low level architectural features. To help you get going there is a large and growing number of developer resources from a very active community. If the platform you’re planning to develop for supports it, OpenCL can be a powerful tool.





RenderScript is a proprietary compute API developed by Google. It’s been an official part of Android™ OS since the Honeycomb release in July 2011. Back then it was intended as both a graphics and a compute API, but the graphics part has since been deprecated. There are several similarities with OpenCL… it’s based on C99, has the same concept of organising data into 1, 2 or 3 dimensions etc. For a quick primer on RenderScript, see GPU Computing in Android? With ARM Mali-T604 & RenderScript Compute You Can!  by Roberto Mijat or Google’s introduction to RenderScript here.


The process of developing for RenderScript is relatively straightforward. You write your RenderScript C99-based code alongside the Java that makes up the rest of your Android application. The Android SDK creates some additional Java glue to link the two together, and compiles the RenderScripts themselves into bitcode, an intermediate, device-independent format that is bundled with the APK. When the device runs, Android will determine what RenderScript devices are available and capable of running the bitcode in question. This might be a GPU (e.g. Mali-T604) or a DSP.  If one is found, the bitcode is passed onto a driver that creates appropriate machine-level code. If there is no suitable device, Android will default back to running the RenderScript on the CPU.


In this way RenderScript is guaranteed to run on just about any Android device, and even with fallback to the CPU it can provide a useful level of acceleration. So if you are specifically looking for compute acceleration in Android, RenderScript is a great tool.



The very first device with GPU-accelerated RenderScript was Google’s Nexus 10, which used an SoC featuring an ARM Mali-T604 GPU. Early examples of RenderScript applications have shown a significant benefit from using accelerated GPU compute.


As a relatively young API, RenderScript knowhow and examples are not as easy to come by compared to OpenCL, but this is likely to increase. There’s more detail about how to use RenderScript here.


RenderScript Summary

RenderScript is a great way to benefit from accelerated compute in the vast majority of Android devices. Whether this compute is put onto the GPU or not will depend on the device and availability of RenderScript GPU drivers, but even when that isn’t the case there should still be some benefit from running RenderScripts on the CPU. It’s a higher-level API than OpenCL, with fewer configuration options, and as such can be easier to get to grips with, particularly as RenderScript development is streamlined into the existing Android SDK. If you have this setup, you already have all the tools you need to get going.


Compute Shaders


So to the new kid on the block, OpenGL ES 3.1 compute shaders. If you’re used to using vertex and fragment shaders already with OpenGL ES, you’ll fit right in with compute shaders. They’re written in GLSL (OpenGL Shading Language) in pretty much the same way with similar status, uniforms and other properties and have access to many of the same types of data including textures, image types, atomic counters and so on. However, unlike vertex and fragment shaders they’re not built into the same program object and as such are not a part of the same rendering pipeline.


Compute shaders introduce a new general-purpose form of data buffer, the Shader Storage Buffer Object, and mirror the ideas of work items and workgroups used in OpenCL and RenderScript. Other additions to GLSL allow work items to identify their position in the data set being processed and allow the programmer to specify the size and shape of the workgroups.


You might typically use a compute shader in advance of the main rendering pipeline, using the shader’s output as another input to the vertex or fragment stages.


compute shader_s.png


Though not part of a rendering pipeline, compute shaders are typically used to support them. They’re not as well suited to general purpose compute work as OpenCL or RenderScript - but assuming your use-case is suitable, compute shaders offer an easy way to support access to general purpose computing on the GPU.


For a great introduction to Compute Shaders, do see Sylwester Bala's recent blog Get started with compute shaders.



Compute Shaders Summary

Compute shaders are coming! How quickly depends on the role-out and adoption of OpenGL ES 3.1, but there’s every chance this technology will find its way into a very wide range of devices as mobile GPUs capable of supporting OpenGL ES 3.1 filter down into the mid-range market over the next couple of years. The same thing happened with the move from OpenGL ES 1.1 to 2.0… nowadays you’d be hard pushed to find a phone or tablet that doesn’t support 2.0.  Relative ease of use combined with growing ubiquity across multiple platforms could just be a winning combination.


See Plout Galatsopoulos' blog on ARM's recent submission for OpenGL ES 3.1 conformance for the Mali-T604, Mali-T628 and Mali-T760 GPUs - and for a great introduction to OpenGL ES 3.1 as a whole, do check out Tom Olson's blog Here comes OpenGL® ES 3.1!



One more thing…


So that’s it.  But as Columbo would say… just one more thing…


OpenGL ES 2.0 Fragment Shaders and Frame Buffer Objects

Although not seen as a power compute user’s weapon of choice, fragment shaders have for a long time been used to run some level of general compute - and they do offer one benefit unique amongst all the main approaches here: ubiquity. Any OpenGL ES 2.0-capable GPU – and that really is just about every smart device out there today – can run fragment shaders. This approach involves thinking of texture maps not necessarily as arrays of texels, but just as a 1D or 2D array of data. As long as the data to be read and written by the shader can be represented by supported texture formats, these values can be sampled and written out for each element in the array. You just set up a Frame Buffer Object and typically render a quad (two triangles making a rectangle) into it, using one or more of these data arrays as texture sources. The fragment shader can then compute more or less whatever it wants from the data in these textures, and output the computed result to the FBO. The resulting texture can then be used as a source for any other fragment shaders in the rendering pipeline.



In this blog I’ve looked at OpenCL, RenderScript, Compute Shaders and fragment shaders as several options for using the GPU for non-graphical compute workloads. Each approach has characteristics that will suit certain applications of developer’s requirements, and all of these tools can be leveraged to both improve performance and reduce energy consumption.  It’s worth noting that the story doesn’t stop here. The world of embedded and mobile heterogeneous computing is evolving fast. The good news is that the Mali GPU architecture is designed to support the latest leading compute APIs, enabling all our customers to achieve improved performance and energy efficiency on a variety of platforms and operating systems.

The octa-core Hisilicon Kirin 920 chipset goes official


Today, Hisilicon launched an impressive new SoC housing four ARM® Cortex®-A15 CPUs clocked between 1.7 and 2.0GHz and four Cortex-A7 cores clocked between 1.3 and 1.6GHz in a big.LITTLE configuration. Designed for the high-end superphone market this latest SoC promises to offer an incredible user experience thanks to its powerful quad-core Mali-T628 GPU that is capable of breathtaking graphical displays, 3D gaming, visual computing, augmented reality, procedural texture generation and voice recognition.


GSM Arena covers the news in this article.

Mediatek announce the MT8127 SoC for quad-core tablets


At the end of last week, MediaTek announced a new chip specially designed to bring advanced multimedia features, outstanding performance and low power consumption to the super-mid market at an agreeable pricepoint. The MT8127 SoC features a quad-core ARM Cortex-A7 processor clocked at 1.5GHz along with a quad-core ARM Mali-450 GPU to enable seamless Full HD video playback. The announcement also included information on a future MT8127-powered device, the ALCATEL ONETOUCH PIXI 8 tablet.


The full press release is available on Mediatek's website.


ARM submits conformance for OpenGL ES 3.1


Also today, the Khronos Group finalized the conformance criteria for the latest version of the OpenGL® ES API, OpenGL ES 3.1. ARM has already submitted conformance for three of its GPUs: the ARM Mali-T604, Mali-T628 and Mali-T760. For full information on this announcement, read Plout Galatsopoulos' blog ARM submits conformance for OpenGL ES 3.1.


Any new devices launched?


This week saw the launch of the XoloQ1200 smartphone, the next in the series from the Indian smartphone brand. It comes with some cool new apps including gesture controls, voice recognition, float task with dual window feature, cold access apps, and smart reading mode, all powered by a 1.3 GHz quad-core MediaTek MT6582 processor with a dual-core Mali-400 MP GPU.


More information on the launch can be found in this article.

This year at GDC Khronos announced the latest version of the OpenGL® ES API. OpenGL ES 3.1 is taking a step up from OpenGL ES 3.0 to enable new, fascinating mobile graphics content. With headline features such as compute shaders and indirect drawing, which Tom Olson, chair of the OpenGL ES Working Group, describes in detail in this very interesting blog Here comes OpenGL® ES 3.1!, application developers can now use this new API to deliver an even higher quality of graphics within the power constraints of mobile platforms. Here at ARM, we are fully committed to enabling our GPUs with the latest graphics and GPU Compute APIs as soon as possible. Today, Khronos finalised the conformance criteria less than three months after the official OpenGL ES 3.1 announcement and ARM is submitting for OpenGL ES conformance.


Conformance has just been submitted for the highly successful and market proven Mali-T604 and Mali-T628 GPUs as well as for the latest released high-end GPU, the Mali-T760. The first two power the graphics capabilities of bestseller products such as, but not limited to, the Samsung Galaxy S5, Galaxy Note 3, Google Nexus 10 and Galaxy Note Pro 12.2, while the Mali-T760 is expected to become available in commercial products within the next few months. Conformance will soon be submitted for our latest mid-range GPU, the Mali-T720 as well.


One of the key features of OpenGL ES 3.1 is the support for compute shaders. Developers can now use the compute capabilities of the GPU without having to use a different compute API and worry about the interoperability between graphics and compute. Seamlessly integrated in a single API, compute shaders can post-process the frame buffer output and implement astonishing visual effects with higher efficiency and lower complexity. It is also worth mentioning here that ARM has adopted GPU Compute from its very first steps and is creating a vibrant ecosystem of developers who are providing a number of innovative applications for Mali GPUs and establishing them as the de facto architecture for mobile GPU Compute.


A very good example of the life-like effects that can be implemented using the horsepower of OpenGL ES 3.1 running on Mali GPUs can be seen in the video below. In this demo, you can see the advanced physics simulation reflected in the motion of a hanging piece of cloth that gets blown by various shaped objects:



For interested readers, there is a blog Get started with compute shaders, which provides a complete background to this demo written by Sylwester Bala, one of ARM’s Senior Demo Developers.


OpenGL ES 3.1 is backwards compatible with OpenGL ES 3.0 and 2.0 making sure that the developer’s investment is protected, while a new set of features are provided such as enhanced texturing functionality that includes texture gather, multisample textures and stencil texture. Texture gather allows faster access to neighbouring texels while texture multisampling and stencil textures allow applications the same flexibility in texture processing as in render targets. These extra texture processing features enable crystal clear graphics to be smoothly displayed on a high resolution screen much more efficiently, which means longer battery life for mobile devices without any compromises in quality. Moreover, the enhanced shading language provides more built in functions to the developers, making their life simpler and increasing their productivity.


ARM is one of the first Khronos members to submit conformance for OpenGL ES 3.1 and we are dedicated to supporting our customers and ecosystem partners with the latest and greatest features that graphics technology has to offer. The power and flexibility of the Midgard architecture ensure our partners and developer ecosystem are always enabled with cutting edge technology that delivers best in class graphics within the tight power and area budget required for mobile devices.

Human Machine Interfaces (HMIs) are an incredibly important part of consumer electronics.  A machine that is clumsy and unintuitive to interact with will rarely be a great success. As a result of this, if you look back over the past ten to fifteen years, it’s been one of the leading areas of innovation for the industry. Phones have evolved from having area-consuming PC-like keyboards to soft-control touchscreens and this change enabled larger screens and a better multimedia experience on pocket-sized mass-market mobile devices. In fact, the simplicity of the touchscreen has become so popular that it’s been adapted into many other areas where an easy to use natural user interface (NUI) is key, such as automotive multimedia systems or advanced medical applications.


In its most simple form it is easy to define HMI as “how a user interacts with their device”. However, innovations in HMI have much deeper influences than that.  We can see that advancements in the area of HMI have not only changed how we interact with devices, but also what those devices do for us. It has been a key influencer in unlocking new functionality within our devices and what they mean in our lives. Phones are no longer used simply as a means of interchanging messages. We can now monitor our health, check the weather, play a vast variety of games on them, draw and edit pictures, or surf the internet at ease. The evolution of the HMI goes hand-in-hand with the evolution of a device’s multimedia content.


Today, richly graphical user interfaces, touchscreens and soft controls are the norm. To enable this, processors have had to evolve to offer the increased computational performance required of the device. From a graphical interface perspective, you can see GPU development has been driven by three key areas of demand:


  1. Increasing resolutions: Since the Google Nexus 10 exceeded full HD 1080p resolution with its 2560x1600 (WQXGA) screen, OEMs are continuing to increase pixel density setting 4K (UHD) resolution as the new goal for mobile silicon. To enable this, ARM is not only increasing the number of potential shader core implementations within each GPU (the ARM® Mali™-T760 GPU can host up to sixteen) but we are also improving the internal efficiency of the cores themselves and the memory access in order to ensure that the scaling of cores results in a proportional scaling of performance.
  2. Diversity of screen sizes:  GPUs are suitable not only for HD tablets and UHD DTV displays but also for smaller, wearable devices and IoT applications. The growing diversity of consumer devices is encouraging semiconductor companies to deliver a correspondingly diverse GPU IP portfolio: a processor suitable for any application. With GPUs ranging from the area-efficient Mali-300 to the high-performing, sixteen core ARM Mali-T760 this is exactly what ARM is offering and we are continuing to evolve our roadmaps to deliver great graphical experiences on any device.
  3. Hardware support for more complex content: as NUIs become increasingly life-like, hardware support for features such as the latest APIs becomes crucial in order to enable graphics-rich, intuitive content with smooth frame rates. Not only that, but the raw computational power needed in order to produce these smooth, life-like images that are expected in high end devices puts ever increasing demands on the capabilities of the processors in them.  Again, that’s where the efficiency of ARM CPUs and GPUs come into play.  Coupled with the configurability and scalability of ARM processors, device manufacturers have the flexibility they need to meet consumer demands, cost efficiently, across the entire market.


I believe the current phase of HMI is still being explored and will continue to see significant innovation. In the world of battery-powered devices, traditional PC games have been adapted from console and controller platforms to mobile. With this shift you can see some developers mimicking console controllers with the touchscreen, whilst others have achieved success with new, simple interfaces tailored to the nature of the game (such as swipes, tilts, etc.) This success is inspiring more developers to either design new applications for these effective HMIs, or even new HMIs tailored to their new game, making the entire multimedia experience ever more intrinsically interactive instead of conforming to traditional HMI methods.


However, that’s all happening today. What really excites me is what we can see coming in the future.


Across nearly all the evolutions in NUI, you can see a desire and trend for effortless and instinctive interaction. Physical push buttons have given way to soft buttons; fixed function devices have had their functionality opened up by this range of application-dependent software-driven controls.  Looking into the near future, I can see the next phase of HMI arriving in the form of our devices “reaching out” to interact with us. Why should I have to remember an easily copied PIN sequence to unlock my device? Why can’t ‘I’ be the key? This trend is at the beginning of its lifecycle with facial recognition capabilities becoming standard in mobile devices and starting to be used for unlocking phones. As another example, why do we still have to find controllers every time we wish to change the channel or volume on the TV? Why can’t we control TVs ourselves via gesture or voice control? Why can’t the TV control itself, reaching out to see if anyone is watching it or whether the content is suitable for the audience (for example if there are children in the room)? As eyeSight’s CEO, Gideon Shmuel, says:


“In order for an interaction solution to be a true enhancement compared to existing UIs it must be simple, intuitive and effortless to control – and to excel even further, an interaction solution should become invisible to the user. This enhanced machine vision understanding will, in fact, deliver a user aware solution that understands and predicts the desired actions, even before a deliberate command has been given.”


The concepts for these new HMIs have existed for a while. But it is only in the past year that technology is starting to catch up in order to provide the desired result within the restricted mobile budget. In most cases when a device is “reaching out” to the user it is using either gesture recognition, motion detection or facial recognition. Two issues had been holding this back. Firstly, the processing budget for UI in embedded and mobile devices was not sufficient to support these pixel-intensive, computationally demanding tasks. Advancements such as GPU Compute, OpenCL™ and ARM big.LITTLE™ processing are addressing this issue, increasing the amount of processing possible within the same time budget, and several companies are seeing success in these areas.


Video interview with eyeSight

Secondly, I believe that the lack of a flexible, adaptable platform on which these tasks can be developed and matured was holding back this technology. However, now devices with performance-efficient GPU Compute technology are entering the market, such as the recently released Galaxy Note 3, and ARM is seeing an explosion in the number of third parties and developers exploring ways in which this new functionality can bring their innovations to life.


Looking even further ahead, it is clear that HMI will become even more complex as machines start to “reach out” to each other as well as to their users. As devices continue to diversify I believe that we will see a burst of innovation in how these devices start to interact and be used either in conjunction or interchangeably with each other. As the Internet of Things picks up its pace, the conversation will be about HMMI rather than simply HMI; then HMMMI.  How will we interact with our devices when all our devices are connected? If my smartphone senses my hands are cold, will it automatically turn the room or car heating up? If I leave my house but accidentally leave the lights on, will they turn themselves off? Will the advancements in NUI on our mobile devices make obsolete any interactions on less interactive devices? Will we even need the mobile devices as an interface with the machine-world or will every device with a processor be able to “reach out” to its environment? The possibilities are vast in a user-aware world, and ARM’s role in this area will continue to be to develop the processor IP which enables continuous, ground breaking innovation.


What are your thoughts on the future of NUI? What will ARM have to do to meet its future needs? Let us know your thoughts in the comments below.

Europe to use ARM CPUs and GPUs to make an exaflop supercomputer 30-50x more energy efficient than the best supercomputers today


In 2011 the Mont Blanc project announced that it could design a higher performing, more energy efficient standard of computer architecture by using the processor technology found in today's embedded and mobile devices. Using the Exynos 5 Dual with its ARM® Mali™-T604 GPU as the basis of their prototype, their new machines are predicted to be able to carry out ten to the power of eighteen operations a second whilst being fifteen to thirty times more energy efficient than the systems used today and are set to completely revolutionize HPC technology.


Next Big Future covered the history of the project to date in this article.


New smartphone market data shows strong growth for entry-level and Android OS-based devices


Worldwide smartphone sales will reach 1.2bn by the end of 2014, an increase of 23.1% over the previous year, according to the latest report from IDC Research, and it is the entry-level market in emerging countries such as India, Indonesia and Russia that is especially drawing the attention of industry analysts. Average selling prices are starting to decrease with a smartphone now expected to sell at $314, but they are also offering far better value for this price as premium technology from previous years becomes affordable to the mass market. In addition, Android is set to continue its leadership, hitting an 80.2% market share by the end of 2014. This is encouraging news for Mali, whose GPU IP is found in over 50% of Android tablets and over 20% of all Android smartphones.


This week's Mali-based device launches


This week saw the launch of the Vodafone Smart Tab 4 into the UK market, an 8 inch tablet with a MediaTek MT8382 processor featuring a quad core ARM Cortex®-A7 CPU and a Mali-400 GPU. Acer announced four new Mediatek based tablets, featuring ARM Mali GPUs, set to be launched in the third quarter of this year. In addition the Alcatel OneTouch Idol X+ and the Wickedleak Wammy Neo with their Mali-450 GPUs were launched into the Indian market and the Huawei Honor 3C with its Mediatek MT6592 was launched in Pakistan.

Fast Fourier Transformation (FFT) is a powerful tool in signal and image processing. One very valuable optimization technique for this type of algorithm is vectorization. This article discusses the motivation, vectorization techniques and performance of FFT on ARM® Mali™ GPUs. For large 1D FFT transforms (greater than 65536 samples), performance improvements over 7x are observed.



A few months ago I was involved in a mission to optimize an image processing app. The application had multiple 2D convolution operations using large-radii blurring filters. Image convolutions map well to the GPU since they are usually separable and can be vectorized effectively on Mali hardware. However, with a growing filter size, they eventually hit a brick wall imposed by computational complexity theory.


An illustration of a 2D convolution.
Source: http://www.westworld.be/page/2/

In the figure above, an input image is convolved with a 3x3 filter.  For the input pixel, there will be about 9 multiplications and 8 additions required to produce the corresponding output. When estimating the time complexity of this 2D convolution, the multiplication and addition operations are assumed as a constant time. Therefore, the time complexity is approximately O(n2), although when the filter size grows, the number of operations per pixel increases and the constant time assumption can no longer hold. With a non-separable filter, the time complexity quickly approaches O(n4) as the filter size becomes comparable to the image size.


In the era of ever increasing digital image resolutions, a O(n4) time is simply not good enough for modern applications. This is where FFT may offer an alternative computing route. With FFT, convolution operations can be carried out in the frequency domain. The FFT forward and inverse transformation each needs O(n2 log n) time and has a clear advantage over time/spatial direct convolution which requires O(n4).

The next section assumes basic understanding of FFT. A brief introduction to the algorithm can be found here:



FFT Vectorization on Mali GPUs

For simplicity, a 1D FFT vectorized implementation will be discussed here. Multi-dimensional FFTs are separable operations, thus the 1D FFT can easily be extended to accommodate higher-dimension transforms. The information flow of FFT is best represented graphically by the classic butterfly diagram:


16-point Decimation in Frequency (DIF) butterfly diagram.

The transformation is broken into 4 individual OpenCL™ kernels: 2-point, 4-point, generic and final transformations. The generic and final kernels are capable of varying in size. The gerneric kernel handles transformations from 8-point to half of the full transformation. The final kernel completes the transformation by computing the full-size butterfly.


FFT operates within the complex domain. The input data is sorted into a floating point buffer of real and imaginary pairs:


The structure of the input and output buffer. A complex number consists of two floating point values. The vector width is also shown.

The first stage of the decimation in time (DIT) FFT algorithm is a 2-point discrete Fourier transform (DFT). The corresponding kernel consists of two butterflies. Each of these two butterflies operate on two complex elements as shown:



An illustration of the first stage kernel. A yellow-grey shadded square represents a single complex number. The yellow outline encloses the butterflies evaluated by a single work item. The same operation is applied to cover all samples.

Each work item has a throughput of 4 complex numbers[e1] , 256 bits. This aligns well with the vector width.

In the second stage, the butterflies have a size of 4 elements. Similar to the first kernel, the second kernel has a throughput of 4 complex number, aligning with the vector width. The main distinctions are in the twiddles and the butterfly network:


The illustration for the second stage kernel: a single 4-point butterfly.

  The generic stage is slightly more involved. In general, we would like to:

  • Re-use the twiddle factors
  • Keep the data aligned to the vector width
  • Maintain a reasonable register usage
  • Maintain a good ratio between arithmetic and data access operations


These requirements help to improve the efficiency of memory access and ALU usage. They also help to ensure that optimal numbers of work items can be dispatched at a time. With these requirements in mind, each work item for this kernel is responsible for 4 complex numbers for 4 butterflies. The kernel essentially operates on 4 partial butterflies and has a total throughput of:


4 complex number * 2 float * sizeof(float) * 4 partial butterflies = 1024 bit per work item


This is illustrated in the following graph:




The graph above represents the 8-point butterfly case for the generic kernel. The left side of the diagram shows 4 butterflies which associate with a work item. The red boxes on the left diagram highlight the complex elements being evaluated by the work item. The drawing on the right is a close-up view of a butterfly. The red and orange lines highlight the relevant information flow of the butterfly.

Instead of evaluating a single butterfly at a time, the kernel works on portions of multiple butterflies. This essentially allows the same twiddle factors to be re-used across the butterflies. For an 8-point transform as shown in the graph above, the butterfly would be distributed across 4 work items.  The kernel is parameterizable from 8-point to N/2 point transform where N is the total length of the original input.


For the final stage, only a single butterfly of size N exists; twiddle factor sharing is not possible. Therefore, the final stage is just vectorized butterfly network that is parameterized to a size of N.



The performance of the 1D FFT implementation described in the last section is compared to a reference CPU implementation. In the graph below, the relative performance speed up is shown from 26 to 217 samples. Please note that the x-axis is on a log metric scale:



GPU FFT performance gain over the reference implementation.


We have noticed in our experiments that FFT algorithm performance tends to improve significantly on the GPU between about 4096 and 8192 samples The speed up continues to improve as the sample sizes grows. The performance gain essentially offsets the setup cost of OpenCL with large samples. This trend would be more prominent in higher dimension transformations.



FFT is a valuable tool in digital signal and image processing. The 1D FFT implementation presented can be extended to higher dimension transformations. Applications such as computational photography, computer vision and image compression should benefit from this. What is more interesting is that the algorithm scales well on GPUs. The performance will further improve with more optimization and future support of half-float.




Filter Blog

By date:
By tag: