Everyone is excited by Android 4.0 Ice Cream Sandwich (ICS) and no wonder! Google's "most ambitious release to date" rolls out many new features that look very promising indeed: zero-click NFC sharing (Android Beam), a unified OS for smartphones and tablets, a new UI (with resizable widgets), single motion panoramic camera, face unlock, live effects and so on. And this time it will all be Open Source. All great, but to be honest, what I am really excited about is a less advertised technology called RenderScript, which in ICS takes another significant stride towards maturity. And why does this excite me? Because RenderScript is the API that will soon enough enable GPU Computing) in Android.
Crash course in Android RenderScript
So, what is RenderScript? It is a new programming framework and API for Android, which Google originally introduced in Honeycomb. In reality, some experimental components were pioneered as far back as Éclair, and showcased in things like Live Wallpapers. But it was not until Honeycomb that the RenderScript API was exposed in the Android SDK. With ICS we have a newer, more mature and firmed-up architecture, compiler, runtime and API for RenderScript. You can find more details on the RenderScript API here.
There are two parts to this API: one for 3D graphics and one for compute (aka RenderScript compute). The 3D graphics part of the API addresses several limitations of graphics programming in Android by allowing the developer to batch graphics operations in higher-level, easier-to-program scripts that still execute natively to keep performance. The RenderScript 3D graphics API sits right on top of OpenGL ES and can be hardware accelerated on the GPU. RenderScript compute complements this by enabling computational intensive tasks to be offloaded to worker processors such as another CPU, a DSP or... a GPU. Unlike OpenCL, where command queues are statically associated to devices, with RenderScript compute it is the runtime that decides where the job will execute. For now (even in ICS) compute jobs are hardwired to the CPU. Multi-device support will be introduced in a future Android release, and processor devices that aspire to be an eligible target will need to supply a RenderScript compute back-end driver.
Today you can see RenderScript in action in applications such as Books, YouTube and MovieStudio and Live Wallpapers. In the near future, I expect great innovation to take place in Android thanks to RenderScript, in particular around GPU Computing, enabled through the compute part of the API.
How does the developer use RenderScript compute? Well, the majority of the application will be written in Java and using the Dalvik APIs as usual. The developer needs to find performance-critical parts of the algorithm which are suitable for RenderScript compute, even better if parallel in nature (for example repetitive operations over very large datasets). The script (compute kernel) will be written into one or more .rs files using the ScriptC language (fundamentally C99 plus vector data types, atomics, barriers etc). These .rs files will be compiled into highly optimized portable bitcode, and used to generate .java files (so called reflective layer) that enable the main Java application to pass data to/from and trigger execution of the script. The APK package will include the Java application and relevant files, assets etc. plus the RenderScript portable bitcode. When Dalvik JITs the application, then the RenderScript bitcode is also compiled (and cached for later re-use).
For more details read Jason Sam's blog post.
GPU Compute: the right tool for the job
To achieve optimal general purpose computational throughput you need a purposely designed processor. The ARM® Mali™-T604 GPU from ARM is designed to integrate the graphics and compute functionalities, optimizing interoperation between the two, both at hardware and software driver levels.
RenderScript introduces many additional requirements for precision and support of mathematical functions (native functions, equivalent to OpenCL's BIFLs). In addition to satisfy IEEE 754 precision requirements for single-precision and double-precision floating point, the Mali-T604 GPU implements most of these native functions directly in hardware. The Mali-T604 also natively support 64-bit integer data types, something not common in competing architectures as set out in my colleague jemdavies's blog. Barriers and atomics are also implemented directly in hardware. This provides an immense step-up in performance and efficiency for general purpose computation if compared to current generation of GPUs not purposely designed for it.
The ARM Mali-T604 GPU is designed to work with the latest version (4) of the AMBA (Advanced Microcontroller Bus Architecture) which features Cache Coherent Interconnect (CCI). Data shared between processors in the system, a natural occurrence in heterogeneous computing, no longer requires costly (in terms of cycles and energy) synchronization via external memory and explicit cache maintenance operations. All of this is now performed in hardware, and is enabled transparently inside the drivers. In addition to reduced memory traffic, CCI avoids superfluous sharing of data: only data genuinely requested by another master is transferred to it, to the granularity of a cache line. No need to flush a whole buffer or data structure anymore.
There is more. As well as task management and event dependencies being optimized in hardware, task dependency coordination is entirely designed into the hardware job manager unit of Mali-T604. The software driver responsibility is reduced to handing over the workload to the GPU: all scheduling, prioritization and run-time synchronization take place transparently, behind the scenes.
Typically GPUs are designed to favour throughput over latency. The Mali-T604 GPU treats generic memory load/stores as first-class operations with proper latency tolerance.
Typically developers use a blend of APIs during development. The Mali software driver infrastructure is tightly integrated and optimized. All APIs of the Mali software stack architecture share the same high-level API objects, the same address space, the same queues, dependencies and events. This approach reduces code footprint and significantly increases performance. Data structures are shared between APIs and devices, to avoid unnecessary memory copies.
Are you looking forward to unleash the power of GPU Computing in Android through Mali-T604?