1 2 3 Previous Next

ARM Mali Graphics

244 posts

Previous blog in the series: Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

 

In my previous blogs in this series I have looked at the bare essentials for using a tile-based rendering architecture as efficiently as possible, in particular showing how to structure an application's use of framebuffers most efficiently to minimize unnecessary memory accesses. With those basics out of the way I can now start looking in more detail about how to drive the OpenGL ES API most efficiently to get the best out of a platform using Mali, but before we do that, I would like to introduce my five principles of performance optimization.

 

Principle 1: Know Your Goals

 

When starting an optimization activity have a clear set of goals for where you want to end up. There are many possible objectives to optimization: faster performance, lower power consumption, lower memory bandwidth, or lower CPU overhead to name the most common ones. The kinds of problems you look for when reviewing an application will vary depending on what type of improvement you are trying to make, so getting this right at the start is of critical importance.

 

It is also very easy to spend increasingly large amounts of time for smaller and smaller gains, and many optimizations will increase complexity of your application and make longer term maintenance problematic. Review your progress regularly during the work to determine when to say "we've done enough", and stop when you reach this point.

 

Principle 2: Don't Just Try to Make Things Fast

 

I am often asked by developers working with Mali how they can make a specific piece of content run faster. This type of question is then often quickly followed up by more detailed questions on how to squeeze a little more performance out of a specific piece of shader code, or how to tune a specific geometry mesh to best fit the Mali architecture. These are all valid questions, but in my opinion often unduly narrow the scope of the optimization activity far too early in the process, and leave many of the most promising avenues of attack unexplored.

 

Both of the questions above try to optimize a fixed workload, and both make the implicit assumption that the workload is necessary at all. In reality graphics scenes often contain a huge amount of redundancy - objects which are off screen, objects which are overdrawn by other objects, objects where half the triangles are facing away from the user, etc - which contribute nothing to the final render. Optimization activities therefore need to attempt to answer two fundamental questions:

 

  1. How do I remove as much redundant work from the scene as possible, as efficiently as possible?
  2. How do I fine tune the performance of what is left?

 

In short - don't just try to make something faster, try to avoid doing it at all whenever possible! Some of this "work avoidance" must be handled entirely in the application, but in many cases OpenGL ES and Mali provides tools which can help provided you use them correctly. More on this in a future blog.

 

Principle 3: Graphics is Art not Science

 

If you are optimizing a traditional algorithm on a CPU there is normally a right answer, and failure to produce that answer will result in a system which does not work. For graphics workloads we are simply trying to create a nice looking picture as fast as possible; if an optimized version is not bit-exact it is unlikely anyone will actually notice, so don't be afraid to play with the algorithms a little if it helps streamline performance.

 

Optimization activities for graphics should look at the algorithms used, and if their expense does not justify the visual benefits they bring then do not be afraid to remove them and replace them with something totally different. Real-time rendering is an art form, and optimization and performance is part of that art. In many cases smooth framerate and fast performance is more important than a little more detail packed into a single frame.

 

Principle 4: Data Matters

 

GPUs are data-plane processors, and graphics rendering performance is often dominated by data-plane problems. Many developers spend a long time looking at OpenGL ES API function call sequences to determine problems, without really looking at the data they are passing into those functions. This is nearly always a serious oversight.

 

OpenGL ES API call sequences are of course important, and many logic issues can be spotted by looking at these during optimization work, but remember that the format, size, and packing of data assets is of critical importance and must not be forgotten when looking for opportunities to make things faster.

 

Principle 5: Measure Early, Measure Often

 

The impact of a single draw call on scene rendering performance is often impossible to tell from the API level, and in many cases seemingly innocuous draw calls often have some of the largest performance overheads. I have seen many performance teams sink days or even weeks of time into optimization something, only to belatedly realise that the shader they have been tuning only contributes 1% of the overall cost of the scene, so while they have done a fantastic job and made it 2x faster that only improves overall performance by 0.5%.

 

I always recommend measuring early and often, using tools such as DS-5 Streamline to get an accurate view of the GPU hardware performance via the integrated hardware performance counters, and Mali Graphics Debugger to work out which draw calls are contributing to that rendering workload. Use the performance counters not only to identify hot spots to optimize, but also to sanity check what your application is doing against what you expect it to be doing. For example, manually estimate the number of pixels, texels, or memory accesses, per frame and compare this estimate against the counters from the hardware. If you see twice as many pixels as expected being rendered then there are possibly some structural issues to investigate first which could give much larger wins than simple shader tuning.

 

Next Time

 

The best optimizations in graphics are best tackled when made a structural part of how an application presents data to the OpenGL ES API, so in my next blog I will be looking at some of the things an application might want to consider when trying very hard to not do any work at all.

 

TTFN,
Pete

 


Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

Introduction

 

This last blog in the Movie Vision App series, following on from The Movie Vision App: Part 1 and The Movie Vision App: Part 2, will discuss two final movie effect filters.

 

 

Movie Vision Filters: “Follow The White Rabbit…”

 

This is the most intriguing and complex filter in the Movie Vision demonstration. The camera preview image is replaced by a grid of small characters (primary Japanese Kana). The characters are coloured varying shades of green reminiscent of old computer displays. Additionally, the brightness is also manipulated to create the appearance of some characters ‘falling’ down the image. The overall impression is that the image is entirely composed of green, computer-code like characters.

 

…
      //Run the WhiteRabbitScript with the RGB camera input allocation.
      mWhiteRabbitScript.forEach_root(mWhiteRabbitInAlloc, mWhiteRabbitOutAlloc);
      //Make the heads move, dependant on the speed.
      for(int hp = 0; hp < mScreenWidth / mCharacterSize; hp++) {
          mHeadPos[hp]+=mSpeeds[mStrChar[hp]];
          //If the character string has reached the bottom of the screen, wrap it back around.
          if(mHeadPos[hp] > mScreenHeight + 150) {
              mHeadPos[hp] = 0;
              mStrChar[hp] = mGen.nextInt(8)+1;
              mStrLen[hp] = mGen.nextInt(100)+50;
              mUpdate = true;
          }
      }
      //If a character string has reached the bottom, update the allocations with new random values.
      if(mUpdate) {
          mStringLengths.copyFrom(mStrLen);
          mStringChars.copyFrom(mStrChar);
          mUpdate = false;
      }
…






“Follow the White Rabbit” excerpt from processing of each camera frame

 

The Java component of this image filter does the standard RenderScript set up, but also populates several arrays to use in mapping the image to characters. The number of columns and rows of characters is calculated and a random index set for each column. A set of header positions and string lengths is also randomly generated for each column. These correspond to areas that will be drawn brighter than the rest of the image, to give the impression of falling strings of characters. On the reception of each camera preview frame, the standard YUV to RGB conversion is performed. Then, the RenderScript image effect script’s references to the character, position and length arrays are updated. The script kernel is executed. Afterwards, the header positions are adjusted so that the vertical brighter strings appear to fall down the image (and wrap back to the top).

 

…
static const int character1[mWhiteRabbitArraySize] = {0, 0, 1, 0, 0, 0,
                                                  0, 0, 1, 0, 0, 0,
                                                  1, 1, 1, 1, 1, 1,
                                                  0, 0, 1, 0, 0, 1,
                                                  0, 1, 0, 0, 0, 1,
                                                  1, 0, 0, 0, 1, 1};
static const int character2[mWhiteRabbitArraySize] = {0, 0, 1, 1, 1, 0,
                                                  1, 1, 1, 1, 1, 1,
                                                  0, 0, 0, 1, 0, 0,
                                                  0, 0, 0, 1, 0, 0,
                                                  0, 0, 1, 0, 0, 0,
                                                  0, 1, 0, 0, 0, 0};
…






“Follow the White Rabbit” RenderScript character setup

 

This is by far the most complicated RenderScript kernel in the Movie Vision app. The script file starts with eight statically defined characters from the Japanese Kana alphabet. These are defined as 6x6 arrays. The first line of the script execution is a conditional statement – the script only executes on every eighth pixel in both the x and y direction. So, the script executes ‘per character’ rather than ‘per pixel’. As we use 6x6 characters, this gives a one pixel border to each character. The output colour for the current position is set to a default green value, based on the input colour. The character index, header position and length values are retrieved from the arrays managed by the Java class. Next, we determine if the character corresponding to the current pixel is in our bright ‘falling’ string, and adjust the green value appropriately: brightest at the head, gradually fading behind and capped at a lower maximum value elsewhere. If the current character position isn’t at the front of the falling string, we also pseudo randomly change the character to add a dynamic effect to the image. Next, some basic skin tone detection is used to further brighten the output if skin is indeed detected. Finally, the output values for all pixels in the current character position are set.

 

…
      //Sets the initial green colour, which is later modified depending on the in pixel.
      refCol.r = 0;
      refCol.g = in->g;
      refCol.b = in->g & 30;
…
//If the Y position of this pixel is the same as the head position in this column.
        if(y == currHeadPos)
            refCol.g = 0xff; //Set it to solid green.
        //If the character is within the bounds of the falling character string for that column, make it darker the further away
        //from the head it is.
        else if((y < currHeadPos && y >= (currHeadPos - currStringLength)) || (y < currHeadPos && (currHeadPos - currStringLength) < 0))
            refCol.g = 230 - ((currHeadPos - y));
        else if(refCol.g > 150) //Cap the green at 150.
            refCol.g -= 100;
        else
            refCol.g += refCol.g | 200; //For every other character, make it brighter.
      //If the current character isn't the head, randomly change it.
      if(y != currHeadPos)
            theChar += *(int*)rsGetElementAt(stringChars, (y/mWhiteRabbitCharSize));
      //Basic skin detection to highlight people.
      if(in->r > in->g && in->r > in->b) {
            if(  in->r > 100 && in->g > 40
              && in->b > 20 && (in->r - in->g) > 15)
                refCol.g += refCol.g & 255;
      }
…
      //Loop through the binary array of the current character.
      for(int py = 0; py < mWhiteRabbitCharSize; py++){
          for(int px = 0; px < mWhiteRabbitCharSize; px++){
                out[(py*mWidth)+px].r = 0;
                out[(py*mWidth)+px].g = 0;
                out[(py*mWidth)+px].b = 0;
                if(theChar == 1) {
                    if(character1[(py*(mWhiteRabbitCharSize))+px] == 1)
                      out[(py*mWidth)+px] = refCol;
                }else if(theChar == 2) {
                    if(character2[(py*(mWhiteRabbitCharSize))+px] == 1)
                      out[(py*mWidth)+px] = refCol;
…






Excerpts of “Follow the White Rabbit” Renderscript Kernel root function

 

 

Movie Vision Filters: “Why So Serious?”

 

whysoserious.png

This filter mimics a sonar vision effect. Part of this is a simple colour mapping to a blue toned image. In addition, areas of the image are brightened relative to the amplitude of sound samples from the microphone.

 

…
    mRecorder = new MediaRecorder();
    mRecorder.setAudioSource(MediaRecorder.AudioSource.MIC);
    mRecorder.setAudioChannels(1);
    mRecorder.setAudioEncodingBitRate(8);
    mRecorder.setOutputFormat(MediaRecorder.OutputFormat.THREE_GPP);
    mRecorder.setAudioEncoder(MediaRecorder.AudioEncoder.AMR_NB);
    mRecorder.setOutputFile("/dev/null");
    try {
        mRecorder.prepare();
        mRecorder.start();
        mRecorder.getMaxAmplitude();
        mRecording = true;
    } catch (IOException ioe){
        mRecording = false;
    }
…






“Why so serious?” setting up the microphone

 

The Java side of this filter does the standard configuration for a RenderScript kernel. It also sets up the Android MediaRecorder to constantly record sound, but dumps the output to /dev/null. A set of look-up tables, similar to the ‘Get to the chopper’ filter, are used to do the colour mapping. References to these are passed to the script. For each camera preview frame, the maximum sampled amplitude since the last frame and a random x and y position are passed to the RenderScript kernel. The image is converted to RGB and then the image effect kernel is executed.

 

…
    //If the current pixel is within the radius of the circle, apply for 'pulse' effect colour.
    if (((x1*x1)+(y1*y1)) < (scaledRadius*scaledRadius)){
        dist = sqrt((x1*x1)+(y1*y1));
        if (dist < scaledRadius){
            effectFactor = (dist/scaledRadius) * 2;
            lightLevel *= effectFactor;
            blue -= lightLevel;
        }
    }
    //Lookup the RGB values based on the external lookup tables.
    uchar R = *(uchar*)rsGetElementAt(redLUT, blue);
    uchar G = *(uchar*)rsGetElementAt(greenLUT, blue);
    uchar B = *(uchar*)rsGetElementAt(blueLUT, blue);
    //Clamp the values between 0-255
    R > 255? R = 255 : R < 0? R = 0 : R;
    G > 255? G = 255 : G < 0? G = 0 : G;
    B > 255? B = 255 : B < 0? B = 32 : B;
    //Set the final output RGB values.
    out->r = R;
    out->g = G;
    out->b = B;
    out->a = 0xff;
}
...






“Why So Serious?” RenderScript Kernel root function

 

The RenderScript kernel calculates a brightness, radius and offset for a ‘pulse’ effect based on the amplitude and position passed to it. If the current pixel is within the pulse circle, it is brightened considerably. The output colour channels for the pixel are then set based on the lookup tables defined in the Java file.

 

 

Movie Vision: Conclusions

 

Can you guess which movies inspired “Follow the White Rabbit” and “Why So Serious?” ?

 

At the beginning of this blog series we stated that the Movie Vision app was conceived as a demonstration to highlight heterogeneous computing capabilities in mobile devices. Specifically, we used RenderScript on Android to show the GPU Compute capabilities of ARM® Mali™ GPU technology. As a proof of concept and a way to explore one of the emerging GPU computing programming frameworks, Movie Vision has been very successful: RenderScript has proven to be an easy to use API. It is worth noting that it is highly portable, leveraging both ARM CPU and GPU technology. The Movie Vision App explored a fun and entertaining use-case, but it is only one example of the potential of heterogeneous approaches like GPU Compute.

 

We hope you have enjoyed this blog series, and that this inspires you to create your own applications that explore the capabilities of ARM technology.

 

 

 

Creative Commons License

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

We are pleased to release a new version of the Mali OpenGL ES Emulator, v2.0-BETA*, which adds support for OpenGL ES 3.1.

From Khronos website:

“OpenGL ES 3.1 provides the most desired features of desktop OpenGL 4.4 in a form suitable for mobile devices,” said Tom Olson, chair of the OpenGL ES working group and Director of Graphics Research at ARM. “It provides developers with the ability to use cutting-edge graphics techniques on devices that are shipping today.”

The OpenGL® ES Emulator is a library that maps OpenGL ES 3.1 API calls to the OpenGL API. By running on a standard PC, the emulator helps software development and testing of next generation OpenGL ES 3.1 applications since no embedded platform is required.

Get the Mali OpenGL ES Emulator

opengles31-ring-example-win32

Introduction

 

Following on from The Movie Vision App: Part 1, in Part 2, we’ll immediately move on and discuss two more image filters implemented for the project.

 

 

Movie Visions Filters: “An Old Man Dies…”

 

oldmandies.png

This filter is only slightly more complex than the “I’ll be back” effect described in the previous blog. The camera preview image is filtered to a black and white, grainy ‘comic book’ style, but any objects detected as red retain their colour.

The Java portion of the filter does the standard RenderScript initialisation. A Script Intrinsic is used to convert the YUV camera preview data to an RGB array, and a second custom script applies the actual visual effect.

 

/*
Function: root
param uchar4 *in        The current RGB pixel of the inout allocation.
param uchar4 *out      The current RGB pixel of the output allocation.
param uint32_t x        The X position of the current pixel.
param uint32_t          The Y position of the current pixel.
*/
void root(const uchar4 *in, uchar4 *out, uint32_t x, uint32_t y){
    //The black and white output char.
    uchar4 bw;
    //Range between -120 and 120, 120 being the highest contrast.
    //We're applying this to make a high-contrast image.
    int contrast = 120;
    float factor = (255 * (contrast + 255)) / (255 * (255 - contrast));
    int c = trunc(factor * (in->r - 128) + 128)-50;
    if(c >= 0 && c <= 255)
        bw.r = c;
    else
        bw.r = in->r;
    //Now determine if we apply a 'grain' effect to this pixel - every 4th pixel
    //If the current pixel is divisible by 4, apple a 'grain' effect.
    if(x % 4 == 0 && y % 4 == 0)
        bw.r &= in->g;
    //Finally determine if this pixel is 'red' enough to be left as red...
    //Red colour threshhold.
    if (in->r > in->g+55 && in->r > in->b+60) {
        //Only show the red channel.
        bw.g = 0;
        bw.b = 0;
    } else {
        //Make all colour channels the same (Black & White).
        bw.g = bw.r;
        bw.b = bw.r;
    }
    //Set the output pixel to the new black and white one.
    *out = bw;
}







“An Old Man Dies” RenderScript Kernel root function

 

First, we apply a formula to calculate a colour value for the pixel that will result in a high contrast black & white image. The value for every fourth pixel is further modified to stand out, resulting in a slight grain effect. Finally, if the pixel’s red colour value exceeds a certain threshold, only the red channel for that pixel is shown. Otherwise, the blue and green channels are set to the same value as the red to achieve the black & white look.

Once again, can you guess the movie that inspired this filter?

 

 

Movie Vision Filters: “Get To The Chopper…”

 

This filter creates a ‘thermal camera’ effect, and also applies a Heads Up Display (HUD) type decoration. The colour filtering utilises RenderScript, whilst the HUD leverages Android’s built in face detection. A set of look-up tables map specific colour ranges to output colours. Thermal cameras generally map infrared to a narrow set of output colours. This image filter mimics this by mapping input image colours to a similar set of output colours.

 

/**
* Creates the lookup table use for the 'heat map' splitting the image int
* 16 different colours.
*/
private void createLUT() {
    final int SPLIT = 8;
    for (int ct = 0; ct < mMaxColour/SPLIT; ct++){
        for (int i = 0; i < SPLIT; i++){
            switch (ct) {
                /**
                * The following cases define a set of colours.
                */
                case (7):
                    mRed[(ct*SPLIT) +i] = 0;
                    mGreen[(ct*SPLIT) +i] = 255;
                    mBlue[(ct*SPLIT) +i] = 0;
                    break;
                case (6):
                    mRed[(ct*SPLIT) +i] = 128;
                    mGreen[(ct*SPLIT) +i] = 255;
                    mBlue[(ct*SPLIT) +i] = 0;
                    break;
                …
                …
            }
      }
  }
  redLUT.copyFrom(mRed);
  greenLUT.copyFrom(mGreen);
  blueLUT.copyFrom(mBlue);
  mChopperScript.set_redLUT(redLUT);
  mChopperScript.set_greenLUT(greenLUT);
  mChopperScript.set_blueLUT(blueLUT);
}







“Get to the Chopper” creating look-up tables

 

On the Java side, along with setting up the typical set up of Allocation objects to pass input and receive output from RenderScript, three look-up tables are defined: one each for the red, green and blue colour channels. Each look-up table is essentially an array of 255 values, giving the output value for each of the possible input values of the colour channel. Each frame is again first converted to RGB before being passed to the image effect RenderScript kernel. After the RenderScript filtering, the decoration drawing callback is used to draw a red, triangular ‘targeting’ reticule on any faces that were detected by the Android face detection API.

 

...
    //Basic skin detection.
    //These values specifically filter out skin colours.
    if(in->r > in->g+10 && in->r > in->b+5 && in->g < 120) {
        //If skin has been detected, apply the 'hotter' colours.
        out->r = in->r & 40;
        out->g = in->g & 40;
        out->b = 24;
        out->a = 0xff;
    }
    //Use the external lookup allocations to dertermine the colour.
    out->r = *(uchar*)rsGetElementAt(redLUT, in->r);
    out->g = *(uchar*)rsGetElementAt(greenLUT, in->g);
    out->b = *(uchar*)rsGetElementAt(blueLUT, in->b);
...







“Get to the Chopper” RenderScript Kernel root function

 

The RenderScript script for this effect is very simple. For each pixel, it first checks if the RGB values fall within a range considered a ‘skin tone’. If so, the output is forced to the ‘hot’ output colours. Otherwise, the output values for the pixel are set directly from the pre-configured look-up tables for each channel.

 

Which movie inspired “Get to the Chopper”? As a hint, it features the same actor as “I’ll be back”.

 

That concludes this second Movie Vision App blog. Read on for the most complex image effects of the Movie Vision App and some concluding comments in The Movie Vision App: Part 3!

 

 

 

 

Creative Commons License

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

Introduction

 

The Applied Systems Engineering (ASE) team at ARM is responsible for both implementing and presenting technical demonstrations that show ARM® technology in context. These demonstrations make their way to trade shows and other events and are also often discussed in online features such as this series of blogs. Here we introduce our Movie Vision App which explores the use of GPU Compute.

 

 

The Movie Vision Concept

 

The Movie Vision app was conceived as a demonstration to highlight emerging heterogeneous computing capabilities in mobile devices. Specifically, it makes use of Android’s RenderScript computation framework. There is a great deal of discussion about GPU Compute’s capabilities and a variety of exciting new use-cases have been highlighted. On the ARM Connected Community, you can find discussions on the advantages and benefits of GPU Compute, the methods available and details on the compute architecture provided by ARM® Mali™ GPUs. The objective of the Movie Vision application is to provide a visually compelling demonstration that leverages the capabilities of ARM technology and GPU Compute, to explore the RenderScript API and to provide an example application for the Mali Developer community.

 

Movie Vision takes the preview feed from an Android device’s camera, applies a visual effect to that feed and displays it on the device screen. A number of visual effect filters have been implemented, each modeled on various special effects seen in popular movies over the years. Frame-by-frame, this breaks down to applying one or more mathematical operations to a large array of data – a task well suited to the kind of massive parallelism that GPU Compute provides.

 

In this series of blogs, we’ll go through each of the visual effect filters we implemented and use these to explore and understand the capabilities of the RenderScript API.

 

 

RenderScript

 

RenderScript is well described on the Android Developers website, an excellent place to start for details and instructions on its use. To summarize, from a developer’s standpoint, in using RenderScript you will be writing your high-performance ‘kernel’ in C99 syntax and then utilizing a Java API to manage the use of this by your application, and to manage the data going into and out of it. These kernels are parallel functions executed on every element of the data you pass into the script. Under the hood, RenderScript code is first compiled to intermediate byte-code, and then further compiled at runtime to machine code by a sequence of Low Level Virtual Machine (LLVM) compilers. This is not conceptually dissimilar to the standard Android or Java VM model, but obviously more specialized. The final machine code is generated by an LLVM on the device, and optimized for that device. On the Google Nexus 10 used for development of the Movie Vision application, RenderScript would thus make use of either the dual core ARM Cortex®-A15 CPU or the GPU Compute capabilities of the Mali-T604 GPU.

 

 

Movie Vision Application Structure

 

The Movie Vision app has the following structure:

ClassBlockDiagram.png

 

ClassFunction
ImageFilterActivity

- Main Android activity class

- UI layout/functionality

- Setup of camera preview

- Setup of face detection

ImageFilterOverlayView- Allows rendering of icons & text decorations on top of filtered camera preview image
Image Filters (Java)

- Set-up for sharing data with RenderScript kernels

- Callback for each camera preview frame

- Callback for rendering decorations

Image Filters (RenderScript)- Application of image filter operations to image data

 

 

The functionality of the app is fairly simple. The Android Camera API provides a hook to receive a camera preview callback, each such call delivering a single frame from the camera. The main Movie Vision Activity receives this and passes the frame data to the currently selected image filter. After the frame has been processed, the resulting image is rendered to the screen. A further call back to the selected filter allows decorations such as text or icons to be rendered on top of the image. The Android AsyncTask class is used to decouple image processing from the main UI thread.

 

Each Image Filter shares some relatively common functionality. All of them perform a conversion of the camera preview data from YUV to RGB. The data from the camera is in YUV, but the filter algorithms and output for Movie Vision require RGB values. The Android 4.2 releases included updates to RenderScript which added an “Intrinsic” script to perform this operation. A RenderScript Intrinsic is an efficient implementation of a common operation. These include Blends, Blurs, Convolutions and other operations – including this YUV to RGB conversion. More information can be found on the Android Developer Website. Each Image Filter Java class also configures its ‘effect’ script. Configuration generally consists of allocating some shared data arrays (using the Allocation object) for input and output and allocating or setting any other values required by the script.

 

/**
* RenderScript Setup
*/
mTermScript = new ScriptC_IllBeBack(mRS, res, R.raw.IllBeBack);


Type.Builder tb = new Type.Builder(mRS, Element.RGBA_8888(mRS));
tb.setX(mWidth);
tb.setY(mHeight);


mScriptOutAlloc = Allocation.createTyped(rs, tb.create());
mScriptInAlloc = Allocation.createSized(rs, Element.U8(mRS), (mHeight * mWidth) + ((mHeight / 2) * (mWidth / 2) * 2));














Initial set up of a RenderScript kernel

 

 

Movie Vision Filters: “I’ll Be Back”

 

illbeback.png

This filter applies a famous movie effect with a red tint and an active Heads-Up Display that highlights faces. It is probably the simplest Movie Vision effect in terms of the RenderScript component. However, a desired additional feature of this filter highlights some challenges.

 

/**
* Processes the current frame. The filter first converts the YUV data
* to RGB via the conversion script. Then the Renderscript kernel is run, which applies a
* red hue to the image. Finally the filter looks for faces and objects, and on finding one
* draws a bounding box around it.
*
* @param data The raw YUV data.
* @param bmp Where the result of the ImageFilter is stored.
* @param lastMovedTimestamp Last time the device moved.
*/
public void processFrame(byte[] data, Bitmap bmp, long lastMovedTimestamp){
    mScriptInAlloc.copyFrom(data);
    mConv.forEach_root(mScriptInAlloc, mScriptOutAlloc);
    mTermScript.forEach_root(mTermOutAlloc, mTermOutAlloc);
    mTermOutAlloc.copyTo(bmp);
}














“I’ll Be Back” processing each camera frame

 

The Java component of this filter is relatively straight forward. The YUV/RGB conversion and image effect RenderScript kernels are configured. For each camera preview frame, we convert to RGB and pass the image to the effect kernel. After that, in a second pass on the frame, we render our HUD images and text if any faces have been detected. This draws a box around the face and prints out some text to give the impression that the face is being analyzed.

 

/*
Function: root
param uchar4 *in        The current RGB pixel of the inout allocation.
param uchar4 *out      The current RGB pixel of the output allocation.
*/
void root(const uchar4 *in, uchar4 *out){
  uchar4 p = *in;
  //Extracting the red channel, ignoring the green and blue channels. Creates the red 'HUE' effect.
  out->r = p.r & 0xff;
  out->b = p.g & 0x00;
  out->g = p.b & 0x00;
  out->a = 0xff;
}














“I’ll Be Back” RenderScript Kernel root function

 

The RenderScript component is very simple. The output green and blue channels are zeroed, so that just the red channel is visible in the final image.

 

Initially, an attempt was made to add pseudo object detection to this effect, such that ‘objects’ as well as faces would be highlighted by the HUD. A prototype using the OpenCV library was implemented, using an included library implementation of an algorithm for Contour Detection. It is worth noting that this approach would not utilise GPU Compute and run only on the CPU. Contour Detection is a relatively complex multi-stage computer vision algorithm. First, a Sobel Edge Detection filter is applied to bring out the edges of the image. Then, a set of steps to identify joined edges (contours) is applied. The prototype object detection then interpreted this to find contours in the image that were likely to be objects. Generally, large, rectangular regions were chosen. One of these would be selected and highlighted with the HUD decorations as an ‘object of interest’.

 

The issue with this object detection prototype was that it required several passes of algorithmic steps, with intermediate data sets. Porting this to RenderScript to take advantage of GPU Compute would have resulted in several chained together RenderScript scripts.  At the time of initial development this resulted in some inherent inefficiencies, although the addition of ‘Script Groups’ in Android 4.2 will have helped to address this. For now, porting the Contour Detection algorithm to RenderScript remains an outstanding project.

 

That concludes the first blog in this series on the Movie Vision App. Carry on reading with The Movie Vision App: Part 2 for examples of increasingly complex visual effect filters. Can you guess the movie that inspired the “I’ll be back” filter?

 

 

 

Creative Commons License

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

Mali Graphics Debugger v1.3.2 now supports the most recent version of Android, 5.0 “Lollipop”.

Mali Graphics Debugger is an advanced API tracer tool for OpenGL ES, EGL and OpenCL. It allows developers trace their graphics and compute applications to debug issues and analyze the performance.

It also supports Linux targets and runs on Windows, Linux and Mac.

Get Mali Graphics Debugger

Mali Graphics Debugger v1.3.2

Gaming has always been one of the top applications for GPU, not to mention it accounts for majority of app revenues on the smart mobile devices. That is why ARM is running the Game Developer Days in major cities this year. Being in the top 3 mobile game markets, hosting the ARM Game Developer Day in China is a no brainier. Partnering with Chengdu Mobile Internet Society, we hosted our very first ARM China Game Developer Day in Chengdu, the provincial capital of Sichuan province and the biggest city in Western China. The location of the event was also strategic. It was held in a coffee shop inside the Tianfu Software Park, one of the top high-tech industrial parks in China -- making it convenient for the local developers to join.

ppt-1.jpg

 

In addition to in-depth technical topics covering the latest Mali technology and ARMv8 architecture, we had also invited speakers from the 3 top game engines - Cocos, Unity3d and Unreal to share how their game engines are best optimized for the ARM platform. Here was the agenda to our ARM Developer Day at Chengdu.

 

Start

Agenda

Speaker

 

9:00

Opening/Registration/Demo
入场/注册/Demo展示

 

 

9:30

Win On ARM  - The challenges and trends of the next generation Mobile Games
次时代移动游戏之趋势与挑战

Leon Zhang/章立

亚太区生态系统合作经理, ARM
APAC Ecosystem Marketing Manager, ARM

10:00

ARM Mali architecture overview and ARM OpenGL ES extensions
ARM Mali™ GPU
架构概述和 ARM OpenGL® ES API 扩展

Joel Liang/梁宇洲

资深生态系统工程师, ARM
Senior Ecosystem Engineer
ARM

10:45

ARM CPU & Mali GPU Synergy Development, Deeply Optimization by using ARM tools, with live Cocos2d-x, Unity, Unreal demos on best practises
运用ARM工具进行ARM CPUMali GPU的协调开发与深度优化 - Cocos2d-x, Unity3D Unreal引擎优化实战手把手讲解

Nathan Li/李陈鲁

资深工程师, 技术团队负责人, ARM
Staff Engineer, Tech Leader, ARM

11:45

Unity5 and Enlighten Realtime Global Illumination Technology
Unity5
Enlighten实时全局光照技术

Zhenpin Guo/郭振平

亚太区技术总监, Unity
Technology Director APAC, Unity

12:30

Lunch and Q&A
午餐与面对面问答交流

 

 

13:30

Benefits of multithread and big.LITTLE, NEON and overview of ARMv8 benefits for game developers
运用多线程,大小核,Neon,ARM V8 64位指令集技术来加速你的移动游戏 !

Alan Chuang/莊智鑫

生态系统市场经理, ARM
Ecosystem Marketing Manager, ARM

14:30

Unreal Engine4 and Optimization experiences on Mali GPU
虚幻引擎4与移动平台优化经验分享

Jack Porter

引擎开发与支持技术Lead, Epic Games
Engine Development and Support Lead, Epic Games

15:15

Tea Break/Demo
茶歇及Demo展示

 

 

15:30

Cocos, not only an engine - enabling your games on ARM 64bit and accelerating the game development
Cocos
不仅仅是引擎

Kenan Liu/刘克男

技术推广经理, 触控科技
Cocos Technical Marketing Manager,  Chukong Technology

16:15

What's new in OpenGL ES 3.1 - ASTC Full Profile(HDR & 3D Textures) and Computer Shaders
OpenGL ES3.1
新特性介绍 - ASTC纹理压缩与计算着色器介绍

Frank Lei/雷宇

资深开发者关系工程师, ARM
Senior DevRel Engineer, ARM

17:15

面对面问答 & 抽奖/Q&A and Lucky Draw

 

 

 

   Leon Zhang  - ARM (Win On ARM  - The challenges and trends of the next generation Mobile Games)

2.jpg

   Joel Liang  - ARM (ARM Mali Architecture Overview and ARM OpenGL ES extensions)

IMAG4401-s.jpg

   Nathan Li  - ARM (Performance Analysis with Mali Tools)

IMAG4406-s.jpg

  Zhenpin Guo - Unity  (Unity5 and Enlighten Realtime Global Illumination Technology)

5.jpg

  Alan Chuang - ARM (Benefits of Multithread and big.LITTLE, NEON and Overview of ARMv8 architecture)

alan-title.jpg

  Jack Porter - Epic Games  (Unreal Engine4 and Optimization experiences on Mali GPU)

3.jpg

  

   Kenan Liu - Chukong (Cocos -- not only an engine)

4.jpg

  Frank Lei - ARM (What's New in OpenGL ES 3.1 -- ASTC Full Profile and Computer Shaders)

640.jpg


The turn-out was surprising well. We had 120 people registered for the event, but a total of 131 people showed up. Many of them are from well-know game companies here in China. Several partners were even flying in from other cities to join the event. There were good Q&As during the presentation and even more interactions at the breaks. Not only our participants enjoyed the technical presentations, but our guest speakers also enjoyed the chatting with the developers.

 

IMAG4404-s.jpg


ppt-4.jpg

圖片1.jpg

For a recap of the event, check out the wrap up summary from our local partner here.

More future events like this and other technical information, check out our Mali Developer Center.


Today, the Media Processing team at ARM is delighted to announce the launch of five new products, the ARM® Mali-T860, Mali-T830 and Mali-T820 graphics processors, the Mali-V550 video processor and the Mali-DP550 display processors.

 

The changing market

 

We’ve discussed the opportunities emerging in the growing mainstream market previously in these blogs. With well over 1Bn consumers already, each of whom has different requirements in terms of price, performance and feature-set, our partners need a choice of semiconductor IP which enables them to address the diversity of demands within this high-volume segment. ARM has long understood the fact that one size and one feature set does not address the needs of every market segment or best serve the needs of partners who are all looking to quickly differentiate their products to gain a competitive advantage. With this in mind, we aim to deliver a scalable roadmap of core IP, bringing our partners choice as well as enabling them to accelerate their time to market and freeing their engineers to bring more innovation and diversity to this accelerating market.

 

At the same time as this diversification in device type is taking place, ARM and its partners are also seeing important trends in mobile content consumption that need to be taken into account when designing the next generation of semiconductor IP. Jakub Lamik, our Director of Product Marketing, discusses some of the important trends such as increased pixel density, increased screen resolutions and increasingly complex content in his blog last week and explains the inter-core technology which ARM offers that helps our partners deal with the increasing strain this content applies to mobile devices.

 

When you take both of these aspects into consideration, there are a range of challenges which our partners face in producing successful end devices. Central to all is the need to offer a range of price and performance points in an energy efficient fashion in order to enable the latest content across the entire breadth of the market.

 

ARM’s new suite of integrated Mali IP

 

Since the development of the first mobile phone, ARM has worked with our partners to develop technology that continually extends the capabilities of the mobile within its fixed power budget. Today we are launching five new products which address the diverse media needs of the mainstream market. The suite offers options for cost efficiency, performance efficiency and the ability to get to market faster, all combined with innate energy efficiencies provided by the ability to allocate tasks across the system to the most appropriate processor, be that CPU, GPU, video or display.


Media Suite Launch.png

 

Introducing the Mali-T860 GPU

 

The Mali-T860 scales across sixteen cores to offer the best performance for the lowest energy consumption of any Midgard GPU. Building on the technical advances of our previous generations of GPUs, it offers a 45% improvement in energy efficiency compared to the Mali-T628 in the same configuration and process node. With micro-architectural enhancements such as quad prioritization and improved early Z test throughput, performance is improved across both casual and advanced gaming content. It is the perfect GPU for an end device targeted at the most demanding consumers who want a great visual experience at an affordable price point.

 

Mali-T860.png

 

Because the key focus of the Mali-T860 is on performance efficiency, it delivers this extra performance within an impressively small energy budget by incorporating support for a range of bandwidth reducing technology including ARM Frame Buffer Compression, Smart Composition and Transaction Elimination.  Native hardware support for 10-bit YUV has also been added to make this GPU an ideal accompaniment to the Mali-V550 video processor and Mali-DP550 display processor, so that users can experience the best visual quality when watching content in an increasingly 4K DTV and STB market. 10-bit YUV is available across the entire media suite released today, whether as native hardware support such as in the Mali-T860 or as a configuration option as in the Mali-T820.

 

For more information on the Mali-T860 GPU, visit its product page.

 

Introducing the Mali-T830 and Mali-T820 GPUs

 

Entering the cost efficient roadmap are the Mali-T820 and Mali-T830. These two GPUs are an evolution of the Mali-T720, recently announced as the GPU in the MediaTek MT6735, and, having been developed alongside the Mali-T860, they have also inherited some important features from this performance efficient GPU which enable them to offer not only area and energy efficiencies compared to previous generations, but performance advancements as well, such as quad prioritization.

The Mali-T820 is optimized for entry-level products and achieves up to 40% higher performance density compared to the Mali-T622. Comparatively, the Mali-T830 balances area, performance and energy efficiency to deliver maximal performance from a minimal silicon area. It has an additional arithmetic pipeline compared to the Mali-T820 and offers up to 55% more performance than the Mali-T622 GPU in the same configuration and process node. It is ideal for bringing more advanced 3D gaming and arithmetically complex use cases to consumers of mainstream smartphones, tablets and DTVs.

 

Mali-T820,T830.png

 

Together, the Mali-T820 and Mali-T830 introduce ARM Frame Buffer Compression to the cost efficient roadmap for the first time. This will ensure that the system-wide bandwidth savings made possible by AFBC – up to 50% - will appear in the next couple of years in more affordable devices, enabling these to deliver high quality multimedia experiences to consumers for longer.

 

For more information on the Mali-T820 and Mali-T830 GPUs, visit their product pages.

 

Introducing the Mali-V550 video processor

 

The Mali-V550 is ARM’s next generation, low bandwidth, multi core, multi codec encode & decode video IP. It is the IP industry’s first single-core video encode and decode solution for HEVC; the combination of encode and decode functionality on a single core and its ability to maximize re-use across multiple codecs ensure that the Mali-V550 maintains its strong area efficiency leadership.

 

The Mali-V550 is a multi-core solution out of the box, scalable to 4K resolutions at 120fps or 1080p at 480fps with an 8-core configuration.  The architecture supports multiple video streams across multiple cores as well as simultaneous encode and decode. For example you can parallel decode eighteen 720p30 decode streams with a Mali-V550 MP4, or any combination of encode or decode. These streams may use different coding standards and are time multiplexed on a frame basis.


Motion search elimination, introduced in Jakub’s blog last week, enables the video processor to avoid a large amount of processing related mainly to the motion search engine, but also sometimes entire reconstruction.  The best motion search elimination benefits apply to WiFi scenarios, when encoding and sending static content (such as user interface or 2D games) to an external display. In such a situation, it is able to lower memory bandwidth by up to 35% as well as lower latency.

While system power, performance and silicon area are all critical for our SoC partners, this can not come at the expense of visual quality. The Mali-V550 is robust against external memory latency: video processing can continue for over 5000 cycles without external memory access and the Mali-V550 can hide more than 300 clock cycles of static latency from a slow memory system without dropping a frame. This means that consumers will benefit from smooth playback with no dropped frames when experiencing multimedia on a device with the Mali-V550 video processor. The Mali-V550 also maintains support for AFBC.

 

For more information on the Mali-V550 video processor, visit its product page.

 

Introducing the Mali-DP550 display processor

 

The Mali-DP550 completes the suite of IP launched today and offers efficient media processing right to the glass.

 

One way of delivering system-wide energy efficiency is to enable each task to be executed on the most appropriate processor.  We have talked about this a lot before in these blogs in the case of GPU Compute enabling applications such as computational photography.  When a Mali-DP550 is deployed in a mobile media system, it too can offload basic tasks from the GPU or CPU such as user interface composition or scaling as well as rotation, post-processing and display control – and it does this all in a single pass so there is no need to go out to memory,  extra bandwidth and power savings.

 

The principal additional feature of the Mali-DP550 is its co-processor interface which enables partners to easily integrate third party or proprietary display IP with the display processor. As the mainstream market diversifies and grows, delivering the right choice of application processors so that consumers can buy a device without compromise requires the ability to differentiate and deliver products quickly and simply. Display is regularly an important differentiating factor for our partners, and with this co-processor interface our partners can continue to use their proprietary display algorithms while benefiting from the advantages that licensing a highly functional core IP block can bring.

 

For more information on the Mali-DP550 display processor, visit its product page.

 

Why choose a media system from ARM?

 

ARM offers each of the IP blocks above as separate licensable products, but the advantages come when you employ an entirely ARM-based system. ARM partners discover system-wide bandwidth efficiencies, reduced time to market and the ability to focus engineering on critical differentiation. Thanks to our bandwidth saving technologies, the availability of an integrated software stack and system-wide performance analysis tools such as DS-5 Streamline, employing an integrated ARM-based media system is simple and very effective.  And importantly, partners can be reassured of the quality of the new products they license because of the proven verification and validation processes that ARM implements consistently across our entire IP range, from CPUs to display IP.

 

Mali System.png

 

The ARM Mali media IP products are available for immediate licensing and initial consumer devices are expected to appear in late 2015 and early 2016.

Chinese Version中文版 : 节约带宽才是王道

 

Building an efficient and high performing System-on-Chip(SoC) is becoming an increasingly complex task. The growing demands for bandwidth heavy applications mean that system components are required to improve efficiency in each generation to address the additional bandwidth consumption that these apps entail. And this is true across all markets: as high-end mobile computing platforms strive for ever greater performance alongside better energy efficiency, SoCs targeting the mainstream still need to deliver a premium style feature set and performance density - and at the same time reduce manufacturing costs and time to market too!

 

If you look closely at typical user interactions with consumer devices you will realize that they very often centre on a combination of text, audio, stills, images, animation, video, in other words, a multimedia experience. What is significant about this is that media intensive use cases require the transfer of large amounts of data and the more advanced the user experience, the higher the requirement for increased system bandwidth. But the higher the bandwidth consumption, the higher the power consumption.


BW.png


But which specific use cases and functional requirements are driving this nearly exponential increase for more bandwidth?

 

  • Screen sizes and resolutions have rapidly increased across a wide range of form factors and performance points. The number of mobile devices with HD screen resolutions is growing fast and tablets now often have a 2.5K screen. There is no sign of this trend slowing down.

 

  • The frames per second that a media system has to deliver have increased.  At the same time, display refresh rates are expected to provide a more compelling user experience. In reality, 60FPS have been a must for some time now and with some advanced use cases we are moving up to 120FPS.

 

  • The amount of computing throughput required to calculate each scene has increased. High-end games and use cases require more complex calculations to represent each of the final pixels. Increasing arithmetic throughput on the per pixel basis simply means that there is more data being transferred and processed.

 

So in summary, there are more pixels which all have to be delivered faster – and at the same time more work is required for each of these pixels every frame. All of this requires not only higher computational capabilities but also more bandwidth. Unless SoC designers think about this in advance when designing a media system, a lot more power will be consumed when delivering a quality user experience.

 

So what can we do about that? A typical media system consists of a number of IP components, each of which has slightly different functionality and characteristics. Each stage of the media processing pipeline is handled by a separate block, and each of them (as Sean explained in his blog) has inputs, intermediate data and outputs - all of which contribute to the total power budget.

 

system.png

 

Looking deeper, the typical media pipeline consists of number of interactions between GPU, Video and Display Processors and it requires the passing of a certain amount of data between each of these components. If we take a glass half full view, it means that there are plenty of opportunities to optimize these interactions and provide components that save bandwidth by working together in an efficient way. This is exactly why ARM has developed a range of bandwidth reducing technologies: to allow increasingly more complex media within the power capacity and thermal limit of mobile devices.


Motion Search Elimination - Lowering Latency and Power

 

Let’s start with looking at how optimizations already applied to Mali GPUs can be applied to other media components. New use cases require an innovative approach. Nowadays we are seeing more and more that require wireless delivery of audio and video from tablets, mobile phones and other consumer devices to large screens such as that on a DTV. Both sending and receiving devices must support compression of the video stream using algorithms such as H.264. In a typical use case, instead of writing the frame buffer to the screen memory, the Display Processor will send it to the Video Decoder and then the compressed frame will be sent over the WiFi network.

 

mse.png

 

Motion Search Elimination extends the concept of Transaction Elimination, introduced last year in the Mali GPUs and described below, to Display and Video Processors. Each of them maintains a signature for each tile and when the Display Processor writes the frame buffer out, the Video Processor can eliminate motion search for tiles where signatures match. Why does this matter? Motion estimation is an expensive part of the video pipeline so skipping the search for selected tiles will lower latency of Wi-Fi transmission, lower bandwidth consumption and as a result, lower the entire SoC power.

 

Transaction Elimination – Saving External Bandwidth

 

Transaction Elimination (TE) is a key bandwidth saving feature of the ARM Mali Midgard GPU architecture that allows significant energy savings when writing out frame buffers. In a nutshell, when TE is enabled, the GPU compares the current frame buffer with the previously rendered frame and performs a partial update only to the particular parts of it that have been modified.

TE.png

With that, the amount of data that need to be transmitted per frame to external memory is significantly reduced. TE can be used by every application for all frame buffer formats supported by the GPU, irrespective of the frame buffer precision requirements. It is highly effective even on first person shooters and video streams. Given that in many other popular graphics applications, such as User Interfaces and casual games, large parts of the frame buffer remain static between two consecutive frames, frame buffer bandwidth savings from TE can reach up to 99%.

 

TE0.png

 

Smart Composition - Reducing Bandwidth and Workloads

 

So is there anything else we could do to minimize the amount of data processed through the GPU? Smart Composition is another technology developed to reduce bandwidth while reading in textures during frame composition and, as outlined by Plout in his blog, it builds on the previously described Transaction Elimination.


SC.png

 

By analyzing frames prior to final frame composition, Smart Composition determines if any reason exists to render a given portion of the frame or whether the previously rendered and composited portion can be reused. If that portion of the frame can be reused then it is not read from memory again or re-composited, thereby saving additional computational effort.

 

AFBC - Bandwidth Reduction in Media System

 

Now let’s look more closely at interactions between the GPU, Video and Display processors. One of the most bandwidth intensive use cases is video post processing. In many use cases, the GPU is required to read a video and apply effects when using video streams as textures in 2D or 3D scenes. In such cases, ARM Frame Buffer Compression (AFBC), a lossless image compression protocol and format with fine grained random access, reduces the overall system level bandwidth and power by up to 50% by minimizing the amount of data transferred between IP blocks within a SoC.

 

AFBC.png

When AFBC is used in an SoC[TW1] , the Video Processor will simply write out the video streams in the compressed format and the GPU will read them and only uncompress them in the on-chip memory. Exactly the same optimization will be applied to the output buffers intended for the screen. Whether it is the GPU or Video Processor producing the final frame buffers, they will be compressed so that the Display Processor will read these in the AFBC format and only uncompress when moving to the display memory. AFBC is described in more detail in Ola’s blog Mali-V500 video processor: reducing memory bandwidth with AFBC.


AFBC2.png

 

ASTC - Flexibility, Reduced Size and Improved Quality

 

But what about interactions between the GPU and a graphics application such as a high-end game or user interface? This is the perfect opportunity to optimize the amount of memory that texture assets require. Adaptive Scalable Texture Compression (ASTC) technology, developed by ARM and AMD, donated to Khronos and has been adopted as an official extension to both the Open GL® and OpenGL® ES graphics APIs. ASTC is a major step forward in reducing memory bandwidth and thus energy use, all while maintaining image quality.


ASTC.png


The ASTC specification includes two profiles: LDR and Full, both of which are already supported on Mali-T62X GPUs and above and are described in more detail by Tom Olson and Stacy Smith.


Mali OpenGL ES Extensions – Efficient Deferred Shading

 

To finish, let’s explore another great opportunity to optimize system bandwidth. Modern games apply various post processing effects and the textures are often combined with the frame buffer. This means that memory is written out through the external bus and then read back multiple times to achieve advanced graphics effects. This type of deferred rendering requires the transfer of a significant amount of external data and consumes a lot of power. But Mali GPUs are a tile-based rendering architecture, which means that fragment shading is performed tile by tile using on-chip memory and only when all the contents of a tile has been processed is the tile data written back out to external memory.

 

PLS2.png

 

This is a perfect opportunity to employ deferred shading without the need to write out the data through the external bus. ARM has introduced two advanced OpenGL ES extensions that enable developers to achieve console-like effects within the mobile bandwidth and power budget: Shader Framebuffer Fetch and Shader Pixel Local Storage. For more information on these extensions, read Jan-Harald’s blog.

pls4.png

 

Anything Else?

 

So have we exhausted all of the possibilities to minimize system bandwidth with the technologies described in this blog? The good news is … of course not! At ARM we are positively obsessed with finding new areas for optimizations and making mobile media systems even more power efficient. In each generation our Silicon Partners, OEMs and end consumers help us to discover new use cases which are posing different challenges and requirements. With this there is a constant stream of new opportunities to get our innovation engines going and design even more efficient SoCs.

 

Got any questions on the technologies outlined above? Let us know in the comments section below.

             SDK.jpg

Following on from my previous blog Mali Tutorial Programme for Novice Developers, I am pleased to announce that the first complete semester of 12 tutorials has now been finished and released. As a reminder, these tutorials are meant for people with no prior graphics - or even Android experience. Over the course of the 12 tutorials we will take you from a simple Android application to being able to create an application that loads models from industry standard modelling packages and lights them and normal maps them correctly.  A getting started guide is also included to help setup your computer to be able to build Android applications.

 

These tutorials are meant to follow on from each other, with each one building on the previous. However, when you get to the simple cube, most of the later tutorials are based off this. To download these tutorials all you need to do is download the Mali Android SDK from Mali Developer Center.

 

Here is a brief summary of the 12 tutorials and what is included in each:

 

1) First Android Native Application: An introduction to creating a basic Android application that uses both the Android SDK and the Android NDK.

 

2) Introduction to Shaders: A brief introduction to shaders and the graphics pipeline. This tutorial is a great companion to the rest of the tutorials and gives you better insight into some of the concepts used later on. It is also great to come back to when you have completed the later tutorials, as it will help to answer some of the questions you may have.

 

3) Graphics Setup: This tutorial teaches you all the setup required to run OpenGL® ES graphics applications on an Android platform. It briefly talks about EGL, surface and contexts - just enough for you to be able to draw graphics in the next tutorial.

 

4) Simple Triangle: Finally you get to draw something to the screen! It is only a triangle, but this triangle will be the basis for nearly everything you choose to do with mobile graphics.

 

5) Simple Cube: In this tutorial we explore how to use 3D objects. Mathematical transformations are also discussed so that you are able to move and rotate the cube at will.

 

6) Texture Cube: Once we have the cube it is time to start making it more realistic. An easy way to do this is through texturing. You can think about it like wallpapering the cube with an image. This allows you to add a lot of detail really simply.

 

7) Lighting: Next we add a realistic approximation to lighting to give the scene more atmosphere. We also go through some of the maths that is involved in the lighting approximations.

 

8) Normal Mapping: This is a way to make our lighting look even more realistic without a heavy cost on calculating at runtime. This is done by doing most of the computation offline and adding it to a texture.

 

9) Asset Loading: This is the tutorial where we get to move away from using a standard cube. This tutorial teaches you how to import objects generated from third party tools. This means you can add objects into your application like characters, furniture and even whole buildings.

 

10) Vertex Buffer Objects: Bandwidth is a huge limiting factor when writing a graphics application for mobile. In this tutorial we explore one way to reduce this by sending vertex information only once.

 

11) Android File Loading: Up until now all of our textures and shaders have been included in the C or Java files that we have been using. This tutorial allows you to separate them out into separate files and then bundle them into your APK. This tutorial also teaches you how to extract your files out of the APK again at runtime.

 

12) Mipmapping and Compressed Textures: As a follow on from Vertex Buffer Objects, this tutorial explores two other ways of reducing bandwidth. OpenGL ES supports the use of certain compressed texture formats. This tutorial explores those as well as using smaller versions of the same texture to deliver not only better looking results, but also a reduction in bandwidth.

 

Got any questions or feedback concerning these tutorials? Let me know in the comments section below.

I am interrupting my blog series to share what I think is a rather elegant way to quickly get up and running with OpenCL on the ARM® Mali-T604 GPU powered Chromebook. Please bear in mind that this is not ARM's "official guide" (which can be found here). However, it's a useful alternative to the official guide if, for example, you don't have a Linux PC or just want to use Chrome OS day in and day out.

 

You will need:

 

How fast you will complete the installation will depend on how fast you can copy-and-paste instructions from this guide, how fast your Internet connection is and how fast your memory card is (I will give an approximate time for each step measured when using 30 MB/s and 45 MB/s cards). The basic OpenCL installation should take up to half an hour; PyOpenCL and NumPy about an hour; further SciPy libraries about 3-4 hours. Most of the time, however, you will be able to leave the Chromebook unattended, beavering away while compiling packages from source.

 

Finally, the instructions are provided "as is", you use them at your own risk, and so on, and so forth... (The official guide also contains an important disclaimer.)

 

Installing OpenCL

Enabling Developer Mode

NB: Enabling Developer Mode erases all user data - do a back up first.

 

Enter Recovery Mode by holding the ESC and REFRESH (↻ or F3) buttons, and pressing the POWER button. In Recovery Mode, press Ctrl+D and ENTER to confirm and enable Developer Mode.

 

Entering developer shell (1 min)

Open the Chrome browser and press Ctrl-Alt-T.

Welcome to crosh, the Chrome OS developer shell.

If you got here by mistake, don't panic!  Just close this tab and carry on.

Type 'help' for a list of commands.

Don't panic, keep the tab opened and carry on to enter the shell:

crosh> shell
chronos@localhost / $ uname -a
Linux localhost 3.8.11 #1 SMP Mon Sep 22 22:27:45 PDT 2014 armv7l SAMSUNG EXYNOS5 (Flattened Device Tree) GNU/Linux

 

Preparing an SD card (5 min)

Insert a blank SD card (denoted as /dev/mmcblk1 in what follows):

chronos@localhost / $ sudo parted -a optimal /dev/mmcblk1
GNU Parted 3.1
Using /dev/mmcblk1
Welcome to GNU Parted! Type 'help' to view a lit of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/mmcblk1 will be destroyed 
and all data on this disk will be lost. Do you want to continue?
Yes/No? Y
(parted) unit mib
(parted) mkpart primary 1 -1
(parted) name 1 root
(parted) print
Model: SD SU08G (sd/mmc)
Disk /dev/mmcblk1: 7580MiB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start    End      Size      File system  Name  Flags
 1      1.00MiB  7579MiB  7578MiB                root

(parted) quit

Make sure the card is not mounted, then format it e.g.:

chronos@localhost / $ sudo mkfs.ext3 /dev/mmcblk1p1

NB: If you use a card that is less than 8 GB, you may need to reserve enough inodes when you format the card e.g.:

chronos@localhost / $ sudo mkfs.ext3 /dev/mmcblk1p1 -j -T small

Mount the card and check that it's ready:

chronos@localhost / $ sudo mkdir -p ~/gentoo
chronos@localhost / $ sudo mount -o rw,exec -t ext3 /dev/mmcblk1p1 ~/gentoo
chronos@localhost / $ df -h ~/gentoo
/dev/mmcblk1p1  7.2G   17M  6.8G   1% /home/chronos/user/gentoo
chronos@localhost / $ df -hi ~/gentoo
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/mmcblk1p1   475K    11  475K    1% /home/chronos/user/gentoo

Installing Gentoo Linux (10-15 min)

chronos@localhost / $ cd ~/gentoo
chronos@localhost ~/gentoo $ ls -la
total 36
drwxr-xr-x  3 root    root            4096 Oct  7 21:37 .
drwx--x--- 33 chronos chronos-access 16384 Oct  7 21:43 ..
drwx------  2 root    root           16384 Oct  7 21:37 lost+found

Download the latest stage 3 archive for armv7a_hardfp:

chronos@localhost ~/gentoo $ sudo wget http://distfiles.gentoo.org/releases/arm/autobuilds/latest-stage3-armv7a_hardfp.txt
chronos@localhost ~/gentoo $ sudo wget http://distfiles.gentoo.org/releases/arm/autobuilds/`cat latest-stage3-armv7a_hardfp.txt | grep stage3-armv7a_hardfp`

Extract the downloaded archive right onto the card e.g.:

chronos@localhost ~/gentoo $ sudo tar xjpf stage3-armv7a_hardfp-20140819.tar.bz2

Clean up:

chronos@localhost ~/gentoo $ sudo rm stage3-armv7a_hardfp-20140819.tar.bz2
chronos@localhost ~/gentoo $ sudo rm latest-stage3-armv7a_hardfp.txt
chronos@localhost ~/gentoo $ ls -la
total 92
drwxr-xr-x  21 root root  4096 Oct  9 19:12 .
drwxr-xr-x  21 root root  4096 Oct  9 19:12 ..
drwxr-xr-x   2 root root  4096 Aug 20 14:44 bin
drwxr-xr-x   2 root root  4096 Aug 20 07:16 boot
drwxr-xr-x  17 root root  3760 Oct  9 18:59 dev
-rwxr--r--   1 root root    85 Oct  7 21:38 enter.sh
drwxr-xr-x  33 root root  4096 Oct  9 19:12 etc
drwxr-xr-x   2 root root  4096 Oct  7 22:14 fbdev
drwxr-xr-x   2 root root  4096 Aug 20 07:16 home
drwxr-xr-x   8 root root  4096 Oct  9 19:08 lib
drwx------   2 root root 16384 Oct  7 20:37 lost+found
drwxr-xr-x   2 root root  4096 Aug 20 07:16 media
drwxr-xr-x   2 root root  4096 Aug 20 07:16 mnt
drwxr-xr-x   2 root root  4096 Aug 20 07:16 opt
dr-xr-xr-x 195 root root     0 Jan  1  1970 proc
drwx------   5 root root  4096 Oct  8 20:46 root
drwxr-xr-x   3 root root  4096 Aug 20 14:43 run
drwxr-xr-x   2 root root  4096 Aug 20 14:54 sbin
-rwxr--r--   1 root root   192 Oct  7 21:38 setup.sh
dr-xr-xr-x  12 root root     0 Oct  9 18:58 sys
drwxrwxrwt   5 root root  4096 Oct  9 19:11 tmp
drwxr-xr-x  12 root root  4096 Oct  7 22:20 usr
drwxr-xr-x   9 root root  4096 Aug 20 07:16 var

 

Downloading OpenCL drivers (4 min)

Go to the page listing Mali-T6xx Linux drivers and download mali-t604_r4p0-02rel0_linux_1+fbdev.tar.gz. Make sure you carefully read and accept the associated licence terms.

chronos@localhost ~/gentoo $ sudo tar xvzf ~/Downloads/mali-t604_r4p0-02rel0_linux_1+fbdev.tar.gz

This will create ~/gentoo/fbdev which we will use later.

 

Entering Gentoo Linux (2 min)

Similar to crouton, we will use chroot to enter our Linux environment.

 

Create two scripts and make them executable:

chronos@localhost ~/gentoo $ sudo vim ~/gentoo/setup.sh
#!/bin/sh
GENTOO_DIR=/home/chronos/user/gentoo
mount -t proc /proc $GENTOO_DIR/proc
mount --rbind /sys  $GENTOO_DIR/sys
mount --rbind /dev  $GENTOO_DIR/dev
cp /etc/resolv.conf $GENTOO_DIR/etc
chronos@localhost ~/gentoo $ sudo vim ~/gentoo/enter.sh
#!/bin/sh
GENTOO_DIR=/home/chronos/user/gentoo
LC_ALL=C chroot $GENTOO_DIR /bin/bash
chronos@localhost ~/gentoo $ sudo chmod u+x ~/gentoo/setup.sh ~/gentoo/enter.sh

Execute the scripts:

chronos@localhost ~/gentoo $ sudo ~/gentoo/setup.sh
chronos@localhost ~/gentoo $ sudo ~/gentoo/enter.sh

Note that the ~/gentoo directory will become the root (/) directory once we enter our new Linux environment. For example, ~/gentoo/fbdev will become /fbdev inside the Linux environment.

 

Installing OpenCL header files (2 min)

Download OpenCL header files from the Khronos OpenCL registry:

localhost / # mkdir /usr/include/CL && cd /usr/include/CL
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/opencl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_platform.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_gl.h
localhost / # wget http://www.khronos.org/registry/cl/api/1.1/cl_ext.h

 

Installing OpenCL driver (2 min)

Change properties on the downloaded OpenCL driver files and copy them to /usr/lib:

localhost / # chown root /fbdev/*
localhost / # chgrp root /fbdev/*
localhost / # chmod 755 /fbdev/*
localhost / # mv /fbdev/* /usr/lib
localhost / # rmdir /fbdev

 

Summary

By now you should have a mint Linux installation complete with the OpenCL drivers and headers, so you can start playing with OpenCL!

When you reboot, you just need to mount the card and execute the setup script again:

chronos@localhost / $ sudo mount -o rw,exec -t ext3 /dev/mmcblk1p1 ~/gentoo
chronos@localhost / $ sudo ~/gentoo/setup.sh

Then you can pop in and out of the Linux environment with:

chronos@localhost / $ sudo ~/gentoo/enter.sh
localhost / # exit
chronos@localhost / $

But the fun just begins here! Follow the instructions below to install PyOpenCL and SciPy libraries for scientific computing.

 

Installing PyOpenCL

Configuring Portage (15 min)

Portage is Gentoo's package management system.

localhost / # echo "MAKEOPTS=\"-j2\"" >> /etc/portage/make.conf
localhost / # echo "ACCEPT_KEYWORDS=\"~arm\"" >> /etc/portage/make.conf
localhost / # mkdir /etc/portage/profile
localhost / # mkdir /etc/portage/package.use
localhost / # mkdir /etc/portage/package.unmask
localhost / # mkdir /etc/portage/package.accept_keywords
localhost / # mkdir /etc/portage/package.keywords
localhost / # touch /etc/portage/package.keywords/dependences

Perform an update:

localhost / # emerge --sync
localhost / # emerge --oneshot portage
localhost / # eselect news read

 

Selecting Python 2.7 (1 min)

localhost / # eselect python set python2.7

 

Installing NumPy (30-40 min)

Install NumPy with LAPACK as follows.

localhost / # echo "dev-python/numpy lapack" >> /etc/portage/package.use/numpy
localhost / # echo "dev-python/numpy -lapack" >> /etc/portage/profile/package.use.mask
localhost / # emerge --autounmask-write dev-python/numpy
localhost / # python -c "import numpy; print numpy.__version__"
1.8.2

 

Installing PyOpenCL (5-10 min)

Install PyOpenCL.

localhost / # cd /tmp
localhost tmp # wget https://pypi.python.org/packages/source/p/pyopencl/pyopencl-2014.1.tar.gz
localhost tmp # tar xvzf pyopencl-2014.1.tar.gz
localhost tmp # cd pyopencl-2014.1
localhost pyopencl-2014.1 # python configure.py
localhost pyopencl-2014.1 # make install
localhost pyopencl-2014.1 # cd examples
localhost examples # python demo.py
(0.0, 241.63054)
localhost examples # python -c "import pyopencl; print pyopencl.VERSION_TEXT"
2014.1

 

Installing scientific libraries

If you would like to follow my posts on benchmarking (e.g. see the intro), I recommend you install packages from the SciPy family.

 

Installing IPython (30-45 min)

localhost / # emerge --autounmask-write dev-python/ipython
localhost / # ipython --version
1.2.1

 

Installing IPython Notebook (3-7 min)

Install IPython Notebook to enjoy a fun blend of Chrome OS and IPython experience.

 

localhost / # emerge dev-python/jinja dev-python/pyzmq www-servers/tornado
localhost / # ipython notebook
2014-05-08 06:49:08.424 [NotebookApp] Using existing profile dir: u'/root/.ipython/profile_default'
2014-05-08 06:49:08.440 [NotebookApp] Using MathJax from CDN: http://cdn.mathjax.org/mathjax/latest/MathJax.js
2014-05-08 06:49:08.485 [NotebookApp] Serving notebooks from local directory: /
2014-05-08 06:49:08.485 [NotebookApp] The IPython Notebook is running at: http://127.0.0.1:8888/
2014-05-08 06:49:08.486 [NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
2014-05-08 06:49:08.486 [NotebookApp] WARNING | No web browser found: could not locate runnable browser.

Open http://127.0.0.1:8888/ in a new Chrome tab to start creating your own IPython Notebooks!

 

Installing Matplotlib (35-50 min)

localhost / # emerge --autounmask-write dev-python/matplotlib
localhost / # python -c "import matplotlib; print matplotlib.__version__"
1.4.0

 

Installing SciPy (45-60 min)

localhost / # emerge --autounmask-write sci-libs/scipy
localhost / # python -c "import scipy; print scipy.__version__"
0.14.0

 

Installing Pandas (55-80 min)

localhost / # emerge --autounmask-write dev-python/pandas
localhost / # etc-update
Scanning Configuration files...
The following is the list of files which need updating, each
configuration file is followed by a list of possible replacement files.
1) /etc/portage/package.keywords/dependences (1)
Please select a file to edit by entering the corresponding number.
              (don't use -3, -5, -7 or -9 if you're unsure what to do)
              (-1 to exit) (-3 to auto merge all files)
                           (-5 to auto-merge AND not use 'mv -i')
                           (-7 to discard all updates)
                           (-9 to discard all updates AND not use 'rm -i'): -3
Replacing /etc/portage/package.keywords/dependences with /etc/portage/package.keywords/._cfg0000_dependences
mv: overwrite '/etc/portage/package.keywords/dependences'? y
Exiting: Nothing left to do; exiting.
localhost / # emerge dev-python/pandas
localhost / # python -c "import pandas; print pandas.__version__"
0.14.1
Tim Hartley

Mali at Techcon 2014

Posted by Tim Hartley Oct 2, 2014

Techcon is 10!  Yes, for the tenth time ARM Techcon is up and running from 1 to 3 October at the Santa Clara convention center.  Ahead of three days of presentations, demonstrations, tutorials, keynotes, exhibitions and panels from the great and the good in the embedded community, Peter Hutton, Executive VP & President Products Group kicked it all off demonstrating how widely supported ARMv8 is – from entry level phones through to servers.  Joining him on stage was Dr. Tom Bradicich from HP announcing two enterprise class ARMv8 servers.  And joining him was one his first customers, Dr. Jim Ang from Sandia Labs.  If this went on there was going to be no room left on the stage.

 

For Mali-philes, graphics, display and video are of course on show here in force.  There’ll be some great, enigmatically named talks … Tom Cooksey’s “Do Androids Have Nightmares of Botched System Integrations?” will showcase the critical parts of the Android media subsystem and how three key Mali products – Graphics, Display and Video can come together to become something greater than the sum of the parts.  Brad Grantham speaking on “Optimizing Graphics Development with Parameterized Batching” and Tom Olson's tackling Bandwidth-efficient rendering using pixel local storage. Anton Lokhmotov’s GPU Compute optimisation guide “Love your Code?  Optimize it the Right Way” will attempt the impossible by mixing live demonstrations from a Chromebook with absolutely no PowerPoint slides at all.

 

Mali is also in much evidence on the show floor, all part of the buzzing ARM Techcon Expo.  Live demonstrations are showcasing ASTC texture encoding, transaction elimination with Mali-T600 and some of the power saving properties of Ittiam Systems's HEVC decoder running on an energy efficient combination of CPU and Mali GPU.

 

As well as Anton’s talk there is plenty to keep GPU Compute fans happy.  Roberto Mijat and I presented a talk this morning about how ARM is working with developers to optimise applications using GPU Compute on Mali.  And Roberto is back with a panel discussion in the Expo, “Meet the Revolutionaries who are Making GPU Compute a Reality!”, with representatives from Ittiam, Khronos and ArcSoft discussing developments in this growing field.

 

Do watch this space... there'll be more detail and blogs about the talks soon.  And if you’re in the area, do come by and check it out!

Chinese Version 中文版:NEON驱动OpenCL强化异构多处理

OpenCL - First Mali and Now NEON

 

I am currently in Santa Clara for ARM TechCon where the latest technologies from ARM and its partners will be on show from tomorrow. There will be a number of exciting announcements from ARM this week, but the one that I have been most involved in is the launch today of a new product that supports OpenCL™ on CPUs with ARM® NEON™ technology and also on the already supported ARM Mali™ Midgard GPUs. NEON is a 128-bit SIMD (Single Instruction, Multiple Data) architecture extension included in all the latest ARM Cortex®-A class processors, so along with Mali GPUs it’s already widely available in current generation devices and an extremely suitable candidate to benefit from the advantages of OpenCL.

 

What is OpenCL Anyway?

 

It’s worth starting with a brief explanation of why support for the OpenCL compute API is important. There are a number of industry trends that create challenges for the software developer. For example, heterogeneous multiprocessing is great for performance and efficiency, but the diversity of instruction sets can often lead to a lack of portability. Another example is that parallel computing gets a task done more quickly, but programming parallel systems is notoriously difficult. This is where OpenCL comes in. It is a computing language (OpenCL C) that enables easier, portable and more efficient programming across heterogeneous platforms, and it is also an API that coordinates parallel computation on those heterogeneous processors. OpenCL load balances tasks across all the available processors in a system; it even simplifies the programming of multi-core NEON by treating it as a single OpenCL device. This is all about efficiently matching the ‘right task to the right processor’.

 

Figure 1: OpenCL is especially suited to parallel processing of large data sets

 

Where Can I Use OpenCL?

 

OpenCL can be used wherever an algorithm lends itself to parallelisation and is being used to process a large data-set. Examples of such algorithms and use-cases can be found in many types of device and include:

 

Mobile

  1. The stabilization, editing, correction and enhancement of images; stitching panoramic images
  2. Face, smile and landmark recognition (for tagging with metadata)
  3. Computer vision, augmented reality

 

Digital TV

  1. Upscaling, downscaling; conversion from 2D to Stereo 3D

  2. Support for emerging codec standards (e.g. HEVC)
  3. Pre- and post-processing (stabilizing, transcoding, colour-conversion)
  4. User interfaces: multi-viewer gesture-based UI and speech control

 

Automotive

  1. Advanced Driver Assistance Systems (ADAS)
  2. Lane departure and collision warnings; road sign and pedestrian detection
  3. Dashboard, infotainment, advanced navigation and dynamic cruise control

 

A Tale of Two Profiles

 

OpenCL supports two ‘profiles’:

 

  1. A ‘Full Profile’, which provides the full set of OpenCL features
  2. An ‘Embedded Profile’, which is a strict subset of the Full Profile – and is provided for compatibility with legacy systems

 

The OpenCL for NEON driver and the OpenCL for Mali Midgard GPU driver both support Full Profile. The heritage of OpenCL from desktop systems means that most existing OpenCL software algorithms have been developed for Full Profile. This makes ARM’s Full Profile support very attractive to programmers who can develop on desktop using mature tools with increased productivity and get products to market faster. Another key benefit is that floating point calculations in OpenCL Full Profile are compliant with the IEEE-754 standard, guaranteeing the precision of results.

 

OpenCL for NEON and Mali - Better Together

 

The OpenCL for NEON and the Mali Midgard GPU drivers are designed to operate together within the same OpenCL context. This close-coupling of the drivers enables them to operate with maximum efficiency. For example, memory coherency and inter-queue dependencies are resolved automatically within the drivers. We refer to this version of OpenCL for NEON as the ‘plug-in’ because it ‘plugs into’ the Mali Midgard GPU OpenCL driver.

 

2.png

Figure 2: The benefits of keeping the CPU and GPU in one CL_Context

 

And Not Forgetting the Utgard GPUs - Mali-400 MP & Mali-450 MP

 

There is also a ‘standalone’ version of OpenCL for NEON that is available to use alongside Mali Utgard GPUs, such as the Mali-400 MP and Mali-450 MP. These particular GPUs focus on supporting graphics APIs really efficiently, but not compute APIs such as OpenCL. Therefore adding OpenCL support on the CPU with NEON is an excellent way to add compute capability into the system. The ‘standalone’ version is also suitable for use when there is no GPU in the system.

 

Reaching Out

 

In addition, as the diagram below shows, the ARM OpenCL framework can be connected to other OpenCL frameworks in order to extend OpenCL beyond NEON and Mali GPUs to proprietary hardware devices, for example those built with FPGA fabric. This is achieved by using the Khronos Installable Client Driver (ICD) which is supported by the ARM OpenCL framework.

3.png

Figure 3: Using the Khronos ICD to connect the ARM OpenCL context with other devices

 

In Summary

 

We've seen that OpenCL for NEON will enhance compute processing on any platform that uses a Cortex-A class processor with NEON. This is true whether the platform includes a Mali Midgard GPU, an Utgard GPU, or maybe has no graphics processor at all. However, the coupling of NEON with a Midgard GPU delivers the greatest efficiencies.

 

As algorithms for mobile use cases become more complex, technologies such as OpenCL for NEON are increasingly important for their successful execution. The OpenCL for NEON product is available for licensing immediately; if you would like further information please contact your local ARM sales representative.

Further Reading

 

For more information on OpenCL, Compute and current use cases that are being developed by the ARM Ecosystem:

 

Realizing the Benefits of GPU Compute for Real Applications with Mali GPUs

Interested in GPU Compute? You have choices!

GPU Compute, OpenCL and RenderScript Tutorials on the Mali Developer Center

The Mali Ecosystem demonstrate GPU Compute solutions at the 2014 Multimedia Seminars

 

ARM is an official Khronos Adopter and an active contributor to OpenCL as a Working Group Member

Evaluating compute performance on mobile platforms: an introduction

Using the GPU for compute-intensive processing is all about improving performance compared to using the CPU only. But how do we measure performance in the first place? In this post, I'll touch upon some basics of benchmarking compute workloads on mobile platforms to ensure we are on solid ground when talking about performance improvements.

 

Benchmarking basics

To measure performance, we select a workload and a metric of its performance. Because workloads are often called benchmarks, the process of evaluating performance is usually called benchmarking.

 

Selecting a representative workload is a bit of a dark art so we will leave this topic for another day. Selecting a metric is more straightforward.

 

The most widely used metric is the execution time. To state bluntly, the lower the execution time is, the faster the system is. In other words, lower is better.

 

Frequently, the chosen metric is inversely proportional to the execution time. So, the higher the metric is, the lower the execution time is. In other words, higher is better. For example, when measuring memory bandwidth, the usual metric is the amount of data copied per unit time. As this metric is inversely proportional to the execution time, higher is better.

 

Benchmarking pitfalls

Benchmarking on mobile platforms can be tricky. Running experiments back to back can produce unexpected performance variation, and so can dwindling battery charge, hot room temperature or an alignment of stars. Fundamentally, we are talking about battery powered, passively cooled devices which tend to like saving their battery charge and keeping their temperature down. In particular, dynamic voltage and frequency scaling (DVFS) can get in the way. Controlling these factors (or at least accounting for them) is key to meaningful performance evaluation on mobile platforms.

 

Deciding what to measure and how to measure it deserves special attention. In particular, when focussing on optimising device code (kernels), it's important to measure kernel execution time directly, because host overheads can hide effects of kernel optimisations.

 

To illustrate some of the pitfalls, I have created an IPython Notebook which I encourage you to view before peeking into our next topic.

20140919_iPython_Notebook.png

Sample from iPython Notebook

What's next?

 

Using GPU settings that are ill-suited for evaluating performance is common but should not bite you once you've become aware of it. However, even when all known experimental factors are carefully controlled for, experiments on real systems may produce noticeably different results from run to run. To properly evaluate performance, what we really need is a good grasp of basic statistical concepts and techniques...

 

Are you snoring already? I too used to think that statistics was dull and impenetrable. (A confession: statistics was the only subject I flunked at university, I swear!) Apparently not so, when you apply it to optimising performance! If you are at the ARM TechCon on 1-3 October 2014, come along to my live demo, or just wait a little bit longer and I will tell you all you need to know!

Throughout this year, application developers have continued to release a vast range of apps using both the OpenGL® ES 2.0 and 3.0 APIs. While the more recent API offers a wider range of features and performance can be better on GPUs which support OpenGL ES 3.0 onwards, thanks to the backwards compatibility of OpenGL ES versions the success and longevity of more cost-optimized OpenGL ES 2.0 GPUs looks set to continue. A consequence of this trend is that demand for the ARM® Mali™-450 MP graphics processor, implementing a design that is optimised for OpenGL ES 2.0 acceleration, has never been higher.

 

The momentum behind ARM’s 64-bit ARMv8-A application processor architecture is growing, enabling more complex applications within strict power budgets. We were able to announce last week the 50th license of the technology across 27 different companies, showing that the demand for greater compute capabilities across a wide range of applications is strong.

 

This market support gave us the opportunity to further optimize the performance of our Mali-450 drivers to support 64-bit builds of OpenGL ES 2.0 apps. So, that’s exactly what we’ve done, with a brand new set of 64-bit Mali-450 drivers that were released to our partners recently. Examples of where we see a Mali-450 GPU and Cortex-A53 CPU successfully combined is the entry-level smartphone market, where cost efficiency is important but the implementation of a 64-bit CPU can offer the all-important differentiation from the competition. With this release, ARM is making it easier for the mass market to access the latest technology advances while providing silicon partners with a wider choice of which GPU can be paired with which CPU.

 

So watch out for the new wave of 64-bit devices based on Mali-450 MP and rest assured that the Mali drivers have been optimised for the feature set of the 64-bit CPU.  The only thing you should see is increased app performance, and a few more CPU cycles available – we’re sure you’ll do great things with them.

Filter Blog

By date:
By tag: