# Blog

1 2 3 Previous Next

314 posts

# Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

Posted by Moritz Pflanzer Feb 8, 2016

In this third and last part of this blog series we are going to extend the mixed-radix FFT OpenCL™ implementation to two dimensions and explain how to perform optimisations to achieve good performance on Mobile ARM® Mali™ GPUs. After a short introduction why two-dimensional Fourier transform is an important and powerful tool the first section covers the mathematical background that lets us reuse the existing one-dimensional approach to build the two-dimensional implementation. Afterwards all the steps are described individually and the details are explained. Finally, at the end of each section we show the performance improvements achieved.

So far only the one-dimensional Fourier transform has been considered to explain the mathematical background and basic optimisations. However, in the field of image processing the two-dimensional variant is much more popular. For instance, a simple high-pass filter can be realised in the frequency domain after the image have been transformed by a two-dimensional Fourier transform. The resulting image could be used to implement an edge detection algorithm.

Original image (spatial domain) High-pass filter via Fourier transform

In general the Fourier transformation can be used to make the convolution operations which are used in many image processing applications more efficient. In particular when the kernel size is big the improvements in terms of the number of operations that have to be executed are significant. In the spatial domain a two-dimensional convolution requires a quadratic number of multiplications and a linear number of additions in the order of the size of the convolution kernel for every element of the image. Therefore the resulting complexity is O(MNmn). In contrast, in the frequency domain the convolution corresponds to a pointwise multiplication of the transformed kernel and image which reduces the complexity for the convolution to O(MN). Nevertheless, since the Fourier transform has to be additionally computed it is not always beneficial to perform the convolution in the frequency domain.

# Background

Due to its separability the two-dimensional Fourier transformation can be expressed as two consecutive one-dimensional transformations. The mathematical principles are no different from those explained in the first part of this blog series which means that this background section will be rather short. It just shows briefly how the general form of the two-dimensional discrete Fourier transform can be split into its one-dimensional parts.

The two dimensional DFT can be expressed in the form

where the trigonometric constant coefficients are defined as

The coordinates k and l of the frequency domain are in the range

Since each of the trigonometric constant coefficients depends only on one of the summation indices the inner sum can be extracted and defined as

which is equal to the definition which was previously used to express the one-dimensional DFT. Only the parameter n had to be added to select the correct value in the two-dimensional input space.

If the newly defined one-dimensional transform is inserted into the original expression for the two-dimensional DFT it is easy to see from the following equation that just another one-dimensional transform is calculated

Because of the commutativity of the operations either of the sums can be extracted. Therefore it does not matter over which dimension the first one-dimensional transform is calculated.

# Implementations

In the background section the separability of the two-dimensional Fourier transform has been used to represent the two-dimensional as two consecutive one-dimensional transforms, one along each dimension. The row-column algorithm is based on this mathematical fact and is used for this implementation. In this algorithm first a one-dimensional FFT is computed individually for every row of the two-dimensional input data. Afterwards a second one-dimensional FFT is performed for every column. But, instead of having two different functions to compute the FFT for the rows and columns, the same function is used and the input data is transposed between and after the two one-dimensional FFTs. This makes sense because the data layout is assumed to be row-major. Thus the most efficient way in terms of cache locality is to load data along the rows.

## Naive row-column algorithm

The naive implementation directly realises the principle of the row-column algorithm and the execution of the 2D FFT is therefore straightforward. The following diagram provides a first overview over the different stages of the 2D FFT and the pseudo code demonstrates the idea of a naive implementation.

### Pseudo code

```// Run the 1D FFT along the rows of the input buffer
// The result is stored to a temporary buffer

// Transpose the intermediate buffer in-place
transpose_complex_list(tmp);

// Run the 1D FFT along the rows of the (transposed) intermediate buffer
// Corresponds to the columns of the original buffer
// The result is stored to the output buffer

// Transpose the output buffer in-place
transpose_complex_list(output);

```

The function to compute the 1D FFT is conceptually the same that has been used in the previous parts of this blog series except that it has to loop over all rows of the source image. As a reminder: The chosen variant of the mixed radix FFT requires the input values to be in digit-reversed order to compute the output values in linear order. Therefore the input values have to be shuffled before the actual FFT can be computed. The reordering is repeated for all rows of the matrix. Similarly the computation of the 1D FFT is repeated for all rows independently of each other.

```/**
* @brief Computes a 1D FFT for each row of the the input
*
* @param[in]  input  The input matrix
* @param[out] output The output matrix
*/
{
// Iterate over all row of the input matrix
for(row_in in input and row_out in output)
{
// Perform digit-reverse reordering
rev_row = digit_reverse(row_in);

// Calculate mixed radix 1D FFT using the reversed row
row_out = ...
}
}
```

## Naive implementation based on OpenCL mixed-radix kernels

This optimisation does not introduce anything terribly new compared to the optimisation of the one-dimensional Fourier transform. The first speedup over the CPU implementation can be achieved by running the computation of the actual FFT on the GPU. For this purpose the buffer with the input values has to be mapped to a CLBuffer. Afterwards the reordering to digit-reversed order is performed on the GPU. The rows of the matrix can be reordered independently such that a one-dimensional kernel is enqueued for each one. The reordering is followed by the first one-dimensional part of the FFT which is now also computed on the GPU. Once the computation has finished the buffers are mapped again to perform the matrix transpose between the two stages of the FFT on the CPU and out-of-place. Subsequently, the rows are another time reordered and the second one-dimensional FFT is computed. Finally the results of the second stage are again transposed to restore the original data layout and copied over to the output buffer.

A loop is used to enqueue separate and independent one-dimensional kernels for each row of the input data. The kernels take the offset of the row within the linearised data representation as argument `buf_offset_float`.

Because each complex element of the buffer is represented through two floating point numbers (real and imaginary part) the row offset has to be doubled to get the actual offset within the buffer.

```// Digit reverse stage
// Uses a 1D NDRange within the loop to process all rows of the buffer
const size_t digit_rev_gws[] = {N};

for(uint row = 0; row < N; ++row)
{
uint buf_offset_float = row * N * 2;
clSetKernelArg(digit_reverse_kernel, 0, sizeof(cl_mem), input_buf);
clSetKernelArg(digit_reverse_kernel, 1, sizeof(cl_mem), digit_reversed_buf);
clSetKernelArg(digit_reverse_kernel, 2, sizeof(cl_mem), idx_bit_reverse_buff);
clSetKernelArg(digit_reverse_kernel, 3, sizeof(cl_uint), &buf_offset_float);
clEnqueueNDRangeKernel(queue, digit_reverse_kernel, 1, NULL, digit_rev_gws, NULL, 0, NULL, NULL);
}

// Perform 2D FFT
// Uses 1D NDRange within the loop to process all rows of the buffer
for(uint row = 0; row < N; ++row)
{
uint buf_stride_float = N * 2;
uint Nx = 1;

// First stage
const size_t first_stage_gws[] = {N / Ny};
clEnqueueNDRangeKernel(queue, first_stage_radix_kernel, 1, NULL, first_stage_gws, NULL, 0, NULL, NULL);

// Update Nx
Nx *= Ny;

// Following stages
for(uint s = 1; s < n_stages; ++s)
{
// ...
}
}
```

```/**
* @brief Computes the first stage of a radix-2 DFT.
*
* @param[in, out] input  The complex array.
* @param[in]      offset The offset of the current row in the array.
*/
kernel void radix_2_first_stage(global float* input, const uint offset)
{
// Each work-item computes a single radix-2
uint idx = get_global_id(0) * 4;

// Load two complex input values
// The row_offset needs to be multiplied by two because complex numbers
// are represented through two float values
float4 in = vload4(0, input + offset + idx);

// Compute DFT N = 2
DFT_2(in.s01, in.s23);

// Store two complex output values

// The row_offset needs to be multiplied by two because complex numbers
// are represented through two float values

vstore4(in, 0, input + offset + idx);
}

```

## Replacing loops with two-dimensional kernels

In the first, naive GPU implementation separate one-dimensional kernels are enqueued for every row of the matrix. The performance can be significantly increased by using kernels with a two-dimensional `NDRange` instead of loops. The second dimension is then used as a replacement for the row offset within the linearised data structure which was previously incremented in the loop. In addition to the offset in the second dimension the offset in the buffer depends also on the size of the matrix which is passed as parameter `N`. Again the offset needs to be multiplied by two to account for the two part complex numbers.

First of all this change reduces the number of kernels and therefore the overhead of enqueuing and dispatching them. Moreover it increases the number of work items which can be executed in parallel since in principle work items from all rows can be executed in parallel. Before, the kernels had to be executed one after the other and thus there was only a comparably small number of work items at any point in time. This reduces the GPU's utilisation, particularly for small matrices.

```// Digit reverse stage
// Uses a 2D NDRange to process all rows of the buffer
const size_t digit_rev_gws[] = {N, N};
clSetKernelArg(digit_reverse_kernel, 0, sizeof(cl_mem), input_buf);
clSetKernelArg(digit_reverse_kernel, 1, sizeof(cl_mem), digit_reversed_buf);
clSetKernelArg(digit_reverse_kernel, 2, sizeof(cl_mem), idx_bit_reverse_buf);
clSetKernelArg(digit_reverse_kernel, 3, sizeof(cl_uint), &N);
clEnqueueNDRangeKernel(queue, digit_reverse_kernel, 2, NULL, digit_rev_gws, NULL, 0, NULL, NULL);

uint buf_stride_float = N * 2;

// First stage
// Uses a 2D NDRange to process all rows of the buffer
uint Nx = 1;

const size_t first_stage_gws[] = {N / Ny, N};
clEnqueueNDRangeKernel(queue, first_stage_radix_kernel, 2, NULL, first_stage_gws, NULL, 0, NULL, NULL);

// Update Nx
Nx *= Ny;

// Following stages
for(uint s = 1; s < n_stages; ++s)
{
// ...
}
```

```/**
* @brief Computes the first stage of a radix-2 DFT.
*
* @param[in, out] input            The complex array.
* @param[in]      buf_stride_float The number of floats per row in the array.
*/
kernel void radix_2_first_stage(global float* input, const uint buf_stride_float)
{
// Each work-item computes a single radix-2
uint idx = get_global_id(0) * 4;

// Compute the offset into the buffer for the current row
// Needs to be multiplied by two because complex numbers
// are represented through two float values
const uint offset = get_global_id(1) * buf_stride_float; // <-- Previously computed in the host code

// Load two complex input values
float4 in = vload4(0, input + offset + idx);

// Compute DFT N = 2
DFT_2(in.s01, in.s23);

// Store two complex output values
vstore4(in, 0, input + offset + idx);
}
```

## Parallel transpose on the GPU

The next optimisation moves the transpose operation so it is also running on the GPU. If it is not performed in-place there are no dependencies between the individual elements at all and the transpose can be fully executed in parallel. For in-place algorithms at least two elements have to swapped in each step as otherwise information would be overwritten.

### Parallel elementwise out-of-place transpose

 This approach resembles the CPU variant of a naive elementwise transpose of a matrix. Each work item reads one element according to its global id from the source buffer and writes it to the corresponding transposed position in the output buffer. The global work group size is thus equal to the size of the matrix (N x N). The elements themselves consist of two floating point values which represent the real and imaginary parts of the complex values. ```/** * @brief Transposes a quadratic matrix of complex numbers stored in row major order. * * @param[in]  input              The complex input array. * @param[out] output             The complex output array. * @param[in]  buf_stride_complex The number of complex numbers per row in the array. */ kernel void transpose(global float2* input, global float2* output, const uint buf_stride_complex) {     uint ix = get_global_id(0);     uint iy = get_global_id(1);     output[ix * buf_stride_complex + iy] = input[iy * buf_stride_complex + ix]; }```

### Parallel blockwise out-of-place transpose

 However, this simple approach is not efficient enough to compensate for the additional driver overhead for launching kernels when performing the transpose on the GPU. To increase the performance the transpose operation has to be executed blockwise instead of elementwise. Instead of reading and writing one pixel per work item a small sub-block of the complete matrix is loaded. It is then locally transposed and written back to the corresponding position in the output matrix. Compared to the naive transpose approach this optimisation increases the cache efficiency and the resource utilisation since each work item operates on more than one pixel. Further it helps to increase the utilisation of the GPU resources by making the kernels larger and slightly more complex. As a result the number of work items decreases which reduces the overhead of dispatching work items for executuion. The choice of two-by-two sub-blocks turned out to be optimal since the local transpose of larger sub-blocks requires more registers. ```/** * @brief Transposes a quadratic matrix of complex numbers stored in row major order. * * @param[in]  input            The complex input array. * @param[out] output           The complex output array. * @param[in]  buf_stride_float The number of floats per row in the array. */ kernel void transpose2x2(global float* input, global float* output, const uint buf_stride_float) {     const uint ix = get_global_id(0);     const uint iy = get_global_id(1);     float4 tmp;     float4 u0, u1;     // Load one sub-block from two rows     u0 = vload4(ix, input + (iy * 2) * buf_stride_float);     u1 = vload4(ix, input + (iy * 2 + 1) * buf_stride_float);     // Transpose the sub-block     tmp = u0;     u0 = (float4)(tmp.s01, u1.s01);     u1 = (float4)(tmp.s23, u1.s23);     // Write the sub-block to the transposed position     vstore4(u0, iy, output + (ix * 2) * buf_stride_float);     vstore4(u1, iy, output + (ix * 2 + 1) * buf_stride_float); } ```

### Parallel blockwise in-place transpose

 Due to its nature a transpose operation can be realised by always swapping to corresponding elements or larger sub-blocks respectively. This allows us to load, transpose and store two sub-blocks per work item in order to reduce the number of work items and to achieve a better balance between arithmetic and load/store operations. As a side-effect the transpose algorithm can be changed from out-of-place to in-place such that the second buffer is no longer needed. As a consequence of the transpose-and-swap approach it would be sufficient to have one work item for each block in the upper triangular matrix. However, if a one-dimensional kernel is enqueued with the smallest possible number of work items the transformation of the linear index into the two-dimensional index is less efficient than simply using a two-dimensional kernel and aborting the work items below the diagonal. The diagonal itself requires special treatment because the diagonal sub-blocks do not need to be exchanged with a different sub-block but only have to be transposed locally. ```/** * @brief Transposes a quadratic matrix of complex numbers stored in row major order. * * @param[in, out] input            The complex array. * @param[in]      buf_stride_float The number of floats per row in the array. */ kernel void transpose2x2(global float* input, const uint buf_stride_float) {     const uint ix = get_global_id(0);     const uint iy = get_global_id(1);     // Abort for sub-blocks below the diagonal     if(ix < iy)     {         return;     }     float4 tmp;     float4 v0, v1, u0, u1;     // Load first sub-block     u0 = vload4(ix, input + (iy * 2) * buf_stride_float);     u1 = vload4(ix, input + (iy * 2 + 1) * buf_stride_float);     // Transpose first sub-block     tmp = u0;     u0 = (float4)(tmp.s01, u1.s01);     u1 = (float4)(tmp.s23, u1.s23);     // Only process a second sub-block if the first one is not on the diagonal     if(ix != iy)     {         // Load second sub-block         v0 = vload4(iy, input + (ix * 2) * buf_stride_float);         v1 = vload4(iy, input + (ix * 2 + 1) * buf_stride_float);         // Transpose second sub-block         tmp = v0;         v0 = (float4)(tmp.s01, v1.s01);         v1 = (float4)(tmp.s23, v1.s23);         // Store second sub-block to transposed position         vstore4(v0, ix, input + (iy * 2) * buf_stride_float);         vstore4(v1, ix, input + (iy * 2 + 1) * buf_stride_float);     }     // Store first sub-block to transposed position     vstore4(u0, iy, input + (ix * 2) * buf_stride_float);     vstore4(u1, iy, input + (ix * 2 + 1) * buf_stride_float); } ```

### Setting local work group size

For the two-dimensional transpose kernel the automatically inferred values for the local work group size were not able to achieve good performance. It turned out to be better to set the local work group size manually to a size of 2 x 4 – or smaller if the buffer (divided by two because each work items handles a 2 x 2 sub-block) is not a multiple of 4. The best size for the local work groups always depends on the kernel and finding a good size can be hard. However, getting the sizes right can help to achieve better caching behaviour and therefore can improve runtime performance.

```// Setting up constants
// For odd matrix sizes padding is necessary
// because the transpose kernel stores two complex values
const size_t padding = 2 * (N % 2);
const size_t buf_stride_float = N * 2 + padding;
const size_t buf_stride_half = buf_stride_float / 2;

clSetKernelArg(transpose_kernel, 0, sizeof(cl_mem), fft_buf);
clSetKernelArg(transpose_kernel, 1, sizeof(cl_uint), &buf_stride_float);

// Set global work group size
const size_t transpose_gws[2] = {buf_stride_half, buf_stride_half};
size_t transpose_lws[2];
size_t *transpose_lws_ptr;

// Determine the largest possible local work group size
// It has to divide the global work group size
if(buf_stride_half % 4 == 0)
{
transpose_lws[0] = 2;
transpose_lws[1] = 4;
transpose_lws_ptr = transpose_lws;
}
else if(buf_stride_half % 2 == 0)
{
transpose_lws[0] = 1;
transpose_lws[1] = 2;
transpose_lws_ptr = transpose_lws;
}
else
{
transpose_lws_ptr = NULL;
}

clEnqueueNDRangeKernel(queue, transpose_kernel, 2, NULL, transpose_gws, transpose_lws_ptr, 0, NULL, NULL);
```

#### Speedup

The following diagram visualises the benefit in terms of a shorter runtime when the mixed-radix variant is used instead of a radix-2 only implementation. Note the logarithmic scaling of the y-axis. Another advantage of a mixed-radix approach over a radix-2 only FFT is the larger number of supported sizes. For instance, FFTs for matrix sizes 120, 1000 and 3000 can not be computed with the radix-2 only approach and are therefore missing in the diagram.

## Conclusion

While this pushes the row-column algorithm to its boundaries it might be possible to achieve better performance if the underlying FFT algorithm is changed. For instance, if the FFT is not computed in-place the digit-reversal could be dropped and moreover the transpose could be merged into the mixed-radix FFT kernels. However, it is difficult to predict the benefits if there are any.

### Useful optimisation aspects

The following list summarises the things that have been considered to increase the performance of the 2D FFT mixed-radix algorithm on Mobile ARM® Mali™ GPUs in this article:

• Instead of enqueuing many one-dimensional kernels in a loop it is better to use one/fewer higher-dimensional kernel/s.
• The optimal size of the sub-blocks for the blocked transpose depends on the available resources. Increasing the block size has only positive effects as long as the cache is big enough and sufficient registers are available.
• Padding buffers can help to generalise kernels and can prevent branches in the control flow. For instance, the mapped buffers are padded to the next larger even size in both dimensions to eliminate edge cases during the transpose operation.
• Loads and stores are most efficient if they are performed according to the data layout because only then the cache efficiency is more optimal.
• It can be more efficient to conditionally abort superfluous work items in a 2D range directly instead of computing the two-dimensional index out of a linear index.
• Find ways to parallelize sequential code and keep work items as independent from each other as possible.
• Compute more than one pixel (or more bytes) per work item.
• Reduce the number of load/store operations by vectorising. This is especially beneficial because the GPU comprises a vector unit (SIMD).

# ARM Vulkan: Built for Modern Mobile Systems

Posted by Mike Smith Feb 1, 2016

Built for Modern Mobile Systems

ARM, along with other key industry members, made a lot of progress since the Vulkan announcement by the Khronos Group back in early 2015 and we’re now in the final stage of development.

## What is Vulkan?

Vulkan is a new, open standard, cross platform API designed from the ground up to enable applications to run more efficiently on modern CPU and GPU architectures spanning desktop, console and most importantly, mobile.

## Who’s going to use Vulkan?

Every application on a mobile device uses graphics in some way, for UI if nothing else. This includes the operating system, with Google in particular having made very clear statements of Vulkan support for Android, suggesting it will very likely become the core graphics API for that platform. Vulkan has received an unprecedented amount of industry support from OS vendors, silicon vendors, game engine developers and software developers.

## Higher performance at lower power

On mobile platforms Vulkan is a game changer, offering lower overhead, more direct control over the GPU and lower CPU usage. The end-user benefit is far more than bigger and better game graphics. Vulkan is a highly efficient API enabling developers to optimize their applications for significantly lower energy consumption than was previously possible.

### Lower CPU usage and minimum energy consumptionMulti-thread / multicore friendly

Traditional graphics APIs were not designed for multi-threaded use and required a lot of synchronization with the CPU to manage draw calls, resulting in high CPU overhead with the CPU becoming a bottleneck, especially on mobile devices. Vulkan was designed from the ground up to minimize driver overhead, enable CPU multi-threading with greater control over memory management for high-efficiency graphics and compute processing performance.

### Direct GPU control with lower driver overhead

Applications using traditional graphics APIs can easily max out CPU usage during complex drawing, driving up energy consumption and generating very high device heat. GPU drivers needed to do a lot of work leveraging CPU to manage multiple rendering and compute processes. Vulkan provides developers with much greater control over GPU processing, enabling them to more easily balance graphics and compute loads across multiple cores. This results in low per core usage, lower overall power consumption, less device heat and considerably longer battery life.

## With great power comes great responsibility

Vulkan has been designed with modern mobile GPU rendering and compute technologies in mind. Application developers can now manage how things are ordered and executed on their own rather than leaving it up to a generic driver. This gives them more work to do but also provides more flexibility in designing applications with better performance and lower energy consumption than was possible using OpenGL ES.

Quote from the Vulkan March 2015 Khronos announcement …

“Since helping found Khronos, ARM has strived to improve the efficiency of standards and deliver compelling graphics to mobile devices with minimum energy consumption,” said Jem Davies, vice president of technology, media processing group, ARM. “Vulkan is a big step forward in enabling our ecosystem of developers to unleash the capabilities of the latest ARM GPU technology.”

ARM, along with other key industry players, have been very busy refining the Vulkan API and ensuring it will deliver the modern day mobile needs. We will offer Vulkan support* on several GPUs shipping today and this support will be a key deliverable for many future Mali GPUs.

*Based on an internal draft Khronos specification of Vulkan, which may change prior to final release. Conformance criteria for this specification have not yet been established.

# Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

Posted by iodice Feb 1, 2016

Back to Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

Here we are for the second part of our blog series about the OpenCL™ implementation of Complex to Complex Fast Fourier Transform based on the method mixed-radix on Mobile ARM® Mali GPUs.

Whilst in the first article - Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1 - we presented the basic mathematical background, in the following we are going to play with the 3 main computation blocks behind the FFT mixed-radix which will be analyzed in a step-by-step approach from the point of view of both theory and of its efficient and simple implementation with OpenCL.

The development platform used to carry out our performance analysis will be the Firefly-RK3288 which is an all-in-one high performance multi-core computing platform based on the Rockchip RK3288 SoC featuring an ARM® Cortex®-A17 @1.8GHz and ARM® Mali™-T760 (Mali GPU Driver: r6p0-02rel0).

For all performance evaluations, DVFS has been disabled and the GPU clock has been set at the maximum frequency of 600MHz.

Further information about the development platform can be found at:

# Implementation

Let's take a look at how we store our complex values. The input and output data are sorted into a floating point complex array with 2*N elements where the real and imaginary parts are placed alternately. Two different buffers with 2*N elements have been used to store the input and output values.

The FFT mixed-radix pipeline that we are going to study is composed mainly of 3 computation blocks:

1. Digit-reverse
2. Twiddle factor multiplication

## Indexing scheme

The only real complexity of FFT mixed-radix is the indexing scheme. The indexing scheme expresses the relation between the index n and k between 2 generic stages X and Y. This relation is fundamental because it allows both to know which input must be processed by each radix basic element and to correctly perform the stages of digit reverse and twiddle factors multiplication.

Given 2 generic stages X and Y:

1. n addresses the values in the stage X: n = nx + ny * Nx
2. k addresses the values in the stage Y: k = ky + kx * Ny

with:

• nx = n % Nx, scans the columns - nx [0, Nx - 1]
• ny = floor(n / Nx), scans the rows - ny [0, N / Nx - 1]
• kx = floor(k / Ny), scans the rows - kx [0, N / Ny - 1]
• ky = (k % Ny), scans the columns - ky [0, Ny - 1]

From the above picture we can easily see what nx, ny, kx and ky are. For what concerns Nx and Ny, in the case of just 2 radix stages they are respectively the radix order of stage X and Y. In the case of more than 2 radix stages, Ny is still the radix order of stage Y but Nx becomes the radix products from the first radix stage to the radix stage X.

Nx is also known as span or butterfly span. The span is the offset between each complex input of radix basic element and it was introduced in the first article in reference to the radix-2 FFT algorithm.

i.e.

Let's assume we have N = 8 x 4 x 4 x 3 x 2 (5 stages).

If the stages X and Y were stage 2 and stage 3, Nx would be Nx = 8 x 4 x 4 = 128 and Ny = 3

With a pipeline made up of M radix stages, we would like to have a relation between the index k and n between 2 generic stages, in particular we would like to have something like:

The general expressions for computing these mapping functions are:

where Ni is:

```/**
* @brief Map index k to index n
*
* @param[in] k  Index k
* @param[in] Nx It is the span
* @param[in] Ny It is the radix order of stage Y
*
* @return       Index n
*/
uint map_k_to_n(uint k, uint Nx, uint Ny)
{
uint Ni = Nx * Ny;
uint ky = k % Ny;    // Remainder
uint kx = k / Ny;    // Integer part
uint n = (kx % Nx) + (kx / Nx) * Ni + ky * Nx;
return n;
}
```

```/**
* @brief Map index n to index k
*
* @param[in] n  Index n
* @param[in] Nx It is the span
* @param[in] Ny It is the radix order of stage Y
*
* @return       Index k
*/
uint map_n_to_k(uint n, uint Nx, uint Ny)
{
uint Ni = Nx * Ny;
uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni);
return k;
}
```

Every time we compute a radix stage, it is good to keep the span Nx update.

```/* Init Nx to 1 */
uint Nx = 1;

/* Scan each radix stage */
for(uint s = 0; s < n_stages; ++s)
{
/* Get radix order of stage s */

/* Body for computing twiddle factor multiplication and radix computation */
...

/* Update Nx */
Nx *= Ny;
}
```

## Digit-reverse

Once we have introduced the indexing scheme, we are ready to analyze the main computation blocks of our pipeline. Let's start with the digit-reverse.

The digit-reverse is the first stage of our pipeline which places the input elements in a specific order called "digit-reverse order".

The digit-reverse order would be exactly the order of the output elements if we left the input elements in linear order (0, 1, 2,...).

Since we know the relation between the index n and k, a possible way of knowing the digit-reverse order may be to iterate the mapping function map_n_to_k() from the first to the last stage in order to know how the input index n would be mapped at the end.

```/**
* @brief This function computes the digit reverse index for each input
*
* @param[in]  stage             It contains the radix order for each radix stage
* @param[out] idx_digit_reverse It contains the digit-reverse order index
* @param[in]  n_stages          Total number of radix stages
* @param[in]  N                 Number of input
*/
void digit_reverse(float* stage, uint* idx_digit_reverse, uint n_stages, uint N)
{
/* Scan elements */
for(uint n = 0; n < N; ++n)
{
uint k = n;
uint Nx = stage[0];

/* Scan stages */
for (uint s = 1; s < n_stages; ++s)
{
/* radix of stage s-th */
uint Ny = stage[s];
uint Ni = Ny * Nx;

/* Update k index */
k = (k * Ny) % Ni + (k / Nx) % Ny + Ni * (k / Ni);

/* Update Nx */
Nx *= Ny;
}

/* K is the index of digit-reverse */
idx_digit_reverse[n] = k;
}
}
```

Once we know the digit-reverse index, we can implement a CL kernel that places the input values in the required order.

```/**
* @brief This kernel stores the input in digit-reverse order
*
* @param[in]  input            It contains the input complex values in linear order
* @param[out] output           It contains the output complex values in digit-reverse order
* @param[in]  idx_bit_reverse  It contains the digit-reverse order index
*/
kernel void digit_reverse(global float2* input, global float2* output, global uint* idx_digit_reverse)
{
/* Each work-item stores a single complex values */
const uint n = get_global_id(0);

/* Get digit-reverse index */
const uint idx = idx_digit_reverse[n];

/* Get complex value */
float2 val = (float2)(input[idx]);

/* Store complex value */
output[n] = val;
}
```

The digit-reverse kernel uses 2 different cl_buffers:

2. one for storing the output complex values in digit-reverse order.

## Twiddle Factor Multiplication

Twiddle factor multiplication is the intermediate stage between 2 generic radix stages X and Y. In particular it multiplies all the N output complex elements of radix stage X by a specific trigonometric complex constant before passing those values to the next stage Y.

This trigonometric complex constant -the "twiddle" factor - depends on nx and ky:

Even the twiddle factor multiplication is highly parallelizable and can be implemented by means of OpenCL in the following manner:

```#define M_2PI_F 6.283185482025146484375f

#define TWIDDLE_FACTOR_MULTIPLICATION(phi, input)                 \
{                                                                 \
float2 w, tmp;                                                \
w.x = cos(phi);                                               \
w.y = sin(phi);                                               \
tmp.x = (w.x * input.x) - (w.y * input.y);                    \
tmp.y = (w.x * input.y) + (w.y * input.x);                    \
input = tmp;                                                  \
}

/**
* @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y
*
* @param[in, out] input It contains the input and output complex values
* @param[in]      Nx    It is the span
* @param[in]      Ny    It is the radix order of stage Y
* @param[in]      Ni    Nx * Ny
*/
kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni)
{
/* Each work-item computes a single complex input */
const uint n = get_global_id(0);

/* Compute nx */
uint nx = n % Nx;

/* Compute k index */
uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni);

/* Compute ky */
uint ky = k % Ny;

/* Compute angle of twiddle factor */
float phi = (-M_2PI_F * nx * ky) / (float)Ni;

/* Multiply by twiddle factor */
TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}
```

An interesting optimization concerns the computation of ky. As we know,

so:

since in the above expression A and B are multiples of Ny, the terms 1 and 3 will be 0 therefore:

but for the identity property of the modulo operator we have:

and then we can compute ky as:

At this point we can simplify our OpenCL kernel in the following manner:

```kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni)
{
/* Each work-item computes a single complex input */
const uint n = get_global_id(0);

/* Compute nx */
uint nx = n % Nx;

/* Compute ky */
uint ky = (n / Nx) % Ny;               // <-

/* Compute angle of twiddle factor */
float phi = (-M_2PI_F * nx * ky) / (float)Ni;

/* Multiply by twiddle factor */
TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}
```

Another important and simple optimization that we can introduce concerns the computation of the angle for the twiddle factor multiplication. As we can observe, the term (-M_2PI_F / (float)Ni) is a constant for all work-items and can therefore be passed as a argument to the CL kernel.

```/**
* @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y
*
* @param[in, out] input     It contains the input and output complex values
* @param[in]      Nx        It is the span
* @param[in]      Ny        It is the radix order of stage Y
* @param[in]      Ni        Nx * Ny
* @param[in]      exp_const (-M_2PI_F / (float)Ni)
*/
kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni, const float exp_const)
{
/* Each work-item computes a single complex input */
const uint n = get_global_id(0);

/* Compute nx */
uint nx = n % Nx;

/* Compute ky */
uint ky = (n / Nx) % Ny;

/* Compute angle of twiddle factor */
float phi = (float)(nx * ky) * exp_const;                    // <-

/* Multiply by twiddle factor */
TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}
```

With these 2 simple optimizations, we are able to improve the performance of this kernel by a factor of ~1.6/1.7x as illustrated in the following graph.

The radix stage is the main computation block of the pipeline composed of N / S radix-S basic elements.

Each radix-S consists of S inputs and S outputs and computes an optimized DFT of length S.

Since there is not a dependency between each radix basic element of the same stage, even the radix stage is an embarrassingly parallel problem.

In our implementation each radix-S of the same radix stage is performed by a single work-item. As in the first stage the radix basic element takes the input in linear order whilst for the following stages accordingly with the span Nx, we are going to use 2 different kernels for the radix basic element:

1. One kernel for the first stage which takes the input in linear order
2. One kernel for the other radix stages

In the following we are going to describe the implementation of radix 2/3/4/5/7.

Defined as:

can be expressed in the following matrix form:

```#define DFT_2(c0, c1) \
{                     \
float2 v0;        \
v0 = c0;          \
c0 = v0 + c1;     \
c1 = v0 - c1;     \
}

/**
* @brief This kernel computes DFT of size 2 for the first stage
*
* @param[in, out] input It contains the input and output complex values
*/
{
/* Each work-item computes a single radix-2 */
uint idx = get_global_id(0) * 4;

/* Load two complex input values */
float4 in = vload4(0, input + idx);

/* Compute DFT N = 2 */
DFT_2(in.s01, in.s23);

/* Store two complex output values */
vstore4(in, 0, input + idx);
}

/**
* @brief This kernel computes DFT of size 2 for the radix stages after the first
*
* @param[in, out] input It contains the input and output complex value
* @param[in]      Nx    It is the span
* @param[in]      Ni    Nx * Ny
*/
kernel void radix_2(global float2* input, uint Nx, uint Ni)
{
/* Each work-item computes a single radix-2 */
uint kx = get_global_id(0);

/* Compute n index */
uint n = (kx % Nx) + (kx / Nx) * Ni;

/* Load two complex input values */
float2 c0 = input[n];
float2 c1 = input[n + Nx];

/* Compute DFT N = 2 */
DFT_2(c0, c1);

/* Store two complex output values */
input[n] = c0;
input[n + Nx] = c1;
}
```

Defined as:

can be expressed in the following matrix form:

```#define SQRT3DIV2       0.86602540378443f

#define DFT_3(c0, c1, c2)                          \
{                                                  \
float2 v0 = c1 + c2;                           \
float2 v1 = c1 - c2;                           \
c1.x = c0.x - 0.5f * v0.x + v1.y * SQRT3DIV2;  \
c1.y = c0.y - 0.5f * v0.y - v1.x * SQRT3DIV2;  \
c2.x = c0.x - 0.5f * v0.x - v1.y * SQRT3DIV2;  \
c2.y = c0.y - 0.5f * v0.y + v1.x * SQRT3DIV2;  \
c0 = c0 + v0;                                  \
}

/**
* @brief This kernel computes DFT of size 3 for the first stage
*
* @param[in, out] input It contains the input and output complex values
*/
{
/* Each work-item computes a single radix-3 */
uint idx = get_global_id(0) * 6;

/* Load three complex input values */
float4 in0 = vload4(0, input + idx);
float2 in1 = vload2(0, input + idx + 4);

/* Compute DFT N = 3 */
DFT_3(in0.s01, in0.s23, in1.s01);

/* Store three complex output values */
vstore4(in0, 0, input + idx);
vstore2(in1, 0, input + idx + 4);
}

/**
* @brief This kernel computes DFT of size 3 for the radix stages after the first
*
* @param[in, out] input It contains the input and output complex value
* @param[in]      Nx    It is the span
* @param[in]      Ni    Nx * Ny
*/
kernel void radix_3(global float2* input, uint Nx, uint Ni)
{
/* Each work-item computes a single radix-3 */
uint kx = get_global_id(0);

/* Compute n index */
uint n = (kx % Nx) + (kx / Nx) * Ni;

/* Load three complex input values */
float2 c0 = input[n];
float2 c1 = input[n + Nx];
float2 c2 = input[n + 2 * Nx];

/* Compute DFT N = 3 */
DFT_3(c0, c1, c2);

/* Store three complex output values */
input[n] = c0;
input[n + Nx] = c1;
input[n + 2 * Nx] = c2;
}
```

Defined as:

can be expressed in the following matrix form:

```#define DFT_4(c0, c1, c2, c3) \
{                             \
float2 v0, v1, v2, v3;    \
v0 = c0 + c2;             \
v1 = c1 + c3;             \
v2 = c0 - c2;             \
v3.x = c1.y - c3.y;       \
v3.y = c3.x - c1.x;       \
c0 = v0 + v1;             \
c2 = v0 - v1;             \
c1 = v2 + v3;             \
c3 = v2 - v3;             \
}

/**
* @brief This kernel computes DFT of size 4 for the first stage
*
* @param[in, out] input It contains the input and output complex values
*/
{
/* Each work-item computes a single radix-4 */
uint idx = get_global_id(0) * 8;

/* Load four complex input values */
float8 in = vload8(0, input + idx);

/* Compute DFT N = 4 */
DFT_4(in.s01, in.s23, in.s45, in.s67);

/* Store four complex output values */
vstore8(in, 0, input + idx);
}

/**
* @brief This kernel computes DFT of size 4 for the radix stages after the first
*
* @param[in, out] input It contains the input and output complex value
* @param[in]      Nx    It is the span
* @param[in]      Ni    Nx * Ny
*/
kernel void radix_4(global float2* input, uint Nx, uint Ni)
{
/* Each work-item computes a single radix-4 */
uint kx = get_global_id(0);

/* Compute n index */
uint n = (kx % Nx) + (kx / Nx) * Ni;

/* Load four complex input values */
float2 c0 = input[n];
float2 c1 = input[n + Nx];
float2 c2 = input[n + 2 * Nx];
float2 c3 = input[n + 3 * Nx];

/* Compute DFT N = 4 */
DFT_4(c0, c1, c2, c3);

/* Store four complex output values */
input[n] = c0;
input[n + Nx] = c1;
input[n + 2 * Nx] = c2;
input[n + 3 * Nx] = c3;
}
```

Defined as:

can be expressed in the following matrix form:

```#define W5_A    0.30901699437494f
#define W5_B    0.95105651629515f
#define W5_C    0.80901699437494f
#define W5_D    0.58778525229247f

#define DFT_5(c0, c1, c2, c3, c4)               \
{                                               \
float2 v0, v1, v2, v3, v4;                  \
v0 = c0;                                    \
v1 = W5_A * (c1 + c4) - W5_C * (c2 + c3);   \
v2 = W5_C * (c1 + c4) - W5_A * (c2 + c3);   \
v3 = W5_D * (c1 - c4) - W5_B * (c2 - c3);   \
v4 = W5_B * (c1 - c4) + W5_D * (c2 - c3);   \
c0 = v0 + c1 + c2 + c3 + c4;                \
c1 = v0 + v1 + (float2)(v4.y, -v4.x);       \
c2 = v0 - v2 + (float2)(v3.y, -v3.x);       \
c3 = v0 - v2 + (float2)(-v3.y, v3.x);       \
c4 = v0 + v1 + (float2)(-v4.y, v4.x);       \
}

/**
* @brief This kernel computes DFT of size 5 for the first stage
*
* @param[in, out] input It contains the input and output complex values
*/
{
/* Each work-item computes a single radix-5 */
uint idx = get_global_id(0) * 10;

/* Load five complex input values */
float8 in0 = vload8(0, input + idx);
float2 in1 = vload2(0, input + idx + 8);

/* Compute DFT N = 5 */
DFT_5(in0.s01, in0.s23, in0.s45, in0.s67, in1.s01);

/* Store five complex output values */
vstore8(in0, 0, input + idx);
vstore2(in1, 0, input + idx + 8);
}

/**
* @brief This kernel computes DFT of size 5 for the radix stages after the first
*
* @param[in, out] input It contains the input and output complex value
* @param[in]      Nx    It is the span
* @param[in]      Ni    Nx * Ny
*/
kernel void radix_5(global float2* input, uint Nx, uint Ni)
{
/* Each work-item computes a single radix-5 */
uint kx = get_global_id(0);

/* Compute n index */
uint n = (kx % Nx) + (kx / Nx) * Ni;

/* Load five complex input values */
float2 c0 = input[n];
float2 c1 = input[n + Nx];
float2 c2 = input[n + 2 * Nx];
float2 c3 = input[n + 3 * Nx];
float2 c4 = input[n + 4 * Nx];

/* Compute DFT N = 5 */
DFT_5(c0, c1, c2, c3, c4);

/* Store five complex output values */
input[n] = c0;
input[n + Nx] = c1;
input[n + 2 * Nx] = c2;
input[n + 3 * Nx] = c3;
input[n + 4 * Nx] = c4;
}
```

Defined as:

can be expressed in the following matrix form:

```#define W7_A    0.62348980185873f
#define W7_B    0.78183148246802f
#define W7_C    0.22252093395631f
#define W7_D    0.97492791218182f
#define W7_E    0.90096886790241f
#define W7_F    0.43388373911755f

#define DFT_7(c0, c1, c2, c3, c4, c5, c6)                           \
{                                                                   \
float2 v0, v1, v2, v3, v4, v5, v6;                              \
v0 = c0;                                                        \
v1 = W7_A * (c1 + c6) - W7_C * (c2 + c5) - W7_E * (c3 + c4);    \
v2 = W7_C * (c1 + c6) + W7_E * (c2 + c5) - W7_A * (c3 + c4);    \
v3 = W7_E * (c1 + c6) - W7_A * (c2 + c5) + W7_C * (c3 + c4);    \
v4 = W7_B * (c1 - c6) + W7_D * (c2 - c5) + W7_F * (c3 - c4);    \
v5 = W7_D * (c1 - c6) - W7_F * (c2 - c5) - W7_B * (c3 - c4);    \
v6 = W7_F * (c1 - c6) - W7_B * (c2 - c5) + W7_D * (c3 - c4);    \
c0 = v0 + c1 + c2 + c3 + c4 + c5 + c6;                          \
c1 = v0 + v1 + (float2)(v4.y, -v4.x);                           \
c2 = v0 - v2 + (float2)(v5.y, -v5.x);                           \
c3 = v0 - v3 + (float2)(v6.y, -v6.x);                           \
c4 = v0 - v3 + (float2)(-v6.y, v6.x);                           \
c5 = v0 - v2 + (float2)(-v5.y, v5.x);                           \
c6 = v0 + v1 + (float2)(-v4.y, v4.x);                           \
}

/**
* @brief This kernel computes DFT of size 7 for the first stage
*
* @param[in, out] input It contains the input and output complex values
*/
{
/* Each work-item computes a single radix-7 */
uint idx = get_global_id(0) * 14;

/* Load seven complex input values */
float8 in0 = vload8(0, input + idx);
float4 in1 = vload4(0, input + idx + 8);
float2 in2 = vload2(0, input + idx + 12);

/* Compute DFT N = 7 */
DFT_7(in0.s01, in0.s23, in0.s45, in0.s67, in1.s01, in1.s23, in2.s01);

/* Store seven complex output values */
vstore8(in0, 0, input + idx);
vstore4(in1, 0, input + idx + 8);
vstore2(in2, 0, input + idx + 12);
}

/**
* @brief This kernel computes DFT of size 7 for the radix stages after the first
*
* @param[in, out] input It contains the input and output complex value
* @param[in]      Nx    It is the span
* @param[in]      Ni    Nx * Ny
*/
kernel void radix_7(global float2* input, uint Nx, uint Ni)
{
/* Each work-item computes a single radix-7 */
uint kx = get_global_id(0);

/* Compute n index */
uint n = (kx % Nx) + (kx / Nx) * Ni;

/* Load seven complex input values */
float2 c0 = input[n];
float2 c1 = input[n + Nx];
float2 c2 = input[n + 2 * Nx];
float2 c3 = input[n + 3 * Nx];
float2 c4 = input[n + 4 * Nx];
float2 c5 = input[n + 5 * Nx];
float2 c6 = input[n + 6 * Nx];

/* Compute DFT N = 7 */
DFT_7(c0, c1, c2, c3, c4, c5, c6);

/* Store seven complex output values */
input[n] = c0;
input[n + Nx] = c1;
input[n + 2 * Nx] = c2;
input[n + 3 * Nx] = c3;
input[n + 4 * Nx] = c4;
input[n + 5 * Nx] = c5;
input[n + 6 * Nx] = c6;
}
```

# Merging twiddle factor multiplication with radix stage

Generally there are a couple of important benefits to combining 2 OpenCL kernels into one:

1. fewer GPU jobs to dispatch thus reducing a possible driver overhead. In our implementation we are going to have (log2(N) - 1)fewer GPU jobs
2. fewer memory accesses which is a great thing not only for performance but also for power consumption

Regarding this last aspect, we can easily guess that, since both twiddle factor multiplication and radix computation use the same cl_buffer for loading and storing, if we had 2 separated kernels we would access the same memory location twice.

Combining these 2 kernels, the new pipeline becomes:

In order to compute the twiddle factor multiplication inside the radix kernel, we need just a small tweak as we can see, for instance, in the following radix-5 OpenCL kernel:

```/**
* @brief This kernel computes DFT of size 5 for the radix stages after the first
*
* @param[in, out] input     It contains the input and output complex value
* @param[in]      Nx        It is the span
* @param[in]      Ni        Nx * Ny
* @param[in]      exp_const (-M_2PI_F / (float)Ni)
*/
kernel void radix_5(global float2* input, uint Nx, uint Ni, float exp_const)
{
/* Each work-item computes a single radix-5 */
uint kx = get_global_id(0);

/* Compute nx */
uint nx = kx % Nx;                                   // <-

/* Compute n index */
uint n = nx + (kx / Nx) * Ni;                        // <-

/* Load five complex input values */
float2 c0 = input[n];
float2 c1 = input[n + Nx];
float2 c2 = input[n + 2 * Nx];
float2 c3 = input[n + 3 * Nx];
float2 c4 = input[n + 4 * Nx];

/* Compute phi */
float phi = (float)nx * exp_const;                  // <- Please note: there is not ky

/* Multiply by twiddle factor */
TWIDDLE_FACTOR_MULTIPLICATION(phi, c1);
TWIDDLE_FACTOR_MULTIPLICATION(2 * phi, c2);
TWIDDLE_FACTOR_MULTIPLICATION(3 * phi, c3);
TWIDDLE_FACTOR_MULTIPLICATION(4 * phi, c4);

/* Compute DFT N = 5 */
DFT_5(c0, c1, c2, c3, c4);

/* Store five complex output values */
input[n] = c0;
input[n + Nx] = c1;
input[n + 2 * Nx] = c2;
input[n + 3 * Nx] = c3;
input[n + 4 * Nx] = c4;
}
```

From the following graph we can notice that the speed-up becomes considerable (~1.5x) already for small N.

We are approaching the end of this second part but before we finish, we'd like to present a final comparison between mixed-radix and radix-2.

In the first article we introduced mixed-radix as the solution to efficiently overcoming the problem of N not being a power of 2. However if we had N power of 2, what would be the benefit of using mixed-radix rather than radix-2?

As we know, radix-2 is just a special case of mixed-radix. Since in our implementation we have the radix-4 as well, 2 consecutive radix-2 stages can be merged in a single radix-4 stage thus reducing:

1. The number of radix stages to compute
2. The number of memory accesses

The following graph shows the speed-up achievable by mixed-radix with radix-4 against a pure radix-2 implementation. As we can appreciate, the speed-up is already considerable for small N getting 1.7x better performance for N greater than 4096.

# Summary

In this second article we presented the implementation of the 3 main FFT mixed-radix computation blocks by means of OpenCL. Regarding the twiddle factor multiplications we showed how simple changes in CL kernels can significantly speed-up the computation.

We also saw that, although the nature of the pipeline is sequential, each stage represents an embarrassingly parallel problem that can be easily and efficiently implemented with OpenCL.

In the end, with the final comparison between mixed-radix and radix-2, we appreciated that mixed-radix not only provides the important flexibility in the choice of N but also provides a considerable speed-up with N power of 2 if we have the radix-4 as well.

In the next and final article of this blog series we are going to learn how to use the implemented FFT mixed-radix as a building block for computing the FFT for the case of 2 dimensions. This is mostly used in Image Processing, Computer Vision and Machine Learning applications.

Ciao,

Gian Marco

Back to Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

# Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Posted by iodice Jan 25, 2016

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

# Introduction

This is the first article of three that will focus on the implementation of Fast Fourier Transform (FFT) using the mixed-radix method on Mobile ARM® Mali™ GPU by means of OpenCL™.

This blog series continues the work of neiltan who analyzed the main strategies for optimizing the radix-2 FFT algorithm in his blog Optimizing Fast Fourier Transformation on ARM Mali GPUs.

In this first post, we are going to build up a minimal background for the 1D complex to complex FFT algorithm by starting to point out the limits of DFT using the direct computation and exploiting the well-known and commonly used radix-2 FFT.

The mathematical prospective of FFT mixed-radix will be used to explain the key concepts of digit-reverse, twiddle factor multiplication and radix computation. This will be covered extensively in the second article from the point of view of an implementation in OpenCL.

Before starting, we'd like to thank Georgios Pinitas and Moritz Pflanzer for their important help in reviewing this article and for their contribution to this implementation.

## Discrete Fourier Transform

The Discrete Fourier Transform (DFT) is a domain transform (time -> frequency, image -> frequency,..) widely used in real-time digital signal processing for a variety of tasks such as spectrum analysis and convolution in the frequency domain.

It is defined as follow:

where:

• x(n) where n [0, N-1] is the n-th element of the input complex data sequence uniformly sampled
• X(k) where k [0, N-1] is the k-th element of the output complex DFT

The inverse function of the DFT, known as the Inverse Discrete Fourier Transform (IDFT), allows us to obtain the original function.

Commonly the above expressions are written in the following tightly-packaged forms:

with

Convolution in the frequency domain corresponds to a simple pointwise multiplication

## Analyzing the complexity of the direct computation of DFT

Generally the complexity of an algorithm can be defined in terms of a number of multiplications. We can easily infer the complexity of direct computation decomposing the N-point DFT in the real and imaginary parts:

where in the above relation we have applied the following trigonometric relation:

Equating the real and imaginary part:

the direct computation would require:

• 2 trigonometric multiplications for N iterations for the real part = 2*N
• 2 trigonometric multiplications for N iterations for the imaginary part = 2*N

Since we have to compute K output complex values, we have:

• 2 * N * K for the real part
• 2 * N * K for the imaginary part

therefore the total multiplications are: 2*N^2 + 2*N^2 = 4 * N^2

declaring the complexity of direct computation as O(N^2).

## Fast Fourier Transform - Cooley–Tukey algorithm

FFT is an efficient algorithm for producing exactly the same result (in the limit of precision) as DFT but with a complexity of O(N log2 N).

This method (and the general idea of FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965, but it was later discovered that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms).

https://en.wikipedia.org/wiki/Fast_Fourier_transform

The most famous and commonly used FFT's algorithm is radix-2 Cooley–Tukey. It works just with N power of 2 and behind this approach we find the "divide et impera" strategy which recursively breaks down the DFT into many smaller DFTs of size 2 called radix-2.

Essentially it is like we are seeing N composed by log2(N) 2's factors (i.e. N = 2 x 2 x..)

Looking at the picture we have:

• log2(N) radix-2 stages. Each one composed of N/2 radix-2 basic elements
• (log2(N) - 1) twiddle factor multiplications stages

Please note that after the first radix stages, the inputs of each radix basic element will not be in consecutive order. In particular there will be an offset called "span" between each input which will depend on the radix stage.

Often for signal processing applications we design our system with radix-2 FFTs in mind. However, having N power of 2 for exactly fitting radix-2 FFTs can cause a significant performance drop in some applications.

For instance, if we consider the worst case for a 1D FFT, forcing the use of radix-2 would require double the amount of data input and consequently double the amount of computation. Moreover, this loss of performance would be much worse for the 2D case. In applications of image processing we have quite often dimensions not power of 2 and then doubling the size of both rows and columns, it would increase by a factor of 4 the total FFT size.

However, even if we have previously factorized N as N = 2 x 2 x 2..., any factorization would generally be possible.

Factorizing N as N = N1 x N2 x N3 x .. the "divide et impera" strategy remains the same but the new algorithm recursively breaks the DFT into many smaller DFTs of sizes N1, N2, N3 called respectively radix-N1, radix-N2, radix-N3,..

In terms of pipeline stages we are going to have:

• N / Ns DFTs of length Ns for each radix stage, where Ns is the radix order
• Z radix's stage with Z equal to the number of factors used for compounding the original length N.

From this first analysis we can underline that the factorization is the fundamental principle behind the Cooley–Tukey algorithm and the radix-2 algorithm is just a special case of mixed radix.

Virtually mixed-radix allows us to compute the FFT on any length N of input data but as typical implementations use a limited number of optimized radixes such as radix-2, radix-3, radix-5, radix-7, the computation will be restricted to those N that are an expansion of these available factors. In our case we are going to use:

so N must be compound of power of 2,3,5,7.

Please note: In the factorization of N there is not 4 as it is a power of 2. The motivation behind the choice to implement an highly optimized radix-4 will be clear in the second article where we will compare the performance of radix-2 and mixed-radix algorithms.

This section will detail the maths behind the Cooley-Turkey mixed-radix algorithm and will explain concepts such as twiddle factor multiplication and digit-reverse.

Since factorization is the key principle of Cooley–Tukey mixed-radix algorithm, let's start to decompose N in 2 factors, Nx and Ny:

Let's arrange the input data x(n) in a matrix with Nx columns and Ny rows where n-th element can be addressed in the following manner:

with:

• nx = n % Nx, scans the columns - nx [0, Nx - 1]
• ny = floor(n / Nx), scans the rows - ny [0, N / Nx - 1]

i.e.

If we had N = 21 and we factorized N as N = 7 x 3, the 16-th element of the input array would be addressed:

• ny = floor(16 / 7) = 2
• nx = 16 % 7 = 2

address of 16-th element = 2 + 2 * 7

using this notation, the DFT can be written as follow:

Now let's place the output elements X(k) in a matrix as well but with Nx rows and Ny columns. The motivation behind this choice will be clear during the demonstration.

The k-th element will be addressed with the following relation:

For addressing the output elements, we use kx and ky:

• kx = floor(k / Ny), scans the rows - kx [0, N / Ny - 1]
• ky = (k % Ny), scans the columns - ky [0, Ny - 1]

Addressing the output elements is completely upside-down because ky scans the columns whilst kx the rows.

and the DFT becomes:

Let's evaluate the complex exponential:

• Term 1: Always 1 because multiple of 2π
• Term 4: Kind of trigonometric complex constant (twiddle factor)

The exponential is easy to study thanks for choosing the output matrix with Nx rows and Ny columns.

Writing the 4th term as:

the final DFT's expression becomes:

This is the expression behind the computation of mixed-radix algorithm in the case of 2 factors.

If we had for instance N = 21 and we factorized N as N = 7 x 3, we could graphically summarize this final expression in the following manner:

Which can be translated in the following pipeline:

# Summary

1. If Nx and Ny were composite numbers, we could again apply the decomposition to Nx and Ny
2. The outer sum in the final expression is the DFT of each row after first multiplying by a proper trigonometric complex constant called "twiddle factor"
3. The final result does not change by swapping the order of the factors
4. Looking at the picture we can easily guess that each stage of the pipeline is an embarrassingly parallel problem.

In the end from the demonstration we can highlight the 3 main computation blocks behind this algorithm:

1. Digit-reverse stage in order to have the output complex values in the correct order
2. Twiddle factor multiplication stage which multiplies each value by a proper complex constant before transmitting to the next radix stage
3. Radix stage which computes an highly optimized DFT

With this first article we have finished to illustrate the minimal background behind FFT mixed-radix algorithm and we are definitely ready to dive in the next article where we are going to play with the OpenCL kernels

Ciao,

Gian Marco

# References

1. https://community.arm.com/groups/arm-mali-graphics/blog/2014/05/30/fast-fourier-transformation-vectorization-on-arm-mali-gpus
2. http://www.engineeringproductivitytools.com/stuff/T0001/PT07.HTM
3. Fast Fourier Transform - Algorithms and Applications, K. R. Rao, Do Nyeon Kim, Jae Jeong Hwang
4. The Scientist and Engineer's Guide to Digital Signal Processing, Steven W. Smith

# ARM enables power efficient, enhanced visual experiences on 2.5K and 4K mobile devices with the Mali-DP650 Display Processor

Posted by vassilisandroutsopoulos Jan 21, 2016

Yesterday ARM introduced its highest performing and most power efficient display processor technology, the ARM® Mali™-DP650, which targets next generation premium and mass market devices.

The Mali-DP650 delivers performance and power efficiency gains of up to 2X compared to its predecessor, the Mali-DP550, in order to address the shift in mobile panel resolution from Full HD (1920x1080 pixels) to 2.5K formats including WQXGA (2560x1600 pixels) for tablets and WQHD (2560x1440 pixels) for a mobile screen. The Mali-DP550 remains the processor of choice for HD (1280x720 pixels) and Full HD devices.

Mali-DP650 offers enhanced 4K (3840x2160 pixels) capabilities and the ability to send 4K to a larger screen through WIFI streaming (e.g.Miracast) or via a connector (such as HDMI or USB-C). It enables more display processor use cases at 4K further reducing the requirement to engage other hardware blocks such as the GPU, thereby saving power and prolonging battery life.

Like its predecessor, the Mali-DP550, the Mali-DP650 offers an array of advanced features such as 7-layer composition, rotation, high quality scaling and a variety of pre and post-processing functions for video and graphics.

Mali-DP650: Performing functions such as composition, rotation, advanced scaling, colour conversion and tone mapping in a single pass through external memory

Addressing the performance challenges as mobile panels move towards 4K

Previously, it was thought that Full HD video and graphics content offered sufficient detail for a mobile screen. However, there is no doubt that being able to display video, UI and gaming content at even higher resolutions makes the overall output sharper and clearer. This has already led to a visible market presence of mobile screens capable of delivering higher resolutions - WQHD, WQXGA and even 4K is now starting to appear. This, however, poses a trade-off as handling higher resolution display layers natively means a costlier SoC in terms of both size and power consumption. Yet demand for such SoCs is growing and so therefore is the demand for effective solutions to address this issue with minimum impact on area, bandwidth and power.

Mali-DP650 doubles the available AXI bus width to 128-bits and adds significant memory subsystem optimizations which results in doubling the pixel throughput at the specified AXI clock rate. This exceeds performance expectations by being able to composite, process and display a combination of up to three full-frame 4K graphics and 10-bit video layers without any external intervention. This means that heavy composition use cases such as 4K video playback or gaming with UI overlays can be offloaded from the GPU onto the Mali-DP650 display processor, delivering power savings and allowing the GPU to focus on tasks it’s optimized for.

Mali-DP650 in a typical Mass Market Mobile system

Mali-DP650 also provides an MMU pre-fetcher solution to hide the delays caused by page table walks and enhances the overall interoperability with ARM’s CoreLink MMU-500 thus improving its overall tolerance to system latency. This translates to being able to build a lower cost interface to the display processor whilst still being able to deliver the same superior quality required by Full HD and beyond state-of-the-art mobile panels.

ARM Frame Buffer Compression (AFBC) is a system bandwidth saving technology which is known to deliver up to 50% system bandwidth saving if implemented throughout the multimedia system (GPU, Video and Display). Mali-DP650 implements AFBC and further optimizes AFBC efficiency by ensuring superior performance with DDR controllers.

Mali-DP650 display output is compatible with MIPI (Mobile Industry Processor Interface) DPHY transmitters.  The Mali-DP650 offers split-display mode which is necessary to be able to talk to MIPI DPHY mobile panels with screen resolutions beyond Full HD. Split display mode allows for an image to be split into two equal parts at the display output thereby allowing two separate MIPI DSI transmitters to be driven in parallel. This doubles the available data rate making display resolutions up to 4k a reality on mobile or tablet devices.

Variable refresh rate saves power on the mobile panel

A display processor sends out pixel information to be displayed at a refresh rate that matches the rate at which the panel refreshes. In most systems today this is 60 times a second (60Hz/60fps). If the display processor did not send the information at the required rate then the panel would not be able to synchronize with the system and the visual output would not be displayed (or would be displayed with clear visual artefacts). Always sending output frames from the system to the panel at a fixed, and high, refresh rate has a key disadvantage. Even when the system is producing new visual information at a much slower rate (e.g. during video playback or when displaying a static webpage), the display processor is still running at the high refresh rate in order to keep the panel fed with information. In mobile, battery-powered systems this is detrimental to the battery life as the majority of the display subsystem must remain running continuously. To counter that, Mali-DP650 supports variable refresh - a technique whereby the output signal from the display processor is not transmitted at a constant and fixed rate but can vary on a per frame basis. The lowest frame rate will depend on the panel itself but a typical target would be 30Hz, which contributes to a longer battery life.

Enabling more composition use cases at resolutions beyond Full HD

Mali-DP650 can receive the 10-bit per pixel 4K video layer in either AFBC or YUV formats alongside additional UI layers (RGB888a or AFBC formats) and composite them with alpha blending. It can also rotate, downscale or upscale to the required resolution and even tone map the final output before sending it to the display. All of this functionality is performed in a single pass through the display processor without having to write back to external memory, thereby significantly reducing the overall system bandwidth and consequently, the power consumption. In addition, Mali-DP650 can be configured to act as a content passport for users as they move from a smaller mobile screen to a larger screen - 4K streaming video playback to external 4K displays via either HDMI, USB-C or through Wi-Fi transmission.

Another common video use case on a mobile device is a mobile video call via IP. This requires simultaneous video streams, one from the local camera and one from the other device, to be composited as a picture-in-picture (PIP) on the mobile display. The Mali-DP650 display processor enables PIP on the mobile display by allowing composition for the two video layers along with an additional graphics UI overlay, and can also upscale or downscale to meet the requirements of the local mobile panel. This again provides system-wide power savings by enabling composition, rotation and scaling without having to engage the GPU.

And it’s not just about the hardware…

Mali-DP650 is delivered with an Android HW Composer driver and an associated ADF-based Kernel Driver optimized to work alongside Mali GPU and Mali Video drivers. The ADF kernel device drivers are licensed under GPLv2 and are available in the driver section of the Mali Developer Centre. For more information on the Mali Display Android drivers, please check out the following blog: Do Androids have nightmares of botched system integrations?

In summary…

The Mali-DP650 display processor is 4k capable and optimised for 2.5k. It offers new features and performance enhancements that enable beyond Full HD mobile and tablet displays with optimal power usage. Mali-DP650 aims to drive new visual content and gaming experiences on mobile devices and act as a content passport for users as they move from a smaller mobile screen to a larger screen, such as a 4K smart television. Mali-DP650 further expands the Mali family of display processors, including Mali-DP550, which remains the processor of choice for mobile panels below Full HD. Display processors are critical for enabling the best overall user experience and prolonging the battery life of the device and we expect the Mali-DP650 to begin appearing in devices from early 2017.

# Mali Christmas, Popular content of 2015

Posted by Ryan Booth Dec 21, 2015

Hello everyone,

We're all looking forward to Christmas so I wanted to keep this one short, for those who might be checking in over the holidays here's some of the most popular content from the ARM Mali Graphics group this year, enjoy!

We hope you've found all our content this year to be useful and they'll be plenty more next year.

Missed the Graphics week back in April? Take some time to read the great blogs that came out of this week long Community event:

ARM Mali Graphics Week - April 13 -17, 2015

Virtual Reality was a hot topic this year, reflect with this series from freddijeffries:

Is Virtual Reality the Next Big Thing…Again?

Virtual Reality: A Focus on Focus

Virtual Reality: The Ice Cave

and learn about the Mali GPUs set to drive this innovation forward in 2016

Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

Why not learn more about the ARM Mali demos and some of the techniques from these popular blogs:

Inside the Demo: GPU Particle Systems with ASTC 3D textures

Dynamic Soft Shadows Based on Local Cubemap

also worth a look are the Reflections Based on Local Cubemaps in Unity

From Hours to Milliseconds: Project Ice Cave

Achieving the icy wall effect in the Ice Cave demo

Learn to achieve great lighting and Global Illumination in your games with Enlighten by Geomerics

Shedding Light on Global Illumination Techniques

Forging the Future of Video Games with Enlighten 3

Lighting in Games

Looking for some projects over Christmas? Perhaps you've got a new Chromebook (User Space drivers available here), why not get Linux set-up with the below guide :

Linux on Chromebook with ARM® Mali™ GPU

Take a look at the ARM tools and getting them set-up with your device:

lorenzodalcol dived into the Mali Graphics debugger with a two part case study:

Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel

Mali GPU Tools: A Case Study, Part 2 — Frame Analysis with Mali Graphics Debugger

Using DS-5 Streamline with Mali on Samsung Galaxy Note 3 & 10.1

Using DS-5 Streamline with Mali on Samsung Galaxy Note 4

Using DS-5 Streamline with Mali on Samsung Galaxy SIII

Using DS-5 Streamline with Mali on Google Nexus 10

Working on performance on Mali in the new year? Learn from the expert with these series of blogs:

peterharris' series of Mali Performance blogs are a great read:

Mali Performance 1: Checking the Pipeline

Mali Performance 2: How to Correctly Handle Framebuffers

Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

Mali Performance 4: Principles of High Performance Rendering

Mali Performance 5: An Application's Performance Responsibilities

Mali Performance 6: Efficiently Updating Dynamic Resources

Mali Performance 7: Accelerating 2D rendering using OpenGL ES

Want to know more about the Mali GPU? Check out the below series also from peterharris:

The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining

The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering

The Mali GPU: An Abstract Machine, Part 3 - The Shader Core

Merry Christmas,

Ryan

# Shedding Light on Global Illumination Techniques

Posted by Ellie Stone Dec 18, 2015

This blog is based on a recent article written by Geomerics for the publication Making Games. For more information on any of the techniques outlined in this blog, read more here.

Lighting is the most fundamental element of any game environment. Applied correctly, it brings three dimensional geometry to life, amplifies the emotions of a scene and draws the eyes to important areas of focus. Global illumination, the interplay of lights and materials, plays a significant role in creating this balanced scene lighting.

Image 1: The left image shows direct lighting only; the right image shows the difference the addition of global illumination makes.

It is vital that games industry members understand the variety of techniques available and adopt the one that is most suitable for their platform, environment, gameplay and performance budget. In this article we’ll explore a variety of options available, weigh up their advantages and limitations and conclude with recommendations for different scenarios.

### 1. Baked Lighting

Baked lighting is an extremely common technique used for both direct and indirect lighting information as well as other effects such as ambient occlusion. It fully pre-calculates lighting information for static scenes and stores the output in static data structures which are then consumed by fragment shaders. The most commonly used data structures are textures, spherical harmonics probes and cubemaps.

Baking can produce high quality results and is among the most affordable runtime techniques, enabling performance efficient indirect lighting even on mobile devices.  However, everything is static at runtime both in the game and in the editor. Light sources, materials and meshes can’t be changed at runtime. The only option for dynamic content is to swap out entire data sets, which is very costly with respect to memory consumption as well as creating and managing this additional content. This either limits dynamic gameplay options or requires sacrifices with respect to visual quality or dynamic content.

In addition, baked lighting has the lengthiest workflow of all the global illumination solutions listed in this article. In order to see the effect of any changes made to a scene, no matter how small, artists need to run a full lighting bake. Given that in order to get a scene to AAA quality an artist will need to iterate on each individual light many times, baking can have a significant impact on a studio’s bottom line as well as the quality of the final result.

### 2. Light Propagation Volumes

The technique for light propagation volumes was proposed by Kaplanayan and Dachsbacher in 2010 as an alternative to traditional baked lighting and first implemented in CryEngine 3. The core idea behind this technique is to inject virtual point lights into a three-dimensional data structure, aggregate the injected lights for each cell, and finally iteratively propagate them to neighbouring cells.

Light propagation volumes allow lights, materials and objects to be fully dynamic at runtime. They work well with complex geometry and are relatively stable and flicker-free. At basic quality levels the technique can even scale down to mobile platforms. In addition, light propagation volumes can theoretically support multiple bounces, glossy reflections and participating media; however, in reality these extensions are considered too expensive for practical applications and the computational resources are rather used to improve the base lighting quality.

Image 2: Light propagation volumes can support participating media and glossy reflections

The technique has a number of issues that limit its application, the biggest being their low spatial resolution due to memory constraints even when employing cascades. Due to this, light and shadow leaking are a common occurrence and the lighting result is fairly ambient with low contrast. Combined with the fact that runtime performance scales with the number of light sources, light propagation volumes are mostly used for outdoor scenes with only a single directional light source.

### 3. Voxel Cone Tracing

While light propagation volumes calculate lighting for each cell and apply the indirect lighting result by a simple lookup into a volume texture, voxel cone tracing only calculates lighting in areas where there is actual geometry while performing a much more involved aggregation of the indirect lighting in the fragment shader.

Image 3: Comparison of light propagation volumes and voxel cone tracing

Voxel cone tracing provides an elegant light propagation and aggregation model for both indirect lighting as well as specular reflections. It is able to handle complex geometry including transparency and participating media while delivering smooth lighting results at a higher quality than light propagation volumes. However, it is very expensive to achieve these quality levels and the technique cannot scale to more budgeted platforms. Voxel cone tracing still suffers from significant light leaking issues even at medium quality levels and the directional information required to significantly improve quality would further increase memory cost.

### 4. Pre-computed Visibility

Determining the visibility between surfaces is the most expensive part of calculating global illumination. By pre-computing the surface-to-surface visibility information, compressing this data and only at runtime combining it with material and direct lighting information, it is possible to provide high quality dynamic global illumination effects across all gaming platforms – even on mobile.

Image 4: This graph visualizes how pre-computed visibility global illumination works

Pre-computed visibility global illumination is similar to baked lighting in respect to storing its data output in the same three data structures: lightmaps, spherical harmonics probes and cubemaps. The difference is that precomputed visibility makes dynamic global illumination effects such as time of day, changing materials and player controlled lighting efficient enough to occur at runtime no matter the platform. Additionally, once the initial pre-computation is done, all light sources and materials can be changed in real-time in the editor and the resultant effect on the global illumination can be seen instantly. This allows artists to iterate over the lighting of a scene much more efficiently than with baked lighting, resulting in huge time-savings and often better visual quality.

The limitations to this technique include the requirement for enough geometry in the scene to be static to enable a precompute – certain environments, for example procedurally generated, completely dynamic or user-generated content, are not suitable. And while it is still much faster than traditional baked lighting, the pre-computation step may well be an adjustment for artists who are used to other techniques.

ARM’s Enlighten is the industry standard for pre-computed visibility global illumination. It is easily accessible as the default global illumination solution inside Unity 5 and licensable for Unreal Engine 3 and 4 as well as for internal engines.

### Conclusion

The aforementioned techniques for computing global illumination can broadly be sorted into two categories:

 Surface Techniques Volume Techniques Baked lighting Light propagation volumes Pre-computed visibility Voxel cone tracing

Surface techniques compute indirect lighting based on the transfer of light between two-dimensional surfaces, whereas volume techniques rely on an underlying three-dimensional data structure. Surface techniques require different degrees of pre-computation and offer limited support for dynamic objects. However, they are affordable at runtime, don’t suffer from light or shadow leaking and are the only techniques that have proven scalability down to mobile platforms.

So if you are developing a game right now, which global illumination technique is best for you? The answer to this question depends strongly on the game environment and target platforms you are considering.

Provided that your scenes are static, baked lighting is a great option for affordable lighting that scales to mobile.

If your game consists largely of outdoor areas and can be lit by a single directional light source, light propagation volumes is a techniques that you might want to consider.

For those titles which are intended only for high-end PCs, voxel cone tracing is an option. Yet, in many cases it is too expensive, especially with respect to memory consumption.

In all other cases, pre-computed visibility is a scalable, flexible and proven technique. If your game consists of enough static meshes to support the necessary surface-to-surface pre-computation, it offers the benefits of dynamism and rapid artist workflow along with the scalability enabled by baked lighting.

# Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

Posted by freddijeffries Dec 17, 2015

ARM®’s newest high performance GPU, the Mali™-T880, delivers better than ever graphics performance whilst remaining well within the energy constraints of a mobile platform. The exciting possibilities this opens up are beginning to be realised with the recent announcements of the Mali-T880 appearing in key devices by two of ARM’s partners.

As graphics content for mobile devices becomes increasingly complex, it places greater and greater pressure on the computational capabilities of GPUs. For a start, consumer expectations are that mobile gaming will provide the same high quality user experience as that available on console. The recent surge in virtual reality content, which is expected to make the biggest impact in mobile rather than desktop, also means that high capability GPUs are essential for all levels of mobile devices. Even mid-range smartphones these days are required to perform increasingly challenging computational procedures at low cost, so the next generation of high end mobile devices to hit the market are facing ever higher requirements for superior graphics capability. One of the key requirements is to be able to deliver high-end mobile gaming and applications, whilst still constrained by the power (and therefore thermal) limits of mobile devices.

The ARM Mali family of GPUs is perfectly positioned to address the needs of the whole mobile industry from entry level to top of the line, next-generation products. Based on the Mali Midgard architecture, the Mali-T880 is scalable from one to sixteen cores. It features an additional arithmetic pipeline per shader core in order to handle greater computational throughput than its predecessors and is up to 40% more energy efficient. Not only that, but the adoption of bandwidth saving technologies such as ARM Frame Buffer Compression (AFBC), Smart composition and Transaction elimination all allow for the Mali-T880 to be extremely efficient on memory bandwidth.

ARM Mali-T880 GPU Structure

Two of ARM’s partners have recently released their take on the Mali-T880, Samsung and Huawei have both taken advantage of the exceptional performance and leading energy efficiency of the Mali-T880 to fulfil these very current needs. Huawei’s Mate 8 smartphone, announced last week, is based on the HiSilicon Kirin 950 chipset and contains an optimal combination of ARM Cortex-A72 and Cortex-A53 CPUs and the Mali-T880 GPU to provide a harmonized suite of efficient computing power. Samsung’s Exynos 8 Octa uses the Mali-T880 to ensure the highest quality user interface, immersive gaming and virtual reality content.

Huawei’s New Mate 8 Smartphone

2016 is set to feature yet another step forward in high end devices and with the launch of the Samsung Gear VR, consumer expectations for mobile graphics continue to grow. As more and more of ARM’s partners begin to utilise the Mali-T880, keep an eye out for the next generation of devices hitting the market in the new year.

Stay tuned for more news on Mali and Premium mobile, there are heaps of exciting developments to come.

# Virtual Reality: A Focus on Focus

Posted by freddijeffries Dec 14, 2015

As discussed in our virtual reality (VR) intro blog, whilst the VR industry is moving from strength to strength, there are still a few key issues which often stand in the way of developing a truly successful VR experience. For the purposes of this blog series we’ll be focussing on what we consider to be some of the main blockers. We’ll discuss the best ways to overcome these with technology and techniques that are currently available, as well as considering what we foresee in the future of the field.

So, to focus on focus. Our eyes are complicated objects and work very precisely to ensure the images we see are processed clearly and accurately (unless you’re a secret spectacle-wearer like me, in which case they’re letting the side down to be honest). The eye works in a similar way to a magnifying glass by concentrating light through the cornea and lens and directing it to the retina. When an image is in focus, all paths of light from a single pixel travel to a single point on the retina, allowing us to see it clearly. When an image is out of focus however, that same pixel travels to different parts of the eye, causing the image to appear blurry.

Light paths reaching the eye in focus

Light paths reaching the eye out of focus

This is a particular issue in current mobile VR systems as the images you see are all the same distance from your eye, namely just a few centimetres away on the screen of your smartphone. The images remain at this focal depth even if the differences between what is shown to each eye tell us that the image is at a different depth. This conflict between the depth from the stereo image (vergence) and the apparent focal depth (accommodation) makes it difficult for our brains to adjust. This can lead to visual discomfort, headaches and nausea.

The impact on VR

The simplest type of video content produced for VR is monoscopic, using what amounts to a single ‘360°’ camera. In practice this single image will likely use multiple cameras to cover the entire field of view and stitch the separate images together retrospectively. This is the cheapest and easiest method of 360° video production but doesn’t provide any of that vital depth information. A more complicated, but arguably better, approach is to use stereoscopic video which is produced by capturing two 360° images, one for each eye, providing depth perception from the stereo image. This method is more difficult to get right due to the complications of stitching the images for each eye together accurately. Capturing enough information to reproduce the focal depth as well as a stereo image is more complicated still, but advances are being made every day.

Arguably, the most successful way of addressing this issue of focus currently is with the effective use of light field displays. A ‘light field’ is how we describe all the light travelling through a region of space. A simple way to think about a light field display is to think about a regular display that could show you a different picture depending on the direction from which you viewed it. Although it may not be obvious, light field displays allow us to show images that have different focal depths. Objects further away from the viewer would create a noticeably different light field on the display than the same objects closer up. This requires the eye to adjust its focus to see them clearly, removing the brain-confusing mismatch between the focal and stereo depths.

Micro-lens Arrays

One way of creating a light field display to improve focus in VR is with the use of a micro-lens array, which is an overlaid transparent sheet with huge numbers of tiny, raised lenses.

Micro-lens array

Each micro-lens covers a small number of pixels of a regular display. These are beginning to emerge as technologies for wearables such as smartwatches and the image you see changes dependent upon which way you view it, a bit like an advanced version of the lenticular images you get in breakfast cereal boxes. However, this micro-lens method forces a trade off against resolution as it’s effectively turning multiple pixels into one.

Micro-lens display

Micro-lens arrays are also reported to be complicated to produce at present, so we’ll also consider an alternative option.

The Light Field Stereoscope

To take advantage of the depth benefits of stereoscopy, multi-layer displays are currently being researched, where multiple display panels are layered with small gaps separating them. The eye sees each panel at a different focal distance and so by carefully crafting the content displayed on each layer, the display can place images at an appropriate depth. At SIGGRAPH 2015, Stanford University’s Huang et al presented the ‘Light field Stereoscope’ in which two LCD panels are backlit and placed one behind the other in the VR headset, with a spacer between. This allows us to project the background, or distant, images on the rear screen while the front screen displays images in the foreground of the scene you’re viewing. Distances in the middle of this range can be depicted by displaying partial images on each. This approximate light field adds some focal depth to the display, with objects in the foreground occluding those further back. The interaction of the two 2D displays is not the same as a true 4D light field but may well be sufficient. However, while there isn’t the same resolution trade off that we saw with the micro-lens approach, the front LCD panel acts as a diffuser and can therefore introduce some blurring.

Huang et al’s Light Field Stereoscope

What comes next?

To accompany light field displays, in recent years we have seen the emergence of light field cameras such as Lytro, which capture the full 4D light field across a camera aperture. These computational cameras allow you to produce light field images that allow you to refocus or move the view position after capture, opening up all kinds of creative possibilities. These cameras are also ostensibly better than your average camera as they typically capture more light. Next-generation 360° light field cameras offer to take this further and extend the focus and viewpoint-independent properties of light field images to VR. This bodes well for the future of 360° light field video playback on mobile VR devices. It will allow the user to explore the whole scene with freedom of head movement and natural depth perception, all within a VR headset. Another emerging area in VR is positional tracking on mobile, which will allow real time response of the images to the actual physical location of the user’s head, a vital point for achieving a truly immersive experience and something we’ll be considering in more depth in the future.

Follow me on the ARM Connected Community and Twitter to make sure you don’t miss out on the next post in our VR series!

# Is Virtual Reality the Next Big Thing…Again?

Posted by freddijeffries Nov 26, 2015

In this blog series we’ll be looking at the current status of Virtual Reality with a special focus on mobile VR, where it’s heading in the future and what ARM® can do to help us get there. We’ll be covering the common challenges and pitfalls and how these can be avoided to produce a great VR experience. We’ll also keep you up to date with other blogs and discussions around technical tips, tutorials and FAQ’s; so stay tuned for the future of mobile VR.

Where VR is now

Virtual Reality isn’t new, people have been talking about it since the 90s, so why has the industry never quite taken off in the way we might expect? The quick answer is that the technology simply wasn’t there. The hardware was prohibitively expensive and very bulky and the graphics capabilities were too limited to produce a successful VR user experience - unless you consider motion sickness a success. Now however, lower cost hardware based on existing platforms is changing the game, with mobile platforms offering console-like performance. Not only that, but existing mobile devices already contain many of the sensors VR requires, from gyros to accelerometers, opening up a whole world of mobile VR possibilities.

What’s next for VR

The Virtual Reality industry has a forecast worth of US\$30 billion by 2020, and that all has to come from somewhere.

Fig.1 Digi-Capital VR Revenue Forecast

Gaming is of course a huge industry and a high-end, immersive gamer experience can now be literally at your fingertips. Mobile VR allows you to become fully involved in your chosen game at home, work, or while trying to escape the monotony of public transport; but that’s not all VR can do. Researching a university assignment can be a chore, but how about if you could visit the most relevant museums or seminars without having to leave the dorm? VR allows us to see exhibitions in world class museums and galleries without an expensive trip to London, Paris, or anywhere else. Shopping too, isn’t everyone’s favourite pastime, especially around the Christmas rush. Wouldn’t it be great if you could wander the aisles and compare options for your next car, sofa, TV or even pair of shoes, without tripping over pushchairs or being upsold by pushy assistants? All this is possible with the huge technical advances in VR and it’s only a matter of time until this is our standard way of working.

Fig.2 nDreams® Perfect Beach experience allows you to get away from it all without leaving the sofa

So how does VR actually work?

Technology is the key to VR success and this blog series will talk about exactly what you need to make it happen. VR comes in mobile or desktop options, but according to Oculus® Co-founder Palmer Luckey, desktop VR is seriously compromised by the requirement for a ‘cable servant’ to follow the user around preventing trip hazards. So mobile VR is the quickest way forward, and the simplest of the mobile options allows you to simply slot your smartphone into the headset and get started. The headset provides you with a stereoscopic display, with two marginally different images rendered for the left and right eye, allowing the user to experience depth. Barrel distortion is then applied to the rendered images in post processing to counteract the curvature of the lenses.

Fig.3 Marginally different images for each eye allow the perception of depth and barrel distortion applies curvature to the image to counteract the curvature of the lens

Finally, sensors in the device detect the movement of your head and adjust the scene in real time to render the updated view to the headset and allow realistic visual feedback. Going forward, additional sensors will facilitate live hand-tracking for a truly immersive experience, and this can be combined with the use of an inbuilt or add-on controller to allow you to interact fully with your virtual surroundings.

VR Optimisation with Mali GPUs

As with any emerging technology, there are issues that can stand in the way of a truly successful VR user experience. These include low resolution blurring the image and compromising visual quality, or a low frame rate making the display appear stilted or jerky. A major issue experienced when developing for VR is latency, or the time it takes for the on-screen image to catch up with the user’s head movement, and this is one of the key causes of sickness or dizziness in VR users.

The ARM® Mali™ GPU family is the world’s #1 licensable GPU in terms of shipments and is perfectly positioned to deliver an optimum VR experience. Mali GPU architecture enables high resolution and power saving through various features such as Adaptive Scalable Texture Compression (ASTC); and ARM Frame Buffer Compression (AFBC) dramatically reduces system bandwidth, with performance fully scalable across multiple cores. Mali support for extensions to OpenGL ES and EGL reduce latency and improve overall performance.

What we’re doing now

At events like VRTGO ARM recently demonstrated  how great a mobile VR experience can be with the Mali-based Samsung® Gear VR headset, a collaboration from Samsung Mobile and Oculus. The first version was based on the Galaxy Note 4, with the second generation now available for the Galaxy S6, both powered by the Mali-T760. The Ice Cave Demo, featuring Geomerics Enlighten global illumination in collaboration with RealtimeUk; was easily ported to VR on the Samsung Gear VR headset; read about how we did this here.

Superior visuals and a smooth user experience are all possible in mobile VR and throughout this blog series we’ll be discussing the common challenges surrounding developing for VR and how ARM’s technology and Mali GPUs can help you overcome them.

Stay tuned for more on mobile VR!

# Mali Performance 7: Accelerating 2D rendering using OpenGL ES

Posted by peterharris Nov 19, 2015

It's been a while since my last performance blog, but one of those lunchtime coffee discussions about a blog I'd like to write was turned into a working tech demo by wasimabbas overnight, so thanks to him for giving me the kick needed to pick up the digital quill again. This time around I look at 2D rendering, and what OpenGL ES can do to help ...

A significant amount of mobile content today is still 2D gaming or 2D user-interface applications, in which the applications render layers of sprites or UI elements to build up what is on screen. Nearly all of these applications are actually using OpenGL ES to perform the rendering, but few applications actually leverage the 3D functionality available, preferring the simplicity of a traditional back-to-front algorithm using blending to handle alpha transparencies.

This approach works, but doesn’t make any use of the 3D features of the hardware, and so in many cases makes the GPU work a lot harder than it needs to. The impacts of this will vary from poor performance to reduced battery life, depending on the GPU involved, and these impacts are amplified by the trend towards higher screen resolutions in mobile devices. This blog looks at some simple changes to sprite rendering engines which can make the rendering significantly faster, and more energy efficient, by leveraging some of the tools which a 3D rendering API provides.

# Performance inefficiency of 2D content

In 2D games the OpenGL ES fragment shaders are usually trivial – interpolate a texture coordinate, load a texture sample, blend to the framebuffer – so there isn’t much there to optimize. Any performance optimization for this type of content is therefore mostly about finding ways to remove redundant work completely, so that the shader never even runs for some of the fragments.

The figure in the introduction section shows a typical blit of a square sprite onto a background layer; the outer parts of the shield sprite are transparent, the border region is partially transparent so it fades cleanly into the background without any aliasing artifacts, and the body of the sprite is opaque. These sprite blits are rendered on top of what is in the framebuffer in a back-to-front render order, with alpha blending enabled.

There are two main sources of inefficiency here:

• Firstly, the substantial outer region around this sprite is totally transparent, and so has no impact on the output rendering at all, but takes time to process.
• Secondly, the middle part of the sprite is totally opaque, completely obscuring many background pixels underneath it. The graphics driver cannot know ahead of time that the background will be obscured, so these background fragments have to be rendered by the GPU, wasting processing cycles and nanojoules of energy rendering something which is not usefully contributing to the final scene.

This is a relatively synthetic example with only a single layer of overdraw, but we see real applications where over half of all of the rendered fragments of a 1080p screen are redundant. If applications can use OpenGL ES in a different way to remove this redundancy then the GPU could render the applications faster, or use the performance headroom created to reduce the clock rate and operating voltage, and thus save a substantial amount of energy. Either of these outcomes sounds very appealing, so the question is how can application developers achieve this?

# Test scene

For this blog we will be rendering a simple test scene consisting of a cover-flow style arrangement of the shield icon above, but the technique will work for any sprite set with opaque regions. Our test scene render looks like this:

… where each shield icon is actually a square sprite using alpha transparencies to hide the pieces which are not visible.

In traditional dedicated 2D rendering hardware there are not usually many options to play with; the application has to render the sprite layers from back to front to make sure blending functions correctly. In our case the applications are using a 3D API to render a 2D scene, so the question becomes what additional tools does the 3D API give the applications which can be used to remove redundant work?

The principal tool used in full 3D scene rendering to remove redundant work is the depth test. Every vertex in a triangle has a “Z” component in its position, which is emitted from the vertex shader. This Z value encodes how close that vertex is to the camera, and the rasterization process will interpolate the vertex values to assign a depth to each fragment which may need fragment shading.  This fragment depth value can be tested against the existing value stored in the depth buffer and if it is not closer1 to the camera than the current data already in the framebuffer then the GPU will discard the fragment without ever submitting it to the shader core for processing, as it now safely knows that it is not needed.

# Using depth testing in “2D” rendering

Sprite rendering engines already track the layering of each sprite so that they stack correctly, so we can map this layer number to a Z coordinate value assigned to the vertices of each sprite which is sent to the GPU, and actually render our scene as if it has 3D depth. If we then use a framebuffer with a depth attachment, enable depth writes, and render the sprites and background image in front-to-back order (i.e. the reverse order of normal blitting pass which is back-to-front) then the depth test will remove parts of sprites and the background which are hidden behind other sprites.

If we run this for our simple test scene, we get:

Uh oh! Something has gone wrong.

The issue here is that our square sprite geometry does not exactly match the shape of opaque pixels. The transparent parts of the sprites closer to the camera are not producing any color values due to the alpha test, but are still setting a depth value. When the sprites on a lower layer are rendered the depth testing means that the pieces which should be visible underneath the transparent parts of an earlier sprite are getting incorrectly killed, and so only the OpenGL ES clear color is showing.

# Sprite geometry

To fix this issue we need to invest some time into setting up some more useful geometry for our sprites. We can only safely set the depth value when rendering front-to-back for the pixels which are totally opaque in our sprite, so the sprite atlas generation needs to provide two sets of geometry for each sprite. One set, indicated by the green area in the middle image below, covers only the opaque geometry, and the second, indicated by the green area in the right image below, picks up everything else unless it is totally transparent (in which case it can be dropped completely).

Vertices are relatively expensive, so use as little additional geometry as possible when generating these geometry sets. The opaque region must only contain totally opaque pixels, but the transparent region can safely contain opaque pixels and totally transparent pixels without side-effects, so use rough approximations for a "good fit" without trying to get "best fit". Note for some sprites it isn’t worth generating the opaque region at all (there may be no opaque texels, or the area involved may be small), so some sprites may consist of only a single region rendered as a transparent render. As a rule of thumb, if your opaque region is smaller than 256 pixels it probably isn't worth bothering with the additional geometry complexity, but as always it's worth trying and seeing.

Generating this geometry can be relatively fiddly, but sprite texture atlases are normally static so this can be done offline as part of the application content authoring process, and does not need to be done live on the platform at run time.

# Draw algorithm

With the two geometry sets for each sprite we can now render the optimized version of our test scene. First render all of the opaque sprites regions and the background from front-to-back, rendering with depth testing and depth writes enabled. This results in the output below:

Any area where one sprite or the background is hidden underneath another sprite is rendering work which has been saved, as that is an area which has been be removed by the early depth test before shading has occurred.

Having rendered all of the opaque geometry we can now render the transparent region for each sprite in a back-to-front order. Leave depth testing turned on, so that sprites on a lower layer don't overwrite an opaque region from a sprite in a logically higher layer which has already been rendered, but disable depth buffer writes to save a little bit of power.

If we clear the color output of the opaque stage, but keep its depth values, and then draw the transparent pass, we can visualize that the additional rendering added by this pass. This is show in the figure below:

Any area where one of the outer rings is partial indicates an area where work has been saved, as the missing part has been removed by the depth test using the depth value of an opaque sprite region closer to the camera which we rendered in the first drawing pass.

If we put it all together and render both passes to the same image then we arrive back at the same visual output as the original back-to-front render:

... but achieve that with around 35% fewer fragment threads started, which should translate approximately to a 35% drop in MHz required to render this scene. Success!

The final bit of operational logic needed is to ensure that the depth buffer we have added to the scene is not written back to memory. If your application is rendering directly to the EGL window surface then there is nothing to do here, as depth is implicitly discarded for window surfaces automatically, but if your engine is rendering to an off-screen FBO ensure that you add a call to glInvalidateFramebuffer()  (OpenGL ES 3.0 or newer) or glDiscardFramebufferEXT() (OpenGL ES 2.0) before changing the FBO binding away from the offscreen target. See Mali Performance 2: How to Correctly Handle Framebuffers for more details.

# Summary

In this blog we have looked at how the use of depth testing and depth-aware sprite techniques can be used to accelerate rendering using 3D graphics hardware significantly.

Adding additional geometry to provide the partition between opaque and transparent regions of each sprite does add complexity, so care must be taken to minimize the number of vertices for each sprite otherwise the costs of additional vertex processing and small triangle sizes will out-weigh the benefits. For cases where the additional geometry required is too complicated, or the screen region covered is too small, simply omit the opaque geometry and render the whole sprite as transparent.

It’s worth noting that this technique can also be used when rendering the 2D UI elements in 3D games. Render the opaque parts of the UI with a depth very close to near clip plane before rendering the 3D scene, then render the 3D scene as normal (any parts behind the opaque UI elements will be skipped), and finally the remaining transparent parts of the UI can be rendered and blended on top of the 3D output. To ensure that the 3D geometry does not intersect the UI elements glDepthRange() can be used to limit the range of depth values emitted by the 3D pass very slightly, guaranteeing that the UI elements are always closer to the near clip plane than the 3D rendering.

Tune in next time,

Pete

[1] Other depth test functions are possible in OpenGL ES, but this is the common usage which is analogous to the natural real world behaviour.

Pete Harris is the lead engineer for the Mali GPU performance analysis team at ARM. He enjoys spending his time working on a whiteboard to determining how to get the best out of complex graphics sub-systems, and how to make the ARM Mali GPUs even better.

# Virtual Reality: The Ice Cave

Posted by freddijeffries Nov 19, 2015

This blog was written by Kapileshwar Syamasundar during his summer placement at ARM in the ARM Mali Graphics demo team. Kapil did some great work at ARM porting the Ice Cave demo to VR using Unity, we hope you can benefit from this too.

Ice Cave VR

1. Introduction

Ice Cave, the latest demo from ARM Mali Ecosystem, has been shown with great success this year in such major events as GDC, Unite Europe, and Unite Boston. The demo has been developed in Unity and aims to demonstrate that it is possible to render high visual quality content on current mobile devices. A number of highly optimized special effects were developed in-house, specifically for this demo, some of which are based on completely new techniques, for example the rendering of shadows and refractions based on local cubemaps.

Figure 1 View of the cave from the entrance in the Ice Cave demo.

The Ice Cave demo was released at a time when Virtual Reality has become the centre of attention in the game development community, and related events and media. A number of VR demos and games have already been released but VR performance requirements can limit the complexity of VR content and therefore the visual quality of the final VR experience.

It is in this landscape that the Ecosystem demo team decided to port the Ice Cave demo to Samsung Gear VR and this task was assigned to me. In this blog I describe my experience in porting the Ice Cave demo to VR during my eight weeks summer placement in the Ecosystem demo team.

By the time I joined the demo team, Unity had just released a version with VR native support for Oculus Rift and Samsung Gear VR.  Previously, VR support was only available by means of a plugin based on Oculus Mobile SDK, but this had some obvious limitations:

• Each VR device has a different plugin
• Plugins may conflict with each other
• Release of newer VR SDKs / Runtimes can break older games
• Lower level engine optimizations are not possible with plugin approach of two separate cameras

Conversely, the newly released Unity VR native integration lacked both support and sufficient information for developers, and experienced many unresolved issues. Nonetheless, the team was convinced that with the native integration in Unity we would be able to achieve the best possible performance; a key point in guaranteeing a successful VR user experience.

2. Samsung Gear VR

The Samsung Gear VR headset does not have a built in display but has instead been designed to host a mobile phone. At the time of writing, the Samsung Gear VR comes in two versions; one for Samsung Note 4 and another for the latest Samsung Galaxy S6. Some of the main specifications of the Samsung Galaxy S6 version are listed below.

 ·           Sensors: Accelerator, Gyrometer, Geomagnetic, Proximity Figure 2. The Samsung Gear VR for Samsung Galaxy S6. ·           Motion to Photon Latency < 20ms ·           Manual Focal Adjustment ·           Main Physical UI: Touch Pad ·           Oculus’s Asynchronous TimeWarp technology

Samsung Gear VR is powered by Oculus VR software and incorporates the Oculus Asynchronous Time Warp technology. This important feature helps reduce latency, or the time taken to update the display based on the latest head movement; a key issue to avoid in VR devices. Besides the Time Warp technology, the Samsung Gear VR has several sensors which it uses in place of the ones incorporated in the phone.

The Samsung Gear VR has its own hardware and features a touch pad, back button, volume key and, according to the specifications, an internal fan designed to help demist the device while in use.

The key point here however, is that you can insert your Samsung Galaxy S6 into the headset and enjoy an immersive experience with just a smartphone. We are no longer limited to the screen size of the phone and can instead become completely immersed in a virtual world.

3. Main steps to port an app/game to VR in Unity

VR integration in Unity has been achieved following one of the main Unity principles, that it must be simple and easy. The following basic steps are all that are needed to port a game to VR:

·         Unity 5.1 version with VR native support (or any higher version).

·         Obtain the signature file for your device from the Oculus website and place it in Plugins/Android/assets folder.

·         Set the “Virtual Reality Supported” option in Player Settings.

·         Set a parent to camera. Any camera control must set camera position and orientation to the camera parent.

·         Associate the camera control with the Gear VR headset touch pad.

·         Build your application and deploy it on the device. Launch the application.

·         You will be prompted to insert the device into the headset. If the device is not ready for VR you will be prompted to connect to the network where the device will download Samsung VR software.

·         NB. It is useful to set the phone to developer mode to visualize the application running in stereo without inserting into the Gear VR device. You can enable the developer mode only if you have installed previously a VR application appropriately signed.

 Enabling Gear VR developer mode•       Go to your device Settings - Application Manager - Gear VR Service•       Select "Manage storage"•       Tap on the "VR Service Version" six times•       Wait for scan process to complete and you should now see the Developer Mode toggleDeveloper mode allows you to launch the application without the headset and also dock the headset at any time without having Home launch.

Figure 3. Steps to enable VR Developer mode  on Samsung Galaxy S6.

Figure 4 Side by Side view of stereo viewports captures with VR developer mode enabled.

4. Not as simple as it seems. Considering VR specifics.

After following the instructions above, I saw nothing but a black screen when inserting the device into the headset. It took me some time to get the VR application running in order to establish that some existing features had to be changed and others added.

VR is a completely different user experience and this is therefore one of the key issues when porting to VR. The original demo had an animation mode which moved the camera through different parts of the cave to show the main features and effects. However, in VR this animation caused motion sickness to the majority of users, particularly when moving backwards. We therefore decided to remove this mode completely.

We also decided to remove the original UI. In the original Ice Cave demo a tap on the screen triggers a menu with different options but this was unsuitable for VR.  The original navigation system, based on two virtual joysticks, was also unsuitable for VR so we decided to entirely replace it with a very simple user interaction based on the touch pad:

·         Pressing and holding the touch pad moves the camera in the direction the user looks.

·         When you release the pressure the camera stops moving.

·         A double tap resets the camera to the initial position.

This simple navigation system was deemed to be intuitive and easy by all users trying the VR version of the demo.

Figure 5. User interaction with touch pad  on the Samsung Gear VR.

The camera speed was also a feature we considered carefully as many users experienced motion sickness when the camera moved just a little too fast. After some tests we were able to set a value that most people were comfortable with.

Additionally, the camera has to be set as a child of a game object. This is the only way Unity can automatically integrate the head tracking with the camera orientation. If the camera has no parent this link will fail so any translation and rotation of the camera has to be applied to the camera parent node.

In VR, as in reality, it is important to avoid tight spaces so the user doesn’t feel claustrophobic. The original Ice Cave was built with this in mind and provides ample space for the user.

The only effect not imported to VR was the dirty lens effect. In the original Ice Cave demo this effect is implemented as a quad that is rendered on top of the scene. A dirty texture appears with more or less intensity depending on how much the camera is aligned with the sun. This didn’t translate well to VR and so the decision was made to completely remove it from the VR version.

Figure 6. Dirty lens effect implemented in the original Ice Cave demo.

5. Extra features in the Ice Cave VR version

In the original demo the user can pass through the walls to look at the cave from the outside. However in VR this didn’t create a good experience and the sensation of embedding disappeared when you went out of the cave. Instead, I implemented camera collision detection and smooth sliding for when the user moves very close to the walls.

When running a VR application on Samsung Gear VR, people around the user are naturally curious about what the user is actually seeing. We thought that it would be interesting, particularly for events, to stream the content from the VR headset to another device such as a tablet. We decided to explore the possibility of streaming just the camera position and orientation to a second device running a non-VR version of the same application.

The new Unity network API allowed a rapid prototyping and in a few days I had an implementation which worked pretty well. The device actually running the VR version on the Samsung Gear VR works as a server and in each frame sends the camera position and orientation over wireless TCP to a second device that works as a client.

Figure 7. Streaming camera position and orientation from Samsung Gear VR to a second device.

Using the built-in touch pad to control the camera motion proved very successful. Nevertheless, we decide to provide the user with an alternative method of control using an external Bluetooth mini controller readily available elsewhere. This required us to write a plugin to extend the Unity functionality by intercepting the Android Bluetooth events and using them to trigger movement and resetting of the camera. Unfortunately there is not much information available so whilst it was only possible to intercept the messages coming from two keys , this was enough to move/stop and reset the camera.

6. Conclusions

Ice Cave VR was implemented during my summer placement with ARM’s Ecosystem Demo team in less than eight weeks with no previous experience of Unity. This was possible thanks to the native VR integration Unity released on version 5.1. In principle, just a few steps are necessary to port a game to VR, although in practice you need to do some extra work to fine-tune the specific requirements of VR in your game. With this integration, Unity has greatly contributed to the democratisation of VR.

Unity VR integration is still in progress and some reported issues are expected to be solved in coming versions. Nonetheless, the Ice Cave VR version shows that it is possible to run high quality VR content on mobile devices if resources are balanced properly at runtime by using highly optimized rendering techniques.

All the advanced graphics techniques utilised in the Ice Cave demo are explained in detail in the ARM Guide for Unity Developers. In the guide it is possible to find the source code or code snippets of these techniques which allowed me to understand how they work.

What I consider the most relevant in all this is the fact that with mobile VR we are no longer limited to the size of our smartphones to enjoy a game. Now we can be part of a limitless virtual world and enjoy a wonderful VR experience from a tiny smartphone inserted in a head set. This really is an outstanding step forward!

Now you can check out the video

# The "ARM Game Developer Day" is back by popular demand!

Posted by gemmaparis Nov 12, 2015

Be part of the future of mobile game technology. Bring along your project challenges for expert advice on how to maximise performance of your game for mobile platforms. Learn about the latest ARM CPU and Mali GPU architecture, multicore programming and tile-based mobile GPUs and see how to implement highly optimized rendering effects for mobile.

The day will feature:

• VR showcase area with live demos
• Q&A Clinics and the sharing of best coding practice for mobile platforms
• Talks on mobile VR, Geomerics advanced applications of dynamic global illumination plus a panel discussion on “the future of mobile game technology”
• Open discussions with ARM experts about your own project challenges and suggested improvements
• The opportunity to network with top mobile game developers and engineers
• Free hot lunch and refreshments throughout the day

London - ARM Game Developer Day

Thursday 3rd December 2015

9:00am - 6:00pm

Rich Mix, 35-47 Bethnal Green Rd, London E1 6LA

See full agenda and register here to make sure you don’t miss out!

Chengdu - ARM Game Developer Day

Wednesday 16th December 2015

9:00am - 6:00pm

iTOWN Coffee, Building A, Chengdu Tianfu Software Park, China

# Eats, Shoots and Interleaves

Posted by stacysmith Nov 2, 2015

Vertex interleaving

Recently we were asked via the community whether there was an advantage in either interleaving or not interleaving vertex attributes in buffers. For the uninitiated, vertex interleaving is a way of mixing all the vertex attributes into a single buffer. So if you had 3 attributes (let’s call them Position (vec4), Normal(vec4), and TextureCoord(vec2)) uploaded separately they would look like this:

P1xP1yP1zP1w , P2xP2yP2zP2w , P3xP3yP3zP3w ... and so on

N1xN1yN1zN1w , N2xN2yN2zN2w , N3xN3yN3zN3w ... and so on

T1xT1y , T2xT2y , T3xT3y ... and so on

(In this case the commas denote a single vertex worth of data)

The interleaved buffer would look like this:

P1xP1yP1zP1w N1xN1yN1zN1w T1xT1y , P2xP2yP2zP2w N2xN2yN2zN2w T2xT2y ,

P3xP3yP3zP3w N3xN3yN3zN3w T3xT3y ... and so on

(Note the colours for clarity)

… Such that the individual attributes are mixed, with a given block containing all the information for a single vertex. This technique is what the stride argument in the glVertexAttribPointer function and its variants is for, allowing the application to tell the hardware how many bytes it has to jump forwards to get to the same element in the next vertex.

However, even though we all knew about interleaving, none of us could really say whether it was any better or worse than just putting each attribute in a different buffer, because (to put it bluntly) separate buffers are just easier to implement.

So in a twist to usual proceedings I have conferred with arguably the top expert in efficiency on Mali,  Peter Harris. What follows is my interpretation of the arcane runes he laid out before my quivering neurons:

Interleaving is better for cache efficiency…

… Sometimes.

Why does interleaving work at all?

The general idea behind interleaving is related to cache efficiency. Whenever data is pulled from main memory it is loaded as part of a cache line. This single segment of memory will almost certainly contain more than just the information desired, as one cache line is larger than any single data type in a shader program. Once in the local cache the data in the loaded line is more quickly available for subsequent memory reads. If this cache line only contains one piece of required information, then the next data you need is in a different cache line which will have to be brought in from main memory. If however, the next piece of data needed is in the same cache line, the code can fetch directly from the cache and so performs fewer loads from main memory and therefore executes faster.

Without getting into physical memory sizes and individual components, this can be illustrated thusly:

Imagine we have 3 attributes, each of them vec4s. Individually they look like this:

| P1 P2 | P3 P4 | ...

| N1 N2 | N3 N4 | ...

| T1 T2 | T3 T4 | ...

From this point forward those vertical lines represent the boundaries between cache lines. For the sake of argument, the cache lines in this example are 8 elements long, so contain 2 vec4s; but in the real world our cache lines are 64 bytes, large enough to hold four 32-bit precision vec4 attributes. For the sake of clear illustration I’ll be keeping the data small in these examples, so if we want all the data for vertex number 2 we would load 3 cache lines from the non-interleaved data:

P1 P2

N1 N2

T1 T2

If this data is interleaved like so:

| P1 N1 | T1 P2 | N2 T2 | P3 N3 | T3 P4 | N4 T4 | ...

The cache lines fetched from main memory will contain:

T1 P2

N2 T2

(We start from T1 because of the cache line alignment)

Using interleaving we've performed one less cache line fetch. In terms of wasted bandwidth, the non-interleaved case loaded 3 attributes which went unused, but only one unused attribute was fetched in the interleaved case. Additionally, it's quite possible that the T1 P2 cache line wouldn't need to be specifically fetched while processing vertex 2 at all; if the previously processed vertex was vertex 1, it is likely that the data will still be in the cache when we process vertex 2.

Beware misalignment

Cache efficiency can be reduced if the variables cross cache line boundaries. Notice that in this very simple example I said the texture coordinates were vec4s. Ordinarily textures would be held in vec2 format, as shown in the very first explanation of interleaving. In this case, visualising the individual elements of the buffer, the cache boundaries would divide the data in a very nasty way:

PxPyPzPw NxNyNzNw | TxTy PxPyPzPw NxNy | NzNw TxTy PxPyPzPw | …

Notice that our second vertex's normal is split, with the x,y and z,w in different cache lines. Though two cache lines will still contain all the required data, it should be avoided as there is a tiny additional power overhead in reconstructing the attribute from two cache lines. If possible it is recommended to avoid splitting a single vector over two cache lines (spanning a 64-byte cache boundary), which can usually be achieved by suitable arrangement of attributes in the packed buffer. In some cases adding padding data may help alignment, but padding itself creates some inefficiencies as it introduces redundant data into the cache which isn’t actually useful. If in doubt try it and measure the impact.

But it's not always this simple

If we look at the function of the GPU naïvely, all of the above makes sense, however the GPU is a little cleverer than that. Not all attributes need to be loaded by the vertex processor. The average vertex shader looks something like this:

uniform vec4 lightSource;

uniform mat4 modelTransform;

uniform mat4 cameraTransform;

in vec4 position;

in vec4 normal;

in vec2 textureCoord;

in vec2 lightMapCoord;

out float diffuse;

out vec2 texCo;

out vec2 liCo;

void main( void ){

texCo = textureCoord;

liCo = lightMapCoord;

diffuse = dot((modelTransform*normal),lightSource);

gl_Position=cameraTransform*(modelTransform*position);

}

If you look at our outputs diffuse is calculated at this stage, as is gl_Position, but texCo and liCo are just read from the input and passed straight back out without any computation performed. For a deferred rendering architecture this is really a waste of bandwidth as it doesn’t add any value to the data being touched. In Midgard family GPUs (Mali-T600 or higher) the driver understands this (very common) use case and has a special pathway for it. Rather than load it in the GPU and output it again to be interpolated, the vertex processor never really sees attributes of this type. They can bypass the vertex shader completely and are just passed directly to the fragment shader for interpolation.

Here I've used a second set of texture coordinates to make the cache align nicely for this example. If we fully interleave all of the attributes our cache structure looks like this

PxPyPzPw NxNyNzNw | TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy | ...

Here the vertex processor still needs to load in two attributes P and N, for which the cache line loads will either look like:

PxPyPxPw NxNyNzNw

… or …

TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy

… to obtain the required data, depending on which vertex we are loading. In this latter case the T and L components are never used, and will be loaded again separately to feed into the interpolator during fragment shading. It’s best to avoid the redundant data bandwidth of the T and L loads for the vertex shading and the redundant loads of P and N when fragment shading. To do this we can interleave the data into separate buffers, one which contains all of the attributes needed for computation in the vertex shader:

PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | ...

… and one containing all of the attributes which are just passed directly to interpolation:

TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | ...

This means that the vertex shader will only ever need to touch the red and green cache lines, and the fragment interpolator will only ever have to touch the blue and orange ones (as well as any other interpolated outputs from the vertex shader). This gives us a much more efficient bandwidth profile for the geometry processing. In this particular case it also means perfect cache alignment for our vertex processor.

A note on data locality

Caches function best when programs make use of the data in the same cache lines in a small time window. This maximizes the chance that data we have fetched is still in the cache and avoids a refetch from main memory. Cache lines often contain data from multiple vertices which may come from multiple triangles. It is therefore best practise to make sure that these adjacent vertices in memory are also nearby in the 3D model (both in terms of attribute buffers and index buffers). This is called data locality and you normally need look no further than your draw call's indices (if you are not using indexed models you have far bigger problems than cache efficiency to solve). If the indices look like this:

(1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 2) (1, 3, 5) ...

You have good data locality. On the other hand, if they look like this:

(1, 45, 183) (97, 12, 56) (4, 342, 71) (18, 85, 22) ...

… then they're all over the place and you’ll be making your GPU caches work overtime. Most modelling software will have some kind of plugin to better condition vertex ordering, so talk to your technical artists to get that sorted somewhere in the asset production process.

To maximize the cache efficiency it’s also worth reviewing the efficiency of your vertex shader variable types, both in terms of sizes and number of elements. We see a surprising amount of content which declares vector elements and then leaves many channels unused (but allocated in memory and so using valuable cache space); or which uploads highp fp32 data and then uses it in the shader as a medium fp16 value. Removing unused vector elements and converting to narrower data types (provided the OES_vertex_half_float extension is available) is a simple and effective way to maximize cache efficiency, reduce bandwidth, and improve geometry processing performance.

So there you have it, interleaving vertex attributes. It would be remiss of me to tell you to expect immediate vast performance improvements from this technique. At best this will only cleave back a tiny bit of efficiency but in large complex projects where you need to squeeze as much as possible out of the hardware, these tiny improvements can all add up.

Thanks again to Peter Harris, who provided a lot of the information for this blog and also was kind enough to go through it afterwards and take out all my mistakes.

# Improving ARM® Mali™ drivers on fbdev

Posted by Guillaume Tucker Oct 28, 2015

This blog post refers to the public ARM Mali Midgard r6p0 user-space binary drivers for GNU/Linux which are now available for download.  The "fbdev" variant now has support for dma-buf using a standard ioctl, which is explained in detail here.

## What is wrong with fbdev?

The Linux kernel defines a user-space API that lets applications control displays via frame buffer drivers, also known as "fbdev" drivers.  This is achieved via a set of file operations on "/dev/fb*" devices such as ioctl or mmap, which allows direct access to the pixel data of a display.  While this is a simple, widespread, lightweight and powerful interface, it is no longer suitable in the world of modern embedded graphics computing.

For example, it does not provide any means to share the frame buffer directly with another kernel driver such as a GPU driver.  When running an application that renders graphics with a GPU, each frame contains pixels that the GPU generated within its own buffers.  The display controller is a separate entity with its own frame buffer located elsewhere in memory.  So the standard way of displaying the GPU frames in fbdev mode is for the user-space (EGL part of the driver) to copy the data from a GPU buffer to the display controller's frame buffer.  This is done on the CPU, typically by calling memcpy.  This works, but can easily become the biggest bottleneck in the system and is therefore not acceptable in a real product.

There are many other limitations to "fbdev" which are not covered in this blog post, such as managing layers, synchronising applications with the display controller or performing 2D composition in hardware.

## What can we do to fix it?

All new display controller drivers should be implementing the Direct Rendering Manager (DRM) API instead of fbdev.  DRM offers a much more modern set of features and solves many of the problems seen with fbdev.  However, there are still a lot of GPU-accelerated products that control their display using an fbdev Linux driver.  Work needs to be done on each of these to avoid the unacceptably inefficient CPU memcpy and to let the GPU directly write into the frame buffer.

A typical way of achieving this is by using the dma-buf framework in the Linux kernel.  Typically, the frame buffer driver registers itself as a dma-buf exporter and implements a way for user-space to get a file descriptor for this frame buffer (an ioctl).  The user-space driver then passes the dma-buf file descriptor to the GPU kernel driver to import and use.  When rendering happens, the GPU pixels are directly written into the frame buffer using a hardware DMA operation - this is also referred to as "zero-copy" and is much faster than calling memcpy.  However, there is no standard ioctl in the fbdev API to export the frame buffer with dma-buf so each fbdev driver has a slightly different one.  This means the user-space needs to be modified to work with each fbdev driver, which is not compatible with the public standard Mali binary drivers as they must not depend on platform specific display drivers.

## FBIOGET_DMABUF

This is why, starting with r6p0, we are adding a more generic ioctl implementation to export the dma-buf file descriptor defined as a custom extension to the standard fbdev API in supported kernels: the FBIOGET_DMABUF ioctl.  This way, no extra dependency is being added on the user-space binary.  If the ioctl is not available on a given kernel, then the user-space will revert to memcpy.  We have already enabled this in our ODROID-XU3 Linux kernel branch on Github.

We should keep this added functionality in all our new fbdev Mali Midgard binary drivers and Linux kernels for supported platforms, such as ODROID-XU3, Chromebook devices with an ARM Mali GPU, Firefly...

By date:
By tag: