Enabling GPU Compute on an ARM Mali-T600 GPU creates a power efficient HEVC decode solution

February 1, 2014

8 minute read time.

Device makers and service providers are constantly innovating to offer a more compelling HEVC video playback experience. While there has been tremendous progress in developing wireless technologies to meet the growing demands for high-bandwidth applications, video decoder applications have traditionally consumed a substantial amount of power and bandwidth when delivering the desired level of user experience. Aricent, a premier engineering services and software company, is leveraging ARM Mali Graphics Processing Units (GPUs) and parallel computing technology to develop codec solutions that consume less power and bandwidth whilst delivering the, high definition video experience desired by consumers.

Computational challenges in High Efficiency Video Coding (HEVC)

Computational needs in video coding went to the next level when the Joint Collaborative Team on Video Coding (JCT-VC) announced the HEVC standard for video compression. The higher compression offered by HEVC technology has opened the doors for seamlessly streaming Ultra HD content @ 30 frames per second (fps) on channels that were originally made for streaming Full HD 30 fps media. HEVC is an important step forward for online media hubs, IPTV companies, broadcasters and other network operators as it would enable them to deliver Full HD resolutions and beyond, at half the bit-rate compared to H.264 without impacting visual quality.

Higher compression offered by HEVC comes at a cost – it poses massive computational challenges when running solely on a mobile CPU. However, GPU Compute solutions take advantage of existing GPU hardware to offload certain data intensive parallel operations thus significantly reducing the processor load and reducing the power consumed.

Aricent exposes the benefits of heterogeneous computing with ARM NEON and ARM Mali-T600 GPU technology

ARM continuously strives to deliver the fastest performance and lowest power platforms to meet the ever increasing computational needs of multimedia solutions. Single Instruction, Multiple Data (SIMD) NEON technology combined with a load store architecture present in ARMv7 based processors (ARM Cortex-A8, A9, A15 etc.) enables parallel processing at the instruction level where 128 bit wide vectors can be operated upon in a single instruction. This means NEON co-processor technology can either operate on sixteen 8-bit elements or eight 16-bit elements in parallel for any arithmetic or logical or memory load/store operation.

ARM Mali-T600 GPUs feature 128-bit SIMD capabilities and parallel computing technology, both of which are now being leveraged by video algorithm developers at Aricent to develop codec solutions with low power consumption and improved performance targeting Ultra HD resolutions. GPU Compute APIs such as OpenCL and RenderScript/FilterScript, as supported by the ARM Mali GPU, facilitate quicker implementation of compute intensive algorithms and reduce time to market. By offloading selected modules of the HEVC video decoder to the GPU, not only is decoding made faster but a lot of energy is saved which otherwise would have been consumed by heavily loading the CPU to carry out the task on its own.

Optimizing an HEVC decoder for GPU Compute: A design approach

The GPU is historically designed to handle computations of graphics workloads. Often when the CPU is busy and loaded with intensive computations, the GPU is found to be idle. Most modern System on Chips (SoCs) have adopted GPUs that support GPU Compute. This extends their ability to use the GPU hardware to accelerate non graphical workloads and more generic computations such as physics, computer vision, advanced image processing and, of course, multimedia codecs. GPU Compute improves the load balancing of computation across different processors within the system, as Aricent was able to achieve in HEVC implementation effort.

The development platform used (InSignal Arndale) features an ARM Mali-T604 GPU. This GPU is scalable up to four cores each capable of up-to 256 parallel threads. This implies that a total of 1024 parallel threads can be executed, wherein each thread has its own dedicated register set including program counter and stack. This is the very reason that the GPU becomes an ideal platform for software which can be massively parallelized e.g. doing color conversion for a video frame.

Fig 1: GPU offloading strategy of HEVC MP Decoder by Aricent

Designing and implementing optimized software, such as a video decoder for GPU Compute, poses numerous challenges as it contains sequential as well as parallel code. Identifying the parts of the software that can be executed in parallel while critical resources are being shared across CPU and GPU needs to be well thought of. The idea is to create a flow such that more and more parts of the software become parallel and hence become an ideal candidate for GPU Compute acceleration. Figure-1 shows major building blocks of the HEVC decoder. As seen, the modules that operate on a block of data without any dependency on the neighboring blocks of the same frame are offloaded to GPU Compute while the modules that have dependency are processed by the host CPU, the ARM Cortex-A15. Architectural level changes needed to resolve block level dependencies and create higher parallelism are discussed in detail below.

Case Study: HEVC inter-prediction on GPU Compute - design flow considerations

Most video standards process the frame by dividing it into smaller blocks. Inter-prediction is a technique that exploits the temporal redundancy within video frames and predicts the current block from the previously encoded frames based on the motion vectors and other prediction related parameters. Later, the predicted block is added to the residue block to form the reconstructed block.

In HEVC, the frame is divided into many Coding Tree Units (CTUs) and the CTUs can further contain multiple Prediction Units (PUs). Inter-prediction has to be performed for each PU separately by using PUInfo (Motion vectors, Modes etc.) that gets decoded in the entropy decoding stage.

In a conventional CPU based decode pipeline, the decoding is done on CTU basis sequentially, as shown in Figure 2. CTUs are processed in raster scan order within frame (or tile) one after another. Assuming CTU size as 64x64 and PU size as 8x8, there can be 64 PUs in a CTU which can be processed in parallel. However, that would not utilize the GPU Compute capabilities fully.

Fig 2: Conventional CPU based decode pipeline

On the contrary, consider the flow given in Figure 3. As seen in the figure, the entropy decoding of the entire frame (not just one CTU) is done before moving on to the inter-prediction. This being a frame level design, it makes PUInfo available for all PUs and hence inter-prediction stage can seamlessly process all the PUs in parallel exploiting the capabilities of GPU Compute fully, as it (inter-prediction) does not have a spatial dependency.

Fig 3: GPU Compute friendly decode pipeline

Memory requirements considerations

The proposed design would increase the memory requirements of the decoder as there is a need to store the PUInfo for the entire frame. The cache performance may also be affected on systems with lower cache (L1 and L2) memory. To avoid these two problems, one may choose to process either half or quarter of the frame at a time instead of processing the entire frame. This will ensure that the memory footprint is kept low.

Buffer allocation considerations

While designing a software solution that uses both the CPU and GPU, there might be a need for sharing data between the two. In such cases, it becomes very important to understand the possible ways through which both processors can share data with each other.

CPU allocated (e.g. allocated using malloc) buffers from the heap area are not visible to the GPU and hence to make the buffers visible to both the GPU and the CPU, one must avoid using “malloc” (or “new operator in case of C++). The recommended way is to use clCreateBuffer API with cl_mem_flags set as CL_MEM_ALLOC_HOST_PTR followed by clEnqueueMapBuffer to do memory mapping and generate the virtual addresses for CPU access. However, there might be a case where the buffer supplied to the GPU accelerated software is already allocated using malloc by a higher layer. For such cases one can use clCreateBuffer API with cl_mem_flags set as CL_MEM_USE_HOST_PTR. This will create a copy of the entire CPU allocated buffer in the GPU memory for easy GPU access.

Experiments and results: HEVC Inverse Discrete Cosine Transform (IDCT)

Figure-4 depicts pictorially, the relative performance improvements (time taken to process) of HEVC IDCT modules on an ARM Mali-T604 GPU for a resolution of 1920x1152. As seen in the figure, GPU performance is similar to that of NEON; however as the block size increases, the GPU outperforms NEON assembly due to increased parallelism and the kernel being adequately loaded. In more detail:

A frame with a resolution of 1920x1152 can be divided into 1,38,240 4x4 blocks. However, an ARM Mali-T604 GPU can execute a maximum of 1024 threads in parallel. This implies that, we need to have multiple sequential passes (135) to process the entire frame; wherein each pass would execute 1024 threads in parallel. However, if we increase the block size to 8x8 whilst keeping the frame resolution constant, we can process the frame in a lower number of sequential passes (34) and hence achieve parallelism more effectively. For a 16x16 block size, we see much better results for the GPU as the kernel would be loaded adequately and the number of passes to process the frame shall be much lower!

Fig 4: Performance comparison - HEVC IDCT Modules

The Aricent’s HEVC MP Decoder delivers a huge step forward for consumers, who will be able to enjoy higher resolution, more power efficient video streaming on their mobile devices. Overcoming the computational entry barriers that could have prevented the benefits of HEVC becoming mainstream is a great example of how GPU Compute on ARM Mali GPUs can deliver new use cases to mobile by increasing SoC performance and reducing power consumption.

George Wang over 11 years ago

Thanks for sharing, this is a really awesome showcase of the power of ARM IPs!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
verma_san over 11 years ago

Lori, thanks for reading. You can share it with the right audience.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Lori Kate Smith over 11 years ago

Thanks for sharing a detailed and technical blogs.
As it relates to the Samsung Electronics Exynos-based arndale board, I wanted to share with those followers.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog