Optimizing DirectFB with ARM NEON

September 11, 2013

5 minute read time.

DirectFB (Direct Frame Buffer) is a graphics library that is widely used in embedded systems, especially home market. More and more applications or libraries choose DirectFB as backend, such as Cairo, GDK, Qt, V8, X11 and Webkit. ARM NEON technology could be well used in 2D acceleration. In this blog, I'll describe how to optimize DirecFB using NEON.

1. Introduction

1.1 DirectFB Introduction

DirectFB(Direct Frame Buffer) is a thin library that provides hardware graphics acceleration, input device handling and abstraction, integrated windowing system with support for translucent windows and multiple display layers. It is free software licensed under the terms of the GNU Lesser General Public License (LGPL). Graphics features provided by DirectFB including Rectangle Filling/Drawing; Triangle Filling/Drawing; Line Drawing Blit; Alpha Blending (texture alpha, alpha modulation); Porter/Duff; Colorizing; Source Color Keying; Destination Color Keying and so on.

Figure-1 DirectFB Architecture

1.2 NEON Introduction

NEONtechnology is a 128 bit SIMD (Single Instruction, Multiple Data) architecture extension for the ARM Cortex™-A series processors, designed to provide flexible and powerful acceleration for consumer multimedia applications, delivering a significantly enhanced user experience. It has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).

NEON technology is widely used in multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound synthesis.

There is a good article Coding for NEON posted by Martyn, and you can find additional information from ARM infocenteras well.

2. Optimizing DirecFB with NEON

Because DirectFB is a complete hardware abstraction layer with software fallbacks for every graphics operation that is not supported by the underlying hardware, it will run on all platforms, but performance can suffer on platforms without specialized 2D hardware. This is where NEON comes in, for many 2D operations (such as Blit, Blending, Color format conversion) NEON can provide a measure of hardware acceleration.

2.1 Profiling

First, we profile DirectFB to identify hotspots for each operation. The Df_dok function is a good example of a critical path code that we want to target in our profiling, it covers most 2D operations, such as filling Rectangle/triangle/spans (with blending), drawing Rectangle/triangle/spans (with blending), blitting (with blending) and so on. For our work I used DS-5 Streamline performance analyzer as the profiling tool. It's easy to set up the profiling environment and provides powerful functions for ARM Linux and Android platforms. For more information on how to do profiling with DS-5 streamline, see ARM DS-5 Using ARM Streamline.

Take the case of the fill-rectangle-blend(rgb16) operation, for example. It's easy to pinpoint the top functions by time spent using DS-5 Streamline. As you can see in the screenshot below, these are Sop_rgb16_to_Dacc (43.82%), Xacc_blend_invsrcalpha (16.29%), Sacc_to_Aop_rgb16 (11.09%), SCacc_add_to_Dacc_C (9.81%).

Figure-2 Snapshot of profiling DirectFB

2.2 Optimization

Next step is to select key functions for optimization according to the profiling results. The functions we targeted are listed below, in the gInit_NEON function source. gInit_NEON defines APIs re-implemented for NEON and be called in the DirectFB initialization stage.

static void gInit_NEON( void )

{

use_neon = 1;

Sop_PFI_to_Dacc[DFB_PIXELFORMAT_INDEX(DSPF_RGB16)] = Sop_rgb16_to_Dacc_NEON;

Sop_PFI_to_Dacc[DFB_PIXELFORMAT_INDEX(DSPF_ARGB )] = Sop_argb_to_Dacc_NEON;

Sacc_to_Aop_PFI[DFB_PIXELFORMAT_INDEX(DSPF_RGB16)] = Sacc_to_Aop_rgb16_NEON;

Cop_to_Aop_PFI[DFB_PIXELFORMAT_INDEX(DSPF_RGB16)] = Cop_to_Aop_16_NEON;

SCacc_add_to_Dacc = SCacc_add_to_Dacc_NEON;

Sacc_add_to_Dacc = Sacc_add_to_Dacc_NEON;

Xacc_blend[DSBF_INVSRCALPHA-1] = Xacc_blend_invsrcalpha_NEON;

Xacc_blend[DSBF_SRCALPHA-1] = Xacc_blend_srcalpha_NEON;

Dacc_modulation[DSBLIT_BLEND_ALPHACHANNEL |

DSBLIT_BLEND_COLORALPHA |

DSBLIT_COLORIZE] = Dacc_modulate_argb_NEON;

Dacc_modulation[DSBLIT_COLORIZE] = Dacc_modulate_rgb_NEON;

Dacc_modulation[DSBLIT_COLORIZE |

DSBLIT_BLEND_ALPHACHANNEL] = Dacc_modulate_rgb_NEON;

Bop_argb_blend_alphachannel_src_invsrc_Aop_PFI[DFB_PIXELFORMAT_INDEX(DSPF_RGB16)] \

= Bop_argb_blend_alphachannel_src_invsrc_Aop_rgb16_NEON;

}

Most of the functions we targeted relate to Color format conversion, blit and blending computation. Take Sop_rgb16_to_Dacc optimization, listed below, for example, It is used to convert RGB565 to ARGB8888.

"vld1.16 {q0}, [%[S]]! \n\t" /* Load 8 pixels from Source to q0 */

"vmov.i16   q4, #0x00FF      \n\t"               /* A: q4 */
"vshr.u16       q3, q0, #8   \n\t"
"vsri.u8         q3, q3, #5   \n\t"               /* R: q3 */
"vshl.u16       q2, q0, #5   \n\t"
"vshr.u16       q2, q2, #8   \n\t"
"vsri.u8         q2, q2, #6   \n\t"               /* G: q2 */
"vshl.u16       q1, q0, #11     \n\t"
"vshr.u16       q1, q1, #8   \n\t"
"vsri.u8         q1, q1, #5   \n\t"               /* B: q1 */
"vst4.16     {d2, d4, d6, d8}, [%[D]]! \n\t"
"vst4.16     {d3, d5, d7, d9}, [%[D]]! \n\t" /* Store 8 pixels to Dst */

The effects of each instruction are described in the comments above, but notice that:

1. Data will be stored in structure GenefxAccumulator defined in DirectFB. There are 16 bits for each channel.

typedef union {

struct {

u16 b;

u16 g;

u16 r;

u16 a;

} RGB;

...

} GenefxAccumulator;

2. The Alpha channel is just filled with 0x00FF.

3. For R, G and B channels, the color data is shifted into the most significant bits, then shifted right with insert to position each color channel in the result register.

2.3 Debug

During the optimization process, a debugger that supports the NEON registers was essential. Again, we used the DS-5 Debugger. It's UI friendly and easy to debug NEON code. More information on how to use DS-5 Debugger, refer to ARM DS-5 Using the Debugger.

Figure-3 Snapshot of debug DirectFB

3. Benchmarking the Results

3.1 Environment

Platform: AML8726 (Cortex-A9, Single core)

DirectFB: 1.5.0

Benchmark case: df_dok
Resolution: 1280 x 768
Rectangle size: 256 x 256
Pixel format: rgb16 and argb

3.2 Result

Figure-4 shows the improvement of rgb16 format. For fill-blend operations (such as fill rectangle/triangle/spans with blending), the fill rate increased 100% on average. There was a 127% improvement for "Blit with format conversion" and a 140% improvement for "Blit from 32bit (blend)". Additionally, performance of "Blit from 32bit (blend) with colorizing" and "Blit from 8bit palette (blend)" increased more than 75%, and "Blit with mask" improved 45%.

Figure-4 improvement of rgb16 format

Figure-5 shows the improvement of argb format. Performance of fill-blending operations also increased 100%. There was 127% improvement for "Blit" and 140% improvement for "Blit from 32bit (blend)". "Blit from 8bit palette (blend)" and "Blit from 32bit (blend) with colorizing" both increased more than 75%. Performance of "Blit with colorizing" and "Blit with mask" were up more than 58%.

Figure-5 improvement of argb format

4. Conclusion

DirectFB was an ideal package to target for NEON optimizations, making a material difference in performance on our development target. These optimizations will benefit all ARMv7 and Cortex-A processors that support the NEON instructions. The source code for the work described in this blog is available from Linaro DirectFB Optimization project, I encourage you to leverage this work to accelerate your next DirectFB project or simply as a reference on how to take advantage of NEON SIMD instructions.

Parents

Abel Zhang over 12 years ago

Could ARMCC in DS-5 pro be used to compile the native code of Android NDK? If could, then how to do? We try to use ARMCC to compile NEON intrincis in the native code of Android NDK, but found none guideline or articles, any sugessions? thanks a lot...
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Abel Zhang over 12 years ago

Could ARMCC in DS-5 pro be used to compile the native code of Android NDK? If could, then how to do? We try to use ARMCC to compile NEON intrincis in the native code of Android NDK, but found none guideline or articles, any sugessions? thanks a lot...
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Tools, Software and IDEs blog

Python on Arm: 2025 Update

Diego Russo

Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
- August 21, 2025
Product update: Arm Development Studio 2025.0 now available

Stephen Theobald

Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
- July 18, 2025
GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Optimizing DirectFB with ARM NEON

Python on Arm: 2025 Update

Product update: Arm Development Studio 2025.0 now available

GCC 15: Continuously Improving