In the information era with its increased use of mobile devices to communicate and access information, web browsers constitute the central component to navigate through the vast amount of information as they are able to fetch and visualize content spread across the world-wide data network known as the Internet. Over the last decade, visualization capabilities of web browsers have been greatly enhanced by the increase in processing power of general purpose CPUs and graphics accelerators. Most mobile platforms include general-purpose SIMD engine, such as ARM NEON which can be used to efficiently process multimedia formats and help enhance user experience — up to a 4x improvement as discussed in this article.
The web browser group at the University of Szeged, Hungary has been actively working on the WebKit browser engine since 2008 in cooperation with ARM Ltd. and industrial partners. Over the last three years we have successfully completed several performance improvements including accelerating the JavaScript , Scalable Vector Graphics (SVG) pixel manipulations and CSS engines. A number of these improvements were also able to efficiently exploit the Symmetric Multiprocessing (SMP) capabilities of recent ARM CPUs. Memory footprint and space requirements have also been a key area of focus throughout this work with Webkit.
SVG filters are powerful graphical operations which can be used to enhance the visual appearance of common graphical primitives (texts, boxes, circles) with effects such as lighting and shadow casting just to name a few.
Filters consist of filter primitives, where each primitive performs an atomic operation, while their results can be combined together to create amazing graphical effects. Some primitives are simple enough to be handled by the underlying graphical subsystem (image moving is an example) while others require software rendering support. The latter ones are both most appealing and most computationally intensive. In this article we share some of our experience that has accumulated during the implementation of ARM NEON based SVG filters.
The lighting filter produces a shining effect based on the alpha channel as a height-map and is well suited to the ARM NEON instruction set since both the increased computation power and the larger register size can be used in multiple ways.
The NEON instruction set allows multiplication of four floating point numbers simultaneously, which is especially useful for fast normalized dot product calculation. A (0,0,0)->(x,y,z) vector can be efficiently represented in a NEON register as follows:
The first three single precision floating point numbers contain the x, y and z coordinates, the fourth coordinate contains the length (which is redundant information but is useful for optimization purposes). The normalized dot product of two vector can be obtained as (x1*x2+y1*y2+z1*z2)/(length1*length2).
All multiplications in this formula can be done by a single NEON VMUL multiplication instruction! This example emphasizes the importance of efficient data layout which can further improve the efficiency of SIMD instruction sets.
NEON registers can also be used to hold temporary data, which can reduce the number of memory reads and writes. As for the lighting filter, the normal vector is calculated from the alpha values of the 3x3 pixel matrix centered around the current pixel. Since the alpha values are processed from left to right, the center and right columns become the next left and center column, respectively. This shift can be done by a VEXT NEON instruction as the three u16 alpha values representing the current row are stored in the upper 6 bytes of a D register, and only the new right column needs to be loaded from the memory.
The NEON instruction set has other features which are very useful with lighting filters such as the fast conversion between multiple integer and floating point numbers, and the efficient clamping of the light strength to the 0-1 range by VMAX and VMIN instructions.
Using the optimizations above, the hand-written NEON-optimized assembly lighting filter is able to run 4 times faster on an ARM Cortex-A9 CPU compared to its C++ counterpart implementation. The implementation supports both diffuse and specular lighting filters with ambient, point and spot light sources.
The Gaussian blur filter has somewhat less potential to use the extra processing power provided by NEON instructions since the blurring effect is quite simple. The value of the current pixel is simply replaced by the average of its neighbouring pixels.
The average calculation must be applied to each row first, then to each column. This sequence is repeated three times (total of 6 runs).
The average must be calculated for all four (red, green, blue, and alpha) channels. All operations on the four channels, including memory transfers and arithmetic operations can be parallelized using appropriate NEON instructions (e.g., VADD, VLDR, VMUL). The NEON-based algorithm is 4 times faster than the original algorithm which processes each channel one-by-one.
From our experience, using the ARM NEON instruction set can considerably speed up computation intensive algorithms, where the same operation can be executed on multiple data of the same type.
NEON registers can also be used to store temporary data in order to reduce the number of memory transfer operations.
All this work is open source, and can be accessed as part of the official WebKit trunk, HERE.
Guest Blogger:Zoltan Herczeg, Senior Developer - University of Szeged, is a Senior Developer at the Software Engineering Department in the University of Szeged, Hungary. He is an accepted contributor of several open source projects including the WebKit browser engine (reviewer status), Perl Compatible Regular Expressions (PCRE) library (commiter status) and maintainer of XEEMU, a cycle accurate ARM instruction simulator. He holds an MSc in Computer Science.