利用NEON技术编写代码
ARM NEON™ technology is widely used for multimedia optimization. The SIMD architecture of NEON technology makes it very suitable for many compute intensive modules in multimedia codecs such as filtering, de-blocking etc. This blog explores effective coding techniques to enhance performance of an audio/video codec.
The compression algorithms used in multimedia codecs involve processing on large chunks of data. These may be simple element-wise operations on arrays or matrices or more complex loops iterating over large data. In many cases, the code can be re-arranged so that loop iterations can be executed efficiently and in parallel.
In the case of a video codec, most of processing is done on row pixels and column pixels of a block. For instance, the in-loop de-blocking filter involves data processing on multiple pixels of vertical and horizontal edges. These computations can be accelerated using NEON technology by arranging the data in a suitable way. Consider the case of vertical filtering.
NEON load and transpose instructions are used to load all 16 pixels (8 bit depth) of one column into a Q register. As shown in the above figure, the 8 pixels P3,P2,P1,P0,Q0,Q1,Q2,Q3 of a row of frame buffer are loaded into one D register and similar operation done for the remaining 15 rows with an increment of frame width for each row. To make use of NEON instructions the loaded row pixels need to be transposed. Vector transpose treats the elements of its operand vectors as elements of 2 x 2matrices, and transposes the matrices .As a first step in transpose, top 8 row pixels should be loaded into d0,d2,d4,d6,d8,d10,d12,d14 and bottom 8 row pixels should be loaded into d1,d3,d5,d7,d9,d11,d13,d15.
Top 8 rows
VLD1.8 d0, [framebuffer], framewidth
VLD1.8 d2, [framebuffer], framewidth
VLD1.8 d4, [framebuffer], framewidth
VLD1.8 d6, [framebuffer], framewidth
VLD1.8 d8, [framebuffer], framewidth
VLD1.8 d10, [framebuffer], framewidth
VLD1.8 d12, [framebuffer], framewidth
VLD1.8 d14, [framebuffer], framewidth
Bottom 8 rows
VLD1.8 d1, [framebuffer], framewidth
VLD1.8 d3, [framebuffer], framewidth
VLD1.8 d5, [framebuffer], framewidth
VLD1.8 d7, [framebuffer], framewidth
VLD1.8 d9, [framebuffer], framewidth
VLD1.8 d11, [framebuffer], framewidth
VLD1.8 d13, [framebuffer], framewidth
VLD1.8 d15, [framebuffer], framewidth
VTRN.32 , VTRN.16 , VTRN.8 instructions are operated on the 8 Q registers(16 D registers) to arrange each column data into one Q register (a total of 8 Q registers) as follows:
Consider the 32 bit partitions of two Q registers as the elements of 2X2 matrix and perform transpose on the 2 internal 2x2 matrices on the following Q registers.
VTRN.32 Q0, Q4
VTRN.32 Q1, Q5
VTRN.32 Q2, Q6
VTRN.32 Q3, Q7
Consider the 16 bit partitions of two Q registers as the elements of 2X2 matrix and perform transpose on the 4 internal 2x2 matrices on the following Q registers.
VTRN.16 Q0, Q2
VTRN.16 Q1, Q3
VTRN.16 Q4, Q6
VTRN.16 Q5, Q7
Consider the 8 bit partitions of two Q registers as the elements of 2X2 matrix and perform transpose on the 8 internal 2x2 matrices on the following Q registers.
VTRN.8 Q0, Q1
VTRN.8 Q2, Q3
VTRN.8 Q4, Q5
VTRN.8 Q6, Q7
To smooth a single edge, filter operations are performed on the 8 Q registers. Consider the equation for updating edge pixel P0.
P0 = (P2+2P1+2P0+2Q0+Q1+4)>>3
One edge pixel requires 4 additions, 3 multiplications and one round instruction. In NEON this operations are performed on Q registers and hence 16 edge pixels get updated with 4 additions, 3 multiplications and one round instruction. Implementing this module using NEON instructions gives 80% reduction in cycles.
Certain compute intensive modules in audio codec such as stereo processing, FFT, filtering etc., can also be effectively coded using NEON. Efficient utilization of NEON features gives an average of 80% reduction of cycles in the case of video codecs and about 40% in the case of audio codecs.
Since data elements in video processing are either 8-bit or 16-bit; NEON vector instructions and large register set make it very suitable for parallel computation of data. Multiple data elements can be loaded/stored in less number of cycles. Additionally NEON also supports aligned loads which can further be used to reduce cycles. Interleaved loads/stores supported by NEON are very suitable to optimize the FFT.
For further details on different optimization techniques and how the code can be arranged to utilize the NEON features, please refer to:
Partner Blogger: Vivek Arora is Program Manager in SoCtronics Technologies Pvt. Ltd., responsible for customer program and product management. He has Bachelor of Technology in Computer Engineering & MBA degrees. He has around 12 years of experience, majority of which is in developing products like mobile phones, custom embedded platforms etc. for global companies.
hi vivek,
The link that you had shared for downloading of Optimization of Multimedia Codecs using ARM NEON is not working.Even after entering my details in regs form...it is not displaying any thing..