This blog has been updated and turned into a more formal guide on Arm Developer. You can find the latest guide here:
In part 1 of this series we dealt with how to load and store data with NEON, and part 2 involved how to handle the leftovers resulting from vector processing. Let us move on to doing some useful data processing - multiplying matrices.
In this post, we will look at how to efficiently multiply four-by-four matrices together, an operation frequently used in the world of 3D graphics. We will assume that the matrices are stored in memory in column-major order - this is the format used by OpenGL-ES.
We start by examining the matrix multiply operation in detail, by expanding the calculation, and identifying sub-operations that can be implemented using Neon instructions.
Notice that in the diagram, we multiply each column of the first matrix (in red) by a corresponding single value in the second matrix (blue) then add together the results for each element to give a column of results. This operation is repeated for each of the four columns in the result matrix.
If each column is a vector in a Neon register, we can use the vector-by-scalar multiplication instruction to calculate efficiently each result column. The sub-operation highlighted in the diagram can be implemented using this instruction. We must also add the results together for each element of the column, which we do using the accumulating version of the same instruction.
As we are operating on the columns of the first matrix, and producing a column of results, reading and writing elements to and from memory is a linear operation, and requires no interleaving load or store instructions.
First, we will look at an implementation that multiplies single precision floating point matrices.
Begin by loading the matrices from memory into Neon registers. The matrices we are multiplying use column-major order, so columns of the matrix are stored linearly in memory. A column can be loaded into Neon registers using VLD1, and written back to memory using VST1.
VLD1
VST1
vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0 vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0 vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1 vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1
As Neon has 32 64-bit registers, we can load all of the elements from both input matrices into registers, and still have registers left over for use as accumulators. Here, d16 to d23 hold 16 elements from the first matrix, and d0 to d7 hold 16 elements from the second.
Most Neon instructions can use the register bank in two ways:
These registers are aliased so that the data in a Q register is the same as that in its two corresponding D registers. For example, q0 is aliased to d0 and d1, and the same data is accessible through either register type. In C terms, this is very similar to a union.
For the floating point matrix multiplication example, we will use Q registers frequently, as we are handling columns of four 32-bit floating point numbers, which fit into a single 128-bit Q register.
We can calculate a column of results using just four Neon multiply instructions:
vmul.f32 q12, q8, d0[0] @ multiply col element 0 by matrix col 0 vmla.f32 q12, q9, d0[1] @ multiply-acc col element 1 by matrix col 1 vmla.f32 q12, q10, d1[0] @ multiply-acc col element 2 by matrix col 2 vmla.f32 q12, q11, d1[1] @ multiply-acc col element 3 by matrix col 3
Here, the first instruction implements the operation highlighted in the diagram - x0, x1, x2 and x3 (in register q8) are each multiplied by y0 (element 0 in d0), and stored in q12.
Subsequent instructions operate on the other columns of the first matrix, multiplying by corresponding elements of the first column of the second matrix. Results are accumulated into q12 to give the first column of values for the result matrix.
Notice that the scalar used in the multiply instructions refers to D registers; although q0[3] should be the same value as d1[1], and using it would perhaps make more sense here, the GNU assembler I'm using does not accept that format. I have to specify a scalar from a D register. Your assembler may be better.
If we only needed to calculate a matrix-by-vector multiplication (another common operation in 3D graphics,) the operation would now be complete, and the result vector can be stored to memory. However, to complete the matrix-by-matrix multiplication, we must execute three more iterations, using values y4 to yF in registers q1 to q3.
.macro mul_col_f32 res_q, col0_d, col1_d vmul.f32 \res_q, q8, \col0_d[0] @ multiply col element 0 by matrix col 0 vmla.f32 \res_q, q9, \col0_d[1] @ multiply-acc col element 1 by matrix col 1 vmla.f32 \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2 vmla.f32 \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3 .endm
The implementation of a four-by-four floating point matrix multiply now looks like this.
vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0 vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0 vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1 vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1 mul_col_f32 q12, d0, d1 @ matrix 0 * matrix 1 col 0 mul_col_f32 q13, d2, d3 @ matrix 0 * matrix 1 col 1 mul_col_f32 q14, d4, d5 @ matrix 0 * matrix 1 col 2 mul_col_f32 q15, d6, d7 @ matrix 0 * matrix 1 col 3 vst1.32 {d24-d27}, [r0]! @ store first eight elements of result vst1.32 {d28-d31}, [r0]! @ store second eight elements of result
Using fixed point arithmetic for calculations is often faster than floating point - it requires less memory bandwidth to read and write values that use fewer bits, and multiplication of integer values is generally quicker than the same operations applied to floating point numbers.
However, when using fixed point arithmetic, you must choose the representation carefully to avoid overflow or saturation, whilst preserving the degree of precision in the results that your application requires.
Implementing a matrix multiply using fixed point values is very similar to floating point. In this example, we will use Q1.14 fixed-point format, but the operations required are similar for other formats, and may only require a change to the final shift applied to the accumulator. Here is the macro:
.macro mul_col_s16 res_d, col_d vmull.s16 q12, d16, \col_d[0] @ multiply col element 0 by matrix col 0 vmlal.s16 q12, d17, \col_d[1] @ multiply-acc col element 1 by matrix col 1 vmlal.s16 q12, d18, \col_d[2] @ multiply-acc col element 2 by matrix col 2 vmlal.s16 q12, d19, \col_d[3] @ multiply-acc col element 3 by matrix col 3 vqrshrn.s32 \res_d, q12, #14 @ shift right and narrow accumulator into @ Q1.14 fixed point format, with saturation .endm
Comparing it to the macro used in the floating point version, you will see that the major differences are:
VMULL
VMLAL
VQRSHRN
The reduction from 32-bits to 16-bits per element also has an effect on the memory accesses; the data can be loaded and stored using fewer instructions. The code for a fixed point matrix multiply looks like this:
vld1.16 {d16-d19}, [r1] @ load sixteen elements of matrix 0 vld1.16 {d0-d3}, [r2] @ load sixteen elements of matrix 1 mul_col_s16 d4, d0 @ matrix 0 * matrix 1 col 0 mul_col_s16 d5, d1 @ matrix 0 * matrix 1 col 1 mul_col_s16 d6, d2 @ matrix 0 * matrix 1 col 2 mul_col_s16 d7, d3 @ matrix 0 * matrix 1 col 3 vst1.16 {d4-d7}, [r0] @ store sixteen elements of result
We will deal with the details of scheduling in a future post, but for now, it is worth seeing the effect of improved instruction scheduling on this code.
In the macro, adjacent multiply instructions write to the same register, so the Neon pipeline must wait for each multiply to complete before it can start the next instruction.
If we take the instructions out of the macro and rearrange them, we can separate those that are dependent using other instructions that are not dependent. These instructions can be issued whilst the others complete in the background.
In this case, we rearrange the code to space out accesses to the accumulator registers.
vmul.f32 q12, q8, d0[0] @ rslt col0 = (mat0 col0) * (mat1 col0 elt0) vmul.f32 q13, q8, d2[0] @ rslt col1 = (mat0 col0) * (mat1 col1 elt0) vmul.f32 q14, q8, d4[0] @ rslt col2 = (mat0 col0) * (mat1 col2 elt0) vmul.f32 q15, q8, d6[0] @ rslt col3 = (mat0 col0) * (mat1 col3 elt0) vmla.f32 q12, q9, d0[1] @ rslt col0 += (mat0 col1) * (mat1 col0 elt1) vmla.f32 q13, q9, d2[1] @ rslt col1 += (mat0 col1) * (mat1 col1 elt1) ... ...
Using this version, matrix multiply performance almost doubles on a Cortex-A8 based system.
You can find details for instruction timings and latencies from the Technical Reference Manual for your Cortex core. With potential performance improvements like those described above, it is well worth spending some time familiarizing yourself with it.
The code for the two functions described above can be found here: matrix_asm_sched.s
@ @ NEON matrix multiplication examples @ .syntax unified @ @ matrix_mul_float: @ Calculate 4x4 (matrix 0) * (matrix 1) and store to result 4x4 matrix. @ matrix 0, matrix 1 and result pointers can be the same, @ ie. my_matrix = my_matrix * my_matrix is possible. @ @ r0 = pointer to 4x4 result matrix, single precision floats, column major order @ r1 = pointer to 4x4 matrix 0, single precision floats, column major order @ r2 = pointer to 4x4 matrix 1, single precision floats, column major order @ .global matrix_mul_float matrix_mul_float: vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0 vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0 vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1 vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1 vmul.f32 q12, q8, d0[0] @ rslt col0 = (mat0 col0) * (mat1 col0 elt0) vmul.f32 q13, q8, d2[0] @ rslt col1 = (mat0 col0) * (mat1 col1 elt0) vmul.f32 q14, q8, d4[0] @ rslt col2 = (mat0 col0) * (mat1 col2 elt0) vmul.f32 q15, q8, d6[0] @ rslt col3 = (mat0 col0) * (mat1 col3 elt0) vmla.f32 q12, q9, d0[1] @ rslt col0 += (mat0 col1) * (mat1 col0 elt1) vmla.f32 q13, q9, d2[1] @ rslt col1 += (mat0 col1) * (mat1 col1 elt1) vmla.f32 q14, q9, d4[1] @ rslt col2 += (mat0 col1) * (mat1 col2 elt1) vmla.f32 q15, q9, d6[1] @ rslt col3 += (mat0 col1) * (mat1 col3 elt1) vmla.f32 q12, q10, d1[0] @ rslt col0 += (mat0 col2) * (mat1 col0 elt2) vmla.f32 q13, q10, d3[0] @ rslt col1 += (mat0 col2) * (mat1 col1 elt2) vmla.f32 q14, q10, d5[0] @ rslt col2 += (mat0 col2) * (mat1 col2 elt2) vmla.f32 q15, q10, d7[0] @ rslt col3 += (mat0 col2) * (mat1 col2 elt2) vmla.f32 q12, q11, d1[1] @ rslt col0 += (mat0 col3) * (mat1 col0 elt3) vmla.f32 q13, q11, d3[1] @ rslt col1 += (mat0 col3) * (mat1 col1 elt3) vmla.f32 q14, q11, d5[1] @ rslt col2 += (mat0 col3) * (mat1 col2 elt3) vmla.f32 q15, q11, d7[1] @ rslt col3 += (mat0 col3) * (mat1 col3 elt3) vst1.32 {d24-d27}, [r0]! @ store first eight elements of result vst1.32 {d28-d31}, [r0]! @ store second eight elements of result mov pc, lr @ return to caller @ Macro: mul_col_s16 @ @ Multiply a four s16 element column of a matrix by the columns of a second matrix @ to give a column of results. Elements are assumed to be in Q1.14 format. @ Inputs: col_d - d register containing a column of the matrix @ Outputs: res_d - d register containing the column of results @ Corrupts: register q12 @ Assumes: the second matrix columns are in registers d16-d19 in column major order @ .macro mul_col_s16 res_d, col_d vmull.s16 q12, d16, \col_d[0] @ multiply col element 0 by matrix col 0 vmlal.s16 q12, d17, \col_d[1] @ multiply-acc col element 1 by matrix col 1 vmlal.s16 q12, d18, \col_d[2] @ multiply-acc col element 2 by matrix col 2 vmlal.s16 q12, d19, \col_d[3] @ multiply-acc col element 3 by matrix col 3 vqrshrn.s32 \res_d, q12, #14 @ shift right and narrow accumulator into @ Q1.14 fixed point format, with saturation .endm @ @ matrix_mul_fixed: @ Calculate 4x4 (matrix 0) * (matrix 1) and store to result 4x4 matrix. @ matrix 0, matrix 1 and result pointers can be the same, @ ie. my_matrix = my_matrix * my_matrix is possible @ @ r0 = pointer to 4x4 result matrix, Q1.14 fixed point, column major order @ r1 = pointer to 4x4 matrix 0, Q1.14 fixed point, column major order @ r2 = pointer to 4x4 matrix 1, Q1.14 fixed point, column major order @ .global matrix_mul_fixed matrix_mul_fixed: vld1.16 {d16-d19}, [r1] @ load sixteen elements of matrix 0 vld1.16 {d0-d3}, [r2] @ load sixteen elements of matrix 1 mul_col_s16 d4, d0 @ matrix 0 * matrix 1 col 0 mul_col_s16 d5, d1 @ matrix 0 * matrix 1 col 1 mul_col_s16 d6, d2 @ matrix 0 * matrix 1 col 2 mul_col_s16 d7, d3 @ matrix 0 * matrix 1 col 3 vst1.16 {d4-d7}, [r0] @ store sixteen elements of result mov pc, lr @ return to caller
[CTAToken URL = "https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-4-shifting-left-and-right" target="_blank" text="Read Part 4: Shifting Left and Right" class ="green"]