We have seen how to load and store data with NEON, and how to handle the leftovers resulting from vector processing. Let us move on to doing some useful data processing - multiplying matrices.

## Matrices

In this post, we will look at how to efficiently multiply four-by-four matrices together, an operation frequently used in the world of 3D graphics. We will assume that the matrices are stored in memory in column-major order - this is the format used by OpenGL-ES.

## Algorithm

We start by examining the matrix mutiply operation in detail, by expanding the calculation, and identifying sub-operations that can be implemented using NEON instructions.

Notice that in the diagram, we multiply each column of the first matrix (in red) by a corresponding single value in the second matrix (blue) then add together the results for each element to give a column of results. This operation is repeated for each of the four columns in the result matrix.

If each column is a vector in a NEON register, we can use the **vector-by-scalar multiplication** instruction to calculate efficiently each result column. The sub-operation highlighted in the diagram can be implemented using this instruction. We must also add the results together for each element of the column, which we do using the accumulating version of the same instruction.

As we are operating on the columns of the first matrix, and producing a column of results, reading and writing elements to and from memory is a linear operation, and requires no interleaving load or store instructions.

## Code

### Floating Point

First, we will look at an implementation that multiplies single precision floating point matrices.

Begin by loading the matrices from memory into NEON registers. The matrices we are multiplying use column-major order, so columns of the matrix are stored linearly in memory. A column can be loaded into NEON registers using `VLD1`

, and written back to memory using `VST1`

.

vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0

vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0

vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1

vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1

As NEON has 32 64-bit registers, we can load all of the elements from both input matrices into registers, and still have registers left over for use as accumulators. Here, d16 to d23 hold 16 elements from the first matrix, and d0 to d7 hold 16 elements from the second.

#### An Aside: D and Q registers

Most NEON instructions can use the register bank in two ways:

- As 32
**Double-word registers**, 64-bits in size, named d0 to d31. - As 16
**Quad-word registers**, 128-bits in size, named q0 to q15.

These registers are aliased so that **the data in a Q register is the same as that in its two corresponding D registers**. For example, q0 is aliased to d0 and d1, and the same data is accessible through either register type. In C terms, this is very similar to a union.

For the floating point matrix multiplication example, we will use Q registers frequently, as we are handling columns of four 32-bit floating point numbers, which fit into a single 128-bit Q register.

#### Back to the Code

We can calculate a column of results using just four NEON multiply instructions:

vmul.f32 q12, q8, d0[0] @ multiply col element 0 by matrix col 0

vmla.f32 q12, q9, d0[1] @ multiply-acc col element 1 by matrix col 1

vmla.f32 q12, q10, d1[0] @ multiply-acc col element 2 by matrix col 2

vmla.f32 q12, q11, d1[1] @ multiply-acc col element 3 by matrix col 3

Here, the first instruction implements the operation highlighted in the diagram - x0, x1, x2 and x3 (in register q8) are each multiplied by y0 (element 0 in d0), and stored in q12.

Subsequent instructions operate on the other columns of the first matrix, multiplyingby corresponding elements of the first column of the second matrix. Results are accumulated into q12 to give the first column of values for the result matrix.

Notice that the scalar used in the multiply instructions refers to D registers; although q0[3] should be the same value as d1[1], and using it would perhaps make more sense here, the GNU assembler I'm using does not accept that format. I have to specify a scalar from a D register. Your assembler may be better.

If we only needed to calculate a matrix-by-vector multiplication (another common operation in 3D graphics,) the operation would now be complete, and the result vector can be stored to memory. However, to complete the matrix-by-matrix multiplication, we must execute three more iterations, using values y4 to yF in registers q1 to q3.

.macro mul_col_f32 res_q, col0_d, col1_d

vmul.f32 \res_q, q8, \col0_d[0] @ multiply col element 0 by matrix col 0

vmla.f32 \res_q, q9, \col0_d[1] @ multiply-acc col element 1 by matrix col 1

vmla.f32 \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2

vmla.f32 \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3

.endm

The implementation of a four-by-four floating point matrix multiply now looks like this.

vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0

vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0

vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1

vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1

mul_col_f32 q12, d0, d1 @ matrix 0 * matrix 1 col 0

mul_col_f32 q13, d2, d3 @ matrix 0 * matrix 1 col 1

mul_col_f32 q14, d4, d5 @ matrix 0 * matrix 1 col 2

mul_col_f32 q15, d6, d7 @ matrix 0 * matrix 1 col 3

vst1.32 {d24-d27}, [r0]! @ store first eight elements of result

vst1.32 {d28-d31}, [r0]! @ store second eight elements of result

### Fixed Point

Using fixed point arithmetic for calculations is often faster than floating point - it requires less memory bandwidth to read and write values that use fewer bits, and multiplication of integer values is generally quicker than the same operations applied to floating point numbers.

However, when using fixed point arithmetic, you must choose the representation carefully to avoid overflow or saturation, whilst preserving the degree of precision in the results that your application requires.

Implementing a matrix multiply using fixed point values is very similar to floating point. In this example, we will use Q1.14 fixed-point format, but the operations required are similar for other formats, and may only require a change to the final shift applied to the accumulator. Here is the macro:

.macro mul_col_s16 res_d, col_d

vmull.s16 q12, d16, \col_d[0] @ multiply col element 0 by matrix col 0

vmlal.s16 q12, d17, \col_d[1] @ multiply-acc col element 1 by matrix col 1

vmlal.s16 q12, d18, \col_d[2] @ multiply-acc col element 2 by matrix col 2

vmlal.s16 q12, d19, \col_d[3] @ multiply-acc col element 3 by matrix col 3

vqrshrn.s32 \res_d, q12, #14 @ shift right and narrow accumulator into

@ Q1.14 fixed point format, with saturation

.endm

Comparing it to the macro used in the floating point version, you will see that the major differences are:

- Values are now 16-bit rather than 32-bit, so we can use D registers to hold four inputs.
- The result of multiplying two 16-bit numbers is a 32-bit number. We use
`VMUL`

and**L**`VMLA`

, because they store their results in Q registers, preserving all of the bits of the result using**L**

double-size elements. - The final result must be 16-bits, but the accumulators are 32-bit. We obtain a 16-bit result using
`VQRSHRN`

, a vector, saturating, rounding, narrowing shift right. This adds the correct rounding value to each element, shifts it right, and saturates the result to the new, narrower element size.

The reduction from 32-bits to 16-bits per element also has an effect on the memory accesses; the data can be loaded and stored using fewer instructions. The code for a fixed point matrix multiply looks like this:

vld1.16 {d16-d19}, [r1] @ load sixteen elements of matrix 0

vld1.16 {d0-d3}, [r2] @ load sixteen elements of matrix 1

mul_col_s16 d4, d0 @ matrix 0 * matrix 1 col 0

mul_col_s16 d5, d1 @ matrix 0 * matrix 1 col 1

mul_col_s16 d6, d2 @ matrix 0 * matrix 1 col 2

mul_col_s16 d7, d3 @ matrix 0 * matrix 1 col 3

vst1.16 {d4-d7}, [r0] @ store sixteen elements of result

## Scheduling

We will deal with the details of scheduling in a future post, but for now, it is worth seeing the effect of improved instruction scheduling on this code.

In the macro, adjacent multiply instructions write to the same register, so the NEON pipeline must wait for each multiply to complete before it can start the next instruction.

If we take the instructions out of the macro and rearrange them, we can separate those that are dependent using other instructions that are not dependent. These instructions can be issued whilst the others complete in the background.

In this case, we rearrange the code to space out accesses to the accumulator registers.

vmul.f32 q12, q8, d0[0] @ rslt col0 = (mat0 col0) * (mat1 col0 elt0)

vmul.f32 q13, q8, d2[0] @ rslt col1 = (mat0 col0) * (mat1 col1 elt0)

vmul.f32 q14, q8, d4[0] @ rslt col2 = (mat0 col0) * (mat1 col2 elt0)

vmul.f32 q15, q8, d6[0] @ rslt col3 = (mat0 col0) * (mat1 col3 elt0)

vmla.f32 q12, q9, d0[1] @ rslt col0 += (mat0 col1) * (mat1 col0 elt1)

vmla.f32 q13, q9, d2[1] @ rslt col1 += (mat0 col1) * (mat1 col1 elt1)

...

...

Using this version, matrix multiply performance almost doubles on a Cortex-A8 based system.

You can find details for instruction timings and latencies from the Technical Reference Manual for your Cortex core. With potential performance improvements like those described above, it is well worth spending some time familiarizing yourself with it.

## Source

The code for the two functions described above can be found here: matrix_asm_sched.s