We have seen how to load and store data with NEON, and how to handle the leftovers resulting from vector processing. Let us move on to doing some useful data processing - multiplying matrices.

 

Matrices

 

In this post, we will look at how to efficiently multiply four-by-four matrices together, an operation frequently used in the world of 3D graphics. We will assume that the matrices are stored in memory in column-major order - this is the format used by OpenGL-ES.

 

Algorithm

 

We start by examining the matrix mutiply operation in detail, by expanding the calculation, and identifying sub-operations that can be implemented using NEON instructions.

 

Notice that in the diagram, we multiply each column of the first matrix (in red) by a corresponding single value in the second matrix (blue) then add together the results for each element to give a column of results. This operation is repeated for each of the four columns in the result matrix.

 

 

If each column is a vector in a NEON register, we can use the vector-by-scalar multiplication instruction to calculate efficiently each result column. The sub-operation highlighted in the diagram can be implemented using this instruction. We must also add the results together for each element of the column, which we do using the accumulating version of the same instruction.


As we are operating on the columns of the first matrix, and producing a column of results, reading and writing elements to and from memory is a linear operation, and requires no interleaving load or store instructions.

 

Code

 

Floating Point


First, we will look at an implementation that multiplies single precision floating point matrices.


Begin by loading the matrices from memory into NEON registers. The matrices we are multiplying use column-major order, so columns of the matrix are stored linearly in memory. A column can be loaded into NEON registers using VLD1, and written back to memory using VST1.


     vld1.32  {d16-d19}, [r1]!            @ load first eight elements of matrix 0
     vld1.32  {d20-d23}, [r1]!            @ load second eight elements of matrix 0
     vld1.32  {d0-d3}, [r2]!              @ load first eight elements of matrix 1
     vld1.32  {d4-d7}, [r2]!              @ load second eight elements of matrix 1

 

As NEON has 32 64-bit registers, we can load all of the elements from both input matrices into registers, and still have registers left over for use as accumulators. Here, d16 to d23 hold 16 elements from the first matrix, and d0 to d7 hold 16 elements from the second.

 

An Aside: D and Q registers

Most NEON instructions can use the register bank in two ways:

  • As 32 Double-word registers, 64-bits in size, named d0 to d31.
  • As 16 Quad-word registers, 128-bits in size, named q0 to q15.

 


These registers are aliased so that the data in a Q register is the same as that in its two corresponding D registers. For example, q0 is aliased to d0 and d1, and the same data is accessible through either register type. In C terms, this is very similar to a union.

 

For the floating point matrix multiplication example, we will use Q registers frequently, as we are handling columns of four 32-bit floating point numbers, which fit into a single 128-bit Q register.

Back to the Code

 

We can calculate a column of results using just four NEON multiply instructions:


     vmul.f32    q12, q8, d0[0]      @ multiply col element 0 by matrix col 0
     vmla.f32    q12, q9, d0[1]      @ multiply-acc col element 1 by matrix col 1
     vmla.f32    q12, q10, d1[0]  @ multiply-acc col element 2 by matrix col 2
     vmla.f32    q12, q11, d1[1]  @ multiply-acc col element 3 by matrix col 3


Here, the first instruction implements the operation highlighted in the diagram - x0, x1, x2 and x3 (in register q8) are each multiplied by y0 (element 0 in d0), and stored in q12.

 

Subsequent instructions operate on the other columns of the first matrix, multiplyingby corresponding elements of the first column of the second matrix. Results are accumulated into q12 to give the first column of values for the result matrix.

 

Notice that the scalar used in the multiply instructions refers to D registers; although q0[3] should be the same value as d1[1], and using it would perhaps make more sense here, the GNU assembler I'm using does not accept that format. I have to specify a scalar from a D register. Your assembler may be better.

 

If we only needed to calculate a matrix-by-vector multiplication (another common operation in 3D graphics,) the operation would now be complete, and the result vector can be stored to memory. However, to complete the matrix-by-matrix multiplication, we must execute three more iterations, using values y4 to yF in registers q1 to q3.


     .macro mul_col_f32 res_q, col0_d, col1_d
     vmul.f32    \res_q, q8, \col0_d[0]      @ multiply col element 0 by matrix col 0
     vmla.f32    \res_q, q9, \col0_d[1]      @ multiply-acc col element 1 by matrix col 1
     vmla.f32    \res_q, q10, \col1_d[0]  @ multiply-acc col element 2 by matrix col 2
     vmla.f32    \res_q, q11, \col1_d[1]  @ multiply-acc col element 3 by matrix col 3
     .endm


The implementation of a four-by-four floating point matrix multiply now looks like this.


     vld1.32  {d16-d19}, [r1]!            @ load first eight elements of matrix 0
     vld1.32  {d20-d23}, [r1]!            @ load second eight elements of matrix 0
     vld1.32  {d0-d3}, [r2]!              @ load first eight elements of matrix 1
     vld1.32  {d4-d7}, [r2]!              @ load second eight elements of matrix 1

     mul_col_f32 q12, d0, d1           @ matrix 0 * matrix 1 col 0
     mul_col_f32 q13, d2, d3           @ matrix 0 * matrix 1 col 1
     mul_col_f32 q14, d4, d5           @ matrix 0 * matrix 1 col 2
     mul_col_f32 q15, d6, d7           @ matrix 0 * matrix 1 col 3

     vst1.32  {d24-d27}, [r0]!            @ store first eight elements of result
     vst1.32  {d28-d31}, [r0]!            @ store second eight elements of result

 

Fixed Point


Using fixed point arithmetic for calculations is often faster than floating point - it requires less memory bandwidth to read and write values that use fewer bits, and multiplication of integer values is generally quicker than the same operations applied to floating point numbers.

 

However, when using fixed point arithmetic, you must choose the representation carefully to avoid overflow or saturation, whilst preserving the degree of precision in the results that your application requires.


Implementing a matrix multiply using fixed point values is very similar to floating point. In this example, we will use Q1.14 fixed-point format, but the operations required are similar for other formats, and may only require a change to the final shift applied to the accumulator. Here is the macro:


     .macro mul_col_s16 res_d, col_d
     vmull.s16   q12, d16, \col_d[0]   @ multiply col element 0 by matrix col 0
     vmlal.s16   q12, d17, \col_d[1]   @ multiply-acc col element 1 by matrix col 1
     vmlal.s16   q12, d18, \col_d[2]   @ multiply-acc col element 2 by matrix col 2
     vmlal.s16   q12, d19, \col_d[3]   @ multiply-acc col element 3 by matrix col 3
     vqrshrn.s32 \res_d, q12, #14            @ shift right and narrow accumulator into
                                             @  Q1.14 fixed point format, with saturation
     .endm


Comparing it to the macro used in the floating point version, you will see that the major differences are:

  • Values are now 16-bit rather than 32-bit, so we can use D registers to hold four inputs.
  • The result of multiplying two 16-bit numbers is a 32-bit number. We use VMULL and VMLAL, because they store their results in Q registers, preserving all of the bits of the result using
    double-size elements.
  • The final result must be 16-bits, but the accumulators are 32-bit. We obtain a 16-bit result using VQRSHRN, a vector, saturating, rounding, narrowing shift right. This adds the correct rounding value to each element, shifts it right, and saturates the result to the new, narrower element size.


The reduction from 32-bits to 16-bits per element also has an effect on the memory accesses; the data can be loaded and stored using fewer instructions. The code for a fixed point matrix multiply looks like this:

 


     vld1.16  {d16-d19}, [r1]       @ load sixteen elements of matrix 0
     vld1.16  {d0-d3}, [r2]         @ load sixteen elements of matrix 1

     mul_col_s16 d4, d0                      @ matrix 0 * matrix 1 col 0
     mul_col_s16 d5, d1                      @ matrix 0 * matrix 1 col 1
     mul_col_s16 d6, d2                      @ matrix 0 * matrix 1 col 2
     mul_col_s16 d7, d3                      @ matrix 0 * matrix 1 col 3

     vst1.16  {d4-d7}, [r0]         @ store sixteen elements of result

 

Scheduling


We will deal with the details of scheduling in a future post, but for now, it is worth seeing the effect of improved instruction scheduling on this code.

 

In the macro, adjacent multiply instructions write to the same register, so the NEON pipeline must wait for each multiply to complete before it can start the next instruction.

 

If we take the instructions out of the macro and rearrange them, we can separate those that are dependent using other instructions that are not dependent. These instructions can be issued whilst the others complete in the background.


In this case, we rearrange the code to space out accesses to the accumulator registers.


     vmul.f32    q12, q8, d0[0]              @ rslt col0  = (mat0 col0) * (mat1 col0 elt0)
     vmul.f32    q13, q8, d2[0]              @ rslt col1  = (mat0 col0) * (mat1 col1 elt0)
     vmul.f32    q14, q8, d4[0]              @ rslt col2  = (mat0 col0) * (mat1 col2 elt0)
     vmul.f32    q15, q8, d6[0]              @ rslt col3  = (mat0 col0) * (mat1 col3 elt0)

     vmla.f32    q12, q9, d0[1]              @ rslt col0 += (mat0 col1) * (mat1 col0 elt1)
     vmla.f32    q13, q9, d2[1]              @ rslt col1 += (mat0 col1) * (mat1 col1 elt1)
     ...
     ...


Using this version, matrix multiply performance almost doubles on a Cortex-A8 based system.

 

You can find details for instruction timings and latencies from the Technical Reference Manual for your Cortex core. With potential performance improvements like those described above, it is well worth spending some time familiarizing yourself with it.

 

Source


The code for the two functions described above can be found here: matrix_asm_sched.s