Exploring the Arm dot product instructions

Arm’s latest Cortex-A55 and Cortex-A75 CPUs, in addition to being based on DynamIQ technology, implement new instructions, added in Armv8.4-A, to calculate dot products. The instructions are signed dot product (SDOT) and unsigned dot product (UDOT). The instructions are optional, and can be included in Cortex-A55 and Cortex-A75 to improve machine learning performance. There are various flavors of SDOT and UDOT, but this this article explores an example using UDOT to calculate the dot product of 2 arrays. It shows how to calculate the dot product of four eight bit elements in a 32-bit register and accumulate the result into a 32-bit destination register as shown below.

32-bit destination register

The article introduces the example, explains how to confirm dot product support in a CPU, reviews tool support information, and demonstrates how to run the example on Cortex-A55 Fast Models and Cycle Models.

Small programming example

Below is a simple function to compute the dot product of 2 arrays. For demonstration purposes, arrays of 64 bytes are used. The function avoids inlining to make it easier to look at the disassembly. The complete software is attached at the end of the article.

unsigned __attribute__((noinline)) dot_product(unsigned char *a, unsigned char *b, int size)
    unsigned int sum = 0;

    for (int i = 0; i < size; i++) {
        sum += a[i] * b[i];

    return sum;

Without any specific direction, Arm Compiler 6 will compile the dot_product() function to use the MADD instruction to multiply and add the sum over the 64 values.

        0x00001ad4:    aa1f03e9    ....    MOV      x9,xzr
        0x00001ad8:    2a1f03e8    ...*    MOV      w8,wzr
        0x00001adc:    2a1f03ea    ...*    MOV      w10,wzr
        0x00001ae0:    f000000b    ....    ADRP     x11,{pc}+0x3000 ; 0x4ae0
        0x00001ae4:    9100816b    k...    ADD      x11,x11,#0x20
        0x00001ae8:    8b09016c    l...    ADD      x12,x11,x9
        0x00001aec:    3940018d    ..@9    LDRB     w13,[x12,#0]
        0x00001af0:    3940058e    ..@9    LDRB     w14,[x12,#1]
        0x00001af4:    3941018f    ..A9    LDRB     w15,[x12,#0x40]
        0x00001af8:    3941058c    ..A9    LDRB     w12,[x12,#0x41]
        0x00001afc:    1b0d21e8    .!..    MADD     w8,w15,w13,w8
        0x00001b00:    1b0e298a    .)..    MADD     w10,w12,w14,w10
        0x00001b04:    91000929    )...    ADD      x9,x9,#2
        0x00001b08:    f101013f    ?...    CMP      x9,#0x40
        0x00001b0c:    54fffee1    ...T    B.NE     0x1ae8 ; dot_product + 20
        0x00001b10:    0b080140    @...    ADD      w0,w10,w8
        0x00001b14:    d65f03c0    .._.    RET

The same functionality can be implemented in assembly language using four UDOT instructions, each processing 16 elements in the array. After the UDOT instructions the 4 values are summed for the result.

.global dot_product_a55
    .type dot_product_a55, "function"
// x0 - unsigned char source pointer 1
// x1 - unsigned char source pointer 2
// x2 - vector size  - must be multiple of 16

  ASR   x2, x2, #4       // compute loop count
  MOV   x3, xzr
  DUP   v0.2d, x3        // clear out destination vector

  LD1   {v1.2d}, [x0], #0x10
  LD1   {v2.2d}, [x1], #0x10
  UDOT  v0.4s, v1.16b, v2.16b
  SUB   x2, x2, #1
  CBNZ  x2, nextblock

  // add the four individual dot products
  ADDV  s0, v0.4s

  // return results in r0
  UMOV  x0, v0.d[0]


The disassembly is shown below:

        0x000000c4:    9344fc42    B.D.    ASR      x2,x2,#4
        0x000000c8:    aa1f03e3    ....    MOV      x3,xzr
        0x000000cc:    4e080c60    `..N    DUP      v0.2D,x3
        0x000000d0:    4cdf7c01    .|.L    LD1      {v1.2D},[x0],#0x10
        0x000000d4:    4cdf7c22    "|.L    LD1      {v2.2D},[x1],#0x10
        0x000000d8:    6e829420     ..n    UDOT     v0.4S,v1.16B,v2.16B
        0x000000dc:    d1000442    B...    SUB      x2,x2,#1
        0x000000e0:    b5ffff82    ....    CBNZ     x2,0xd0 ; 0xd0
        0x000000e4:    4eb1b800    ...N    ADDV     s0,v0.4S
        0x000000e8:    4e083c00    .<.N    MOV      x0,v0.D[0]
        0x000000ec:    d65f03c0    .._.    RET

Let’s look at the performance of each implementation by compiling and running on Arm Fast Models and Arm Cycle Models.

Confirming dot product support

The Cortex-A55 and Cortex-A75 have optional configuration choices to include dot product support. Before trying to use dot product instructions, it's important to make sure the CPU configuration supports them. In AArch64 state this is done by reading the ID_AA64ISAR0_EL1 register. In AArch32 state it is done by reading the ID_ISAR6 register.

The easiest way to do this is using inline assembly to read the appropriate register into a C variable and check the correct bit. For Arm Compiler 6 a function is shown below to read the register and another function to return a boolean value indicating dot product support.

static unsigned long long read_id_aa64isar0()
     unsigned long long id_aa64isar0;

     __asm ("MRS %x0, ID_AA64ISAR0_EL1 \n" : "=r" (id_aa64isar0) );

    return (id_aa64isar0);

static bool dot_product_supported()
     if (read_id_aa64isar0() & 0x0000100000000000ULL)
         return true;
         return false;

The register information can be found in the Cortex-A55 Technical Reference Manual. Bit 44 indicates dot product support as shown in the Cortex-A55 TRM description.

AArch64 instruction Set

DS-5 compiler support

Today, the latest version of DS-5 is 5.28 and includes Arm Compiler 6.9. This version has support for Cortex-A55 and dot product instructions. For Arm Compiler 6 to build the example, use -mcpu or -march with values that support the UDOT instruction. Any of the following armclang options will work:

  • -mcpu=cortex-a55
  • -march=armv8.4-a
  • -march=armv8.2-a+dotprod

More information about the UDOT instruction can be found in the armasm User Guide. Disassembly using fromelf also supports the dot product instructions. Sometimes fromelf may not fully decode the system registers without the --cpu argument so it’s good practice to add it.

$ fromelf --cpu=8.2-A.64.dotprod -c dot_product-A55.axf

Fast Models and DS-5 debugger support

Arm Fast Models provide a fast, flexible programmer's view models of Arm IP, allowing you to develop software such as drivers, firmware, operating systems, and applications prior to silicon availability. They allow full control over the simulation, including profiling, debug and trace. Fast Models are a wonderful way to check out the functionality of the code, debug any issues, and make sure the dot product instruction sequence works as expected.

The dot product example can be run on a system constructed using Arm Fast Models. Below shows a system with the Cortex-A55, memory, and a PL011 UART to print messages.

System with Cortex-A55 memory and a PL011 UART to print messages

The current version of Fast Models is 11.2, and no parameter changes are required to enable dot product support for Cortex-A55. The model does have a parameter named has_dot_product which can be used to disable dot product instructions. The default value is 2, which indicates dot product instructions are available, and setting has_dot_product=1 removes dot product instructions. For more information refer to the Fast Models Reference Manual.

DS-5 can be connected to the Fast Model simulation as described in the blog Using DS-5 with custom Fast Model systems.

The System ID registers as viewed in DS-5 are shown below with the ID_AA64ISAR0_EL1 register highlighted. Bit 44 is set to a 1 indicating the dot product instructions are supported.

System ID registers in DS-5

The disassembly window in DS-5 shows the dot product instruction:

Disassembly window in DS-5

Once the code is working with Fast Models it can be run on the Cortex-A55 Cycle Model to compare the two different dot product implementations.

Cortex-A55 Cycle Model

Arm Cycle Models are compiled directly from Arm RTL and retain complete functional accuracy and can be simulated using Arm SoC Designer or any SystemC simulator. This enables users to confidently make architecture decisions, optimize performance, or develop bare metal software.

One innovative feature of Cycle Models is configuration via a web portal, called Arm IP Exchange, which allows users to specify configuration choices and then the model is compiled from RTL in the background. When the model is ready, users get an e-mail with a link to download the model.

Here is the screenshot of the configuration page from Arm IP Exchange for Cortex-A55.  There is an option to include the dot product instructions, and when set to TRUE the ability to execute the dot product instructions is included in the model.

Configuration page from Arm IP Exchange for Cortex-A55

An equivalent Cycle Model system in SoC Designer is shown below. This can be used for a cycle accurate simulation of the dot product example to compare performance.

Equivalent Cycle Model system in SoC Designer

When the example is run on the Cortex-A55 cycle model, the number of cycles executed with and without the dot product is printed in the terminal. The function without dot product takes 402 cycles and with dot product takes only 73 cycles using -Omax for Arm Compiler 6. The cycle count is obtained by reading the cycle counter register. The example takes significantly less cycles using the dot product instructions. Results will vary based on the compiler optimizations used. The complete software is attached at the bottom of the article along with the makefile to build it using Arm Compiler 6.


As expected, utilizing dot product instructions significantly improves performance. The dot product instructions are a configuration option in Cortex-A55 and Cortex-A75. Some background in how to detect they are available and support for compilation, models, and debugging is helpful when starting to use them. Fast Models are a good way to try dot product instructions, and Cycle Models provide cycle accurate performance comparisons when experimenting with dot product instructions to optimize software.

More information on tools and models can be found on developer.arm.com

Download Arm DS-5 Now