Arm’s latest Cortex-A55 and Cortex-A75 CPUs, in addition to being based on DynamIQ technology, implement new instructions, added in Armv8.4-A, to calculate dot products. The instructions are signed dot product (SDOT) and unsigned dot product (UDOT). The instructions are optional, and can be included in Cortex-A55 and Cortex-A75 to improve machine learning performance. There are various flavors of SDOT and UDOT, but this article explores an example using UDOT to calculate the dot product of 2 arrays. It shows how to calculate the dot product of four eight bit elements in a 32-bit register and accumulate the result into a 32-bit destination register as shown below.
The article introduces the example, explains how to confirm dot product support in a CPU, reviews tool support information, and demonstrates how to run the example on Cortex-A55 Fast Models and Cycle Models.
Below is a simple function to compute the dot product of 2 arrays. For demonstration purposes, arrays of 64 bytes are used. The function avoids inlining to make it easier to look at the disassembly. The complete software is attached at the end of the article.
unsigned __attribute__((noinline)) dot_product(unsigned char *a, unsigned char *b, int size) { unsigned int sum = 0; for (int i = 0; i < size; i++) { sum += a[i] * b[i]; } return sum; }
Without any specific direction, Arm Compiler 6 will compile the dot_product() function to use the MADD instruction to multiply and add the sum over the 64 values.
dot_product 0x00001ad4: aa1f03e9 .... MOV x9,xzr 0x00001ad8: 2a1f03e8 ...* MOV w8,wzr 0x00001adc: 2a1f03ea ...* MOV w10,wzr 0x00001ae0: f000000b .... ADRP x11,{pc}+0x3000 ; 0x4ae0 0x00001ae4: 9100816b k... ADD x11,x11,#0x20 0x00001ae8: 8b09016c l... ADD x12,x11,x9 0x00001aec: 3940018d ..@9 LDRB w13,[x12,#0] 0x00001af0: 3940058e ..@9 LDRB w14,[x12,#1] 0x00001af4: 3941018f ..A9 LDRB w15,[x12,#0x40] 0x00001af8: 3941058c ..A9 LDRB w12,[x12,#0x41] 0x00001afc: 1b0d21e8 .!.. MADD w8,w15,w13,w8 0x00001b00: 1b0e298a .).. MADD w10,w12,w14,w10 0x00001b04: 91000929 )... ADD x9,x9,#2 0x00001b08: f101013f ?... CMP x9,#0x40 0x00001b0c: 54fffee1 ...T B.NE 0x1ae8 ; dot_product + 20 0x00001b10: 0b080140 @... ADD w0,w10,w8 0x00001b14: d65f03c0 .._. RET
The same functionality can be implemented in assembly language using four UDOT instructions, each processing 16 elements in the array. After the UDOT instructions the 4 values are summed for the result.
.global dot_product_a55 .type dot_product_a55, "function" // x0 - unsigned char source pointer 1 // x1 - unsigned char source pointer 2 // x2 - vector size - must be multiple of 16 dot_product_a55: ASR x2, x2, #4 // compute loop count MOV x3, xzr DUP v0.2d, x3 // clear out destination vector nextblock: LD1 {v1.2d}, [x0], #0x10 LD1 {v2.2d}, [x1], #0x10 UDOT v0.4s, v1.16b, v2.16b SUB x2, x2, #1 CBNZ x2, nextblock // add the four individual dot products ADDV s0, v0.4s // return results in r0 UMOV x0, v0.d[0] RET
The disassembly is shown below:
dot_product_a55 0x000000c4: 9344fc42 B.D. ASR x2,x2,#4 0x000000c8: aa1f03e3 .... MOV x3,xzr 0x000000cc: 4e080c60 `..N DUP v0.2D,x3 nextblock 0x000000d0: 4cdf7c01 .|.L LD1 {v1.2D},[x0],#0x10 0x000000d4: 4cdf7c22 "|.L LD1 {v2.2D},[x1],#0x10 0x000000d8: 6e829420 ..n UDOT v0.4S,v1.16B,v2.16B 0x000000dc: d1000442 B... SUB x2,x2,#1 0x000000e0: b5ffff82 .... CBNZ x2,0xd0 ; 0xd0 0x000000e4: 4eb1b800 ...N ADDV s0,v0.4S 0x000000e8: 4e083c00 .<.N MOV x0,v0.D[0] 0x000000ec: d65f03c0 .._. RET
Let’s look at the performance of each implementation by compiling and running on Arm Fast Models and Arm Cycle Models.
The Cortex-A55 and Cortex-A75 have optional configuration choices to include dot product support. Before trying to use dot product instructions, it's important to make sure the CPU configuration supports them. In AArch64 state this is done by reading the ID_AA64ISAR0_EL1 register. In AArch32 state it is done by reading the ID_ISAR6 register.
The easiest way to do this is using inline assembly to read the appropriate register into a C variable and check the correct bit. For Arm Compiler 6 a function is shown below to read the register and another function to return a boolean value indicating dot product support.
static unsigned long long read_id_aa64isar0() { unsigned long long id_aa64isar0; __asm ("MRS %x0, ID_AA64ISAR0_EL1 \n" : "=r" (id_aa64isar0) ); return (id_aa64isar0); } static bool dot_product_supported() { if (read_id_aa64isar0() & 0x0000100000000000ULL) return true; else return false; }
The register information can be found in the Cortex-A55 Technical Reference Manual. Bit 44 indicates dot product support as shown in the Cortex-A55 TRM description.
Today, the latest version of DS-5 is 5.28 and includes Arm Compiler 6.9. This version has support for Cortex-A55 and dot product instructions. For Arm Compiler 6 to build the example, use -mcpu or -march with values that support the UDOT instruction. Any of the following armclang options will work:
More information about the UDOT instruction can be found in the armasm User Guide. Disassembly using fromelf also supports the dot product instructions. Sometimes fromelf may not fully decode the system registers without the --cpu argument so it’s good practice to add it.
$ fromelf --cpu=8.2-A.64.dotprod -c dot_product-A55.axf
Arm Fast Models provide a fast, flexible programmer's view models of Arm IP, allowing you to develop software such as drivers, firmware, operating systems, and applications prior to silicon availability. They allow full control over the simulation, including profiling, debug and trace. Fast Models are a wonderful way to check out the functionality of the code, debug any issues, and make sure the dot product instruction sequence works as expected.
The dot product example can be run on a system constructed using Arm Fast Models. Below shows a system with the Cortex-A55, memory, and a PL011 UART to print messages.
The current version of Fast Models is 11.2, and no parameter changes are required to enable dot product support for Cortex-A55. The model does have a parameter named has_dot_product which can be used to disable dot product instructions. The default value is 2, which indicates dot product instructions are available, and setting has_dot_product=1 removes dot product instructions. For more information refer to the Fast Models Reference Manual.
DS-5 can be connected to the Fast Model simulation as described in the blog Using DS-5 with custom Fast Model systems.
The System ID registers as viewed in DS-5 are shown below with the ID_AA64ISAR0_EL1 register highlighted. Bit 44 is set to a 1 indicating the dot product instructions are supported.
The disassembly window in DS-5 shows the dot product instruction:
Once the code is working with Fast Models it can be run on the Cortex-A55 Cycle Model to compare the two different dot product implementations.
Arm Cycle Models are compiled directly from Arm RTL and retain complete functional accuracy and can be simulated using Arm SoC Designer or any SystemC simulator. This enables users to confidently make architecture decisions, optimize performance, or develop bare metal software.
One innovative feature of Cycle Models is configuration via a web portal, called Arm IP Exchange, which allows users to specify configuration choices and then the model is compiled from RTL in the background. When the model is ready, users get an e-mail with a link to download the model.
Here is the screenshot of the configuration page from Arm IP Exchange for Cortex-A55. There is an option to include the dot product instructions, and when set to TRUE the ability to execute the dot product instructions is included in the model.
An equivalent Cycle Model system in SoC Designer is shown below. This can be used for a cycle accurate simulation of the dot product example to compare performance.
When the example is run on the Cortex-A55 cycle model, the number of cycles executed with and without the dot product is printed in the terminal. The function without dot product takes 402 cycles and with dot product takes only 73 cycles using -Omax for Arm Compiler 6. The cycle count is obtained by reading the cycle counter register. The example takes significantly less cycles using the dot product instructions. Results will vary based on the compiler optimizations used. The complete software is attached at the bottom of the article along with the makefile to build it using Arm Compiler 6.
As expected, utilizing dot product instructions significantly improves performance. The dot product instructions are a configuration option in Cortex-A55 and Cortex-A75. Some background in how to detect they are available and support for compilation, models, and debugging is helpful when starting to use them. Fast Models are a good way to try dot product instructions, and Cycle Models provide cycle accurate performance comparisons when experimenting with dot product instructions to optimize software.
More information on tools and models can be found on developer.arm.com
Download Arm DS-5 Now
In the article I was not specific about the Cortex-A55 and Cortex-A75 configuration options. Only the Cortex-A55 has a configuration option to include or exclude the dot product instructions. The Cortex-A75 always includes the dot product instructions. Furthermore, all cores in a cluster must have the same dot product support. This means that if the Cortex-A75 is in the cluster then the Cortex-A55 must also have dot product support.