Emulating SVE on existing Armv8-A hardware using DynamoRIO and ArmIE

The Arm Instruction Emulator (ArmIE) is a tool that converts instructions not supported on hardware to native Armv8-A instructions, such as those from the Scalable Vector Extension (SVE) instruction set. ArmIE enables developers to run and test SVE binaries on existing Armv8-A hardware, without resorting to simulators with high overheads.This approach trades off performance accuracy (e.g., ArmIE does not provide any timing information) for faster application execution time. This allows for larger, more realistic applications to be run, coupled with dynamic binary instrumentation.

Arm Instruction Emulator

Dynamic binary instrumentation support is provided through DynamoRIO integration, extending ArmIE capabilities beyond simple emulation. Instrumentation enables the collection of dynamic characteristics and metrics from the executing application, such as memory traces and instruction counts, allowing a deeper and more insightful analysis. Given the wide range of potential instrumentation which can be used and the metrics that can be gathered with ArmIE and DynamoRIO, we have added the ability to instrument emulated instructions to the DynamoRIO API, allowing developers to build their own DynamoRIO clients with access to emulated instructions, when required. To help understand how the emulated instruction instrumentation functions of the API work, we provide four example instrumentation clients and their respective source codes, with emulation support. These clients are based on existing DynamoRIO ones and are as following:

  • Instruction count client w/ emulated SVE (/samples/inscount_emulated.cpp)
  • Instruction count client (emulation API in the code, but no emulated SVE)  (samples/inscount.cpp)
  • Opcode count client (samples/opcodes_emulated.cpp)
  • Memory tracing client (samples/memtrace_simple.c)

The structure adopted by ArmIE can be seen in the following diagram. Conceptually, ArmIE consists of an emulation client (currently for SVE) and optional instrumentation clients (e.g., instruction count), which communicate between each other using the emulator API. Additional information on the clients, how ArmIE works and how to set it up can be seen in the documentation provided with the tool and in the ArmIE website.

 ArmIE structure

Running and Instrumenting SVE Binaries

In order to generate a SVE binary, you have to use SVE-capable compilers such as the Arm HPC compiler or GCC 8.2+. You also need to enable the SVE architecture flag (e.g. -march=armv8-a+sve). For all examples in this section, we use the Arm HPC Compiler 18.4, and the latest available version of ArmIE at this time, 18.4.

We are going to look at the HACCKernels mini-app, which implements HACC's particle force kernels. We change the makefile to point to the Arm HPC compiler (armclang++) and add the necessary SVE flag to it (-march=armv8-a+sve). We also remove the OpenMP flag for these examples, in order to simplify the instrumentation output analysis.

On the source code side, we make small modifications in the main.cpp file. We reduce the number of iterations (int NumIters) from 2000 to 500 and run only the 5th order kernel out of the available three kernels, in order to reduce the execution time for the evaluation presented here (~12x reduction). Running only the 5th order kernel provides a clear breakdown of the SVE impact in HACCKernels, both in instruction utilization and memory accesses, hence this decision. No other changes to the code are made at this point.

Instruction Count

Important Note: ArmIE is not capable of producing timing information and incurs an emulation and binary instrumentation overhead on the running application. Therefore, no real-time performance considerations should be done based on these results.

We start with the instruction count client (inscount), choosing a vector length of 512 bits. This client counts all the dynamic instructions that are executed by the binary, separating SVE instructions from AArch64 instructions. At this point in time, there is no breakdown on the types of instructions available in the client. Additionally, in version 18.4, the inscount client prints the emulated SVE instruction opcodes (and PC) to output. This can be decoded to obtain extra information about which instructions were executed. We will focus more on this in the next section.

$ armie -msve-vector-bits=512 -i libinscount_emulated.so -- ./HACCKernels

Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 42.9214 s
205464290 instructions executed of which 167576110 were emulated instructions

From this inscount run, we can observe a very high number of emulated SVE instructions (81.56% of the total instructions), which demonstrates a good use of the vector extension.
The default mode of the inscount client counts all the executed instructions, including ones from shared libraries. We can enable a client flag to disable the count of shared libraries, leading to a higher SVE utilization rate, of 83.00%. The example below demonstrates how to run the inscount client with this flag and its respective result. The run command for this case differs from the previous one in that it is the DynamoRIO command which ArmIE uses to load and run the emulation and instrumentation clients. This underlying DynamoRIO command can be exposed when running the ArmIE command using the -s option. In this case, the "-only_from_app" string is passed to the instrumentation client, libinscount_emulated.so, as a parameter to ignore all instruction counting except those in the application.

$ $ARMIE_PATH/bin64/drrun -client $ARMIE_PATH/lib64/release/libsve_512.so 0 "" -client $ARMIE_PATH/samples/bin64/libinscount_emulated.so 1 "-only_from_app" -max_bb_instrs 32 -max_trace_bbs 4 -- ./HACCKernels

Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 42.7263 s
201887951 instructions executed of which 167576110 were emulated instructions

With the inscount client, we can also quickly compare the SVE utilization between different vector lengths. The table below shows the SVE utilization for vector lengths between 128 bits and 1024 bits.

SVE Utilization (no shared libs) for different vector lengths
Vector length 128-bit 256-bit 512-bit 1024-bit
SVE utilization 93.43% 89.61% 83.00% 72.49%

The main takeaway from this table is the reduction of total SVE instructions, the wider the vectors gets. This is an expected occurrence since wider vectors can store more data and perform more simultaneous operations, thus reducing the total number of SVE instructions.

Opcodes Count

Similarly to the inscount client, the opcodes client reports the dynamic count of the total number of instructions executed, broken down by opcode. This client is useful for understanding the 'hotness' factor of SVE instructions, and to correlate it against the application's source code. Non-SVE opcodes are decoded by DynamoRIO, resulting in the corresponding mnemonics that can be seen in the output below.

$ armie -msve-vector-bits=512 -i libopcodes_emulated.so -- ./HACCKernels

Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 85.9275 s
Opcode execution counts in AArch64 mode:
     184763 : ubfm           
     224217 : cbnz           
     236845 : and            
     253632 : ldrb           
     481172 : adrp           
     624493 : orr            
     739335 : add            
     810385 : fadd           
    1017337 : subs           
    1172879 : ldr            
    1320770 : fmadd          
    2792022 : xx             
    3127984 : str            
    4263314 : fcvt           
    5342564 : bcond          
    5833081 : fmul           
    8473704 : eor            
77 unique emulated instructions written to undecoded.txt


The unique SVE instruction opcodes are written to an output file (undecoded.txt) which can then be decoded. To facilitate this process, we include a decoder script within ArmIE (bin64/enc2instr.py) that uses the LLVM machine code (llvm-mc) binary (available in the Arm HPC compiler), to disassemble the instruction encodings. Using this script, we obtain a breakdown of the SVE instructions, with their mnemonics and accessed registers, as seen below.

Please note that the provided script is written for a generic case where a single encoding can be passed to it, and not specifically for this client. Hence, we need to remove the instruction count, present in the undecoded.txt file, when running the script, to avoid any incompatibilities. Below we extract the encodings from the generated file, run them through the script and paste back together the instruction count with the respective decoding, all in a single command line:

$ awk '{print $3}' undecoded.txt | $ARMIE_PATH/bin64/enc2instr.py -mattr=+sve | awk -F: '{print $2}' | paste undecoded.txt /dev/stdin

  4150900 : 0xa5484c9b	 ld1w	{z27.s}, p3/z, [x4, x8, lsl #2]
  4150900 : 0xa5484479	 ld1w	{z25.s}, p1/z, [x3, x8, lsl #2]
  4150900 : 0xa5484458	 ld1w	{z24.s}, p1/z, [x2, x8, lsl #2]
  4150900 : 0xa5484437	 ld1w	{z23.s}, p1/z, [x1, x8, lsl #2]
  4150900 : 0x65b9033a	 fmla	z26.s, p0/m, z25.s, z25.s
  4150900 : 0x65b8031b	 fmla	z27.s, p0/m, z24.s, z24.s
  4150900 : 0x65b68359	 fmad	z25.s, p0/m, z26.s, z22.s
  4150900 : 0x65b58358	 fmad	z24.s, p0/m, z26.s, z21.s
  4150900 : 0x65b4837a	 fmad	z26.s, p0/m, z27.s, z20.s
  4150900 : 0x65b3e35b	 fnmsb	z27.s, p0/m, z26.s, z19.s
  4150900 : 0x65b2e35b	 fnmsb	z27.s, p0/m, z26.s, z18.s
  4150900 : 0x65b1e35b	 fnmsb	z27.s, p0/m, z26.s, z17.s
  4150900 : 0x65a6635b	 fnmls	z27.s, p0/m, z26.s, z6.s
  4150900 : 0x65a58357	 fmad	z23.s, p0/m, z26.s, z5.s
  4150900 : 0x659c0bbc	 fmul	z28.s, z29.s, z28.s
  4150900 : 0x659c035a	 fadd	z26.s, z26.s, z28.s
  4150900 : 0x659b0b5a	 fmul	z26.s, z26.s, z27.s
  4150900 : 0x65970afa	 fmul	z26.s, z23.s, z23.s
  4150900 : 0x65922343	 fcmeq	p3.s, p0/z, z26.s, #0.0
  4150900 : 0x658da39d	 fsqrt	z29.s, p0/m, z28.s
  4150900 : 0x658c80fc	 fdivr	z28.s, p0/m, z28.s, z7.s
  4150900 : 0x6584035c	 fadd	z28.s, z26.s, z4.s
  4150900 : 0x65834342	 fcmge	p2.s, p0/z, z26.s, z3.s
  4150900 : 0x65820739	 fsub	z25.s, z25.s, z2.s
  4150900 : 0x65810718	 fsub	z24.s, z24.s, z1.s
  4150900 : 0x658006f7	 fsub	z23.s, z23.s, z0.s
  4150900 : 0x25a91d01	 whilelo	p1.s, x8, x9
  4150900 : 0x25834042	 orr	p2.b, p0/z, p2.b, p3.b
  4150900 : 0x25034023	 and	p3.b, p0/z, p1.b, p3.b
  4150900 : 0x25004243	 not	p3.b, p0/z, p2.b
  4150900 : 0x05b9cad9	 mov	z25.s, p2/m, z22.s
  4150900 : 0x05b8cab8	 mov	z24.s, p2/m, z21.s
  4150900 : 0x05b7c8b7	 mov	z23.s, p2/m, z5.s
  4150900 : 0x05b6c736	 mov	z22.s, p1/m, z25.s
  4150900 : 0x05b5c715	 mov	z21.s, p1/m, z24.s
  4150900 : 0x05a5c6e5	 mov	z5.s, p1/m, z23.s
  4150900 : 0x04b0e3e8	 incw	x8
  4150900 : 0x0420bf7a	 movprfx	z26, z27
  4150900 : 0x0420bf5b	 movprfx	z27, z26
  4150900 : 0x0420be1b	 movprfx	z27, z16
   166036 : 0x25351d00	 whilelo	p0.b, x8, x21
   166036 : 0x252c8808	 incp	x8, p0.b
    83018 : 0xe4084000	 st1b	{z0.b}, p0, [x0, x8]
    50000 : 0x658022c0	 faddv	s0, p0, z22.s
    50000 : 0x658022a1	 faddv	s1, p0, z21.s
    50000 : 0x658020a2	 faddv	s2, p0, z5.s
    50000 : 0x25b9ce07	 fmov	z7.s, #1.00000000
    50000 : 0x25b8c005	 mov	z5.s, #0                // =0x0
    50000 : 0x25a91fe1	 whilelo	p1.s, xzr, x9
    50000 : 0x2598e3e0	 ptrue	p0.s
    50000 : 0x05242294	 mov	z20.s, s20
    50000 : 0x05242273	 mov	z19.s, s19
    50000 : 0x05242252	 mov	z18.s, s18
    50000 : 0x05242231	 mov	z17.s, s17
    50000 : 0x05242210	 mov	z16.s, s16
    50000 : 0x052420a6	 mov	z6.s, s5
    50000 : 0x05242084	 mov	z4.s, s4
    50000 : 0x05242063	 mov	z3.s, s3
    50000 : 0x05242042	 mov	z2.s, s2
    50000 : 0x05242021	 mov	z1.s, s1
    50000 : 0x05242000	 mov	z0.s, s0
    50000 : 0x046530b6	 mov	z22.d, z5.d
    50000 : 0x046530b5	 mov	z21.d, z5.d
    41509 : 0xe4084380	 st1b	{z0.b}, p0, [x28, x8]
    41509 : 0xe40842c0	 st1b	{z0.b}, p0, [x22, x8]
    10500 : 0x253c1d00	 whilelo	p0.b, x8, x28
    10500 : 0x04285028	 addvl	x8, x8, #1
     3500 : 0xe40842e0	 st1b	{z0.b}, p0, [x23, x8]
     3500 : 0xe4084280	 st1b	{z0.b}, p0, [x20, x8]
     3500 : 0xe4084260	 st1b	{z0.b}, p0, [x19, x8]
     3500 : 0x2538c000	 mov	z0.b, #0                // =0x0
     3000 : 0x04bf5028	 rdvl	x8, #1
     2000 : 0x25351fe0	 whilelo	p0.b, xzr, x21
     1500 : 0x253c1fe0	 whilelo	p0.b, xzr, x28
      500 : 0x04bf5029	 rdvl	x9, #1

Memory Tracing

The memory tracing client (memtrace) focuses on the dynamic memory accesses of the application, capturing information such as the accessed addresses and data sizes. It is based on the existing non-SVE DynamoRIO memtrace client, with added SVE emulation and tracing support. Running the emulated memtrace client results in two different memory trace files: a SVE-only trace and a non-SVE one. To keep the memory traces consistent, we include an additional field, 'Sequence Number', that updates the order of each memory access sequentially, through a shared counter between the emulation side and the core DynamoRIO instrumentation. The memory trace format is the following:

  • Sequence Number
  • Thread ID
  • SVE Bundle
  • isWrite (1 = write, 0 = read)
  • Data Size (Bytes)
  • Data Address
  • PC

We also add the 'SVE Bundle' field to the memory traces, which identifies SVE linear and gather/scatter vector accesses. It consists of 3 bits with the following possible combinations appearing in the resulting trace:

  • 0: Contiguous access
  • 1 or 3: Gather/Scatter bundle, first element
  • 2: Gather/Scatter bundle, another element
  • 4 or 6: Gather/Scatter bundle, last element

An important consideration to have when tracing SVE binaries, is that the output trace can easily use up a large amount of disk space. Therefore, we support marker instructions that developers must include in their SVE code to define start/end regions (multiple regions are supported) where the memtrace client will execute. In a typical scenario, this corresponds to the main kernel loops of the application. Note that only the region inside these markers will be traced. If they are not used, no tracing will be done. These markers should also be outside vectorizable loops, as they might hinder vectorization.

In the case of the mini-app we are using, HACCKernels, we add the marker definitions at the start of the main.cpp file and define the region of interest around the main kernel, GravityForceKernel5:

#define __START_TRACE() {asm volatile (".inst 0x2520e020");}
#define __STOP_TRACE() {asm volatile (".inst 0x2520e040");}

....

__START_TRACE();
run(GravityForceKernel5, "5th Order");
__STOP_TRACE();

After adding the region markers to the code, we can then compile it and run it with the memtrace client (again with 512-bit vectors):

$ armie -e libmemtrace_sve_512.so -i libmemtrace_simple.so -- ./HACCKernels
> Data file /home/migtai01/apps-unimplemented/HACCKernels_sve_vectorizer/memtrace.HACCKernels.03531.0000.log created
> Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 73.5272 s

$ ls
> memtrace.HACCKernels.0000.log
> sve-memtrace.HACCKernels.8114.log

Given the large size of the combined trace files (over 22M trace lines) we cannot show the entirety of the trace here, so we focus on a small snippet. For analysis purposes, it is advantageous to merge both the non-SVE and SVE trace files into a single one. This can be done with a simple script that parses the separate memory trace files and orders them into a single full trace output, based on the 'Sequence Number' trace field. Such a script is not included with ArmIE at this point in time.

To facilitate analysis of the fully-merged memory trace, we use different separator characters after the first element of each trace line: a colon ' : ' separator for non-SVE traces, and a comma ' , ' separator for SVE traces. This can be seen bellow, in the merged memory trace snippet of HACCKernels:

Format: <sequence number>: <TID>, <isBundle>, <isWrite>, <data size>, <data address>, <PC>
....
2990: 0, 0,  0,  4, 0x401a68, 0x401678
2991: 0, 0,  0,  4, 0x401a6c, 0x401680
2992: 0, 0,  0,  4, 0x401a70, 0x401688
2993: 0, 0,  0,  4, 0x401a74, 0x401690
2994: 0, 0,  0,  4, 0x401a78, 0x401698
2995: 0, 0,  0,  4, 0x401a7c, 0x4016a0
2996, 0, 0, 0, 64, 0x44c750, 0x4016f0
2997, 0, 0, 0, 64, 0x44d110, 0x4016f4
2998, 0, 0, 0, 64, 0x44dad0, 0x4016f8
2999, 0, 0, 0, 4, 0x44e490, 0x401750
3000, 0, 0, 0, 16, 0x44e49c, 0x401750
3001, 0, 0, 0, 4, 0x44e4b0, 0x401750
....

The memory trace snippet shows an initial section with non-SVE memory accesses (traces 2990 to 2995), followed by SVE accesses (traces 2996 to 3001), as can be seen by the first separator character, after the sequence number. Only load accesses are captured in this memory trace snippet, but write operations are present in the full memory trace. Looking at the 'size' field we can also observe three full SVE-vector loads (64 byte size equals to 512-bit vector lengths).

Memory traces are commonly used for different types of post-processing analysis. This can encompass a wide-range of scripts and tools, ranging from simple parsing scripts to more complex cache simulators, to mention a few. Processing memory traces falls outside the scope of ArmIE and, as such, no extra tools are included with it at this point in time. It is up to developers to integrate the generated traces into their analysis workflow and tools.

Below we present some simple scripting experiment that can be done to the fully-merged memory traces. It parses all memory traces and prints related information such as number of linear and gather/scatter bundle accesses, percentage of writes and reads, or accesses with inactive vector lanes. We can observe that the HACCKernels mini-application does not present a single gather/scatter operation and that load operations dominate the memory accesses performed.

SVE vector size: 512 bits (64B)

Total Memory References = 20486551
   -> linear SVE: 16779970 (81.91%)
   -> bundle SVE: 0 (0.00%)
   ->    non-SVE: 3706581 (18.09%)

Linear SVE accesses with at least 1 inactive lane:
   -> 1237914 (7.38% of linear SVE traces)

==============

Total Writes = 3121089  (34 unique writes - different PCs)
   -> linear SVE: 176536 (5.66%)
   -> bundle SVE: 0 (0.00%)
   ->    non-SVE: 2944553 (94.34%)

Linear SVE Writes with at least 1 inactive lane:
   -> 3376 (1.91% of linear SVE write traces)

==============

Total Loads = 17365462 (51 unique loads - different PCs)
   -> linear SVE: 16603434 (95.61%)
   -> bundle SVE: 0 (0.00%)
   ->    non-SVE: 762028 (4.39%)

Linear SVE Loads with at least 1 inactive lane:
> 1234538 (7.44% of linear SVE loads)

==============

Distribution of memory operations:
   -> 15.23% Writes
   -> 84.77% Loads
   
SVE Bundles Stats:
   -> 0 SVE bundle accesses

Summary

In this article, we give an overview of the Arm Instruction Emulator, from its structure to the existing emulation clients and how to use them.  We also briefly talk about types of analysis and studies that can be achieved with the provided clients, touching the waters of post-processing analysis, with regards to memory traces. As it was mentioned, more complex post-processing analysis can be done on memory traces, such as by using cache simulation, as the existent DynamoRIO cache simulator shows for a different type of non-SVE traces. Although not yet compatible with ArmIE, it stands to show what type of analysis we can explore with the traces and metrics we can gather at this point in time.

With SVE-enabled silicon on the horizon, it is of great importance to provide developers and manufacturers with tools to run their codes in preparation for the upcoming hardware. ArmIE enables running SVE code on native 64-bit Armv8-A architectures, with smaller overheads when compared to simulators, allowing for larger and more significant workloads to be run. Furthermore, through integration with DynamoRIO, ArmIE provides dynamic binary instrumentation, with some clients already available: instruction and opcode counting, and memory tracing. On top of this, ArmIE provides an emulation API that empowers developers to build their own emulation clients, focusing on metrics important for their evaluations and methodologies.

Check out our upcoming SVE Hackathon at SC18 or a future Arm HPC workshop for more details. 

Arm HPC tools

Anonymous
HPC