The Arm Instruction Emulator (ArmIE) is a tool that converts instructions not supported on hardware to native Armv8-A instructions, such as those from the Scalable Vector Extension (SVE) instruction set. ArmIE enables developers to run and test SVE binaries on existing Armv8-A hardware, without resorting to simulators with high overheads.This approach trades off performance accuracy (e.g., ArmIE does not provide any timing information) for faster application execution time. This allows for larger, more realistic applications to be run, coupled with dynamic binary instrumentation.
Dynamic binary instrumentation support is provided through DynamoRIO integration, extending ArmIE capabilities beyond simple emulation. Instrumentation enables the collection of dynamic characteristics and metrics from the executing application, such as memory traces and instruction counts, allowing a deeper and more insightful analysis. Given the wide range of potential instrumentation which can be used and the metrics that can be gathered with ArmIE and DynamoRIO, we have added the ability to instrument emulated instructions to the DynamoRIO API, allowing developers to build their own DynamoRIO clients with access to emulated instructions, when required. To help understand how the emulated instruction instrumentation functions of the API work, we provide four example instrumentation clients and their respective source codes, with emulation support. These clients are based on existing DynamoRIO ones and are as following:
The structure adopted by ArmIE can be seen in the following diagram. Conceptually, ArmIE consists of an emulation client (currently for SVE) and optional instrumentation clients (e.g., instruction count), which communicate between each other using the emulator API. Additional information on the clients, how ArmIE works and how to set it up can be seen in the documentation provided with the tool and in the ArmIE website.
In order to generate a SVE binary, you have to use SVE-capable compilers such as the Arm HPC compiler or GCC 8.2+. You also need to enable the SVE architecture flag (e.g. -march=armv8-a+sve). For all examples in this section, we use the Arm HPC Compiler 18.4, and the latest available version of ArmIE at this time, 18.4.
-march=armv8-a+sve
We are going to look at the HACCKernels mini-app, which implements HACC's particle force kernels. We change the makefile to point to the Arm HPC compiler (armclang++) and add the necessary SVE flag to it (-march=armv8-a+sve). We also remove the OpenMP flag for these examples, in order to simplify the instrumentation output analysis.
makefile
On the source code side, we make small modifications in the main.cpp file. We reduce the number of iterations (int NumIters) from 2000 to 500 and run only the 5th order kernel out of the available three kernels, in order to reduce the execution time for the evaluation presented here (~12x reduction). Running only the 5th order kernel provides a clear breakdown of the SVE impact in HACCKernels, both in instruction utilization and memory accesses, hence this decision. No other changes to the code are made at this point.
main.cpp
int NumIters
Important Note: ArmIE is not capable of producing timing information and incurs an emulation and binary instrumentation overhead on the running application. Therefore, no real-time performance considerations should be done based on these results.
We start with the instruction count client (inscount), choosing a vector length of 512 bits. This client counts all the dynamic instructions that are executed by the binary, separating SVE instructions from AArch64 instructions. At this point in time, there is no breakdown on the types of instructions available in the client. Additionally, in version 18.4, the inscount client prints the emulated SVE instruction opcodes (and PC) to output. This can be decoded to obtain extra information about which instructions were executed. We will focus more on this in the next section.
$ armie -msve-vector-bits=512 -i libinscount_emulated.so -- ./HACCKernels Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 42.9214 s 205464290 instructions executed of which 167576110 were emulated instructions
From this inscount run, we can observe a very high number of emulated SVE instructions (81.56% of the total instructions), which demonstrates a good use of the vector extension.The default mode of the inscount client counts all the executed instructions, including ones from shared libraries. We can enable a client flag to disable the count of shared libraries, leading to a higher SVE utilization rate, of 83.00%. The example below demonstrates how to run the inscount client with this flag and its respective result. The run command for this case differs from the previous one in that it is the DynamoRIO command which ArmIE uses to load and run the emulation and instrumentation clients. This underlying DynamoRIO command can be exposed when running the ArmIE command using the -s option. In this case, the "-only_from_app" string is passed to the instrumentation client, libinscount_emulated.so, as a parameter to ignore all instruction counting except those in the application.
-s
-only_from_app
$ $ARM_INSTRUCTION_EMULATOR_DIR/bin64/drrun -client $ARM_INSTRUCTION_EMULATOR_DIR/lib64/release/libsve_512.so 0 "" -client $ARM_INSTRUCTION_EMULATOR_DIR/samples/bin64/libinscount_emulated.so 1 "-only_from_app" -max_bb_instrs 32 -max_trace_bbs 4 -- ./HACCKernels Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 42.7263 s 201887951 instructions executed of which 167576110 were emulated instructions
With the inscount client, we can also quickly compare the SVE utilization between different vector lengths. The table below shows the SVE utilization for vector lengths between 128 bits and 1024 bits.
The main takeaway from this table is the reduction of total SVE instructions, the wider the vectors gets. This is an expected occurrence since wider vectors can store more data and perform more simultaneous operations, thus reducing the total number of SVE instructions.
Similarly to the inscount client, the opcodes client reports the dynamic count of the total number of instructions executed, broken down by opcode. This client is useful for understanding the 'hotness' factor of SVE instructions, and to correlate it against the application's source code. Non-SVE opcodes are decoded by DynamoRIO, resulting in the corresponding mnemonics that can be seen in the output below.
$ armie -msve-vector-bits=512 -i libopcodes_emulated.so -- ./HACCKernels Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 85.9275 s Opcode execution counts in AArch64 mode: 184763 : ubfm 224217 : cbnz 236845 : and 253632 : ldrb 481172 : adrp 624493 : orr 739335 : add 810385 : fadd 1017337 : subs 1172879 : ldr 1320770 : fmadd 2792022 : xx 3127984 : str 4263314 : fcvt 5342564 : bcond 5833081 : fmul 8473704 : eor 77 unique emulated instructions written to undecoded.txt
undecoded.txt
Please note that the provided script is written for a generic case where a single encoding can be passed to it, and not specifically for this client. Hence, we need to remove the instruction count, present in the undecoded.txt file, when running the script, to avoid any incompatibilities. Below we extract the encodings from the generated file, run them through the script and paste back together the instruction count with the respective decoding, all in a single command line:
$ awk '{print $3}' undecoded.txt | $ARM_INSTRUCTION_EMULATOR_DIR/bin64/enc2instr.py -mattr=+sve | awk -F: '{print $2}' | paste undecoded.txt /dev/stdin 4150900 : 0xa5484c9b ld1w {z27.s}, p3/z, [x4, x8, lsl #2] 4150900 : 0xa5484479 ld1w {z25.s}, p1/z, [x3, x8, lsl #2] 4150900 : 0xa5484458 ld1w {z24.s}, p1/z, [x2, x8, lsl #2] 4150900 : 0xa5484437 ld1w {z23.s}, p1/z, [x1, x8, lsl #2] 4150900 : 0x65b9033a fmla z26.s, p0/m, z25.s, z25.s 4150900 : 0x65b8031b fmla z27.s, p0/m, z24.s, z24.s 4150900 : 0x65b68359 fmad z25.s, p0/m, z26.s, z22.s 4150900 : 0x65b58358 fmad z24.s, p0/m, z26.s, z21.s 4150900 : 0x65b4837a fmad z26.s, p0/m, z27.s, z20.s 4150900 : 0x65b3e35b fnmsb z27.s, p0/m, z26.s, z19.s 4150900 : 0x65b2e35b fnmsb z27.s, p0/m, z26.s, z18.s 4150900 : 0x65b1e35b fnmsb z27.s, p0/m, z26.s, z17.s 4150900 : 0x65a6635b fnmls z27.s, p0/m, z26.s, z6.s 4150900 : 0x65a58357 fmad z23.s, p0/m, z26.s, z5.s 4150900 : 0x659c0bbc fmul z28.s, z29.s, z28.s 4150900 : 0x659c035a fadd z26.s, z26.s, z28.s 4150900 : 0x659b0b5a fmul z26.s, z26.s, z27.s 4150900 : 0x65970afa fmul z26.s, z23.s, z23.s 4150900 : 0x65922343 fcmeq p3.s, p0/z, z26.s, #0.0 4150900 : 0x658da39d fsqrt z29.s, p0/m, z28.s 4150900 : 0x658c80fc fdivr z28.s, p0/m, z28.s, z7.s 4150900 : 0x6584035c fadd z28.s, z26.s, z4.s 4150900 : 0x65834342 fcmge p2.s, p0/z, z26.s, z3.s 4150900 : 0x65820739 fsub z25.s, z25.s, z2.s 4150900 : 0x65810718 fsub z24.s, z24.s, z1.s 4150900 : 0x658006f7 fsub z23.s, z23.s, z0.s 4150900 : 0x25a91d01 whilelo p1.s, x8, x9 4150900 : 0x25834042 orr p2.b, p0/z, p2.b, p3.b 4150900 : 0x25034023 and p3.b, p0/z, p1.b, p3.b 4150900 : 0x25004243 not p3.b, p0/z, p2.b 4150900 : 0x05b9cad9 mov z25.s, p2/m, z22.s 4150900 : 0x05b8cab8 mov z24.s, p2/m, z21.s 4150900 : 0x05b7c8b7 mov z23.s, p2/m, z5.s 4150900 : 0x05b6c736 mov z22.s, p1/m, z25.s 4150900 : 0x05b5c715 mov z21.s, p1/m, z24.s 4150900 : 0x05a5c6e5 mov z5.s, p1/m, z23.s 4150900 : 0x04b0e3e8 incw x8 4150900 : 0x0420bf7a movprfx z26, z27 4150900 : 0x0420bf5b movprfx z27, z26 4150900 : 0x0420be1b movprfx z27, z16 166036 : 0x25351d00 whilelo p0.b, x8, x21 166036 : 0x252c8808 incp x8, p0.b 83018 : 0xe4084000 st1b {z0.b}, p0, [x0, x8] 50000 : 0x658022c0 faddv s0, p0, z22.s 50000 : 0x658022a1 faddv s1, p0, z21.s 50000 : 0x658020a2 faddv s2, p0, z5.s 50000 : 0x25b9ce07 fmov z7.s, #1.00000000 50000 : 0x25b8c005 mov z5.s, #0 // =0x0 50000 : 0x25a91fe1 whilelo p1.s, xzr, x9 50000 : 0x2598e3e0 ptrue p0.s 50000 : 0x05242294 mov z20.s, s20 50000 : 0x05242273 mov z19.s, s19 50000 : 0x05242252 mov z18.s, s18 50000 : 0x05242231 mov z17.s, s17 50000 : 0x05242210 mov z16.s, s16 50000 : 0x052420a6 mov z6.s, s5 50000 : 0x05242084 mov z4.s, s4 50000 : 0x05242063 mov z3.s, s3 50000 : 0x05242042 mov z2.s, s2 50000 : 0x05242021 mov z1.s, s1 50000 : 0x05242000 mov z0.s, s0 50000 : 0x046530b6 mov z22.d, z5.d 50000 : 0x046530b5 mov z21.d, z5.d 41509 : 0xe4084380 st1b {z0.b}, p0, [x28, x8] 41509 : 0xe40842c0 st1b {z0.b}, p0, [x22, x8] 10500 : 0x253c1d00 whilelo p0.b, x8, x28 10500 : 0x04285028 addvl x8, x8, #1 3500 : 0xe40842e0 st1b {z0.b}, p0, [x23, x8] 3500 : 0xe4084280 st1b {z0.b}, p0, [x20, x8] 3500 : 0xe4084260 st1b {z0.b}, p0, [x19, x8] 3500 : 0x2538c000 mov z0.b, #0 // =0x0 3000 : 0x04bf5028 rdvl x8, #1 2000 : 0x25351fe0 whilelo p0.b, xzr, x21 1500 : 0x253c1fe0 whilelo p0.b, xzr, x28 500 : 0x04bf5029 rdvl x9, #1
The memory tracing client (memtrace) focuses on the dynamic memory accesses of the application, capturing information such as the accessed addresses and data sizes. It is based on the existing non-SVE DynamoRIO memtrace client, with added SVE emulation and tracing support. Running the emulated memtrace client results in two different memory trace files: a SVE-only trace and a non-SVE one. To keep the memory traces consistent, we include an additional field, 'Sequence Number', that updates the order of each memory access sequentially, through a shared counter between the emulation side and the core DynamoRIO instrumentation. The memory trace format is the following:
We also add the 'SVE Bundle' field to the memory traces, which identifies SVE linear and gather/scatter vector accesses. It consists of 3 bits with the following possible combinations appearing in the resulting trace:
An important consideration to have when tracing SVE binaries, is that the output trace can easily use up a large amount of disk space. Therefore, we support marker instructions that developers must include in their SVE code to define start/end regions (multiple regions are supported) where the memtrace client will execute. In a typical scenario, this corresponds to the main kernel loops of the application. Note that only the region inside these markers will be traced. If they are not used, no tracing will be done. These markers should also be outside vectorizable loops, as they might hinder vectorization.
In the case of the mini-app we are using, HACCKernels, we add the marker definitions at the start of the main.cpp file and define the region of interest around the main kernel, GravityForceKernel5:
#define __START_TRACE() {asm volatile (".inst 0x2520e020");} #define __STOP_TRACE() {asm volatile (".inst 0x2520e040");} .... __START_TRACE(); run(GravityForceKernel5, "5th Order"); __STOP_TRACE();
After adding the region markers to the code, we can then compile it and run it with the memtrace client (again with 512-bit vectors):
$ armie -e libmemtrace_sve_512.so -i libmemtrace_simple.so -- ./HACCKernels > Data file /home/migtai01/apps-unimplemented/HACCKernels_sve_vectorizer/memtrace.HACCKernels.03531.0000.log created > Gravity Short-Range-Force Kernel (5th Order): 9178.27 -835.505 -167.99: 73.5272 s $ ls > memtrace.HACCKernels.0000.log > sve-memtrace.HACCKernels.8114.log
Given the large size of the combined trace files (over 22M trace lines) we cannot show the entirety of the trace here, so we focus on a small snippet. For analysis purposes, it is advantageous to merge both the non-SVE and SVE trace files into a single one. This can be done with a simple script that parses the separate memory trace files and orders them into a single full trace output, based on the 'Sequence Number' trace field. Such a script is not included with ArmIE at this point in time.
To facilitate analysis of the fully-merged memory trace, we use different separator characters after the first element of each trace line: a colon ' : ' separator for non-SVE traces, and a comma ' , ' separator for SVE traces. This can be seen bellow, in the merged memory trace snippet of HACCKernels:
Format: <sequence number>: <TID>, <isBundle>, <isWrite>, <data size>, <data address>, <PC> .... 2990: 0, 0, 0, 4, 0x401a68, 0x401678 2991: 0, 0, 0, 4, 0x401a6c, 0x401680 2992: 0, 0, 0, 4, 0x401a70, 0x401688 2993: 0, 0, 0, 4, 0x401a74, 0x401690 2994: 0, 0, 0, 4, 0x401a78, 0x401698 2995: 0, 0, 0, 4, 0x401a7c, 0x4016a0 2996, 0, 0, 0, 64, 0x44c750, 0x4016f0 2997, 0, 0, 0, 64, 0x44d110, 0x4016f4 2998, 0, 0, 0, 64, 0x44dad0, 0x4016f8 2999, 0, 0, 0, 4, 0x44e490, 0x401750 3000, 0, 0, 0, 16, 0x44e49c, 0x401750 3001, 0, 0, 0, 4, 0x44e4b0, 0x401750 ....
The memory trace snippet shows an initial section with non-SVE memory accesses (traces 2990 to 2995), followed by SVE accesses (traces 2996 to 3001), as can be seen by the first separator character, after the sequence number. Only load accesses are captured in this memory trace snippet, but write operations are present in the full memory trace. Looking at the 'size' field we can also observe three full SVE-vector loads (64 byte size equals to 512-bit vector lengths).
Memory traces are commonly used for different types of post-processing analysis. This can encompass a wide-range of scripts and tools, ranging from simple parsing scripts to more complex cache simulators, to mention a few. Processing memory traces falls outside the scope of ArmIE and, as such, no extra tools are included with it at this point in time. It is up to developers to integrate the generated traces into their analysis workflow and tools.
Below we present some simple scripting experiment that can be done to the fully-merged memory traces. It parses all memory traces and prints related information such as number of linear and gather/scatter bundle accesses, percentage of writes and reads, or accesses with inactive vector lanes. We can observe that the HACCKernels mini-application does not present a single gather/scatter operation and that load operations dominate the memory accesses performed.
SVE vector size: 512 bits (64B) Total Memory References = 20486551 -> linear SVE: 16779970 (81.91%) -> bundle SVE: 0 (0.00%) -> non-SVE: 3706581 (18.09%) Linear SVE accesses with at least 1 inactive lane: -> 1237914 (7.38% of linear SVE traces) ============== Total Writes = 3121089 (34 unique writes - different PCs) -> linear SVE: 176536 (5.66%) -> bundle SVE: 0 (0.00%) -> non-SVE: 2944553 (94.34%) Linear SVE Writes with at least 1 inactive lane: -> 3376 (1.91% of linear SVE write traces) ============== Total Loads = 17365462 (51 unique loads - different PCs) -> linear SVE: 16603434 (95.61%) -> bundle SVE: 0 (0.00%) -> non-SVE: 762028 (4.39%) Linear SVE Loads with at least 1 inactive lane: > 1234538 (7.44% of linear SVE loads) ============== Distribution of memory operations: -> 15.23% Writes -> 84.77% Loads SVE Bundles Stats: -> 0 SVE bundle accesses
In this article, we give an overview of the Arm Instruction Emulator, from its structure to the existing emulation clients and how to use them. We also briefly talk about types of analysis and studies that can be achieved with the provided clients, touching the waters of post-processing analysis, with regards to memory traces. As it was mentioned, more complex post-processing analysis can be done on memory traces, such as by using cache simulation, as the existent DynamoRIO cache simulator shows for a different type of non-SVE traces. Although not yet compatible with ArmIE, it stands to show what type of analysis we can explore with the traces and metrics we can gather at this point in time.
With SVE-enabled silicon on the horizon, it is of great importance to provide developers and manufacturers with tools to run their codes in preparation for the upcoming hardware. ArmIE enables running SVE code on native 64-bit Armv8-A architectures, with smaller overheads when compared to simulators, allowing for larger and more significant workloads to be run. Furthermore, through integration with DynamoRIO, ArmIE provides dynamic binary instrumentation, with some clients already available: instruction and opcode counting, and memory tracing. On top of this, ArmIE provides an emulation API that empowers developers to build their own emulation clients, focusing on metrics important for their evaluations and methodologies.
Check out our upcoming SVE Hackathon at SC18 or a future Arm HPC workshop for more details.
[CTAToken URL = "https://developer.arm.com/products/software-development-tools/hpc/sve?_ga=2.17107897.1404748529.1541151109-1895908523.1539873912" target="_blank" text="Arm HPC tools" class ="green"]