## Table of Contents

## Introduction

WebAssembly (Wasm) is a virtual instruction set architecture (ISA) that is designed to be a portable, efficient, and safe representation of typical end-user applications. As its name suggests, initially it was created for the Web, so it is supported by all major browsers. Nowadays it is also used in other places such as Node.js and as an alternative way to run containers with Docker. Some examples of programs that can be compiled to Wasm are games based on the Unity platform, and productivity applications such as Adobe Photoshop.

This blog post discusses some of the challenges that are faced when one is trying to implement the WebAssembly specification on the 64-bit Arm architecture.

Most of the operations defined by the Wasm specification have simple, essentially one-to-one mappings to the 64-bit instruction set in the Arm architecture (A64). Out of those that are not as straightforward, we are going to look at the vector bitmask instructions: `i8x16.bitmask`

, `i16x8.bitmask`

, `i32x4.bitmask`

, and `i64x2.bitmask`

, operating on 8-, 16-, 32-, and 64-bit elements respectively. They return a 32-bit integer value, where bit *i* in the result corresponds to the most significant bit of vector element *i*, counting from 0. Visualized for 32-bit elements:

The Wasm proposal also describes the intended use case for these instructions. It is the case where we perform some processing in a loop using vector operations, followed by scalar post-processing. The bitmask operations act as the interface between both execution phases.

## Implementation

### 8-bit elements

Let us start with the 8-bit element case, that is the `i8x16.bitmask`

operation. The closest A64 equivalent is the `BEXT`

instruction. This is one of the optional bit permute operations (also known as the `FEAT_SVE_BitPerm`

feature) that have been introduced by the second version of the Scalable Vector Extension (SVE2). However, `BEXT`

differs from `i8x16.bitmask`

in several aspects:

- Is more generic: it can collect arbitrary input bits (instead of just the most significant bits), controlled by a second argument.
- Produces its result in a vector register instead of a general-purpose one, where integer scalars usually reside while executing Wasm code.
- Does not operate on the 128-bit vector as a whole, but at most on 64-bit parts of it independently. Note that
`BEXT`

as an SVE2 instruction could work with much larger vectors, but here we are only interested in the lowest 128 bits. The rest is ignored, since WebAssembly specifies the vector size as 128 bits.

With that in mind, here is a possible implementation:

movi vt.16b, #128 bext zt.d, zs.d, zt.d fmov xt, vt.d[1] fmov xd, dt orr wd, wd, wt, lsl #8

where `Vs`

/`Zs`

is the input vector register, `Wd`

/`Xd`

- the output general-purpose register, and `Dt`

/`Qt`

/`Vt`

/`Zt`

and `Wt`

/`Xt`

- temporary vector and general-purpose registers respectively. Note that we combine the output bits from each 64-bit half of the vector register with a scalar integer operation instead of vector instructions, and the moves from SIMD & floating-point to general-purpose registers that are right before the last instruction are independent, so a processor could execute them in parallel, if it is able to, making the cost of the second move essentially negligible. In general, this is an important factor to keep in mind when designing the sequence of operations necessary to express a computation.

If the machine that we would like to run our code on does not support the bit permute operations, we could explore an alternative approach. While `BEXT`

collects the input bits into the least significant bits of the output, we can use the most significant ones instead. This is achievable with regular arithmetic operations. To make our lives easier, we should make sure that the input vector elements have at most one bit set. This can be done with an unsigned right shift, for instance. Arithmetic instructions operate on at most 64-bit values, so initially we are going to process each set of 8 8-bit input vector elements separately.

Consider element 0 from the input: After the previous transformation, it corresponds to bit 0 in the intermediate value. We must shift it into bit 56, the first bit from the most significant 8 bits. Remembering that a left shift by 1 bit corresponds to a multiplication by 2, this is equivalent to multiplying by 2^{56} (or 0x100000000000000 in hexadecimal notation). The other input bits are greater than or equal to 2^{8}, so they will result in a value that is 2^{64} or larger. It is not going to fit into our 64-bit output, and hence will be discarded.

Now consider element 1 and bit 8 respectively. We must shift the latter into bit 57, so we multiply by 2^{49} / 0x2000000000000 to obtain an intermediate value that must be added to the one from the previous step. As before, the bits associated with the rest of the elements would be discarded. However, there is one exception, bit 0, which would be shifted into bit 49. Since it is outside of the range where we are collecting our result (bits 56 to 63 inclusive), there is no problem. Furthermore, bit 49 of the intermediate value from the previous step in the calculation is 0. So, there is no risk that a carry would perturb the bits we are interested in when we perform the addition. Following a similar logic for the rest of the input vector elements, we arrive at our final multiplier, 0x102040810204080. The only thing that is left is to combine the resultant bits for both sets of input vector elements. As a side note, techniques similar to this one are frequently referred to as "SIMD within a register" (SWAR).

If we decide to implement this sequence of operations with vector instructions, we quickly discover an issue. The vector `MUL`

instruction that we expect to use does not support 64-bit vector elements. Fortunately there is an alternative, namely the `PMULL`

instruction provided by the Armv8 Cryptographic Extension (and the `FEAT_AES`

feature in particular), which does multiplication of polynomials over `{0, 1}`

(also referred to as carryless multiplication). Since we have been careful to avoid carries in our intermediate values, we could treat that instruction as just a regular multiplication. Furthermore, the final combining step can be done with the `TRN2`

instruction. That gives us the following implementation:

ldr qt, 1f ushr vt2.16b, vs.16b, #7 pmull2 vt3.1q, vt.2d, vt2.2d pmull vt.1q, vt.1d, vt2.1d trn2 vt.8b, vt.8b, vt3.8b umov wd, vt.h[3] ... 1: .byte 128, 64, 32, 16, 8, 4, 2, 1, 128, 64, 32, 16, 8, 4, 2, 1

where `Vt2`

and `Dt3`

/`Vt3`

are temporary vector registers.

Alternatively, we could implement the same approach purely with scalar integer instructions which has the advantage that it does not require any ISA extensions. We are also going to make use of a performance improvement that has been introduced with the Cortex-A77 processor (and its successors). The 64-bit multiplication instructions have significantly lower latency and higher throughput (refer to the Arm Cortex-A77 Software Optimization Guide for more details). So with a slight adjustment to the multiplier to avoid a couple of shift operations (clearing all bits except the most significant ones in the input instead), we obtain the following instruction sequence:

ldr xd, =0x2040810204081 fmov xt, vs.d[1] fmov xt2, ds and xt, xt, #0x8080808080808080 and xt2, xt2, #0x8080808080808080 mul xt, xd, xt mul xt2, xd, xt2 lsr xd, xt, #48 bfxil xd, xt2, #56, #8

where `Xt2`

is another temporary general-purpose register.

Let us return to the original approach of collecting the input bits directly into the least significant bits of the output. The latter can be viewed as a sum of powers of 2; for example, the input vector element 3 corresponds to 2^{3} or 8. So let us assume that each input vector element has a value of either 0 or 1. Then all we have to do is to multiply them by the respective power of 2 and then sum all the products. In other words, we have to perform a dot product. The Armv8.2 architecture extension does in fact provide operations like these, covered by the `FEAT_DotProd`

feature. Unfortunately the instructions calculate at most 32-bit results, so we must use a couple of other combining operations. Finally, note that when an input vector element is interpreted as a signed 8-bit integer, if its most significant bit is set, then it would be a negative number. As a result, we can alternatively transform it into either -1 or 0 with a single operation - `CMLT`

with 0 as its second (immediate) operand.

In fact, all vector comparison instructions produce results with our expected values. So we could omit that initial normalization operation if we determine that all inputs come from such instructions. The same holds for bitwise logical operations such as `ORR`

on the results of vector comparisons, but we will not make any assumptions here. After that, we must use negative powers of 2 and the `SDOT`

instruction to perform the dot product which gives us the following implementation:

ldr qt, 1f cmlt vt2.16b, vs.16b, #0 movi dt3, #0 sdot vt3.4s, vt.16b, vt2.16b addp vt.4s, vt3.4s, vt3.4s umov wt, vt.b[4] fmov wd, st orr wd, wd, wt, lsl #8 ... 1: .byte -1, -2, -4, -8, -16, -32, -64, -128, -1, -2, -4, -8, -16, -32, -64, -128

One mandatory feature that is part of the Armv8.6 architecture extension is `FEAT_I8MM`

- the 8-bit integer matrix multiplication operations. We can make the observation that the `SMMLA`

instruction is quite similar to the combination of `SDOT`

and `ADDP`

in the previous sequence, which gives us the following alternative:

ldr dt, 1f cmlt vt2.16b, vs.16b, #0 movi dt3, #0 smmla vt3.4s, vt.16b, vt2.16b fmov xd, dt3 orr xd, xd, xd, lsr #24 mov wd, wd ... 1: .byte -1, -2, -4, -8, -16, -32, -64, -128

This instruction sequence also provides a potential simplification. If we know that the consumer of the result ignores the most significant 32 bits (because it operates on `Wd`

instead of `Xd`

, for example), then we can omit the last move.

Finally, for reference here is the implementation that has been provided when the bitmask operations have been proposed for Wasm:

ldr qt, 1f sshr vt2.16b, vs.16b, #7 and vt.16b, vt.16b, vt2.16b ext vt2.16b, vt.16b, vt.16b, #8 zip1 vt.16b, vt.16b, vt2.16b addv ht, vt.8h umov wd, vt.h[0] ... 1: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

### 16-bit elements

We can use similar tricks as in the 8-bit case to derive instruction sequences for 16-bit elements.

Here is the implementation from the Wasm proposal:

ldr qt, 1f sshr vt2.8h, vs.8h, #15 and vt.16b, vt.16b, vt2.16b addv ht, vt.8h umov wd, vt.h[0] ... 1: .2byte 1, 2, 4, 8, 16, 32, 64, 128

An approach using scalar integer multiplication:

ldr xd, =0x200040008001 fmov xt, vs.d[1] fmov xt2, ds and xt, xt, #0x8000800080008000 and xt2, xt2, #0x8000800080008000 mul xt, xd, xt mul xt2, xd, xt2 lsr xd, xt, #56 bfxil xd, xt2, #60, #4

A sequence based on the `PMULL`

instruction:

ushr vt.8h, vs.8h, #15 ldr dt2, 1f xtn vt.8b, vt.8h pmull vt.1q, vt.1d, vt2.1d umov wd, vt.b[7] ... 1: .byte 128, 64, 32, 16, 8, 4, 2, 1

An alternative that uses a dot product:

ldr qt, 1f cmlt vt2.8h, vs.8h, #0 movi dt3, #0 sdot vt3.4s, vt.16b, vt2.16b addp vt.4s, vt3.4s, vt3.4s umov wd, vt.b[4] fmov wt, st orr wd, wd, wt ... 1: .byte -1, 0, -2, 0, -4, 0, -8, 0, -16, 0, -32, 0, -64, 0, -128, 0

The corresponding implementation with a matrix multiplication:

ldr qt, 1f cmlt vt2.8h, vs.8h, #0 movi dt3, #0 smmla vt3.4s, vt.16b, vt2.16b umov wd, vt3.b[12] fmov wt, st3 orr wd, wd, wt ... 1: .byte -1, 0, -2, 0, -4, 0, -8, 0, -16, 0, -32, 0, -64, 0, -128, 0

And with the `BEXT`

instruction:

movi vt.8h, #128, lsl #8 bext zt.d, zs.d, zt.d fmov xd, vt.d[1] fmov xt, dt orr wd, wt, wd, lsl #4

### 32-bit elements

Here is the implementation from the Wasm proposal:

ldr qt, 1f sshr vt2.4s, vs.4s, #31 and vt.16b, vt.16b, vt2.16b addv st, vt.4s fmov wd, st ... 1: .4byte 1, 2, 4, 8

With fewer input vector elements to process, we can come up with a simplified implementation that uses scalar integer operations:

fmov xd, vs.d[1] fmov xt, ds and xd, xd, #0x8000000080000000 and xt, xt, #0x8000000080000000 orr xd, xd, xd, lsl #31 orr xt, xt, xt, lsl #31 lsr xd, xd, #60 bfxil xd, xt, #62, #2

An alternative that uses vector polynomial multiplication:

ushr vt.4s, vs.4s, #31 ldr dt2, 1f xtn vt.4h, vt.4s pmull vt.1q, vt.1d, vt2.1d umov wd, vt.b[7] ... 1: .2byte 0x800, 0x400, 0x200, 0x100

An instruction sequence that uses a matrix multiplication:

ldr qt, 1f cmlt vt2.4s, vs.4s, #0 movi dt3, #0 smmla vt3.4s, vt.16b, vt2.16b umov wd, vt3.b[12] fmov wt, st3 orr wd, wd, wt ... 1: .byte -1, 0, 0, 0, -2, 0, 0, 0, -4, 0, 0, 0, -8, 0, 0, 0

And another application of the `BEXT`

instruction:

movi vt.4s, #128, lsl #24 bext zt.d, zs.d, zt.d fmov xd, vt.d[1] fmov xt, dt orr wd, wt, wd, lsl #2

### 64-bit elements

There are only 2 input bits to deal with which can be done with a simple implementation that uses only scalar integer operations:

fmov xd, vs.d[1] fmov xt, ds lsr xd, xd, #62 bfxil xd, xt, #63, #1

## Evaluation

Now we have plenty of candidate instruction sequences, but how do we choose one? To this end we can wrap each instruction sequence (taking the one for 64-bit elements as an example) with an assembly function such as:

.arch armv8-a .text .align 6 .global scalar #ifndef _WIN64 .type scalar, %function #endif // _WIN64 scalar: .cfi_startproc mov x1, x0 .p2align 6 1: .rept 1024 fmov x0, v0.d[1] fmov x2, d0 lsr x0, x0, #62 bfxil x0, x2, #63, #1 #ifdef LATENCY dup v0.2d, x0 #endif // LATENCY .endr subs x1, x1, #1 b.ne 1b ret .cfi_endproc #ifndef _WIN64 .size scalar, .-scalar #endif // _WIN64

This function has the following C prototype:

uint32_t scalar(size_t n, uint64x2_t v);

where the first parameter, `n`

, specifies the number of times that the corresponding bitmask operation should execute. While the second parameter, `v`

, is just some input data. Note that we have chosen instruction sequences whose runtime is independent of the input values.

Then to complete the benchmark implementation we can just obtain the current time before and after running the function and calculate the elapsed time duration: The smaller the number, the better the candidate instruction sequence. This can be achieved with calls such as:

`clock_gettime(CLOCK_MONOTONIC, ...)`

on POSIX-like systems`QueryPerformanceCounter()`

on Windows

There are a couple of important points that are associated with this measurement method: The branch layout (such as branch density) and alignment (especially alignment of branch targets) might have a disproportionate effect on assembly microbenchmarks such as this one. So the function above enforces an alignment on 64 bytes where appropriate. It is also mostly laid out as a very long block of linear code without branches that consists of repeated copies of the instruction sequence we are looking at. Furthermore, there are at least 2 values that we might be interested in measuring. One of them is latency, that is the amount of time it takes until the result of the bitmask operation is ready for another (different) operation that depends on it. The other number is throughput - the number of independent bitmask operations that we are able to execute in parallel. Our assembly function can be used to obtain both numbers. If the `LATENCY`

macro is defined, then each replicated instruction sequence is made a part of the same very long dependency chain. In that case we can obtain a latency value by dividing the total runtime by the number of executed sequences, similarly for throughput.

We perform the evaluation on 2 processors - Cortex-A55 and Cortex-X1. Smaller and more power-efficient processors such as Cortex-A55 may sometimes have surprising performance characteristics compared to their much more powerful alternatives like Cortex-X1. Therefore, that choice provides a reasonable coverage of the spectrum of processors that are expected to execute Wasm code. Unfortunately, neither of these processors support the `FEAT_I8MM`

and `FEAT_SVE_BitPerm`

features, so we will have to exclude the corresponding instruction sequences from the evaluation.

The results are presented as a set of tables - one for each vector element size (except for 64-bit elements, in which case we have only one candidate implementation). All values in the tables are speedups relative to the instruction sequence suggested in the Wasm proposal, so higher numbers are better (and those below 100% are regressions).

### 8-bit elements

Processor | Latency | Throughput | ||||
---|---|---|---|---|---|---|

Scalar | PMULL | SDOT | Scalar | PMULL | SDOT | |

Cortex-A55 | 114.22% | 120.43% | 88.85% | 105.09% | 124.93% | 83.04% |

Cortex-X1 | 163.57% | 156.49% | 114.41% | 106.93% | 120.00% | 147.95% |

### 16-bit elements

Processor | Latency | Throughput | ||||
---|---|---|---|---|---|---|

Scalar | PMULL | SDOT | Scalar | PMULL | SDOT | |

Cortex-A55 | 88.58% | 116.67% | 71.84% | 81.71% | 137.39% | 64.42% |

Cortex-X1 | 127.14% | 107.66% | 88.93% | 64.36% | 147.73% | 90.28% |

### 32-bit elements

Processor | Latency | Throughput | ||
---|---|---|---|---|

Scalar | PMULL | Scalar | PMULL | |

Cortex-A55 | 103.23% | 108.26% | 123.63% | 125.00% |

Cortex-X1 | 117.93% | 92.34% | 79.78% | 157.78% |

## Conclusion

Given the intended use case for the bitmask instructions, we are going to prioritize minimizing the operation latency.

Our results are a typical example of the kind of compromises one has to make when targeting a broad range of processors (in terms of performance characteristics) simultaneously. Consider the 8-bit vector element case: While the best implementation for Cortex-A55 is the one using the `PMULL`

instruction, the optimal sequence for Cortex-X1 is based on scalar multiplications. Fortunately, all of those alternatives are better than our baseline. So, we are able to choose one depending on what processor we expect to encounter most often in practice.

However, the choice is not as simple in the 16- and 32-bit element cases. The reason is that the best sequence for one of the processors might lead to a regression on the other. A compromise is still possible, however. Namely the implementation using vector polynomial multiplication in the 16-bit case and the one based on scalar integer instructions for 32-bit vector elements.

## Further reading

If your application uses operations similar to the Wasm vector bitmask instructions, but not necessarily exactly the same, the previous blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more may provide useful information, including alternatives to the x86 Move Byte Mask (`PMOVMSKB`

) instruction.