Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • AArch64
  • NEON
  • SVE
  • SIMD and Vector Execution
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture

Anton Kirilov
Anton Kirilov
December 6, 2023
15 minute read time.

Table of Contents

  • Introduction
  • Implementation
    • 8-bit elements
    • 16-bit elements
    • 32-bit elements
    • 64-bit elements
  • Evaluation
    • 8-bit elements
    • 16-bit elements
    • 32-bit elements
  • Conclusion
  • Further reading

Introduction

WebAssembly (Wasm) is a virtual instruction set architecture (ISA) that is designed to be a portable, efficient, and safe representation of typical end-user applications. As its name suggests, initially it was created for the Web, so it is supported by all major browsers. Nowadays it is also used in other places such as Node.js and as an alternative way to run containers with Docker. Some examples of programs that can be compiled to Wasm are games based on the Unity platform, and productivity applications such as Adobe Photoshop.

This blog post discusses some of the challenges that are faced when one is trying to implement the WebAssembly specification on the 64-bit Arm architecture.

Most of the operations defined by the Wasm specification have simple, essentially one-to-one mappings to the 64-bit instruction set in the Arm architecture (A64). Out of those that are not as straightforward, we are going to look at the vector bitmask instructions: i8x16.bitmask, i16x8.bitmask, i32x4.bitmask, and i64x2.bitmask, operating on 8-, 16-, 32-, and 64-bit elements respectively. They return a 32-bit integer value, where bit i in the result corresponds to the most significant bit of vector element i, counting from 0. Visualized for 32-bit elements:

i32x4.bitmask

The Wasm proposal also describes the intended use case for these instructions. It is the case where we perform some processing in a loop using vector operations, followed by scalar post-processing. The bitmask operations act as the interface between both execution phases.

Implementation

8-bit elements

Let us start with the 8-bit element case, that is the i8x16.bitmask operation. The closest A64 equivalent is the BEXT instruction. This is one of the optional bit permute operations (also known as the FEAT_SVE_BitPerm feature) that have been introduced by the second version of the Scalable Vector Extension (SVE2). However, BEXT differs from i8x16.bitmask in several aspects:

  • Is more generic: it can collect arbitrary input bits (instead of just the most significant bits), controlled by a second argument.
  • Produces its result in a vector register instead of a general-purpose one, where integer scalars usually reside while executing Wasm code.
  • Does not operate on the 128-bit vector as a whole, but at most on 64-bit parts of it independently. Note that BEXT as an SVE2 instruction could work with much larger vectors, but here we are only interested in the lowest 128 bits. The rest is ignored, since WebAssembly specifies the vector size as 128 bits.

With that in mind, here is a possible implementation:

movi vt.16b, #128
bext zt.d, zs.d, zt.d
fmov xt, vt.d[1]
fmov xd, dt
orr  wd, wd, wt, lsl #8

where Vs/Zs is the input vector register, Wd/Xd - the output general-purpose register, and Dt/Qt/Vt/Zt and Wt/Xt - temporary vector and general-purpose registers respectively. Note that we combine the output bits from each 64-bit half of the vector register with a scalar integer operation instead of vector instructions, and the moves from SIMD & floating-point to general-purpose registers that are right before the last instruction are independent, so a processor could execute them in parallel, if it is able to, making the cost of the second move essentially negligible. In general, this is an important factor to keep in mind when designing the sequence of operations necessary to express a computation.

If the machine that we would like to run our code on does not support the bit permute operations, we could explore an alternative approach. While BEXT collects the input bits into the least significant bits of the output, we can use the most significant ones instead. This is achievable with regular arithmetic operations. To make our lives easier, we should make sure that the input vector elements have at most one bit set. This can be done with an unsigned right shift, for instance. Arithmetic instructions operate on at most 64-bit values, so initially we are going to process each set of 8 8-bit input vector elements separately.

Consider element 0 from the input: After the previous transformation, it corresponds to bit 0 in the intermediate value. We must shift it into bit 56, the first bit from the most significant 8 bits. Remembering that a left shift by 1 bit corresponds to a multiplication by 2, this is equivalent to multiplying by 256 (or 0x100000000000000 in hexadecimal notation). The other input bits are greater than or equal to 28, so they will result in a value that is 264 or larger. It is not going to fit into our 64-bit output, and hence will be discarded.

Now consider element 1 and bit 8 respectively. We must shift the latter into bit 57, so we multiply by 249 / 0x2000000000000 to obtain an intermediate value that must be added to the one from the previous step. As before, the bits associated with the rest of the elements would be discarded. However, there is one exception, bit 0, which would be shifted into bit 49. Since it is outside of the range where we are collecting our result (bits 56 to 63 inclusive), there is no problem. Furthermore, bit 49 of the intermediate value from the previous step in the calculation is 0. So, there is no risk that a carry would perturb the bits we are interested in when we perform the addition. Following a similar logic for the rest of the input vector elements, we arrive at our final multiplier, 0x102040810204080. The only thing that is left is to combine the resultant bits for both sets of input vector elements. As a side note, techniques similar to this one are frequently referred to as "SIMD within a register" (SWAR).

If we decide to implement this sequence of operations with vector instructions, we quickly discover an issue. The vector MUL instruction that we expect to use does not support 64-bit vector elements. Fortunately there is an alternative, namely the PMULL instruction provided by the Armv8 Cryptographic Extension (and the FEAT_AES feature in particular), which does multiplication of polynomials over {0, 1} (also referred to as carryless multiplication). Since we have been careful to avoid carries in our intermediate values, we could treat that instruction as just a regular multiplication. Furthermore, the final combining step can be done with the TRN2 instruction. That gives us the following implementation:

  ldr    qt, 1f
  ushr   vt2.16b, vs.16b, #7
  pmull2 vt3.1q, vt.2d, vt2.2d
  pmull  vt.1q, vt.1d, vt2.1d
  trn2   vt.8b, vt.8b, vt3.8b
  umov   wd, vt.h[3]
  ...
1:
  .byte  128, 64, 32, 16, 8, 4, 2, 1, 128, 64, 32, 16, 8, 4, 2, 1

where Vt2 and Dt3/Vt3 are temporary vector registers.

Alternatively, we could implement the same approach purely with scalar integer instructions which has the advantage that it does not require any ISA extensions. We are also going to make use of a performance improvement that has been introduced with the Cortex-A77 processor (and its successors). The 64-bit multiplication instructions have significantly lower latency and higher throughput (refer to the Arm Cortex-A77 Software Optimization Guide for more details). So with a slight adjustment to the multiplier to avoid a couple of shift operations (clearing all bits except the most significant ones in the input instead), we obtain the following instruction sequence:

ldr   xd, =0x2040810204081
fmov  xt, vs.d[1]
fmov  xt2, ds
and   xt, xt, #0x8080808080808080
and   xt2, xt2, #0x8080808080808080
mul   xt, xd, xt
mul   xt2, xd, xt2
lsr   xd, xt, #48
bfxil xd, xt2, #56, #8

where Xt2 is another temporary general-purpose register.

Let us return to the original approach of collecting the input bits directly into the least significant bits of the output. The latter can be viewed as a sum of powers of 2; for example, the input vector element 3 corresponds to 23 or 8. So let us assume that each input vector element has a value of either 0 or 1. Then all we have to do is to multiply them by the respective power of 2 and then sum all the products. In other words, we have to perform a dot product. The Armv8.2 architecture extension does in fact provide operations like these, covered by the FEAT_DotProd feature. Unfortunately the instructions calculate at most 32-bit results, so we must use a couple of other combining operations. Finally, note that when an input vector element is interpreted as a signed 8-bit integer, if its most significant bit is set, then it would be a negative number. As a result, we can alternatively transform it into either -1 or 0 with a single operation - CMLT with 0 as its second (immediate) operand.

In fact, all vector comparison instructions produce results with our expected values. So we could omit that initial normalization operation if we determine that all inputs come from such instructions. The same holds for bitwise logical operations such as ORR on the results of vector comparisons, but we will not make any assumptions here. After that, we must use negative powers of 2 and the SDOT instruction to perform the dot product which gives us the following implementation:

  ldr   qt, 1f
  cmlt  vt2.16b, vs.16b, #0
  movi  dt3, #0
  sdot  vt3.4s, vt.16b, vt2.16b
  addp  vt.4s, vt3.4s, vt3.4s
  umov  wt, vt.b[4]
  fmov  wd, st
  orr   wd, wd, wt, lsl #8
  ...
1:
  .byte -1, -2, -4, -8, -16, -32, -64, -128, -1, -2, -4, -8, -16, -32, -64, -128

One mandatory feature that is part of the Armv8.6 architecture extension is FEAT_I8MM - the 8-bit integer matrix multiplication operations. We can make the observation that the SMMLA instruction is quite similar to the combination of SDOT and ADDP in the previous sequence, which gives us the following alternative:

  ldr   dt, 1f
  cmlt  vt2.16b, vs.16b, #0
  movi  dt3, #0
  smmla vt3.4s, vt.16b, vt2.16b
  fmov  xd, dt3
  orr   xd, xd, xd, lsr #24
  mov   wd, wd
  ...
1:
  .byte -1, -2, -4, -8, -16, -32, -64, -128

This instruction sequence also provides a potential simplification. If we know that the consumer of the result ignores the most significant 32 bits (because it operates on Wd instead of Xd, for example), then we can omit the last move.

Finally, for reference here is the implementation that has been provided when the bitmask operations have been proposed for Wasm:

  ldr   qt, 1f 
  sshr  vt2.16b, vs.16b, #7
  and   vt.16b, vt.16b, vt2.16b
  ext   vt2.16b, vt.16b, vt.16b, #8
  zip1  vt.16b, vt.16b, vt2.16b
  addv  ht, vt.8h
  umov  wd, vt.h[0]
  ...
1:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

16-bit elements

We can use similar tricks as in the 8-bit case to derive instruction sequences for 16-bit elements.

Here is the implementation from the Wasm proposal:

  ldr    qt, 1f
  sshr   vt2.8h, vs.8h, #15
  and    vt.16b, vt.16b, vt2.16b
  addv   ht, vt.8h
  umov   wd, vt.h[0]
  ...
1:
  .2byte 1, 2, 4, 8, 16, 32, 64, 128

An approach using scalar integer multiplication:

ldr   xd, =0x200040008001
fmov  xt, vs.d[1]
fmov  xt2, ds
and   xt, xt, #0x8000800080008000
and   xt2, xt2, #0x8000800080008000
mul   xt, xd, xt
mul   xt2, xd, xt2
lsr   xd, xt, #56
bfxil xd, xt2, #60, #4

A sequence based on the PMULL instruction:

  ushr  vt.8h, vs.8h, #15
  ldr   dt2, 1f
  xtn   vt.8b, vt.8h
  pmull vt.1q, vt.1d, vt2.1d
  umov  wd, vt.b[7]
  ...
1:
  .byte 128, 64, 32, 16, 8, 4, 2, 1

An alternative that uses a dot product:

  ldr   qt, 1f
  cmlt  vt2.8h, vs.8h, #0
  movi  dt3, #0
  sdot  vt3.4s, vt.16b, vt2.16b
  addp  vt.4s, vt3.4s, vt3.4s
  umov  wd, vt.b[4]
  fmov  wt, st
  orr   wd, wd, wt
  ...
1:
  .byte -1, 0, -2, 0, -4, 0, -8, 0, -16, 0, -32, 0, -64, 0, -128, 0

The corresponding implementation with a matrix multiplication:

  ldr   qt, 1f
  cmlt  vt2.8h, vs.8h, #0
  movi  dt3, #0
  smmla vt3.4s, vt.16b, vt2.16b
  umov  wd, vt3.b[12]
  fmov  wt, st3
  orr   wd, wd, wt
  ...
1:
  .byte -1, 0, -2, 0, -4, 0, -8, 0, -16, 0, -32, 0, -64, 0, -128, 0

And with the BEXT instruction:

movi vt.8h, #128, lsl #8
bext zt.d, zs.d, zt.d
fmov xd, vt.d[1]
fmov xt, dt
orr  wd, wt, wd, lsl #4

32-bit elements

Here is the implementation from the Wasm proposal:

  ldr    qt, 1f
  sshr   vt2.4s, vs.4s, #31
  and    vt.16b, vt.16b, vt2.16b
  addv   st, vt.4s
  fmov   wd, st
  ...
1:
  .4byte 1, 2, 4, 8

With fewer input vector elements to process, we can come up with a simplified implementation that uses scalar integer operations:

fmov  xd, vs.d[1]
fmov  xt, ds
and   xd, xd, #0x8000000080000000
and   xt, xt, #0x8000000080000000
orr   xd, xd, xd, lsl #31
orr   xt, xt, xt, lsl #31
lsr   xd, xd, #60
bfxil xd, xt, #62, #2

An alternative that uses vector polynomial multiplication:

  ushr   vt.4s, vs.4s, #31
  ldr    dt2, 1f
  xtn    vt.4h, vt.4s
  pmull  vt.1q, vt.1d, vt2.1d
  umov   wd, vt.b[7]
  ...
1:
  .2byte 0x800, 0x400, 0x200, 0x100

An instruction sequence that uses a matrix multiplication:

  ldr   qt, 1f
  cmlt  vt2.4s, vs.4s, #0
  movi  dt3, #0
  smmla vt3.4s, vt.16b, vt2.16b
  umov  wd, vt3.b[12]
  fmov  wt, st3
  orr   wd, wd, wt
  ...
1:
  .byte -1, 0, 0, 0, -2, 0, 0, 0, -4, 0, 0, 0, -8, 0, 0, 0

And another application of the BEXT instruction:

movi vt.4s, #128, lsl #24
bext zt.d, zs.d, zt.d
fmov xd, vt.d[1]
fmov xt, dt
orr  wd, wt, wd, lsl #2

64-bit elements

There are only 2 input bits to deal with which can be done with a simple implementation that uses only scalar integer operations:

fmov  xd, vs.d[1]
fmov  xt, ds
lsr   xd, xd, #62
bfxil xd, xt, #63, #1

Evaluation

Now we have plenty of candidate instruction sequences, but how do we choose one? To this end we can wrap each instruction sequence (taking the one for 64-bit elements as an example) with an assembly function such as:

    .arch armv8-a
    .text
    .align 6
    .global scalar
#ifndef _WIN64
    .type scalar, %function
#endif // _WIN64
scalar:
    .cfi_startproc
    mov x1, x0
    .p2align 6
1:
    .rept 1024
    fmov x0, v0.d[1]
    fmov x2, d0
    lsr x0, x0, #62
    bfxil x0, x2, #63, #1
#ifdef LATENCY
    dup v0.2d, x0
#endif // LATENCY
    .endr
    subs x1, x1, #1
    b.ne 1b
    ret
    .cfi_endproc
#ifndef _WIN64
    .size scalar, .-scalar
#endif // _WIN64

This function has the following C prototype:

uint32_t scalar(size_t n, uint64x2_t v);

where the first parameter, n, specifies the number of times that the corresponding bitmask operation should execute. While the second parameter, v, is just some input data. Note that we have chosen instruction sequences whose runtime is independent of the input values.

Then to complete the benchmark implementation we can just obtain the current time before and after running the function and calculate the elapsed time duration: The smaller the number, the better the candidate instruction sequence. This can be achieved with calls such as:

  • clock_gettime(CLOCK_MONOTONIC, ...) on POSIX-like systems
  • QueryPerformanceCounter() on Windows

There are a couple of important points that are associated with this measurement method: The branch layout (such as branch density) and alignment (especially alignment of branch targets) might have a disproportionate effect on assembly microbenchmarks such as this one. So the function above enforces an alignment on 64 bytes where appropriate. It is also mostly laid out as a very long block of linear code without branches that consists of repeated copies of the instruction sequence we are looking at. Furthermore, there are at least 2 values that we might be interested in measuring. One of them is latency, that is the amount of time it takes until the result of the bitmask operation is ready for another (different) operation that depends on it. The other number is throughput - the number of independent bitmask operations that we are able to execute in parallel. Our assembly function can be used to obtain both numbers. If the LATENCY macro is defined, then each replicated instruction sequence is made a part of the same very long dependency chain. In that case we can obtain a latency value by dividing the total runtime by the number of executed sequences, similarly for throughput.

We perform the evaluation on 2 processors - Cortex-A55 and Cortex-X1. Smaller and more power-efficient processors such as Cortex-A55 may sometimes have surprising performance characteristics compared to their much more powerful alternatives like Cortex-X1. Therefore, that choice provides a reasonable coverage of the spectrum of processors that are expected to execute Wasm code. Unfortunately, neither of these processors support the FEAT_I8MM and FEAT_SVE_BitPerm features, so we will have to exclude the corresponding instruction sequences from the evaluation.

The results are presented as a set of tables - one for each vector element size (except for 64-bit elements, in which case we have only one candidate implementation). All values in the tables are speedups relative to the instruction sequence suggested in the Wasm proposal, so higher numbers are better (and those below 100% are regressions).

8-bit elements

Processor Latency Throughput
Scalar PMULL SDOT Scalar PMULL SDOT
Cortex-A55 114.22% 120.43% 88.85% 105.09% 124.93% 83.04%
Cortex-X1 163.57% 156.49% 114.41% 106.93% 120.00% 147.95%

16-bit elements

Processor Latency Throughput
Scalar PMULL SDOT Scalar PMULL SDOT
Cortex-A55 88.58% 116.67% 71.84% 81.71% 137.39% 64.42%
Cortex-X1 127.14% 107.66% 88.93% 64.36% 147.73% 90.28%

32-bit elements

Processor Latency Throughput
Scalar PMULL Scalar PMULL
Cortex-A55 103.23% 108.26% 123.63% 125.00%
Cortex-X1 117.93% 92.34% 79.78% 157.78%

Conclusion

Given the intended use case for the bitmask instructions, we are going to prioritize minimizing the operation latency.

Our results are a typical example of the kind of compromises one has to make when targeting a broad range of processors (in terms of performance characteristics) simultaneously. Consider the 8-bit vector element case: While the best implementation for Cortex-A55 is the one using the PMULL instruction, the optimal sequence for Cortex-X1 is based on scalar multiplications. Fortunately, all of those alternatives are better than our baseline. So, we are able to choose one depending on what processor we expect to encounter most often in practice.

However, the choice is not as simple in the 16- and 32-bit element cases. The reason is that the best sequence for one of the processors might lead to a regression on the other. A compromise is still possible, however. Namely the implementation using vector polynomial multiplication in the 16-bit case and the one based on scalar integer instructions for 32-bit vector elements.

Further reading

If your application uses operations similar to the Wasm vector bitmask instructions, but not necessarily exactly the same, the previous blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more may provide useful information, including alternatives to the x86 Move Byte Mask (PMOVMSKB) instruction.

Anonymous
Architectures and Processors blog
  • When a barrier does not block: The pitfalls of partial order

    Wathsala Vithanage
    Wathsala Vithanage
    Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
    • September 15, 2025
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025