Java Vector API on AArch64

June 7, 2023

14 minute read time.

When supported by the underlying hardware, vector operations can increase the number of computations performed per CPU cycle. For example, assume we want to add two vectors each containing a sequence of four integer values. Vector hardware allows us to perform this operation (four integer additions in total) in a single CPU cycle. Scalar additions would only perform one integer addition at the same time. This mode of operation is called SIMD (Single Instruction, Multiple Data), whereas the traditional way of execution is called SISD (Single Instruction, Single Data).

In the Arm architecture, Neon is an advanced SIMD architecture extension for the A-profile and R-profile processors. Neon registers are considered as vectors of elements of the same data type, with Neon instructions operating on multiple elements simultaneously. Multiple data types are supported, including floating-point and integer operations. Scalable Vector Extension (SVE) is a vector extension of the A64 instruction set of the Armv8-A architecture. Armv9-A builds on SVE with the SVE2 extension. Unlike other SIMD architectures, SVE and SVE2 do not define the size of the vector registers, but constrains it to a range of possible values. Vendors can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. The design of SVE and SVE2 guarantees that the same program can run on different implementations of the instruction set architecture [1].

SIMD in Java opens ways for developers to explore new opportunities in different areas. Those areas include High Performance Computing (HPC), Artificial Intelligence (AI) frameworks and Machine Learning (ML) linear algebra-based algorithms, and financial services. These kinds of Java workload require a lot of computing. But before JDK16, there are very limited ways of using SIMD, which includes:

Superword auto-vectorization
JNI (Java Native Interface) calls to native SIMD code
Intrinsified Java libraries with SIMD implementation

OpenJDK C2 JIT compiler supports superword auto-vectorization, but it is too weak to handle a lot of complex code patterns. For example,

Auto-Vectorization works:

for (int i = 0; i < LENGTH; i++) {
  c[i] = a[i] + b[i];
}

Auto-vectorization does not work:

for (int i = 0; i < LENGTH; i+=2) {
  c[i] = a[i] + b[i];
}

for (int i = 0; i < LENGTH; i++) {
  if (a[i] < b[i]) {
    c[i] = a[i] + b[i];
  }
}

The Java Native Interface (JNI) allows access to fast native libraries, but hand-crafted assembly code is error-prone and hard to maintain. Furthermore, calling a native method has higher overhead than a normal Java method call.

Some methods of built-in libraries like String.indexOf use SIMD instructions. Nevertheless, most of these methods are too specific to some certain application scenarios. That is they are not flexible enough for developers to build their own algorithms.

Since JDK16, the Java Vector API is introduced as an incubator module (jdk.incubator.vector). It enables developers to use vector operations in a platform agnostic way. These operations can then be compiled to SIMD instructions at runtime by JDK JIT compiler.

This blog discusses insight into Vector API. We first go over some Vector API fundamentals, basic usages, and features, and then show how well does AArch64 support Vector API.

Vector API Fundamentals

To understand the basics of Vector API, in this section we have a detailed look at the firstExample:

public static void firstExample(int[] a, int[] b, int[] c) {
  for (int i = 0; i < a.length; i++) {
    if (a[i] > b[i]) {
      c[i] = a[i] + b[i];
    } else {
      c[i] = a[i];
    }
  }
}

Code 1 Addition with control-flow example in pure Java

It is a very common case that conditionally computes the sum of elements from two arrays and puts the result into a new array. Without vectorization, CPU would execute load and add instructions multiple times depending on the array length.

The same function implemented with Vector API is as below:

static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED;

static void firstExampleInVectorAPI(int[] a, int[] b, int[] c) {
  int i = 0;

  int upperBound = SPECIES.loopBound(a.length);

  // Core Loop
  for (; i < upperBound; i += SPECIES.length()) {
    IntVector av = IntVector.fromArray(SPECIES, a, i);
    IntVector bv = IntVector.fromArray(SPECIES, b, i);
    VectorMask<Integer> m = av.compare(Vectoroperators.GT, bv);
    av.add(bv, m).intoArray(c, i);
  }
  // Tail Loop
  for (; i < a.length; i++) {
    if (a[i] > 0) {
      c[i] = a[i] + b[i];
    } else {
      c[i] = a[i];
    }
  }
}

Code 2 Addition with control-flow example written in Vector API

We explain some basic elements of Vector API based on this example in the following paragraphs. This would help to obtain an overall view about how to use Vector API in Java from scratch.

static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED;

VectorSpecies<E> represents the combination of element type and shape. The set of shapes supports corresponds to vector sizes of 64, 128, 256, and 512 bits, and hardware supported Max bits.

Combinations of element type and shape

Figure 1 Combinations of element type and shape

SPECIES_PREFERRED is a good way to get current hardware’s best supported vector size. Therefore, it is not required to specify a vector shape manually.

int upperBound = SPECIES.loopBound(a.length);

Compute loop bound for the following vector loop.

 // Core Loop
 for (; i < upperBound; i += SPECIES.length()) {
   IntVector av = IntVector.fromArray(SPECIES, a, i);
   IntVector bv = IntVector.fromArray(SPECIES, b, i);
   VectorMask<Integer> m = av.compare(Vectoroperators.GT, bv);
   av.add(bv, m).intoArray(c, i);
 }

length() derives the data-width which must be incremented after each iteration. For an Integer operation running on 128-bit hardware, it increases by 4.
fromArray(SPECIES, a, i) loads data from arrays (offset i) to vectors of given SPECIES.
compare, add are the operations on Vector. The result of compare is a VectorMask instance.
intoArray(c, i) stores result vector data into array starting from index i.

 // Tail Loop
 for (; i < a.length; i++) {
   if (a[i] > 0) {
     c[i] = a[i] + b[i];
   } else {
     c[i] = a[i];
   }
 }

Tail loop handles the remaining items which do not align within the SIMD register. i should start at the first item that was not touched by the core-loop. g., for length of 130 of byte type array on 128-bit vector size machine, tail loop should start at index 128 to handle the last two elements.

Vector API is implemented in pure Java without any native keyword in the source code (absolutely cross-platform). In this way, Vector API does the operation element by element, meaning it is functional but not optimal.

Vector API also defines intrinsics for JVM, which are some special Java methods picked out from the Vector API source code and marked with @IntrinsicCandidate. JIT compiler prefer to emit more efficient native code for these methods using hardware vector registers and SIMD instructions, rather than fall into the Java implementations. If these methods are running on a system with relevant SIMD functionality (for example, AVX2 or Neon), then they will be replaced with native implementations. Intrinsics of Vector API are defined generalized way to reduce the code size. The same kind of operations can call a same intrinsic through different operation ID. The following code shows an intrinsic example dealing with binary operations like x.add(y), x.sub(y), x.mul(y), etc. The last parameter is the default Java implementation. At this moment, there are more than 20 intrinsics defined for entire Vector APIs.

@IntrinsicCandidate 
  public static
  <VM extends VectorPayload,
  M extends VectorMask<E>, E>
  VM binaryOp(int oprId,
              Class<? extends VM> vmClass, 
              Class<? extends M> mClass, Class<E> eClass,
              int length,
              VM v1, VM v2, M m,
              BinaryOperation<VM, M> defaultImpl) {
    assert isNonCapturingLambda(defaultImpl) : defaultImpl;
    return defaultImpl.apply(v1, v2, m);
  }

Code 3 intrinsics for binary operations

The default Java implementation is also a fallback of intrinsification as sometimes a vector operation cannot be compiled successfully. One reason, and also the most obvious, is the running CPU architecture does not support the required instructions efficiently. For example, compress operation compresses the lane elements of this vector as selected by the specified mask. At this moment, such operation is only implemented on SVE and AVX512 by JIT compiler. For example, on SVE machine, it is one instruction:

COMPACT <Zd>.<T>, <Pg>, <Zn>.<T>

Vector API performance truly depends on the hardware and corresponding support in OpenJDK. In the next section, we will describe how AArch64 support Vector API.

Vector API on AArch64

Status

Figure 2 Vector API status

At this moment, AArch64 provides complete support for Vector API. In JDK16, AArch64 supported basic API on Neon platform and introduced SVE-friendly API to support VLA (Vector Length Agnostic). Besides, it also extended the max supported vector size up to 2048 bits, which aligns Arm SVE well. In JDK18, SVE’s predicate feature for VectorMask was enabled so that Vector API can get the best performance on SVE platform. In JDK19, AArch64 added SVE2 feature detection in JVM and started to use SVE2 in some APIs. In the recent JDK20, it fine-tuned code generator for Arm micro-architectures like Neoverse-V1 and Neoverse-N2.

Predicate feature

AArch64 supports predicate feature for Vector API both on Neon and SVE machine.

On platform like Neon, which does not have predicate register, an instance of VectorMask<E> is compiled into vector register just the same as Vector object. And in general a mask-accepting operation is composed of the equivalent unmasked operation and a blend operation. For the core loop in Code 2, the generated code (Code 4) does unpredicated add first. The result which in v18 would be adjusted by the following BSL instructions. BSL does bitwise selection between v18 and v16 according to v17. The value of v17 is all-bit one or zero in each lane.

ldr     q16, [x17, #16]
ldr     q17, [x18, #16]
add     v18.4s, v16.4s, v17.4s        <-- unpredicated add
cmgt    v17.4s, v16.4s, v17.4s
bsl     v17.16b, v18.16b, v16.16b     <-- blend
add     x14, x10, x14
str     q17, [x14, #16]

Code 4 Generated code of predicated addition on Neon machine

In JDK18, we enabled predicate feature for Vector API on SVE. With predicate feature, the same code generated on SVE machine is shown in Code 5.

ldr     q16, [x17, #16]
ldr     q17, [x18, #16]
cmpgt   p0.s, p7/z, z16.s, z17.s
add     z16.s, p0/m, z16.s, z17.s    <-- predicated add
add     x14, x10, x14
str     q16, [x14, #16]

Code 5 Generated code of predicated addition on SVE machine

Code Generation

AArch64 supports various SIMD instructions to generate final native code for AArch64 – Neon and SVE/SVE2. The JVM option UseSVE is introduced since JDK16. It represents the highest supported SVE instruction set version that is available by the code generator. So far, its value can be 0, 1, 2, corresponding to Neon, SVE, and SVE2.

By default, UseSVE is initialized to the highest supported instruction set, determined during the startup of JVM. Setting a lower version of instruction set is allowed, for example, UseSVE can be “0” on SVE supported machine. But setting a higher version which is beyond the hardware real capability will cause a warning and JVM will revert it to the default.

For most of the vector operations, the final generated code is straightforwardly using the highest supported instruction set. For example, as we talked previously, add operation generates add (predicate) with SVE. But to obtain the best performance, we fine-tuned code generators for some special APIs. Vector.lane(int i) gets the element by the given index. On SVE machine, it can be implemented with Neon instruction if the target lane is within 128-bit range. For example, Byte512Vector.lane(7) generates more efficient code:

smov x11, v16.b[7]

Besides, we have optimized the generated code based on running hardware. For example, the add reduction operation “ByteVector.reduceLanes(VectorOperators.ADD)”, on Neoverse N2 we generate

addv    b17, v16.16b
smov    x12, v17.b[0]
add     w12, w12, w16, sxtb

instead of

uaddv   d17, p0, z16.b
smov    x15, v17.b[0]
add     w15, w14, w15, sxtb

to get better performance. This is based on Arm-Neoverse-N2-Software-Optimization-Guide.

SVE2 in Vector API

In JDK19, we enabled SVE2 feature in Vector API. This improves performance of some APIs quite a lot. For example, Vector API defines lanewise COMPRESS_BIT operation, which does the compress(int i, int mask) on each lane of vector.

Vector<E> lanewise(VectorOperators.COMPRESS_BITS);

compress(int I, int mask) returns the value obtained by compressing the bits of the specified int value, i, in accordance with the specified bit mask.

The default Java implementation of COMPRESS_BIT applies compress(int i, int mask) on each lane one by one. Since there is no direct Neon instruction to support compress, it generates more than 60+ instructions per lane. And SVE2 reduces the final generated code from 240+ instructions (4 elements with type of int, on 128-bit vector size machine) to just one instruction – BEXT.

bext    z16.b, z17.b, z17.b

The following chart shows the benchmark result we evaluated on 128-bit vector size machine, compared the compressBits with and without SVE2. It is more than 70x improvement.

Figure 3 Compress/expand bit benchmark

Figure 3 Compress and expand bit benchmark

Performance

Vector API benchmarks

We evaluated the performance of Vector API in JDK20 with the micro-benchmark suite in openjdk/panam-vector, which is based on JMH (Java Microbenchmark harness).

The following chart shows the performance ratio comparing Vector API with the corresponding Non-Vector API code on two SVE machines – Neoverse N2 and Neoverse V1. The non-Vector API code is written in normal Java without using Vector API.

Figure 4 Vector API vs Non Vector API on Neoverse V1

Figure 5 Vector API vs Non Vector API on Neoverse N2

Key findings:

Vector API works well for those complex operations which cannot be auto-vectorized in Non Vector API (default Java). For example, BIT_COUNTMasked in Vector API is about 30x faster than Non Vector API code on Neoverse N2.
For other simple cases, since scalar code can be auto-vectorized well, Vector API and scalar code have similar performance.

Vector API in BLAS

Netlib is a high-performance, hardware-accelerated implementation of BLAS, LAPACK, or ARPACK in Java. In this project, it supplies 3 implementations of BLAS:

Native implementation which uses JNI to invoke the OpenBLAS library pre-installed in the native environment.
Vector API implementation if the Java version is 16 or greater.
Default Java implementation.

We evaluated the performance of BLAS through the built-in benchmark set in Netlib, compared these three implementations on Neoverse N2 with JDK20. In general, Vector API has a better performance when the data size is small, and it is better than Default Java. To explain more about the results, we dig into one of the typical cases l1.SdotBenchamrk.blas, which benchmarks multiply-accumulate operation, computes the product of two float numbers and adds that product to an accumulator: “c += a * b”.

The performance results are different depending on the size of data. The following chart shows the performance gain of Vector API and Native compared with Default Java. X-axis represents the length of input array. Y-axis represents the multiple of performance value compared with Default Java (colored in gray) which has been normalized. Native call through JNI approach has the best score when the data size is large, approximately bigger than 5000. But when the data size is small, the JNI call overhead and the additional data copy overhead between java heap and native memory could not be ignored. For this case, performance of JNI is worse than Default Java's. Vector API implementation is always better than Java code, furthermore, when the data size is not too large (less than 5000), it is the best choice.

Figure 6 Benchmark of blas.sdot in three ways

Summary of three approaches [2]

	Performance	Overhead	Portability
Native	Best if data size is very big	High (data copy between java heap and native memory)	Readiness of native library
Vector API	Best if data size is not too big	No	Java16 or later
Default Java	Slow, backup path	No	Any Java version

Summary

Java Vector API is a significant step forward to provide a good SIMD abstraction layer for high-level application developers. On the one hand, it can go beyond the limitation of auto-vectorization, generating SIMD instruction in a more robust way. On the other hand, compared to JNI the code is more portable, and it is much easier to maintain. In future, besides more useful APIs coming in, integrating with Project Valhalla could be a promising performance improvement work.

Reference

[1] Scalable Vector Extensions
[2] Enabling Vectorized Engine in Apache Spark

0 comments
0 members are here

Servers and Cloud Computing blog

Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025
Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog