When supported by the underlying hardware, vector operations can increase the number of computations performed per CPU cycle. For example, assume we want to add two vectors each containing a sequence of four integer values. Vector hardware allows us to perform this operation (four integer additions in total) in a single CPU cycle. Scalar additions would only perform one integer addition at the same time. This mode of operation is called SIMD (Single Instruction, Multiple Data), whereas the traditional way of execution is called SISD (Single Instruction, Single Data).
In the Arm architecture, Neon is an advanced SIMD architecture extension for the A-profile and R-profile processors. Neon registers are considered as vectors of elements of the same data type, with Neon instructions operating on multiple elements simultaneously. Multiple data types are supported, including floating-point and integer operations. Scalable Vector Extension (SVE) is a vector extension of the A64 instruction set of the Armv8-A architecture. Armv9-A builds on SVE with the SVE2 extension. Unlike other SIMD architectures, SVE and SVE2 do not define the size of the vector registers, but constrains it to a range of possible values. Vendors can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. The design of SVE and SVE2 guarantees that the same program can run on different implementations of the instruction set architecture [1].
SIMD in Java opens ways for developers to explore new opportunities in different areas. Those areas include High Performance Computing (HPC), Artificial Intelligence (AI) frameworks and Machine Learning (ML) linear algebra-based algorithms, and financial services. These kinds of Java workload require a lot of computing. But before JDK16, there are very limited ways of using SIMD, which includes:
OpenJDK C2 JIT compiler supports superword auto-vectorization, but it is too weak to handle a lot of complex code patterns. For example,
for (int i = 0; i < LENGTH; i++) { c[i] = a[i] + b[i]; }
for (int i = 0; i < LENGTH; i+=2) { c[i] = a[i] + b[i]; }
for (int i = 0; i < LENGTH; i++) { if (a[i] < b[i]) { c[i] = a[i] + b[i]; } }
The Java Native Interface (JNI) allows access to fast native libraries, but hand-crafted assembly code is error-prone and hard to maintain. Furthermore, calling a native method has higher overhead than a normal Java method call.
Some methods of built-in libraries like String.indexOf use SIMD instructions. Nevertheless, most of these methods are too specific to some certain application scenarios. That is they are not flexible enough for developers to build their own algorithms.
Since JDK16, the Java Vector API is introduced as an incubator module (jdk.incubator.vector). It enables developers to use vector operations in a platform agnostic way. These operations can then be compiled to SIMD instructions at runtime by JDK JIT compiler.
This blog discusses insight into Vector API. We first go over some Vector API fundamentals, basic usages, and features, and then show how well does AArch64 support Vector API.
To understand the basics of Vector API, in this section we have a detailed look at the firstExample:
public static void firstExample(int[] a, int[] b, int[] c) { for (int i = 0; i < a.length; i++) { if (a[i] > b[i]) { c[i] = a[i] + b[i]; } else { c[i] = a[i]; } } }
It is a very common case that conditionally computes the sum of elements from two arrays and puts the result into a new array. Without vectorization, CPU would execute load and add instructions multiple times depending on the array length.
The same function implemented with Vector API is as below:
static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED; static void firstExampleInVectorAPI(int[] a, int[] b, int[] c) { int i = 0; int upperBound = SPECIES.loopBound(a.length); // Core Loop for (; i < upperBound; i += SPECIES.length()) { IntVector av = IntVector.fromArray(SPECIES, a, i); IntVector bv = IntVector.fromArray(SPECIES, b, i); VectorMask<Integer> m = av.compare(Vectoroperators.GT, bv); av.add(bv, m).intoArray(c, i); } // Tail Loop for (; i < a.length; i++) { if (a[i] > 0) { c[i] = a[i] + b[i]; } else { c[i] = a[i]; } } }
We explain some basic elements of Vector API based on this example in the following paragraphs. This would help to obtain an overall view about how to use Vector API in Java from scratch.
static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED;
Figure 1 Combinations of element type and shape
int upperBound = SPECIES.loopBound(a.length);
// Core Loop for (; i < upperBound; i += SPECIES.length()) { IntVector av = IntVector.fromArray(SPECIES, a, i); IntVector bv = IntVector.fromArray(SPECIES, b, i); VectorMask<Integer> m = av.compare(Vectoroperators.GT, bv); av.add(bv, m).intoArray(c, i); }
// Tail Loop for (; i < a.length; i++) { if (a[i] > 0) { c[i] = a[i] + b[i]; } else { c[i] = a[i]; } }
Vector API is implemented in pure Java without any native keyword in the source code (absolutely cross-platform). In this way, Vector API does the operation element by element, meaning it is functional but not optimal.
Vector API also defines intrinsics for JVM, which are some special Java methods picked out from the Vector API source code and marked with @IntrinsicCandidate. JIT compiler prefer to emit more efficient native code for these methods using hardware vector registers and SIMD instructions, rather than fall into the Java implementations. If these methods are running on a system with relevant SIMD functionality (for example, AVX2 or Neon), then they will be replaced with native implementations. Intrinsics of Vector API are defined generalized way to reduce the code size. The same kind of operations can call a same intrinsic through different operation ID. The following code shows an intrinsic example dealing with binary operations like x.add(y), x.sub(y), x.mul(y), etc. The last parameter is the default Java implementation. At this moment, there are more than 20 intrinsics defined for entire Vector APIs.
@IntrinsicCandidate public static <VM extends VectorPayload, M extends VectorMask<E>, E> VM binaryOp(int oprId, Class<? extends VM> vmClass, Class<? extends M> mClass, Class<E> eClass, int length, VM v1, VM v2, M m, BinaryOperation<VM, M> defaultImpl) { assert isNonCapturingLambda(defaultImpl) : defaultImpl; return defaultImpl.apply(v1, v2, m); }
The default Java implementation is also a fallback of intrinsification as sometimes a vector operation cannot be compiled successfully. One reason, and also the most obvious, is the running CPU architecture does not support the required instructions efficiently. For example, compress operation compresses the lane elements of this vector as selected by the specified mask. At this moment, such operation is only implemented on SVE and AVX512 by JIT compiler. For example, on SVE machine, it is one instruction:
COMPACT <Zd>.<T>, <Pg>, <Zn>.<T>
Vector API performance truly depends on the hardware and corresponding support in OpenJDK. In the next section, we will describe how AArch64 support Vector API.
Figure 2 Vector API status
At this moment, AArch64 provides complete support for Vector API. In JDK16, AArch64 supported basic API on Neon platform and introduced SVE-friendly API to support VLA (Vector Length Agnostic). Besides, it also extended the max supported vector size up to 2048 bits, which aligns Arm SVE well. In JDK18, SVE’s predicate feature for VectorMask was enabled so that Vector API can get the best performance on SVE platform. In JDK19, AArch64 added SVE2 feature detection in JVM and started to use SVE2 in some APIs. In the recent JDK20, it fine-tuned code generator for Arm micro-architectures like Neoverse-V1 and Neoverse-N2.
AArch64 supports predicate feature for Vector API both on Neon and SVE machine.
On platform like Neon, which does not have predicate register, an instance of VectorMask<E> is compiled into vector register just the same as Vector object. And in general a mask-accepting operation is composed of the equivalent unmasked operation and a blend operation. For the core loop in Code 2, the generated code (Code 4) does unpredicated add first. The result which in v18 would be adjusted by the following BSL instructions. BSL does bitwise selection between v18 and v16 according to v17. The value of v17 is all-bit one or zero in each lane.
ldr q16, [x17, #16] ldr q17, [x18, #16] add v18.4s, v16.4s, v17.4s <-- unpredicated add cmgt v17.4s, v16.4s, v17.4s bsl v17.16b, v18.16b, v16.16b <-- blend add x14, x10, x14 str q17, [x14, #16]
In JDK18, we enabled predicate feature for Vector API on SVE. With predicate feature, the same code generated on SVE machine is shown in Code 5.
ldr q16, [x17, #16] ldr q17, [x18, #16] cmpgt p0.s, p7/z, z16.s, z17.s add z16.s, p0/m, z16.s, z17.s <-- predicated add add x14, x10, x14 str q16, [x14, #16]
AArch64 supports various SIMD instructions to generate final native code for AArch64 – Neon and SVE/SVE2. The JVM option UseSVE is introduced since JDK16. It represents the highest supported SVE instruction set version that is available by the code generator. So far, its value can be 0, 1, 2, corresponding to Neon, SVE, and SVE2.
By default, UseSVE is initialized to the highest supported instruction set, determined during the startup of JVM. Setting a lower version of instruction set is allowed, for example, UseSVE can be “0” on SVE supported machine. But setting a higher version which is beyond the hardware real capability will cause a warning and JVM will revert it to the default.
For most of the vector operations, the final generated code is straightforwardly using the highest supported instruction set. For example, as we talked previously, add operation generates add (predicate) with SVE. But to obtain the best performance, we fine-tuned code generators for some special APIs. Vector.lane(int i) gets the element by the given index. On SVE machine, it can be implemented with Neon instruction if the target lane is within 128-bit range. For example, Byte512Vector.lane(7) generates more efficient code:
smov x11, v16.b[7]
Besides, we have optimized the generated code based on running hardware. For example, the add reduction operation “ByteVector.reduceLanes(VectorOperators.ADD)”, on Neoverse N2 we generate
addv b17, v16.16bsmov x12, v17.b[0]add w12, w12, w16, sxtb
instead of
uaddv d17, p0, z16.bsmov x15, v17.b[0]add w15, w14, w15, sxtb
to get better performance. This is based on Arm-Neoverse-N2-Software-Optimization-Guide.
In JDK19, we enabled SVE2 feature in Vector API. This improves performance of some APIs quite a lot. For example, Vector API defines lanewise COMPRESS_BIT operation, which does the compress(int i, int mask) on each lane of vector.
Vector<E> lanewise(VectorOperators.COMPRESS_BITS);
compress(int I, int mask) returns the value obtained by compressing the bits of the specified int value, i, in accordance with the specified bit mask.
The default Java implementation of COMPRESS_BIT applies compress(int i, int mask) on each lane one by one. Since there is no direct Neon instruction to support compress, it generates more than 60+ instructions per lane. And SVE2 reduces the final generated code from 240+ instructions (4 elements with type of int, on 128-bit vector size machine) to just one instruction – BEXT.
bext z16.b, z17.b, z17.b
The following chart shows the benchmark result we evaluated on 128-bit vector size machine, compared the compressBits with and without SVE2. It is more than 70x improvement.
Figure 3 Compress and expand bit benchmark
We evaluated the performance of Vector API in JDK20 with the micro-benchmark suite in openjdk/panam-vector, which is based on JMH (Java Microbenchmark harness).
The following chart shows the performance ratio comparing Vector API with the corresponding Non-Vector API code on two SVE machines – Neoverse N2 and Neoverse V1. The non-Vector API code is written in normal Java without using Vector API.
Figure 4 Vector API vs Non Vector API on Neoverse V1
Figure 5 Vector API vs Non Vector API on Neoverse N2
Key findings:
Netlib is a high-performance, hardware-accelerated implementation of BLAS, LAPACK, or ARPACK in Java. In this project, it supplies 3 implementations of BLAS:
We evaluated the performance of BLAS through the built-in benchmark set in Netlib, compared these three implementations on Neoverse N2 with JDK20. In general, Vector API has a better performance when the data size is small, and it is better than Default Java. To explain more about the results, we dig into one of the typical cases l1.SdotBenchamrk.blas, which benchmarks multiply-accumulate operation, computes the product of two float numbers and adds that product to an accumulator: “c += a * b”.
The performance results are different depending on the size of data. The following chart shows the performance gain of Vector API and Native compared with Default Java. X-axis represents the length of input array. Y-axis represents the multiple of performance value compared with Default Java (colored in gray) which has been normalized. Native call through JNI approach has the best score when the data size is large, approximately bigger than 5000. But when the data size is small, the JNI call overhead and the additional data copy overhead between java heap and native memory could not be ignored. For this case, performance of JNI is worse than Default Java's. Vector API implementation is always better than Java code, furthermore, when the data size is not too large (less than 5000), it is the best choice.
Figure 6 Benchmark of blas.sdot in three ways
Java Vector API is a significant step forward to provide a good SIMD abstraction layer for high-level application developers. On the one hand, it can go beyond the limitation of auto-vectorization, generating SIMD instruction in a more robust way. On the other hand, compared to JNI the code is more portable, and it is much easier to maintain. In future, besides more useful APIs coming in, integrating with Project Valhalla could be a promising performance improvement work.
[1] Scalable Vector Extensions[2] Enabling Vectorized Engine in Apache Spark