Improving Java performance on OCI Ampere A1 compute instances

November 8, 2021

6 minute read time.

Introduction

Oracle Cloud Infrastructure (OCI) has recently launched the Ampere A1 Compute family of Arm Neoverse N1-based VMs and bare-metal instances. These A1 instances use Ampere Altra CPUs that were designed specifically to deliver performance, scalability, and security for cloud applications. The A1 Flex VM family supports an unmatched number of VM shapes that can be configured with 1-80 cores and 1-512GB of RAM (up to 64GB per core). Ampere A1 compute is also offered in bare-metal configurations with up to 2-sockets and 160-cores per instance.

In this blog, we investigate the performance of Java using SPECjbb2015^[1]on OCI A1 instances. We tuned SPECjbb2015 for best performance by referring to the configurations used by the online SPECjbb submissions. Those Java options may not apply to all Java workloads due to the very large heap size and other unrealistic options. The goal here is to see the best scores we can achieve on A1 using SPECjbb. We compared the performance results of SPECjbb2015 over different versions of OpenJDKs to identify a list of patches that improve the performance. As SPECjbb is a latency-sensitive workload, we also presented the impact of Arm LSE (Large System Extensions) on the performance in this blog. We built OpenJDK from the source code on our own and all the tests are on:

BM.Standard.A1.160: Ampere® Altra® CPUs based on Arm Neoverse N1 cores, 3.0GHz all-core sustained max. Each core comes with its own 64KB L1 I-cache, 64KB L1 D-cache, and 1MB L2 D-cache.

SPECjbb2015 is a benchmark that has been developed from the ground up to measure Java server performance:

"The SPECjbb2015 benchmark is based on the usage model of a worldwide supermarket company with an IT infrastructure that handles a mix of point-of-sale requests, online purchases, and data-mining operations. It exercises Java 7 and higher features, using the latest data formats (XML), communication using compression, and secure messaging".

This benchmark presents two metrics to evaluate the performance of JVM: max-jOPS regarding to throughput and critical-jOPS regarding to critical throughput under service-level agreements (SLAs), with response times ranging from 10 to 100 milliseconds.

Comparison of SPECjbb2015 scores between default and tuned configurations

SPECjbb2015 can be tested in three modes: Composite, MultiJVM and Distributed. In our evaluation, we used the Composite mode, that uses a single JVM to run the workload. All the tests are run on Oracle-Linux-8 with 80 cores of BM.Standard.A1.160 by setting numactl --membind=0 -C 0-79. We first ran SPECjbb with the out-of-the-box default settings to get the baseline performance. We then tuned the Java flags to try to get the best max-jOPS and critical-jOPS on A1.

Our initial Java flags were borrowed from two public online SPECjbb2015 submissions^[2,3]. We scaled our flags accordingly based on the core counts of our hardware and removed flags that are not related to Arm architecture, for example: -XX:UseAVX. As we only used cores from a single NUMA domain in our tests, we also removed flags that have no impact in this case, such as -XX:+UseNUMA.

In our evaluation, we used the following Java flags:

-server -Xms124g -Xmx124g -Xmn114g -XX:SurvivorRatio=20 -XX:MaxTenuringThreshold=15 -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:-UseAdaptiveSizePolicy -XX:-UsePerfData -XX:ParallelGCThreads=40 -XX:+UseTransparentHugePages -XX:+UseCompressedOops -XX:ObjectAlignmentInBytes=32

SPECjbb2015 properties:

-Dspecjbb.comm.connect.timeouts.connect=700000 -Dspecjbb.comm.connect.timeouts.read=700000

-Dspecjbb.comm.connect.timeouts.write=700000 -Dspecjbb.customerDriver.threads.probe=80

-Dspecjbb.forkjoin.workers.Tier1=80 -Dspecjbb.forkjoin.workers.Tier2=1 -Dspecjbb.forkjoin.workers.Tier3=16

-Dspecjbb.heartbeat.period=100000 -Dspecjbb.heartbeat.threshold=1000000

OS settings:

echo always > /sys/kernel/mm/transparent_hugepage/enabled

echo always > /sys/kernel/mm/transparent_hugepage/defrag

Figure 1 compares the performance scores using default settings and our tuned configurations:

SPECjbb2015 performance of Base vs Tuned settings

Figure 1. SPECjbb2015 scores using base versus tuned settings

We built OpenJDK-11.0.12-ga using GCC10 with the option '-mno-outline-atomics'. The best max-jOPS with the tuned configurations with this version of OpenJDK is 122,074, an almost 31% uplift from the baseline score 92,973. The critical-jOPS improves by 51% from 72,171 to 47,729.

Note that the tuned configurations we used are for achieving higher max-jOPS and critical-jOPS. They may not apply for all Java workloads considering the setting with tremendous heap size and large object alignment, which are not realistic for sake of production runs.

SPECjbb2015 scores over different versions of OpenJDKs

We also tested the benchmark over a variety of different versions of OpenJDKs to see how the performance changes with those patches.

Figure 2. SPECjbb scores on OpenJDK 11 to 16 using tuned configurations

Figure 2 shows the SPECjbb scores with GA versions from OpenJDK 11 to 16 using our tuned configurations. Overall, the max-jOPS does not change much from JDK11 to JDK16. The lowest max-jOPS is 122,074 on jdk-11.0.12-ga and highest one is 127,957 on jdk-14-ga. The critical-jOPS improves from 72,171 to 97,810, which is a 35.5% uplift. The major improvement is due to the patches between JDK11 to JDK14.

Impact of LSE on Java performance

Arm introduced the Large System Extensions (LSE) in Armv8.1, which provides a set of atomic operations such as compare-and-swap (CAS), atomic load and increment (LDADD). Most of the new operations have load-acquire or store-release semantics which are low-cost atomic operations compared to the legacy way of implementing atomic operations through pairs of load/store exclusives. In our experiments, we see that LSE benefits the workloads that are highly concurrent and use heavy synchronizations. We evaluated the impact of LSE in Java performance by running with JDK compiled with or without LSE. Users can build the binary with LSE by adding the GCC flag +lse to the -march option. To make the binaries run on any Arm v8 systems (some of which may not support LSE), users can add the compiler flag -moutline-atomics when compiling. This flag is introduced in GCC 9.4 and is enabled by default in GCC 10. When GCC compiles SW with this flag, it automatically generates code to dynamically check if LSE is supported or not. In our directed tests, we showed negligible performance overhead due to the dynamic check.

We evaluated the impact of LSE with a more common JVM heap size '-Xmx30g -Xms2g' and used G1GC, the default GC in JDK11. Though we achieved slightly lower max-jOPS and critical-jOPS with this setting, this setting is more realistic for real-world Java applications. The following are the new Java flags:

-server -XX:+UseG1GC -Xmx30g -Xms2g -XX:NewSize=1332k -XX:MaxNewSize=18g -XX:GCTimeRatio=19 -XX:-InlineSynchronizedMethods -XX:+LoopUnswitching -XX:-OptimizeFill -XX:+UseSignalChaining

Figure 3 compares the impact of enabling LSE on two versions of OpenJDK11, that is, 11.0.8-ga and 11.0.12-ga.

Impact of LSE on SPECjbb2015 performance

Figure 3. Impact of LSE on SPECjbb2015 performance

The chart on the left shows a huge improvement brought by LSE instructions on both max-jOPS and critical-jOPS. With JDK-11.0.8-ga, the max-jOPS and critical-jOPS improves by 77.9% and 109.9%, respectively. While the chart on the right with JDK-11.0.12-ga shows LSE uplifts critical-jOPS by only 5% and the max-jOPS even decreases a bit. As there are only small changes between the two sub-versions JDK-11.0.8-ga and JDK-11.0.12-ga, it is surprising to see they present quite different performance, especially on the SLA-dependent score. We investigated the changes between the two versions to isolate the patches that result in the performance gap. And we discovered that the patch JDK-8248214^[4] caused the difference in this case. This patch reduces the false-sharing cache contention by adding paddings between two volatile variables that are declared side by side. The issue was fixed in OpenJDK-11.0.9 and later versions. This finding presents good evidence that LSE is important to the performance of the workloads that are heavily contended.

Conclusion

In this blog, we evaluated the Java performance using SPECjbb2015 on OCI Ampere A1 Compute instances. We compared the performance scores of SPECjbb2015 with and without tuned configurations. It shows our tuning settings improve the max-jOPS and critical-jOPS by 30% and 59%, respectively on OpenJDK-11.0.12-ga. We tested SPECjbb2015 over a variety of OpenJDKs from 11 to 16. Though the max-jOPS does not improve much on newer versions of OpenJDK, the critical-jOPS gain a 35% improvement from JDK11 to JDK16. We also evaluated the impact of LSE on the performance using a more common Java heap size and the default GC by the JDK. Our major finding is that LSE is critical to heavily-contended workloads.

Reference

[1] https://www.spec.org/jbb2015

[2] https://www.spec.org/jbb2015/results/res2020q2/jbb2015-20200416-00540.html

[3] https://www.spec.org/jbb2015/results/res2019q2/jbb2015-20190313-00350.html

[4] https://bugs.openjdk.java.net/browse/JDK-8248214

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog