Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Improving Java performance on Neoverse N1 systems
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • java
  • performance analysis
  • Neoverse N1
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Improving Java performance on Neoverse N1 systems

Shiyou Huang
Shiyou Huang
July 14, 2021
6 minute read time.

In the past few years, we have seen the rapid growth of the Java ecosystem on hardware systems built on the Arm Neoverse N1 platform. Examples of common applications that use this language and framework are Hadoop, Kafka, Tomcat, and so on. Considering the broad usage of Java applications, the performance of Java on Arm is quite critical. In this blog, we investigate Java performance using an enterprise benchmark on Arm's Neoverse N1 based CPU, which has demonstrated huge cost/power/performance improvements in server CPUs. We built OpenJDK from the source code with different compiling flags to test the impact of LSE (Large System Extensions), the new architecture extension introduced in ARMv8.1. The Java enterprise benchmark we tested provides two throughput-based metrics: maximum throughput and throughput under service level agreement (SLA). We tuned performance by tweaking an initial set of Java runtime flags and augmented system parameters using an automated optimization tool. All the runs in this blog are on the AWS Graviton2 m6g.16xlarge instance, which has 64 Neoverse N1 cores, 64KB L1D/core, 1MB L2D/core, and 32MB of shared SLC. The OS distro is Ubuntu Focal. All the results in this blog show the relative performance numbers, the higher the better.

Java Flags and OS configurations

Our initial Java flags were borrowed from two public online submissions to a very common Java Based Benchmark[1,2]. We scaled our flags accordingly based on the core counts of our hardware and removed flags that are not related to Arm architecture, for example: -XX:UseAVX. As the machine we used has only one NUMA domain, we also removed flags that have no impact in this case, such as -XX:+UseNUMA. 

In our initial evaluation, we used the following configurations:

Java flags:

"-server -Xms124g -Xmx124g -Xmn114g -XX:SurvivorRatio=20 -XX:MaxTenuringThreshold=15 -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:+UseParallelOldGC -XX:+AlwaysPreTouch -XX:-UseAdaptiveSizePolicy -XX:-UsePerfData -XX:ParallelGCThreads=32 -XX:+UseTransparentHugePages -XX:+UseCompressedOops -XX:ObjectAlignmentInBytes=32"

OS configurations:

kernel.sched_migration_cost_ns=1000
kernel.sched_rt_runtime_us=990000
kernel.sched_latency_ns=24000000
kernel.sched_wakeup_granularity_ns = 4000000
kernel.sched_min_granularity_ns = 5000000
kernel.shmall = 64562836

vm.dirty_background_ratio = 15
vm.dirty_writeback_centisecs=1500
vm.dirty_expire_centisecs=10000
vm.dirty_ratio=8
vm.zone_reclaim_mode=1
vm.swappiness = 0

dev.raid.speed_limit_min = 4000
net.core.netdev_max_backlog = 2048
net.core.rmem_default = 106496
net.core.rmem_max = 4194304
net.core.somaxconn = 2048
net.core.wmem_default = 65536
net.core.wmem_max = 8388608
net.ipv4.tcp_adv_win_scale = 0
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_fin_timeout = 40
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_rmem = 4096 98304 196608
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_window_scaling = 0
net.ipv4.tcp_wmem = 4096 131072 8388608

While most of the improvement was achieved through hand-tuning and combining borrowed configurations, the OS parameters were augmented through Concertio Optimizer Studio[3]. We changed the parameter 'kern.sched_wakeup_granularity_ns' back to 4000000 from 33300000 as it fixed a large regression in throughput under SLA. For the rest of the article, we refer to the throughput under SLA as SLA score for brevity. The following chart compares the performance scores using default settings and our previous configurations:

performance-improved-by-tuned-flags

We normalized the baseline results without any tuning configurations to 1. As we can see, tuning is critical to the performance of Java applications. The throughput is improved by 40% and the SLA score for more than 80% after the tuning configurations are applied. 

Impact of LSE

Arm introduced the Large System Extensions (LSE) in Armv8.1, which provides a set of atomic operations such as compare-and-swap (CAS), atomic load and increment (LDADD). Most of the new operations have load-acquire or store-release semantics which are low-cost atomic operations comparing to the legacy way to implement atomic operations through pairs of load/store exclusives. In our experiments, we see that LSE benefits the workloads that are highly concurrent and use heavy synchronizations. We evaluated the impact of LSE in Java performance by running with JDK compiled with or without LSE. Users can build the binary with LSE by adding the GCC flag +lse to the -march option. To make the binaries run on any arm v8 systems (some of which may not support LSE), users can add the compiler flag -moutline-atomics when compiling. This flag is introduced in GCC 9.4 and is enabled by default in GCC 10. When GCC compiles SW with this flag, it will automatically generate code to dynamically check if LSE is supported or not. In our directed tests, we showed negligible performance overhead due to the dynamic check. 

We evaluated the impact of LSE with a more common JVM heap size '-Xmx30g -Xms2g' and used G1GC, the default GC in JDK11. Though we achieved similar throughput and slightly lower SLA score with this setting, this setting is more realistic for real-world Java applications. The same OS parameters as above have applied. The following are the new Java flags:

"-server -XX:+UseG1GC -Xmx30g -Xms2g -XX:NewSize=1332k -XX:MaxNewSize=18g -XX:GCTimeRatio=19 -XX:-InlineSynchronizedMethods -XX:+LoopUnswitching -XX:-OptimizeFill -XX:+UseSignalChaining"

The following charts compared the impact of enabling LSE on two versions of OpenJDK11, that is, 11.0.8-ga and 11.0.10-ga. Results by JDK compiled without LSE are normalized and used as the baseline.

impact-of-lse-with-jdk11.0.8impact-of-lse-with-jdk11.0.10

On the chart on the left we see an 11% improvement in throughput and a 45% improvement in SLA score by enabling LSE with JDK-11.0.8-ga. The chart on the right with JDK-11.0.10-ga shows LSE uplifts the throughput by 2% and SLA score by 7%. As there are only small changes between the two sub-versions JDK-11.0.8-ga and JDK-11.0.10-ga, it is surprising to see they present quite different performance, especially in the SLA score. We investigated the changes between the two versions to isolate the patches that result in the performance gap, and discovered that the patch JDK-8248214 caused the difference in this case. This patch reduces the false-sharing cache contention by adding paddings between two volatile variables that are declared side by side. The issue was fixed in OpenJDK-11.0.9 and later versions. This presents good evidence that LSE is important to the performance of the workloads that are heavily contended. 

Patches

We also tested the benchmark over a variety of different versions of OpenJDKs to search patch candidates to back-port to OpenJDK 11 for performance enhancement. The large heap size was used in the test to reduce the run-to-run variance as for most patches, they only cause improvement less than 5%. During the regression test, we identified dozens of patches that potentially improve the performance of OpenJDK. We listed two of them:JDK-8248214 and JDK-8248214 in this blog as they caused major uplift in our tests. The following chart shows the performance results from JDK 11 to JDK 14.

performance-over-different-jdks

We use the results of OpenJDK-11.0.8-ga as the baseline. Comparing to the baseline scores, the highest throughput speedup is 1.25X and the highest SLA speedup is 4.41X. The huge gap between OpenJDK-11.0.8 and OpenJDK 11.0.9 is caused by false cache sharing. This issue was fixed by the patch JDK-8248214 that has been mentioned previously. 

Ignoring the baseline results, there is nearly a 10% uplift in throughput from the lowest speedup of 1.14X to the highest 1.25X. For SLA, it is a more than 35% uplift from the lowest speedup 3.26X to the highest 4.41X. The key patch JDK-8204947 that was committed between OpenJDK-12+19 and OpenJDK-12+24 results in the uplift. This patch provides a better implementation of the task terminator from the Shenandoah project. As there are ups and downs among the newer versions of JDKs, we computed the average SLA score before OpenJDK-12+19 and the rest, which are 3.54 and 4.25 respectively. As a result, these patches improve the SLA score by 20% on average.

Conclusion

In this blog, we compared the throughput and SLA score of our enterprise Java application with and without tuned configurations on AWS Graviton2 m6g instances. It shows our tuning settings improve the throughput and SLA score by 40% and 80%, respectively. We also evaluated the impact of LSE on the results using a more common Java heap size. Our major finding is that LSE is critical to heavily-contended workloads. We did regression tests over different JDK versions to identify performance-related patches that can be potentially back ported to JDK11. Overall these patches uplift the throughput and SLA score by 10% and 20% respectively. 

[1] https://www.spec.org/jbb2015/results/res2020q2/jbb2015-20200416-00540.html

[2] https://www.spec.org/jbb2015/results/res2019q2/jbb2015-20190313-00350.html

[3] https://optimizer.concertio.com

Anonymous
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025