In the past few years, we have seen the rapid growth of the Java ecosystem on hardware systems built on the Arm Neoverse N1 platform. Examples of common applications that use this language and framework are Hadoop, Kafka, Tomcat, and so on. Considering the broad usage of Java applications, the performance of Java on Arm is quite critical. In this blog, we investigate Java performance using an enterprise benchmark on Arm's Neoverse N1 based CPU, which has demonstrated huge cost/power/performance improvements in server CPUs. We built OpenJDK from the source code with different compiling flags to test the impact of LSE (Large System Extensions), the new architecture extension introduced in ARMv8.1. The Java enterprise benchmark we tested provides two throughput-based metrics: maximum throughput and throughput under service level agreement (SLA). We tuned performance by tweaking an initial set of Java runtime flags and augmented system parameters using an automated optimization tool. All the runs in this blog are on the AWS Graviton2 m6g.16xlarge instance, which has 64 Neoverse N1 cores, 64KB L1D/core, 1MB L2D/core, and 32MB of shared SLC. The OS distro is Ubuntu Focal. All the results in this blog show the relative performance numbers, the higher the better.
Our initial Java flags were borrowed from two public online submissions to a very common Java Based Benchmark[1,2]. We scaled our flags accordingly based on the core counts of our hardware and removed flags that are not related to Arm architecture, for example: -XX:UseAVX. As the machine we used has only one NUMA domain, we also removed flags that have no impact in this case, such as -XX:+UseNUMA.
XX:UseAVX
-XX:+UseNUMA
In our initial evaluation, we used the following configurations:
Java flags:
-server -Xms124g -Xmx124g -Xmn114g -XX:SurvivorRatio=20 -XX:MaxTenuringThreshold=15 -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:+UseParallelOldGC -XX:+AlwaysPreTouch -XX:-UseAdaptiveSizePolicy -XX:-UsePerfData -XX:ParallelGCThreads=32 -XX:+UseTransparentHugePages -XX:+UseCompressedOops -XX:ObjectAlignmentInBytes=32
OS configurations:
While most of the improvement was achieved through hand-tuning and combining borrowed configurations, the OS parameters were augmented through Concertio Optimizer Studio[3]. We changed the parameter 'kern.sched_wakeup_granularity_ns' back to 4000000 from 33300000 as it fixed a large regression in throughput under SLA. For the rest of the article, we refer to the throughput under SLA as SLA score for brevity. The following chart compares the performance scores using default settings and our previous configurations:
kern.sched_wakeup_granularity_ns'
We normalized the baseline results without any tuning configurations to 1. As we can see, tuning is critical to the performance of Java applications. The throughput is improved by 40% and the SLA score for more than 80% after the tuning configurations are applied.
Arm introduced the Large System Extensions (LSE) in Armv8.1, which provides a set of atomic operations such as compare-and-swap (CAS), atomic load and increment (LDADD). Most of the new operations have load-acquire or store-release semantics which are low-cost atomic operations comparing to the legacy way to implement atomic operations through pairs of load/store exclusives. In our experiments, we see that LSE benefits the workloads that are highly concurrent and use heavy synchronizations. We evaluated the impact of LSE in Java performance by running with JDK compiled with or without LSE. Users can build the binary with LSE by adding the GCC flag +lse to the -march option. To make the binaries run on any arm v8 systems (some of which may not support LSE), users can add the compiler flag -moutline-atomics when compiling. This flag is introduced in GCC 9.4 and is enabled by default in GCC 10. When GCC compiles SW with this flag, it will automatically generate code to dynamically check if LSE is supported or not. In our directed tests, we showed negligible performance overhead due to the dynamic check.
+lse
-march
-moutline-atomics
We evaluated the impact of LSE with a more common JVM heap size '-Xmx30g -Xms2g' and used G1GC, the default GC in JDK11. Though we achieved similar throughput and slightly lower SLA score with this setting, this setting is more realistic for real-world Java applications. The same OS parameters as above have applied. The following are the new Java flags:
-server -XX:+UseG1GC -Xmx30g -Xms2g -XX:NewSize=1332k -XX:MaxNewSize=18g -XX:GCTimeRatio=19 -XX:-InlineSynchronizedMethods -XX:+LoopUnswitching -XX:-OptimizeFill -XX:+UseSignalChaining
The following charts compared the impact of enabling LSE on two versions of OpenJDK11, that is, 11.0.8-ga and 11.0.10-ga. Results by JDK compiled without LSE are normalized and used as the baseline.
On the chart on the left we see an 11% improvement in throughput and a 45% improvement in SLA score by enabling LSE with JDK-11.0.8-ga. The chart on the right with JDK-11.0.10-ga shows LSE uplifts the throughput by 2% and SLA score by 7%. As there are only small changes between the two sub-versions JDK-11.0.8-ga and JDK-11.0.10-ga, it is surprising to see they present quite different performance, especially in the SLA score. We investigated the changes between the two versions to isolate the patches that result in the performance gap, and discovered that the patch JDK-8248214 caused the difference in this case. This patch reduces the false-sharing cache contention by adding paddings between two volatile variables that are declared side by side. The issue was fixed in OpenJDK-11.0.9 and later versions. This presents good evidence that LSE is important to the performance of the workloads that are heavily contended.
We also tested the benchmark over a variety of different versions of OpenJDKs to search patch candidates to back-port to OpenJDK 11 for performance enhancement. The large heap size was used in the test to reduce the run-to-run variance as for most patches, they only cause improvement less than 5%. During the regression test, we identified dozens of patches that potentially improve the performance of OpenJDK. We listed two of them:JDK-8248214 and JDK-8248214 in this blog as they caused major uplift in our tests. The following chart shows the performance results from JDK 11 to JDK 14.
We use the results of OpenJDK-11.0.8-ga as the baseline. Comparing to the baseline scores, the highest throughput speedup is 1.25X and the highest SLA speedup is 4.41X. The huge gap between OpenJDK-11.0.8 and OpenJDK 11.0.9 is caused by false cache sharing. This issue was fixed by the patch JDK-8248214 that has been mentioned previously.
Ignoring the baseline results, there is nearly a 10% uplift in throughput from the lowest speedup of 1.14X to the highest 1.25X. For SLA, it is a more than 35% uplift from the lowest speedup 3.26X to the highest 4.41X. The key patch JDK-8204947 that was committed between OpenJDK-12+19 and OpenJDK-12+24 results in the uplift. This patch provides a better implementation of the task terminator from the Shenandoah project. As there are ups and downs among the newer versions of JDKs, we computed the average SLA score before OpenJDK-12+19 and the rest, which are 3.54 and 4.25 respectively. As a result, these patches improve the SLA score by 20% on average.
In this blog, we compared the throughput and SLA score of our enterprise Java application with and without tuned configurations on AWS Graviton2 m6g instances. It shows our tuning settings improve the throughput and SLA score by 40% and 80%, respectively. We also evaluated the impact of LSE on the results using a more common Java heap size. Our major finding is that LSE is critical to heavily-contended workloads. We did regression tests over different JDK versions to identify performance-related patches that can be potentially back ported to JDK11. Overall these patches uplift the throughput and SLA score by 10% and 20% respectively.
[1] https://www.spec.org/jbb2015/results/res2020q2/jbb2015-20200416-00540.html
[2] https://www.spec.org/jbb2015/results/res2019q2/jbb2015-20190313-00350.html
[3] https://optimizer.concertio.com