Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Improving Java performance on Neoverse N1 systems
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • java
  • performance analysis
  • Neoverse N1
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Improving Java performance on Neoverse N1 systems

Shiyou Huang
Shiyou Huang
July 14, 2021
6 minute read time.

In the past few years, we have seen the rapid growth of the Java ecosystem on hardware systems built on the Arm Neoverse N1 platform. Examples of common applications that use this language and framework are Hadoop, Kafka, Tomcat, and so on. Considering the broad usage of Java applications, the performance of Java on Arm is quite critical. In this blog, we investigate Java performance using an enterprise benchmark on Arm's Neoverse N1 based CPU, which has demonstrated huge cost/power/performance improvements in server CPUs. We built OpenJDK from the source code with different compiling flags to test the impact of LSE (Large System Extensions), the new architecture extension introduced in ARMv8.1. The Java enterprise benchmark we tested provides two throughput-based metrics: maximum throughput and throughput under service level agreement (SLA). We tuned performance by tweaking an initial set of Java runtime flags and augmented system parameters using an automated optimization tool. All the runs in this blog are on the AWS Graviton2 m6g.16xlarge instance, which has 64 Neoverse N1 cores, 64KB L1D/core, 1MB L2D/core, and 32MB of shared SLC. The OS distro is Ubuntu Focal. All the results in this blog show the relative performance numbers, the higher the better.

Java Flags and OS configurations

Our initial Java flags were borrowed from two public online submissions to a very common Java Based Benchmark[1,2]. We scaled our flags accordingly based on the core counts of our hardware and removed flags that are not related to Arm architecture, for example: -XX:UseAVX. As the machine we used has only one NUMA domain, we also removed flags that have no impact in this case, such as -XX:+UseNUMA. 

In our initial evaluation, we used the following configurations:

Java flags:

"-server -Xms124g -Xmx124g -Xmn114g -XX:SurvivorRatio=20 -XX:MaxTenuringThreshold=15 -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:+UseParallelOldGC -XX:+AlwaysPreTouch -XX:-UseAdaptiveSizePolicy -XX:-UsePerfData -XX:ParallelGCThreads=32 -XX:+UseTransparentHugePages -XX:+UseCompressedOops -XX:ObjectAlignmentInBytes=32"

OS configurations:

kernel.sched_migration_cost_ns=1000
kernel.sched_rt_runtime_us=990000
kernel.sched_latency_ns=24000000
kernel.sched_wakeup_granularity_ns = 4000000
kernel.sched_min_granularity_ns = 5000000
kernel.shmall = 64562836

vm.dirty_background_ratio = 15
vm.dirty_writeback_centisecs=1500
vm.dirty_expire_centisecs=10000
vm.dirty_ratio=8
vm.zone_reclaim_mode=1
vm.swappiness = 0

dev.raid.speed_limit_min = 4000
net.core.netdev_max_backlog = 2048
net.core.rmem_default = 106496
net.core.rmem_max = 4194304
net.core.somaxconn = 2048
net.core.wmem_default = 65536
net.core.wmem_max = 8388608
net.ipv4.tcp_adv_win_scale = 0
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_fin_timeout = 40
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_rmem = 4096 98304 196608
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_window_scaling = 0
net.ipv4.tcp_wmem = 4096 131072 8388608

While most of the improvement was achieved through hand-tuning and combining borrowed configurations, the OS parameters were augmented through Concertio Optimizer Studio[3]. We changed the parameter 'kern.sched_wakeup_granularity_ns' back to 4000000 from 33300000 as it fixed a large regression in throughput under SLA. For the rest of the article, we refer to the throughput under SLA as SLA score for brevity. The following chart compares the performance scores using default settings and our previous configurations:

performance-improved-by-tuned-flags

We normalized the baseline results without any tuning configurations to 1. As we can see, tuning is critical to the performance of Java applications. The throughput is improved by 40% and the SLA score for more than 80% after the tuning configurations are applied. 

Impact of LSE

Arm introduced the Large System Extensions (LSE) in Armv8.1, which provides a set of atomic operations such as compare-and-swap (CAS), atomic load and increment (LDADD). Most of the new operations have load-acquire or store-release semantics which are low-cost atomic operations comparing to the legacy way to implement atomic operations through pairs of load/store exclusives. In our experiments, we see that LSE benefits the workloads that are highly concurrent and use heavy synchronizations. We evaluated the impact of LSE in Java performance by running with JDK compiled with or without LSE. Users can build the binary with LSE by adding the GCC flag +lse to the -march option. To make the binaries run on any arm v8 systems (some of which may not support LSE), users can add the compiler flag -moutline-atomics when compiling. This flag is introduced in GCC 9.4 and is enabled by default in GCC 10. When GCC compiles SW with this flag, it will automatically generate code to dynamically check if LSE is supported or not. In our directed tests, we showed negligible performance overhead due to the dynamic check. 

We evaluated the impact of LSE with a more common JVM heap size '-Xmx30g -Xms2g' and used G1GC, the default GC in JDK11. Though we achieved similar throughput and slightly lower SLA score with this setting, this setting is more realistic for real-world Java applications. The same OS parameters as above have applied. The following are the new Java flags:

"-server -XX:+UseG1GC -Xmx30g -Xms2g -XX:NewSize=1332k -XX:MaxNewSize=18g -XX:GCTimeRatio=19 -XX:-InlineSynchronizedMethods -XX:+LoopUnswitching -XX:-OptimizeFill -XX:+UseSignalChaining"

The following charts compared the impact of enabling LSE on two versions of OpenJDK11, that is, 11.0.8-ga and 11.0.10-ga. Results by JDK compiled without LSE are normalized and used as the baseline.

impact-of-lse-with-jdk11.0.8impact-of-lse-with-jdk11.0.10

On the chart on the left we see an 11% improvement in throughput and a 45% improvement in SLA score by enabling LSE with JDK-11.0.8-ga. The chart on the right with JDK-11.0.10-ga shows LSE uplifts the throughput by 2% and SLA score by 7%. As there are only small changes between the two sub-versions JDK-11.0.8-ga and JDK-11.0.10-ga, it is surprising to see they present quite different performance, especially in the SLA score. We investigated the changes between the two versions to isolate the patches that result in the performance gap, and discovered that the patch JDK-8248214 caused the difference in this case. This patch reduces the false-sharing cache contention by adding paddings between two volatile variables that are declared side by side. The issue was fixed in OpenJDK-11.0.9 and later versions. This presents good evidence that LSE is important to the performance of the workloads that are heavily contended. 

Patches

We also tested the benchmark over a variety of different versions of OpenJDKs to search patch candidates to back-port to OpenJDK 11 for performance enhancement. The large heap size was used in the test to reduce the run-to-run variance as for most patches, they only cause improvement less than 5%. During the regression test, we identified dozens of patches that potentially improve the performance of OpenJDK. We listed two of them:JDK-8248214 and JDK-8248214 in this blog as they caused major uplift in our tests. The following chart shows the performance results from JDK 11 to JDK 14.

performance-over-different-jdks

We use the results of OpenJDK-11.0.8-ga as the baseline. Comparing to the baseline scores, the highest throughput speedup is 1.25X and the highest SLA speedup is 4.41X. The huge gap between OpenJDK-11.0.8 and OpenJDK 11.0.9 is caused by false cache sharing. This issue was fixed by the patch JDK-8248214 that has been mentioned previously. 

Ignoring the baseline results, there is nearly a 10% uplift in throughput from the lowest speedup of 1.14X to the highest 1.25X. For SLA, it is a more than 35% uplift from the lowest speedup 3.26X to the highest 4.41X. The key patch JDK-8204947 that was committed between OpenJDK-12+19 and OpenJDK-12+24 results in the uplift. This patch provides a better implementation of the task terminator from the Shenandoah project. As there are ups and downs among the newer versions of JDKs, we computed the average SLA score before OpenJDK-12+19 and the rest, which are 3.54 and 4.25 respectively. As a result, these patches improve the SLA score by 20% on average.

Conclusion

In this blog, we compared the throughput and SLA score of our enterprise Java application with and without tuned configurations on AWS Graviton2 m6g instances. It shows our tuning settings improve the throughput and SLA score by 40% and 80%, respectively. We also evaluated the impact of LSE on the results using a more common Java heap size. Our major finding is that LSE is critical to heavily-contended workloads. We did regression tests over different JDK versions to identify performance-related patches that can be potentially back ported to JDK11. Overall these patches uplift the throughput and SLA score by 10% and 20% respectively. 

[1] https://www.spec.org/jbb2015/results/res2020q2/jbb2015-20200416-00540.html

[2] https://www.spec.org/jbb2015/results/res2019q2/jbb2015-20190313-00350.html

[3] https://optimizer.concertio.com

Anonymous
Architectures and Processors blog
  • Optimizing TIFF image processing using AARCH64 (64-bit) Neon

    Ramin Zaghi
    Ramin Zaghi
    This guest blog shows how 64-bit Neon technology can be used to improve performance in image processing applications.
    • October 13, 2022
  • Arm A-Profile Architecture Developments 2022

    Martin Weidmann
    Martin Weidmann
    2022 additions to Arm A-Profile architecture covering Virtual Memory System Architecture, SME2 and mitigating some ROP attacks with Guarded Control Stack.
    • September 29, 2022
  • A closer look at Arm A-profile support for non-maskable interrupts

    Christoffer Dall
    Christoffer Dall
    Arm is adding support in both the CPU and Generic Interrupt Controller (GIC) architecture for NMIs. But what is an NMI? how does operating systems software use these features?
    • May 23, 2022