In recent years, the demand for Arm architecture within Tencent has increased significantly. And as Arm servers have been introduced into various Tencent product lines, the demand for Arm architecture-optimized software has increased as well.
To help meet the growing demand for Arm technology, Tencent and Arm have been collaborating on tools, software, and other technology to make it easier for software developers to build Arm-based products and services on Tencent platforms.
Tencent’s KonaJDK has been central to this effort. KonaJDK provides a high-performance, highly stable foundation for creating Java applications for Tencent environments. This blog describes Arm’s status as a first-class platform for KonaJDK and the features that Arm developers can take advantage of in JDK17.
Garbage collection (GC) eliminates the need for programs to manually release memory, reducing the likelihood of memory management-related errors. For GC algorithms, cleaning up memory accurately and efficiently is a complex process. The growth in the size of memory data sets, which can range from 10GB to 100GB or even 1TB, further exacerbates this problem. GC algorithms are constantly iterative, and only by selecting the most appropriate GC algorithms can they help Tencent's businesses achieve their goals.
GC algorithms such as cmS and G1 tend to increase their pause times with the size of the heap. They may even produce minute-level pauses when the super heap triggers full GC. This has become a major obstacle to its widespread use in latency-sensitive applications, requiring more appropriate GC algorithms to meet the needs of these businesses.
ZGC has been added to JDK to address the latency caused by GC pauses, with the goal of keeping pause time below 10 ms. This reduces throughput by no more than 15% relative to G1 while supporting large and extra-large heap sizes. An experimental version of ZGC was introduced in JDK11 and, with continued refinements, became an official element of JDK15.
Figure 1: ZGC performance (from the design of ZGC, Per Liden).
The KonaJDK team did a lot of work to complete Arm architecture support for ZGC in JDK11:
One initial difficulty the KonaJDK team encountered in supporting ZGC in the Arm structure was how to properly add the barrier instruction to ensure correctness. Because Arm uses a weakly ordered memory model, code that can execute correctly on the x86 platform may result in random errors due to a lack of necessary barrier in the Arm architecture. After initially completing the ZGC support code, the KonaJDK team conducted ZGC stress tests and found that random JDK crashes occurred in the thousandths of a percentage point range. Attempts were made to analyze the problem through community code and ZGC logic. Ultimately, the KonaJDK team fixed the problem and the ZGC code has now run millions of times in a row in the Arm architecture without problems.
Like other GC algorithms, ZGC has its own use case scenarios. The biggest advantage of the ZGC algorithm is its ability to keep the pause time below 10ms, making it ideal for applications that are sensitive to pause times. To achieve such a short pause time, ZGC comes with the cost of some performance loss and memory consumption. ZGC effectively reduces the necessary pause time by concurrently transforming several tasks in parallel with the application code. Concurrent execution, however, can also lead to a degree of decreased application throughput.
With continued investment throughout the OpenJDK community, the current performance degradation of ZGC has been kept relatively low. In terms of performance, ZGC can exceed G1 by about 5% to 20% in all types of benchmarks. In the case of small heaps, ZGC performance is lower than G1 by about 10%.
If an application uses an oversized heap (tens or even hundreds of GBs) to avoid tens of seconds or even minutes of GC pauses, then ZGC is recommended. ZGC is also recommended if the business has strict time-bound requirements for downtime.
When an application needs to perform multiple tasks concurrently, it creates multiple threads, each responsible for concurrent execution of a task. As workloads grow in size, creating a thread for each task can consume a large amount of memory. In addition, thread switching requires kernel completion. And when many threads exist, frequent switching overhead can adversely affect performance. The "co-routine" was developed to address this situation.
Co-routines are lightweight threads that take into account both deployment and execution efficiencies. Co-routine switching is done in the user state, which is much less expensive than thread switching done in the kernel. Co-routines also need less memory than do threads. To achieve better performance in high concurrency situations, co-routines are more widely used than threads.
KonaFiber, a co-routine developed by Tencent, provides better switching performance while being compatible with the OpenJDK Community Loom API. KonaFiber is implemented in JDK8 and JDK11. KonaFiber supports the Arm architecture, and can meet the requirements of Arm-based applications needing co-routines.
Figure 2: Kona JDK vs. Loom
To meet the needs of the workloads and provide better co-switching performance, KonaFiber uses the JKU-based StackFul scheme to create an independent stack for each co-routine. When the co-routine is switched, in addition to checking the pin state of the co-routine and its context, it only needs to modify the Frame Pointer and Stack Pointer to complete the switch. KonaFiber's StackFul solution uses more memory and is more suitable for workloads that are less sensitive to memory consumption but more sensitive to performance.
Co-routine switching performance data is shown in figure 3. The left chart shows the comparison of the number of co-routine switches per second; the right chart compares memory consumption.
Figure 3: Performance comparison of co-routine switching methods
Figure 4: Memory usage per co-routine method
KonaFiber's implementation focuses on code refactoring, and is continuously optimized in a number of ways:
The co-routine is continuously optimized to reduce resource consumption.
GC is optimized to reduce the overhead introduced by the co-routine to GC
It is extensively tested to improve robustness and stability
KonaFiber provides higher and more stable scheduling performance than Loom, the co-routine of the OpenJDK community. Figure 4 compares the number of dispatches per second between KonaFiber and Loom in different co-routines.
Figure 5: Loom co-routine scheduling performance
Figure 6: KonaFiber co-routine scheduling performance
KonaFiber is now open-source in KonaJDK8, and will be open-source in KonaJDK11. KonaJDK continues to follow up with the Loom community and continuously improve KonaFiber's implementation.
During GC operation, there are several GC threads working on various tasks in parallel. But the processing time of different tasks varies, which makes the load distribution between GC threads unbalanced. JDK balances the load between GC threads and reduces the pause time of the GC by looking at the task queues of other GC threads. If there is a task that the thread can perform, it "steals" the task and executes it. The process continues until the end of the GC.
This scheme achieves automatic load balancing. But in the execution process the possibility of multiple GC threads "stealing" the same task can lead to competition between them, hurting performance.
To optimize this process, in 2016 Google published a new load-balancing algorithm called Optimal Work-Stealing Threads (OWST). With OWST, when there are multiple GC threads wanting to "steal" a task, one thread ends up doing the "stealth" operation while the other threads go into a waiting state. Threads performing "stealth" operations check the task queues of individual GC threads, wake up threads based on the number of tasks, and perform tasks. The algorithm effectively reduces the competition for locks among the various GC threads and improves the efficiency of load balancing.
The OpenJDK community first implemented the OWST algorithm on Shenandoah GC, incorporating the trunk branch into the JDK12 release and making it the default Parallel Terminator. To better support the LTS version, the KonaJDK team ported OWST to JDK8 and JDK11 and completed the relevant code adaptation and testing. After validation, commercial OWST algorithm support was added to JDK8 and JDK11, effectively reducing the execution time of GC parallel tasks and reducing the pause time of GC.
By testing the performance of SPECjbb2015, OWST with ParallelGC was able to improve the critical-jOPS score by approximately 8% with little impact on max-jOPS. In addition, Tencent's internal big data-related Map/Reduce and Spark SQL tasks were tested, and performance was improved by 10%.
KonaJDK is optimized and supported on the Arm architecture through JDK8 and JDK11, with enhanced support coming for JDK17. The KonaJDK team continuously analyzes and tests modules such as JDK's underlying class library, runtime, memory management, execution engine, and more, continuously expanding JDK's capabilities and improving performance.
The KonaJDK team is committed to the Arm architecture and is investing to refining the technology to meet the growing demand for Arm architectures.
Note: this blog has been translated from an article previously posted by Tencent. For the original article, please go here.
Thanks! Useful information