Java is a popular programming language in the internet domain. Java applications have some characteristics that make their performance behavior distinct from native applications. Since Java bytecode cannot be executed directly on the CPU, they often run on a runtime Java Virtual Machine (JVM). The JVM must convert bytecode first into machine code. either through an interpreter or via the Just-In-Time (JIT) compiler. The machine code generated at runtime plays a critical role in the efficiency and performance of Java applications.
In some areas of the internet, such as e-commerce, programs need to handle diverse user inputs. They also must offer rich functionality. For example, e-commerce apps typically integrate features such as:
Each feature requires a large amount of runtime code, data, and third-party libraries. As a result, Java-based e-commerce applications can be compiled at runtime into large size of machine code. This is stored in the so-called code cache and executed often.
In the Hotspot JVM, the code cache is a heap-like structure in a contiguous memory region. The code cache is divided into multiple segments by code type. Users can configure the size of each segment based on application requirements. This design reduces memory fragmentation across segments. These segments include:
The C2 compiler stores native code in the non-profiled nmethod segment. This segment holds frequently executed hot code, and code that was executed multiple times during startup but rarely called afterward.
Modern CPUs are deeply pipelined and include multiple execution units. The Arm Neoverse CPU frontend fetches instructions from memory and decodes them into low-level hardware operations, known as micro-operations. The backend dispatches these operations and executes them out of order on the Neoverse CPU. The large size of code can affect front-end CPU performance. Effects include instruction fetch delays, ITLB refills, instruction pipeline drains, and branch target buffer entry refills.
We ran a 10 times code enlarging experiment to simulate a large code cache. By increasing the memory allocated for each nmethod, we created a large block of used code cache. We used the DaCapo Java benchmark to measure performance impact. When this experiment was run on the Neoverse N2 platform, we saw some decrease in both throughput (~4-6%) and tail latency (~1-3%). The figure below shows PMU statistics collected using the DaCapo Benchmark Spring test case, with the non-profiled nmethod size inflated by 10x.
This experiment cannot fully simulate a large-code Java application, as it only causes the compiler-generated code to be dispersed in terms of memory addresses. Therefore, the performance data does not entirely reflect real-world scenarios. However, the PMU statistics still reveal its impact on front-end performance.
This experiment does not change the executed instructions or the data being used. it only expands the spatial distribution of the code. As a result, front-end CPU resources become a bottleneck.
This performance bottleneck is closely related to the size of front-end resources: cache size, BTB size, and iTLB size. Different Neoverse CPUs have different resources sizes. Regardless of the resource size, we can reduce the impact of this bottleneck with software or configuration optimizations.
Move data out of code cache
In the code cache, each compiled method contains both code and data. The data includes the method header, relocation data, oops, JMCI data, deoptimization data, scope metadata. By removing as much data as possible from the code cache, we can effectively reduce its footprint. This optimization increases code density, allowing better use of CPU L1/L2 cache, iTLB, and BTB resources when invoking these functions.
We are attempting to backport several patches to OpenJDK 21 to measure their performance and PMU impact in code inflation experiments. These patches reduce the size of nmethod header and move most of immutable and mutable data out of the code cache.
In our experiment, the size of non-profiled nmethods decreased by 39%, from 229MB to 149MB. In the DaCapo benchmark result, throughput and tail latency improved as front-end performance metrics were optimized. From the PMU data collected, we can see that the Cache refill, iTLB refill and branch miss MPKI all decreased.
The reason is that increased spatial locality of the code improves the use efficiency of front-end resources. This accelerates instruction fetch and decode operations.
In large code footprint Java applications, the wide address range of the execution code means the CPU requires more MMU and TLB resources. The reason is to store the virtual-to-physical address mappings. This impacts iTLB refill in such applications. Applying Transparent Huge Pages (THP) to the code cache region can increase page table entry sizes, reducing the total number of page tables needed. This feature can reduce iTLB resources use.
In OpenJDK, enabling the -XX:+UseTransparentHugePages option applies 2MB huge pages to the code cache heap when the Linux OS allows. With this configuration, we see improvements in performance and iTLB Refill PMU metrics.
-XX:+UseTransparentHugePages
The total size of hot code in a stable workload is usually small. The hot code typically exists in the non-profiled method segment because of tiered compilation. Tier 4 (T4) methods are JIT-compiled by c2 compiler after intensive usage, in the order of their active usage detection. As a result, hot and cold code are often interleaved.
To improve CPU front-end performance, a hot-method segment in the code cache can enhance spatial locality for frequently executed code. Clustering hot methods together improves instruction fetch and decode efficiency.
To identify which methods should be placed in this hot region, profiling tools are needed to collect performance profiling data. One approach is to use Java Flight Recorder (JFR) to dynamically adjust code placement during runtime. However, this approach is complex and adds performance overhead from method relocation.
Alternatively, hot methods can be predefined ahead of time. The steps involve:
Split nmethods into frequently and infrequently accessed parts and allocate them separately as described above. The newly added hot code segment can be placed between non-nmethod and non-profiled nmethod segments to maintain joint locality of hot code with stubs and optimized code:
There is a side effect in the hot segment. It moves some non-profiled nmethods that were originally adjacent into different segments. In some cases, methods close in memory are also invoked consecutively. This relocation causes these consecutively invoked methods to be placed in different memory page tables instead of sharing the same one. This increases the burden on the instruction TLB (ITLB). As discussed earlier, enabling Transparent Huge Pages (THP) for the Code Cache alleviates this issue by reducing the number of page table entries required. Therefore, enable this feature when using the hot nmethod segment.
Neoverse cores provide some hardware registers to control CPU cache behavior. In the Neoverse N2, IMP_CPUECTLR_EL1 register has several fields which can impact the L2 cache usage for instruction fetch.
Setting to CMC_MIN_WAYS = 0 and L2_INST_PART = 2 produced significant improvements in throughput and latency during code inflation experiments.
Performance testing and tuning of large code footprint Java applications on Arm Neoverse CPUs shows that large code cache sizes significantly affect CPU front-end efficiency. Experiments of code cache inflation revealed significant performance degradation from increased pressure on front-end resources such as CPU cache, TLB, and branch-prediction units.
To address these bottlenecks, several software optimizations were proposed, including
These approaches can improve the large code footprint Java application performance. Some approaches showed throughput and latency improvements in the DaCapo benchmark with 10x code inflation. These optimizations and configurations can significantly reduce front-end bottlenecks caused by large code caches. This Leads to an improvement in the execution efficiency of Java workloads on Neoverse CPUs.