Optimizing Code Cache Performance for Large Code Footprint Java Applications on Neoverse

September 16, 2025

7 minute read time.

Introduction

Java is a popular programming language in the internet domain. Java applications have some characteristics that make their performance behavior distinct from native applications.  Since Java bytecode cannot be executed directly on the CPU, they often run on a runtime Java Virtual Machine (JVM). The JVM must convert bytecode first into machine code. either through an interpreter or via the Just-In-Time (JIT) compiler. The machine code generated at runtime plays a critical role in the efficiency and performance of Java applications. 

In some areas of the internet, such as e-commerce, programs need to handle diverse user inputs. They also must offer rich functionality. For example, e-commerce apps typically integrate features such as:

Product browsing, search, and filtering
Shopping cart, marketing campaigns, order management, and payment systems

Each feature requires a large amount of runtime code, data, and third-party libraries. As a result, Java-based e-commerce applications can be compiled at runtime into large size of machine code. This is stored in the so-called code cache and executed often. 

Performance impact of large code footprint

In the Hotspot JVM, the code cache is a heap-like structure in a contiguous memory region. The code cache is divided into multiple segments by code type. Users can configure the size of each segment based on application requirements. This design reduces memory fragmentation across segments. These segments include:

Non-method:  Contains non-method code such as compiler buffers and bytecode interpreter. This code type stays in the code cache forever.
Profiled nmethod: Contains lightly optimized, profiled methods with a short lifetime.
Non-profiled nmethod: Contains fully optimized, non-profiled methods with a potentially long lifetime.

The C2 compiler stores native code in the non-profiled nmethod segment. This segment holds frequently executed hot code, and code that was executed multiple times during startup but rarely called afterward. 

Modern CPUs are deeply pipelined and include multiple execution units.  The Arm Neoverse CPU frontend fetches instructions from memory and decodes them into low-level hardware operations, known as micro-operations. The backend dispatches these operations and executes them out of order on the Neoverse CPU. The large size of code can affect front-end CPU performance. Effects include instruction fetch delays, ITLB refills, instruction pipeline drains, and branch target buffer entry refills.

Large code experiment

We ran a 10 times code enlarging experiment to simulate a large code cache. By increasing the memory allocated for each nmethod, we created a large block of used code cache. We used the DaCapo Java benchmark to measure performance impact. When this experiment was run on the Neoverse N2 platform, we saw some decrease in both throughput (~4-6%) and tail latency (~1-3%). The figure below shows PMU statistics collected using the DaCapo Benchmark Spring test case, with the non-profiled nmethod size inflated by 10x.

This experiment cannot fully simulate a large-code Java application, as it only causes the compiler-generated code to be dispersed in terms of memory addresses. Therefore, the performance data does not entirely reflect real-world scenarios. However, the PMU statistics still reveal its impact on front-end performance.

Code Inflation impact on front-end performance

This experiment does not change the executed instructions or the data being used. it only expands the spatial distribution of the code. As a result, front-end CPU resources become a bottleneck.

Performance optimization

This performance bottleneck is closely related to the size of front-end resources: cache size, BTB size, and iTLB size. Different Neoverse CPUs have different resources sizes. Regardless of the resource size, we can reduce the impact of this bottleneck with software or configuration optimizations.

Move data out of code cache

In the code cache, each compiled method contains both code and data. The data includes the method header, relocation data, oops, JMCI data, deoptimization data, scope metadata. By removing as much data as possible from the code cache, we can effectively reduce its footprint. This optimization increases code density, allowing better use of CPU L1/L2 cache, iTLB, and BTB resources when invoking these functions.

We are attempting to backport several patches to OpenJDK 21 to measure their performance and PMU impact in code inflation experiments. These patches reduce the size of nmethod header and move most of immutable and mutable data out of the code cache. 

In our experiment, the size of non-profiled nmethods decreased by 39%, from 229MB to 149MB. In the DaCapo benchmark result, throughput and tail latency improved as front-end performance metrics were optimized. From the PMU data collected, we can see that the Cache refill, iTLB refill and branch miss MPKI all decreased. 

The reason is that increased spatial locality of the code improves the use efficiency of front-end resources. This accelerates instruction fetch and decode operations.

Code cache optimization

Enable transparent huge pages for code cache

In large code footprint Java applications, the wide address range of the execution code means the CPU requires more MMU and TLB resources. The reason is to store the virtual-to-physical address mappings. This impacts iTLB refill in such applications. Applying Transparent Huge Pages (THP) to the code cache region can increase page table entry sizes, reducing the total number of page tables needed. This feature can reduce iTLB resources use.

In OpenJDK, enabling the -XX:+UseTransparentHugePages option applies 2MB huge pages to the code cache heap when the Linux OS allows. With this configuration, we see improvements in performance and iTLB Refill PMU metrics.

Performance impact of enabling Transparent Huge Pages

Hot method segment in code cache

The total size of hot code in a stable workload is usually small. The hot code typically exists in the non-profiled method segment because of tiered compilation. Tier 4 (T4) methods are JIT-compiled by c2 compiler after intensive usage, in the order of their active usage detection.  As a result, hot and cold code are often interleaved.

To improve CPU front-end performance, a hot-method segment in the code cache can enhance spatial locality for frequently executed code. Clustering hot methods together improves instruction fetch and decode efficiency.

To identify which methods should be placed in this hot region, profiling tools are needed to collect performance profiling data. One approach is to use Java Flight Recorder (JFR) to dynamically adjust code placement during runtime. However, this approach is complex and adds performance overhead from method relocation.

Alternatively, hot methods can be predefined ahead of time. The steps involve:

Using tools such as async-profiler to find Tier 4 hot methods during the 1st run.
Parse the JVM compilation logs using custom scripts to achieve these methods size.
Generate hot method lists that fit within the predefined hot segment size in the code cache.
Create a directive file to guide the JIT compiler in placing hot methods optimally during the next run, avoiding runtime relocation overhead.

Split nmethods into frequently and infrequently accessed parts and allocate them separately as described above. The newly added hot code segment can be placed between non-nmethod and non-profiled nmethod segments to maintain joint locality of hot code with stubs and optimized code:

There is a side effect in the hot segment. It moves some non-profiled nmethods that were originally adjacent into different segments. In some cases, methods close in memory are also invoked consecutively. This relocation causes these consecutively invoked methods to be placed in different memory page tables instead of sharing the same one. This increases the burden on the instruction TLB (ITLB). As discussed earlier, enabling Transparent Huge Pages (THP) for the Code Cache alleviates this issue by reducing the number of page table entries required. Therefore, enable this feature when using the hot nmethod segment.

CPU system register configuration

Neoverse cores provide some hardware registers to control CPU cache behavior. In the Neoverse N2, IMP_CPUECTLR_EL1 register has several fields which can impact the L2 cache usage for instruction fetch. 

CMC_MIN_WAYS limits how many ways of L2 cache are used by CMC prefetch. Its default value is 2 which means CMC must leave at least 2 ways for data from the L2 cache. In front-end bottleneck scenarios, setting this value to 0 can reserve more L2 cache for instruction fetch.
L2_INST_PART can reserve a portion of L2 cache exclusively for instructions. It is disabled by default. To enable this dedicated space can improve cache hit ratio during instruction fetch.

Setting to CMC_MIN_WAYS = 0 and L2_INST_PART = 2 produced significant improvements in throughput and latency during code inflation experiments.

Performance improvement in throughput and latency during code inflation experiments.

Summary

Performance testing and tuning of large code footprint Java applications on Arm Neoverse CPUs shows that large code cache sizes significantly affect CPU front-end efficiency. Experiments of code cache inflation revealed significant performance degradation from increased pressure on front-end resources such as CPU cache, TLB, and branch-prediction units.

To address these bottlenecks, several software optimizations were proposed, including

Reducing the footprint of compiled methods by moving data out of the code cache.
Enable THP for JVM code cache heap.
Introducing a dedicated hot method segment to improve spatial locality.
Configure CPU register to reserve more cache space for instruction fetch.

These approaches can improve the large code footprint Java application performance. Some approaches showed throughput and latency improvements in the DaCapo benchmark with 10x code inflation. These optimizations and configurations can significantly reduce front-end bottlenecks caused by large code caches. This Leads to an improvement in the execution efficiency of Java workloads on Neoverse CPUs.

Servers and Cloud Computing blog

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025
Accelerating early developer bring-up and pre-silicon validation with Arm Neoverse CSS V3

odinlmshen

Discover the Arm Neoverse RD-V3 Software Stack Learning Path—helping developers accelerate early bring-up and pre-silicon validation for complex firmware on Neoverse CSS V3.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Optimizing Code Cache Performance for Large Code Footprint Java Applications on Neoverse

Introduction

Performance impact of large code footprint

Large code experiment

Performance optimization

Enable transparent huge pages for code cache

Hot method segment in code cache

CPU system register configuration

Summary

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

Accelerating early developer bring-up and pre-silicon validation with Arm Neoverse CSS V3

Optimizing Code Cache Performance for Large Code Footprint Java Applications on Neoverse

Introduction

Performance impact of large code footprint

Large code experiment

Performance optimization

Enable transparent huge pages for code cache

Hot method segment in code cache

CPU system register configuration

Summary

Enable transparent huge pages for code cache

Hot method segment in code cache