“This content was initially posted 27 September 2013 on blogs.arm.com”
Contemporary ARM® architecture (ARMv7, the upcoming ARMv8) offers advanced CPU features like MMU, multi-level cache, TLB, multi core, hardware coherency and similar, which enable modern operating systems for optimal resources management.
One interesting performance optimization is the concept of superpages (also known as huge pages), which allow for efficient use of TLB translations so that they cover large physical regions.
This article describes recent development of superpages support for the FreeBSD operating system on ARM architecture.
Superpages is an optimization technique used to improve system performance between CPU and main memory, enabling each entry in the Translation Look-aside Buffer (TLB) to map a large physical memory region into a virtual address space.
Processor core uses virtual address to access some memory location. For resident pages, a virtual address refers to a particular physical frame in main memory, so the CPU requires a valid translation of the addresses (virtual to physical) for the access to succeed. To speed up this procedure CPU Memory Management Unit (MMU) maintains a table of recently used translations called Translation Look-aside Buffer (TLB). Access is immediate for pages which have been recently used and still have a valid translation stored in the TLB. Other scenarios require finding a missing translation in page tables or, in case of failure, handling a hardware exception (which can be time consuming). More details about the ARM architecture and memory management are available from ARM infocenter.
In practice TLB size is limited to several dozens of entries and a single TLB entry usually covers the smallest available page size so that dense page granulation can be maintained. Using smaller page sizes will lead to having more pages to be managed by the operating system and more overhead. A superpage replaces 256 normal pages, thus results in fewer TLB misses.
The FreeBSD operating system has a generic framework for transparent superpages, which operates autonomously without explicit user request. The Virtual Memory (VM) subsystem manages all paging operations including superpages. The creation of large mappings depends on the reservation based memory allocation. When a virtual object is created, a contiguous physical memory chunk is reserved for that object. Later, the base pages within the object are faulted-in, populating the reservation map in the process. When the reservation is fully populated the memory area becomes a candidate for promotion to a superpage. Otherwise the reservation will probably be dropped or preempted for other objects to use. The mechanism truly adapts to current system needs as only active pages participate in the promotion.
Using variable page sizes can bring problems related to the page management policy. Each superpage has only one "dirty" bit and one "referenced" bit per page table entry. The whole large page would need to be written to the backing storage if one base page within a superpage was modified, as there is no way to determine which page really needs to be synchronized. To avoid this, and other bottlenecks, in the page management algorithm there is a set of rules that need to be satisfied so that the promotion will occur:
- The area under superpage needs to be contiguous in physical and virtual space.
- All the base pages within superpage have to have the same access attributes and state.
- Superpage is created with permission to write only if all base pages were already modified, otherwise it is set as read only and will be demoted on any write access.
The ARM architecture dependent portion of the superpages support has been recently developed by Semihalf in cooperation with The FreeBSD Foundation. The work covered all superpage management mechanisms including superpage promotion, demotion, creation, removal, shared mappings management and all OS specific aspects of this feature. One superpage size is supported in the current implementation. 1 MB section mapping was selected to serve as a superpage while 4 KB remains the base page. This gives 256-entry population map per large mapping for the VM system to manage.
The functionality has been extensively tested using various benchmarks and techniques. It is important to note that the performance improvement is dependent of the application behavior and usage pattern. Processes allocating large areas of consistent memory will benefit more from superpages than processes allocating small, independent pages. Results presented below have been measured on ARMADA XP platform.
The most significant results can be observed using the Giga Updates Per Second (GUPS) benchmark. GUPS measures how frequently the system can issue updates to randomly generated memory locations. In particular it measures both memory latency and bandwidth capabilities of the system. On multicore ARMv7 platforms, measured CPU time usage and real time duration dropped by 34%. Number of updates performed in the same amount of time has increased by 52%.
LMbench is a popular suite of system performance benchmarks. It is equipped with the memory testing program and can be used for example to examine memory latency and bandwidth.
Using superpages, the measured memory latency dropped by 37,85%. Memory bandwidth improvement varied depending on the type of operation and was in the range from 2,26% for mmap reread to 8,44% for memory write. It is worth noting that LMbench uses STREAM benchmark to measure memory bandwidth which uses floating point arithmetic to perform the memory operations. Currently FreeBSD does not yet support FPU on ARM, therefore results were impacted by this.
Self-hosted - world build™
Using this technique helped reduce duration of the self-hosted "world" build (whole user space environment build) when using GCC. The time needed for building the whole set of user applications comprising to the root file system decreased by 1 hour 22 minutes on the test platform (20% shorter) when superpages were enabled.
This new functionality has been tested on a mix of ARMv6 and ARMv7 platforms, in uniprocessor (UP) and symmetric multiprocessor (SMP) environments. Memory performance improvement could be observed in all cases. However, these new features do not cover all of the hardware and OS capabilities. There are possible ways of improvement. Adding support for additional 64 KB page size will further increase the amount of created superpages,enabling a smoother and more efficient process for the promotion from 4 KB small page to 1 MB section. In addition, a larger number of processes will be capable of taking advantage from superpages if the required population map size is smaller.
The support has been integrated to the FreeBSD 10.0-CURRENT and is available in the FreeBSD main SVN repository and will be included to the upcoming FreeBSD 10.0 release.
Rafal Jaworowski is the CTO and co-founder of Semihalf, where he is leading the engineering team. During 14 years of embedded systems experience he developed software running on various ARM system, which include high performance system-on-chip devices built around ARMv5, v6 and v7 architecture definition. He is contributing to the FreeBSD project as a kernel source committer.