Our work on memory translation optimizations has excited me for a while, and now that the work is being presented at ASPLOS 22 I can finally share my enthusiasm through this blog post. One of the reasons I have been looking forward to sharing is because the work involves some incredibly elegant ways to improve on memory translation without breaking any of the pre-existing hardware or software. The other, more personal reason is because it showcases the amazing close collaboration between Andreas Sandberg and me at Arm Research, and Chang Hyun Park and David Black-Schaffer from Uppsala University.
The research problem at hand was how to improve virtual to physical translation performance. Those of you familiar with this area knows that this topic is not new, and many academic studies have already explored and even proposed designs that could lead to performance improvements. However, when looking to implement such proposals in real systems we found that many designs were impractical. This is not because the solutions were bad, or that they had a fundamental flaw, but rather because they all had an intangible quality that could be the difference between a nice idea and a successful integration into a product. One thing these designs had in common was that to solve the problem they all offered in one way or another an overhaul of the old mechanism for performing translations. This could have meant that old systems would not be able to run software designed for the new designs. Alternatively, they might have broken certain unwritten assumptions that permeate the design of both the hardware and more importantly the software that we use every day.
Figure 1: A typical translation table walk
Before getting into the details of what we did, a brief explanation or refresher of address translation and translation table walks would be useful. Modern operating systems rely on virtual addressing to isolate processes and hide the complexities of the underlying hardware. This design is preferable for various reasons, but mainly because it allows applications to be isolated as they operate in different virtual address spaces. It also helps the operating system transparently manage physical memory and avoid issues such as fragmentation.
To make this work, an intermediate step is necessary. This step converts a Virtual Address (VA) into a Physical Address (PA) that the hardware can access. This step is called translation, and the process of converting the VA into a PA is called a translation table walk. A simple approach is to keep all the translations in one big table with one entry per virtual page. The size of such a data structure would be proportional to the complete virtual address space size. Since most processes only use a fraction of the virtual address space available to them, the size of the table would make it prohibitively expensive and impractical. Instead, most processors split the table into smaller tables arranged in a hierarchical structure called a radix tree. This makes it possible to grow the data structure as needed, making the size of the structure proportional to the occupied virtual address space. Additionally, these smaller tables make the system more flexible as they themselves are the size of a page and as such can be allocated using the same mechanism as pages. On the other hand, as we can see in figure 1, the walk requires more steps to complete the translation process. Current processors need to traverse 4 or 5 tables to get to the data, which can be very costly when a system experiences multiple accesses that need a virtual to physical translation. In short, a system that makes a single memory request must complete an additional four accesses just to find the address of the desired data.
It is obvious that if every memory access ended up needing an additional 4-5 accesses, the whole concept of virtual addressing would be impractical - despite all its other benefits. For this reason, processors keep some data from the translation tables in the caches. They also cache a small number of the virtual to physical translations in a small buffer, the translation lookaside buffer (TLB). Additionally, some operating systems find regions of contiguous data and merge the pages into larger ones that span the entire range. This feature is known as Transparent Huge Pages (THP) in Linux. Huge pages can improve performance. The larger range of a huge page makes the translation table walk terminate earlier and thus reduces the number of accesses needed to determine the physical address of the page.
Despite the previous optimizations, translation has become more of a problem in modern processors. This is because new features such as security extensions require additional metadata, which adds memory accesses and potentially walks. Furthermore, virtualization, where multiple operating systems run in virtual systems on the same host, introduces additional memory accesses. Each translation table access in the virtual system needs a translation itself in the host, which creates a multiplicative effect. In the virtualized case, an application that makes a single memory request can end up requiring an additional 25 memory accesses to translate a virtual address to a physical address before it can fetch the desired data. This explosion of memory translation cost has been the motivation with our academic partners at Uppsala University.
The core ideas of our publication are simple:
Figure 2: Conventional and flattened translation tables
First, we merge translation tables to create huge tables similar to transparent huge pages. This fits in nicely with our current designs. Just as a regular translation table fits precisely into one page, so can a large table fit precisely into one huge page. This effectively shortens the walk by the number of merges in the translation table structure. That is a massive 64% reduction in memory accesses: very neat.
The second core idea of our paper is to dynamically increase amount of space available to translation table fragments in data caches. This increases the probability of finding translation table fragments in data caches and reduces the number of times a walk access needs to go to memory. As caches have a much lower latency at retrieving data this can further cut down the cost of a walk. This dynamic scheme is based on the observation that applications that mostly miss in a data cache would not suffer a performance penalty if their share of the cache is reduced. Such applications typically miss in the cache because their data set is too large to fit in the cache. The translation table on the other hand is orders of magnitude smaller and is more likely to fit. By dynamically partitioning the cache, we ensure that applications that cannot fit their data set store their translation tables in the cache instead which leads to a performance improvement. All of this is great, but how does it all tie up to the original point of not being a complete overhaul of the Virtual Memory System Architecture and breaking software? Let us take a closer look at three interesting corner cases:
“Oh no! I ran out of space. These large tables waste so much space.”
Fear not, for this is a best effort mechanism. In case a system does run out of space, we can take a page from THP (no pun intended----okay maybe intended) and split the large translation table into the original smaller tables. After the split, the translation table walker reverts to the old mechanism and performs a traditional four or five step walk to determine the PA. This solves a frequent problem with many academic designs which we wanted to avoid. Many other proposals can make the translation faster but require a much larger piece of memory to be free and contiguous to be able to do so. This seems to be an insurmountable problem as a lot of software relies on the implicit assumption that as long as there are free pages available, the system can always service allocation requests.
“How can I ensure that my operating system continues to support old hardware while using the merged translation tables feature on newer hardware?”
Operating systems implementing support for merged tables already need to handle the case where they cannot allocate a huge page when allocating a new table. In such cases, they transparently fall back to normal pages and a traditional translation table structure. Our Linux kernel prototype implements backwards compatibility by always using this fallback path if the hardware does not support merged tables.
"My operating system uses recursive mappings. Will merged translation tables play well with that?"
This one is slightly more complicated, but the short answer is yes. Operating systems that use recursive mappings map the translation tables themselves into VA space using a recursive (self-referencing) entry in the root table. This is incredibly important because this mechanism is used by mainstream operating systems such as Windows to manage translation tables. A naïve implementation of merged tables has two problems: First, it would break down in the case when the root is a large (merged) table. Second, large tables would not be recognized as large pages.
The fact that following a recursive mapping in a large table would increment the current walk level by two causes the first problem. Consequently, you end up in a system-breaking situation where some of the tables are inaccessible. Luckily this is easily avoidable with a little trick. We can reserve a range of 512 entries in the large table (which correspond to a regular page of the huge table) to point to each of the subtranslation tables. Each of these entries has the “next table large” bit disabled. This way we can reach all the next level tables including root table’s own sub tables.
The second problem can be solved by treating a huge table entry at L2 as an alternative huge page representation. Overall, this means that the translation table walker returns a huge page if it encounters an entry with the next table large bit set at L2.
There is so much nuance in how this small change has such a significant impact, it probably needs a few more blog posts. If you want to read the actual paper, you can enjoy all the finer details at the following link. As a final note I would like to thank Andreas, David, and Chang Hyun, as well as our colleagues at Arm. Through a number of amazing brainstorming sessions we were able to troubleshoot all the hurdles along the way. I am proud to see our work positively received and adopted and very excited to do more research together in the near future.
Read the Paper