Our latest world-class embedded graphics processor, the ARM® Mali™-T604 GPU, has excellent memory bandwidth, pixel fill rates to make the mind boggle, and gigaflops of programmable shading power to spare.
We need to keep this engine fuelled with data, and since most of its data comes from memory, we have spent a lot of time and effort designing its Memory Management Unit (MMU). I'd like to show you around its headline features, and explain why a properly designed MMU is so important.
In a unified memory architecture, which most embedded graphics systems use, memory is shared between the CPU and GPU and acts as a high bandwidth communication channel for the scene data.
The application running on the CPU, with the co-operation of the driver stack, will have allocated memory and set up all the data required to render a scene. A good application will prepare data in a format that the hardware can read directly, so that the driver stack has low overhead. (The chance of this happening are greatly increased if the GPU can read multiple different types of data, but that's a topic for another time.)
Now we need to make sure that this scene data is actually available to the GPU.
In the past, especially with non-unified architectures, it was common to have only a subset of the actual memory addressable by the GPU, often as some kind of memory "window". This is less prevalent now, as GPUs and other peripherals usually have access to the whole address space. But with the advent of 64-bit addressing in the mobile space, we need to plan ahead to avoid this kind of annoyance.
All modern CPUs, have a memory manager built in, allowing the operating system to give each process an apparently contiguous virtual address space, even when the actual physical memory is badly fragmented. If the GPU addresses memory physically, this fragmentation becomes visible to it, and we have to ensure that the CPU gets memory that is contiguous, both physically and virtually.
In both of the above cases, you can use a physically-addressed GPU and some software workarounds, but having an MMU on the GPU itself is a much better solution to the memory sharing problem. The software workaround would be to use a special memory allocator to create a suitable shared area and copy the data into it. Apart from the obvious cost of the copy, we also need to do the allocation, perhaps wrestling with the OS if it doesn't support this directly, and keeping tabs on when this memory can be freed.
The GPU's MMU can instead map memory from the CPU's address space into the GPU's address space very easily. If it uses the same layout of page tables as the CPU, this information can be shared or copied very quickly.
The Mali-T604 MMU uses the same page table layout as ARM CPUs which support large address spaces, such as the ARM Cortex™-A15 MPCore™ processor. This reduces the overhead of building page tables for the GPU, and is a format already familiar to developers. It also supports large page sizes, which further simplifies memory mapping from the CPU, as we do not have to break up large pages. It also reduces page management overhead and makes better use of TLB resources on the GPU.
One beneficial thing that the MMU brings is that we can set up regions of memory with the same virtual address on both the CPU and GPU. Then, we no longer need to manually translate addresses as they are passed to the GPU, which makes the process much more efficient.
This also makes it much easier to manipulate these structures on the CPU side, allowing us to use more complex data structures to transfer more work to the GPU itself. Without this, we would either have to make a copy, or keep track of all the pointers on both sides of the CPU-GPU divide as we update the data structures.
Once we have an MMU, of course we also gain all the traditional benefits of inter-process protection, fast context switching, and fault handling. We can even start to use techniques on the GPU that have long been traditional on the CPU-side, such as demand paging.
There are other benefits too. You gain a degree of control over your memory that is simply not possible with software-based schemes. It is possible to map memory with different caching characteristics for different data structures. This is important if data which is read only once is not to pollute valuable cache space on the GPU.
Often, a mobile platform only has limited free virtual address space, even though the physical memory is bigger. As mobile 3D assets get bigger, it makes sense to allow the CPU to unmap certain classes of memory from its own address space, while still letting the GPU access it. Texture memory is the obvious example of this, as most textures are placed in memory once by the CPU and then only read by the GPU. If the MMU uses separate page tables from the CPU, this is easy to arrange.
The Mali T-604 MMU also allows pages to be marked as shared, allowing us to exploit coherency with the CPU. The drivers take advantage of all of these features to optimise both performance and CPU address footprint.
On a CPU, the threading model is quite coarse, with each thread running for thousands of cycles before context switching to another thread. Inter-process context switches require us to change MMU settings to ensure appropriate protection, and this usually involves changing the page tables and flushing any cached translation data.
On the GPU, hundreds of threads will be in flight at any one time; the 4-core Mali-T604 has 1024 live threads, for example. Since switching between threads occurs at a very fine-grained level, using the CPU strategy of switching page tables would be very expensive. By equipping the GPU's MMU with the ability to access multiple address spaces at one time, we can eliminate the context switch bottleneck completely in the majority of cases.
Although we need to protect each process by giving it rights to access only its own memory, we can still mix threads from different applications in the same GPU at the same time. Each thread is associated with one virtual address space, corresponding to the process, and each address space has its own page table. In this way, the GPU can effectively switch context between every instruction of every thread, if necessary.
The Mail-T604 MMU implements four independent address spaces, so that threads from four processes can be running on the GPU at the same time without need to context switch.
As assets get bigger, and platforms become more capable, the limitations of 32-bit addressing start to become apparent, and a larger address space becomes a necessity. As high-end embedded devices start to use memories beyond 4GB in size, having a similarly capable MMU on the GPU prevents us having a recurrence of the "special GPU-addressable memory window" problem.
Do you remember the memory map from the original IBM PC? Most peripherals could only access the so-called "low" memory area, which was therefore full of special-purpose memory regions, fragmenting the available memory and making programming so much more tortuous than it needed to be.
With a 64-bit MMU, the GPU's memory can be allocated anywhere in the CPU's 64-bit address space and we neatly avoid falling back into that particular quagmire.
The Mali-T604 MMU implements 64-bit virtual addresses, with 48-bits actually mappable, and up to 48 bits of physical address space. This matches typical large-memory CPU implementations like the Cortex-A15, and is a good compromise between page table size and actual RAM size. A terabyte should be enough main memory for embedded devices, at least for now. Storing and interpreting the full 64-bit address allows us to expand a long way further in the future without changing the architecture.
A capable memory management unit is fast becoming an essential part of a modern embedded graphics processor unit. It solves some common problems and also creates some exciting new opportunities for optimising how graphics drivers are written.
In conclusion, then, a fully-featured 64-bit memory management unit, like that in the Mali-T604, is the "must have" accessory for next-generation GPUs.