Memory Management on Embedded Graphics Processors

September 11, 2013

6 minute read time.

Our latest world-class embedded graphics processor, the ARM® Mali™-T604 GPU, has excellent memory bandwidth, pixel fill rates to make the mind boggle, and gigaflops of programmable shading power to spare.

We need to keep this engine fuelled with data, and since most of its data comes from memory, we have spent a lot of time and effort designing its Memory Management Unit (MMU). I'd like to show you around its headline features, and explain why a properly designed MMU is so important.

Unified Memory Sharing

In a unified memory architecture, which most embedded graphics systems use, memory is shared between the CPU and GPU and acts as a high bandwidth communication channel for the scene data.

The application running on the CPU, with the co-operation of the driver stack, will have allocated memory and set up all the data required to render a scene. A good application will prepare data in a format that the hardware can read directly, so that the driver stack has low overhead. (The chance of this happening are greatly increased if the GPU can read multiple different types of data, but that's a topic for another time.)

Now we need to make sure that this scene data is actually available to the GPU.

In the past, especially with non-unified architectures, it was common to have only a subset of the actual memory addressable by the GPU, often as some kind of memory "window". This is less prevalent now, as GPUs and other peripherals usually have access to the whole address space. But with the advent of 64-bit addressing in the mobile space, we need to plan ahead to avoid this kind of annoyance.

All modern CPUs, have a memory manager built in, allowing the operating system to give each process an apparently contiguous virtual address space, even when the actual physical memory is badly fragmented. If the GPU addresses memory physically, this fragmentation becomes visible to it, and we have to ensure that the CPU gets memory that is contiguous, both physically and virtually.

In both of the above cases, you can use a physically-addressed GPU and some software workarounds, but having an MMU on the GPU itself is a much better solution to the memory sharing problem. The software workaround would be to use a special memory allocator to create a suitable shared area and copy the data into it. Apart from the obvious cost of the copy, we also need to do the allocation, perhaps wrestling with the OS if it doesn't support this directly, and keeping tabs on when this memory can be freed.

The GPU's MMU can instead map memory from the CPU's address space into the GPU's address space very easily. If it uses the same layout of page tables as the CPU, this information can be shared or copied very quickly.

The Mali-T604 MMU uses the same page table layout as ARM CPUs which support large address spaces, such as the ARM Cortex™-A15 MPCore™ processor. This reduces the overhead of building page tables for the GPU, and is a format already familiar to developers. It also supports large page sizes, which further simplifies memory mapping from the CPU, as we do not have to break up large pages. It also reduces page management overhead and makes better use of TLB resources on the GPU.

Software Consequences

One beneficial thing that the MMU brings is that we can set up regions of memory with the same virtual address on both the CPU and GPU. Then, we no longer need to manually translate addresses as they are passed to the GPU, which makes the process much more efficient.

This also makes it much easier to manipulate these structures on the CPU side, allowing us to use more complex data structures to transfer more work to the GPU itself. Without this, we would either have to make a copy, or keep track of all the pointers on both sides of the CPU-GPU divide as we update the data structures.

Once we have an MMU, of course we also gain all the traditional benefits of inter-process protection, fast context switching, and fault handling. We can even start to use techniques on the GPU that have long been traditional on the CPU-side, such as demand paging.

Interesting Extras

There are other benefits too. You gain a degree of control over your memory that is simply not possible with software-based schemes. It is possible to map memory with different caching characteristics for different data structures. This is important if data which is read only once is not to pollute valuable cache space on the GPU.

Often, a mobile platform only has limited free virtual address space, even though the physical memory is bigger. As mobile 3D assets get bigger, it makes sense to allow the CPU to unmap certain classes of memory from its own address space, while still letting the GPU access it. Texture memory is the obvious example of this, as most textures are placed in memory once by the CPU and then only read by the GPU. If the MMU uses separate page tables from the CPU, this is easy to arrange.

The Mali T-604 MMU also allows pages to be marked as shared, allowing us to exploit coherency with the CPU. The drivers take advantage of all of these features to optimise both performance and CPU address footprint.

MMUs for Multithreading

On a CPU, the threading model is quite coarse, with each thread running for thousands of cycles before context switching to another thread. Inter-process context switches require us to change MMU settings to ensure appropriate protection, and this usually involves changing the page tables and flushing any cached translation data.

On the GPU, hundreds of threads will be in flight at any one time; the 4-core Mali-T604 has 1024 live threads, for example. Since switching between threads occurs at a very fine-grained level, using the CPU strategy of switching page tables would be very expensive. By equipping the GPU's MMU with the ability to access multiple address spaces at one time, we can eliminate the context switch bottleneck completely in the majority of cases.

Although we need to protect each process by giving it rights to access only its own memory, we can still mix threads from different applications in the same GPU at the same time. Each thread is associated with one virtual address space, corresponding to the process, and each address space has its own page table. In this way, the GPU can effectively switch context between every instruction of every thread, if necessary.

The Mail-T604 MMU implements four independent address spaces, so that threads from four processes can be running on the GPU at the same time without need to context switch.

The 64-bit Leap

As assets get bigger, and platforms become more capable, the limitations of 32-bit addressing start to become apparent, and a larger address space becomes a necessity. As high-end embedded devices start to use memories beyond 4GB in size, having a similarly capable MMU on the GPU prevents us having a recurrence of the "special GPU-addressable memory window" problem.

Do you remember the memory map from the original IBM PC? Most peripherals could only access the so-called "low" memory area, which was therefore full of special-purpose memory regions, fragmenting the available memory and making programming so much more tortuous than it needed to be.

With a 64-bit MMU, the GPU's memory can be allocated anywhere in the CPU's 64-bit address space and we neatly avoid falling back into that particular quagmire.

The Mali-T604 MMU implements 64-bit virtual addresses, with 48-bits actually mappable, and up to 48 bits of physical address space. This matches typical large-memory CPU implementations like the Cortex-A15, and is a good compromise between page table size and actual RAM size. A terabyte should be enough main memory for embedded devices, at least for now. Storing and interpreting the full 64-bit address allows us to expand a long way further in the future without changing the architecture.

Conclusion

A capable memory management unit is fast becoming an essential part of a modern embedded graphics processor unit. It solves some common problems and also creates some exciting new opportunities for optimising how graphics drivers are written.

In conclusion, then, a fully-featured 64-bit memory management unit, like that in the Mali-T604, is the "must have" accessory for next-generation GPUs.

Parents

Sean Ellis over 11 years ago

Hello Baudouis,

Thanks for the feedback.

Indeed, the Mali-T604 has a larger virtual address range than the Cortex-A15, but you shouldn't read too much into it.

It is desirable for the GPU architecture to be long-lived, so we made a deliberate decision to represent pointers as 64-bit values so that it would be as future-proof as possible. Extending the page table system to map the full 64-bit range would be expensive, so we chose to implement mapping for 48 bits of virtual address as a natural size that fits neatly into the page table system.

You will notice also that the physical address range is "up to" 48 bits. This gives implementers some leeway, and I expect 40-bit physical addresses to be the norm for most systems which also include a Cortex-A15.

I'm afraid I can't talk about our virtualization strategy for Mali-T604 at the moment, but it's an interesting topic for a future blog post.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Sean Ellis over 11 years ago

Hello Baudouis,

Thanks for the feedback.

Indeed, the Mali-T604 has a larger virtual address range than the Cortex-A15, but you shouldn't read too much into it.

It is desirable for the GPU architecture to be long-lived, so we made a deliberate decision to represent pointers as 64-bit values so that it would be as future-proof as possible. Extending the page table system to map the full 64-bit range would be expensive, so we chose to implement mapping for 48 bits of virtual address as a natural size that fits neatly into the page table system.

You will notice also that the physical address range is "up to" 48 bits. This gives implementers some leeway, and I expect 40-bit physical addresses to be the norm for most systems which also include a Cortex-A15.

I'm afraid I can't talk about our virtualization strategy for Mali-T604 at the moment, but it's an interesting topic for a future blog post.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Graphics, Gaming, and VR blog

The mobile gaming revolution, powered by Arm

Philippe Bressy

This blog post describes the stratospheric growth of mobile gaming growth from the late 90s to present day, and how Arm technology has been at the heart of the mobile gaming revolution.
- November 18, 2024
Shader analysis and more in Arm Performance Studio 2024.4

Julie Gaskin

Learn about the new shader analysis features for mobile developers in Frame Advisor, and hear about other Arm Performance Studio changes in this release.
- October 2, 2024
Save your battery while enjoying the modern graphics on mobile with Android Dynamic Performance Framework

Patrick Wang

Save battery and enhance mobile gaming with ADPF and Unreal Engine. Mori shows you how it optimizes graphics based on real-time thermal data, reducing overheating and power consumption.
- September 26, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog