Vulkan Mobile Best Practice: Picking the Most Efficient Load/Store Operations

October 15, 2019

7 minute read time.

When you set up your render passes in Vulkan you have to set load and store operations. They let you specify what it should be done with your images at the render pass boundaries – discard/clear the contents or keep them in memory.

It may be surprising, especially if you have a background in desktop graphics, that you must specify these operations upfront. They even look like duplicates, since you can still clear the screen at the beginning of your command buffer.

The reason why we need to specify these operations upfront is that it allows for some special optimizations in tile-based GPUs (virtually all mobile GPUs). A load operation is the only chance to clear the screen efficiently on mobile! In order to understand the deceptive importance of this topic we need to start a brief overview of the unique challenges of mobile GPUs.

Tile-based rendering

The explanation that follows is a bit of an oversimplification, but it will make sure we are on the same page for what follows. You can learn more about tile-based rendering in the Mali GPU: An Abstract Machine blog post by Pete Harris.

The main challenge for mobile GPUs is that memory bandwidth is at a premium – reading or writing a full-screen image has a significant cost. The idea behind tile-based rendering is to minimize the number of main memory accesses by rendering a small area of the screen at a time (a “tile”) using a fast tile-local memory. Only when the whole tile is rendered, results are written back to main memory.

For example, this explains why post-processing tends to be expensive on mobile: a post-processing fragment shader will typically need to access neighboring pixels, including across tiles. This forces the GPU to flush the initial image to main memory and then read it back for post-processing. This is a full-screen image write/read, which is expensive as previously mentioned.

Load operations

How do these concepts affect your Vulkan app? You may have to rethink your view of images and pay more attention to the cost of some operations.

Load Operations Options
Figure 1: Load operations

Figure 1 shows the available load operations in Vulkan. From the point of view of a desktop GPU, you may think of them this way:

LOAD: well, the image is already in memory, so we just use it;
CLEAR: that is extra work for the GPU, as it must clear the image before drawing;
DON’T CARE: I could just clear the screen myself with vkCmdClearColorImage.

This makes sense for desktop, but the picture is very different on mobile.

Let us start by LOAD. When processing a tile, the corresponding parts of the image will be copied to tile-local memory so they can be processed. Over all the tiles this corresponds to a full-screen image load, and that’s not exactly optimal.

Can we do better? Let us look at CLEAR. Now the GPU would have to clear the whole image and then load it into tile-local memory. That seems like overkill – and it is indeed! Mobile GPUs can perform a nice optimization here: they don’t touch the stored image at all, and they just start a tile from a cleared state, which is free. This saves a full-screen image load from main memory, meaning that CLEAR will be significantly more efficient than LOAD on mobile.

Regarding DON’T CARE, that’s the same as CLEAR for Mali GPUs, so feel free to use it if you prefer. But if you are going to clear the screen at any point, please make sure you do it with a CLEAR load operation!

Think about it: what happens when you call vkCmdClearColorImage? You are specifically asking for the image to be cleared – and the GPU must comply. This results in a full screen image write, which again is terrible for memory bandwidth.

The biggest takeaway from this section: LOAD_OP_CLEAR is free, vkCmdClearColorImage is very expensive. Use the former, so you can use the GPU budget for rendering great graphics instead of clearing the screen!

Store operations

Most of what we said about LOAD will apply to STORE as well, with minor adjustments.

Store Operation Options
Figure 2: Store operations

As usual, on desktop STORE may seem like an obvious choice, we are writing the image regardless so we might as well store it.

Tile-based rendering has a different view of the world once again. Now let us get this out of the way: if you need an image in any further render pass, you will need to store it, there is no way around that.

But let us say that we have a simple forward renderer, in which we are not going to re-use the depth buffer. If we set the store operation to STORE for the depth buffer, the GPU will have to explicitly copy it to the main memory, tile by tile. If we set it to DON’T CARE it will just be discarded at the end of tile processing, saving a full screen write.

You might have noticed that if we clear the depth buffer the GPU does not need to load the image from main memory, then if we set the store operation to DON’T CARE the GPU will not have to store it to main memory. So why do we have to have a depth buffer at all? Well, Vulkan has a tool for that. We still need to create the depth buffer, because the GPU must know its description so it can be processed in tile memory. On the other hand, the GPU does not need to allocate memory for it. We just need to do two things:

Allocate memory with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT, which tells the GPU not to allocate that memory unless it is really needed;
Allocate the depth buffer with a usage flag of VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT.

With these two parameters set no memory will be allocated for the buffer and of course no memory bandwidth will be spent to load and store it.

Don’t break render passes

This is a corollary from previous sections: do not break render passes unless you really need to do so! More specifically, if you only need per-pixel accesses you can get away with just a single render pass, thus keeping data in tile-local memory.

UI is a good example: you can have a separate render pass for the UI and it might not make a difference on desktop, but on mobile it is definitely better if you keep it in the same render pass as your scene.

Another more complex example is with deferred rendering. Since you do not need access to neighboring pixels, you can run both passes of deferred rendering using Vulkan subpasses. Data does not need to be written back to main memory between subpasses, as long as the subpass interface fits in tile-local memory.

The sample

Our Vulkan Best Practice for Mobile Developers project on Github has a sample on load and store operations, which lets you compare all possible combinations and see their effects on bandwidth. You can check out the tutorial for the sample here.

Figure 3: loadOp = LOAD, storeOp = STORE

Figure 4: loadOp = CLEAR, storeOp = STORE

Figure 5: loadOp = CLEAR, storeOp = DON’T CARE

As you can see from the pictures above, using the correct load/store operations has a significant impact on read/write bandwidth respectively.

We can even compute the difference: as we have seen before the cost of a LOAD is that of reading a full screen color image. We can compute the size of a full screen image as, and if we multiply it by the framerate, we get the impact on bandwidth.

In this case we get 4 bytes (32 bits) per pixels at 2220 x 1080 and 61.7 fps, resulting in a theoretical difference of 591 MB/s.

Results might be muddied by framebuffer compression, but in the pictures above it is disabled. As you can see, the measured difference is 645 MB/s, which is close to the estimated one.

Overall, we were able to get up to 12% savings in external read cycles and almost 50% savings in external write cycles.

We would encourage you to check out the project on Vulkan Mobile Best Practice GitHub page and try the sample for yourself! The sample code gives developers on-screen control to demonstrate multiple ways of using the feature; it also shows the performance impact of the different approaches through real-time hardware counters on the display. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.

[CTAToken URL = "https://github.com/khronosGroup/Vulkan-samples" target="_blank" text="Vulkan Best Practices" class ="green"]

nickyang over 1 year ago

what is main memory? is it cpu or system memory? or GPU main memory(Global Memory)?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Graphics, Gaming, and VR blog

Introducing Arm Accuracy Super Resolution

arm-phodges

Today we introduce “Arm Accuracy Super Resolution” (Arm ASR), which is a best-in-class open-source solution for upscaling on mobile devices.
- July 10, 2024
Getting started with Android Dynamic Performance Framework (ADPF) in Unreal Engine

Syed Farhan Hassan

For research purposes, Arm has developed a demo using Unreal Engine and Android Dynamic Performance Framework (ADPF) to investigate how ADPF is used to optimize game performance.
- July 4, 2024
NanoMesh on Mobile: Delivering great beauty in simplicity

Nathan Li

From the GDC24 tech talk “SmartGI Evolution: Adaptive NanoMesh on Mobile”. SmartGI and NanoMesh are cutting-edge rendering solutions aiming to enable the best possible graphics on all platforms.
- May 28, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Vulkan Mobile Best Practice: Picking the Most Efficient Load/Store Operations

Tile-based rendering

Load operations

Store operations

Don’t break render passes

The sample

Introducing Arm Accuracy Super Resolution

Getting started with Android Dynamic Performance Framework (ADPF) in Unreal Engine

NanoMesh on Mobile: Delivering great beauty in simplicity