Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Mobile, Graphics, and Gaming blog Vulkan Mobile Best Practice: Picking the Most Efficient Load/Store Operations
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • vulkan
  • Graphics Developers
  • Graphics APIs
  • gpu
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Vulkan Mobile Best Practice: Picking the Most Efficient Load/Store Operations

Attilio Provenzano
Attilio Provenzano
October 15, 2019
7 minute read time.

When you set up your render passes in Vulkan you have to set load and store operations. They let you specify what it should be done with your images at the render pass boundaries – discard/clear the contents or keep them in memory.

It may be surprising, especially if you have a background in desktop graphics, that you must specify these operations upfront. They even look like duplicates, since you can still clear the screen at the beginning of your command buffer.

The reason why we need to specify these operations upfront is that it allows for some special optimizations in tile-based GPUs (virtually all mobile GPUs). A load operation is the only chance to clear the screen efficiently on mobile! In order to understand the deceptive importance of this topic we need to start a brief overview of the unique challenges of mobile GPUs.

Tile-based rendering

The explanation that follows is a bit of an oversimplification, but it will make sure we are on the same page for what follows. You can learn more about tile-based rendering in the Mali GPU: An Abstract Machine blog post by Pete Harris.

The main challenge for mobile GPUs is that memory bandwidth is at a premium – reading or writing a full-screen image has a significant cost. The idea behind tile-based rendering is to minimize the number of main memory accesses by rendering a small area of the screen at a time (a “tile”) using a fast tile-local memory. Only when the whole tile is rendered, results are written back to main memory.

For example, this explains why post-processing tends to be expensive on mobile: a post-processing fragment shader will typically need to access neighboring pixels, including across tiles. This forces the GPU to flush the initial image to main memory and then read it back for post-processing. This is a full-screen image write/read, which is expensive as previously mentioned.

Load operations

How do these concepts affect your Vulkan app? You may have to rethink your view of images and pay more attention to the cost of some operations.

Load Operations Options
Figure 1: Load operations 

Figure 1 shows the available load operations in Vulkan. From the point of view of a desktop GPU, you may think of them this way:

  • LOAD: well, the image is already in memory, so we just use it;
  • CLEAR: that is extra work for the GPU, as it must clear the image before drawing;
  • DON’T CARE: I could just clear the screen myself with vkCmdClearColorImage.

This makes sense for desktop, but the picture is very different on mobile.

Let us start by LOAD. When processing a tile, the corresponding parts of the image will be copied to tile-local memory so they can be processed. Over all the tiles this corresponds to a full-screen image load, and that’s not exactly optimal.

Can we do better? Let us look at CLEAR. Now the GPU would have to clear the whole image and then load it into tile-local memory. That seems like overkill – and it is indeed! Mobile GPUs can perform a nice optimization here: they don’t touch the stored image at all, and they just start a tile from a cleared state, which is free. This saves a full-screen image load from main memory, meaning that CLEAR will be significantly more efficient than LOAD on mobile.

Regarding DON’T CARE, that’s the same as CLEAR for Mali GPUs, so feel free to use it if you prefer. But if you are going to clear the screen at any point, please make sure you do it with a CLEAR load operation!

Think about it: what happens when you call vkCmdClearColorImage? You are specifically asking for the image to be cleared – and the GPU must comply. This results in a full screen image write, which again is terrible for memory bandwidth.

The biggest takeaway from this section: LOAD_OP_CLEAR is free, vkCmdClearColorImage is very expensive. Use the former, so you can use the GPU budget for rendering great graphics instead of clearing the screen!

Store operations

Most of what we said about LOAD will apply to STORE as well, with minor adjustments.

Store Operation Options
Figure 2: Store operations

As usual, on desktop STORE may seem like an obvious choice, we are writing the image regardless so we might as well store it.

Tile-based rendering has a different view of the world once again. Now let us get this out of the way: if you need an image in any further render pass, you will need to store it, there is no way around that.

But let us say that we have a simple forward renderer, in which we are not going to re-use the depth buffer. If we set the store operation to STORE for the depth buffer, the GPU will have to explicitly copy it to the main memory, tile by tile. If we set it to DON’T CARE it will just be discarded at the end of tile processing, saving a full screen write.

You might have noticed that if we clear the depth buffer the GPU does not need to load the image from main memory, then if we set the store operation to DON’T CARE the GPU will not have to store it to main memory. So why do we have to have a depth buffer at all? Well, Vulkan has a tool for that. We still need to create the depth buffer, because the GPU must know its description so it can be processed in tile memory. On the other hand, the GPU does not need to allocate memory for it. We just need to do two things:

  • Allocate memory with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT, which tells the GPU not to allocate that memory unless it is really needed;
  • Allocate the depth buffer with a usage flag of VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT.

With these two parameters set no memory will be allocated for the buffer and of course no memory bandwidth will be spent to load and store it.

Don’t break render passes

This is a corollary from previous sections: do not break render passes unless you really need to do so! More specifically, if you only need per-pixel accesses you can get away with just a single render pass, thus keeping data in tile-local memory.

UI is a good example: you can have a separate render pass for the UI and it might not make a difference on desktop, but on mobile it is definitely better if you keep it in the same render pass as your scene.

Another more complex example is with deferred rendering. Since you do not need access to neighboring pixels, you can run both passes of deferred rendering using Vulkan subpasses. Data does not need to be written back to main memory between subpasses, as long as the subpass interface fits in tile-local memory. 

The sample

Our Vulkan Best Practice for Mobile Developers project on Github has a sample on load and store operations, which lets you compare all possible combinations and see their effects on bandwidth. You can check out the tutorial for the sample here.

loadOp = CLEAR, storeOp = STORE
Figure 3: loadOp = LOAD, storeOp = STORE

loadOp = CLEAR, storeOp = STORE
Figure 4: loadOp = CLEAR, storeOp = STORE

storeOp = DON’T CARE
Figure 5: loadOp = CLEAR, storeOp = DON’T CARE

As you can see from the pictures above, using the correct load/store operations has a significant impact on read/write bandwidth respectively.

We can even compute the difference: as we have seen before the cost of a LOAD is that of reading a full screen color image. We can compute the size of a full screen image as, and if we multiply it by the framerate, we get the impact on bandwidth.

In this case we get 4 bytes (32 bits) per pixels at 2220 x 1080 and 61.7 fps, resulting in a theoretical difference of 591 MB/s.

Results might be muddied by framebuffer compression, but in the pictures above it is disabled. As you can see, the measured difference is 645 MB/s, which is close to the estimated one.

Overall, we were able to get up to 12% savings in external read cycles and almost 50% savings in external write cycles.

We would encourage you to check out the project on Vulkan Mobile Best Practice GitHub page and try the sample for yourself! The sample code gives developers on-screen control to demonstrate multiple ways of using the feature; it also shows the performance impact of the different approaches through real-time hardware counters on the display. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.

Vulkan Best Practices 

Anonymous
Parents
  • nickyang
    nickyang over 1 year ago

    what is main memory? is it cpu or system memory?  or GPU main memory(Global Memory)?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Comment
  • nickyang
    nickyang over 1 year ago

    what is main memory? is it cpu or system memory?  or GPU main memory(Global Memory)?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Children
No Data
Mobile, Graphics, and Gaming blog
  • Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

    Lisa Sheckleford
    Lisa Sheckleford
    With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
    • March 18, 2025
  • Generative AI in game development

    Roberto Lopez Mendez
    Roberto Lopez Mendez
    How is Generative AI (GenAI) technology impacting different areas of game development?
    • March 13, 2025
  • Physics simulation with graph neural networks targeting mobile

    Tomas Zilhao Borges
    Tomas Zilhao Borges
    In this blog post, we perform a study of the GNN architecture and the new TF-GNN API and determine whether GNNs are a viable approach for implementing physics simulations.
    • February 26, 2025