Memory access ordering part 3: Memory access ordering in the Arm Architecture

September 11, 2013

10 minute read time.

In my previous posts, I have introduced the concept of memory access ordering and discussed barriers and their implementation in the Linux kernel. I chose to do it in this order because I wanted to start by communicating the underlying concepts before I went into detail about what the Arm architecture does about memory ordering. This post goes into the juicy bits of what this actually means and how this is handled in the Arm architecture.

Two separate concepts are relevant to memory access ordering in the Arm architecture - memory types and shareability domains. These progressively made their explicit entry into the Arm architecture in versions 6 and 7, implemented by the Arm11 and Cortex family of processors respectively.

Enter the abstract

When describing many of the concepts mentioned in this post, the Arm Architecture Reference Manual makes frequent use of the words/phrases observer/observers and is observed to or must observe. In practice, this refers to Master bus interfaces and how the devices controlling those interfaces, as well as the interconnect, must handle transactions. Only a Master interface can observe a transaction. Since all bus transactions are initiated by a Master, the ordering of accesses arriving at Slave interfaces can be inferred from the Master ordering rules. Note that transaction ordering does not refer simply to the order in which transactions leave a Master interface - they can often be reordered in the memory system, and can be observed in different order by different Masters where not explicitly ordered.

Memory types

Before the Armv6 architecture, not much explicit was defined about the out-of-ordering of memory accesses - the Sequential Execution Model was assumed to apply to all instructions. Processors that implemented caches and write buffers could mark regions of memory as being cacheable or bufferable without greater side effects than the obvious ones.

However, for modern processors that implement multiple cores, out-of-order execution or simply permits certain accesses to be buffered and others to happen synchronously, it is vital that certain rules are defined for what constraints apply to:

the order in which the memory accesses of one core relate to the surrounding instructions
the order in which the memory accesses of one core can be observed by other cores within a multi-core processor
the order in which the memory accesses of a processor can be observed by other Masters on the system bus

I will not go into full detail of the different memory types in this post (that is enough information for its own post), but I will give a quick overview of the points relevant for this post. Memory types (and their additional attributes such as cache policy) are configured in the translation tables the operating system sets up for the MMU.

Normal memory

Normal memory is effectively for all of your data and executable code. This memory type permits speculative reads, merging of accesses and (if interrupted by an exception) repeating of reads without side effects. Accesses to Normal memory can always be buffered, and in most situations they are also cached - but they can be configured to be uncached. There is no implicit ordering of Normal memory accesses, beyond pure address dependencies and control dependencies. When not explicitly restricted, the only limit to how out-of-order non-dependent accesses can be is the processor's ability to hold multiple live transactions.

Device memory and Strongly-ordered memory

The Device and Strongly-ordered memory types are used with memory mapped peripherals or other control registers. For the purposes of this post, Device and Strongly-ordered memory are quite similar, and with the Armv7-A Large Physical Address Extension (LPAE), this becomes even more true since processors implementing the LPAE treat Device and Strongly-ordered memory regions identically. Armv7-A processors that do not implement the LPAE can set device memory to be Shareable or Non-shareable.

Accesses to these types of memory must happen exactly the number of times that executing the program suggests they should. Two writes to the same location must be performed as two writes, and two reads from the same location must both take place. This is obviously important when you are accessing peripheral control registers. There is however no guarantee about ordering between memory accesses to different devices, or usually between accesses of different memory types.

Barriers

Barriers were introduced progressively into the Arm architecture.

Some Armv5 processors, such as the Arm926EJ-S, implemented a Drain Write Buffer cp15 operation, which halted execution until any buffered writes had drained into the external memory system.
With the introduction of the Armv6 memory model, this operation was redefined in more architectural terms and became the Data Synchronization Barrier. Armv6 also introduced the new Data Memory Barrier and Flush Prefetch Buffer cp15 operations.
Armv7 evolved the memory model somewhat, extending the meaning of the barriers - and the Flush Prefetch Buffer operation was renamed the Instruction Synchronization Barrier.
Armv7 also allocated dedicated instruction encodings for the barrier operations. Use of the cp15 operations is now deprecated and software targeting Armv7 or later should use the DMB, DSB and ISB mnemonics.
And finally, Armv7 extended the Shareability concept to cover both Inner-shareable and Outer-shareable domains (see below). This together with AMBA4 ACE gives us barriers that propagate into the memory system.

So, what are these barriers then, and what do they do?

Instruction Synchronization Barrier (ISB)

The Instruction Synchronization Barrier ensures that any subsequent instructions are fetched anew from cache in order that privilege and access is checked with the current MMU configuration. It is used to ensure any previously executed context changing operations (including cp15 operations) will have completed by the time the ISB completed.

Access type and domain are not really relevant for this barrier. It is not used in any of the Linux memory barrier primitives, but appears here and there in memory management, cache control and context switching code.

Data Memory Barrier (DMB)

The basic functionality of a DMB is as follows:

It prevents reordering of data accesses instructions across itself. All data accesses by this processor/core before the DMB will be visible to all other masters within the specified shareability domain before any of the data accesses after it. It also ensures that any explicit preceding data (or unified) cache maintenance operations have completed before any subsequent data accesses are executed.

The DMB instruction takes two optional parameters: an operation type (stores only - 'ST' - or loads and stores) and a domain. The default operation type is loads and stores and the default domain is System. So, in effect DMB is shorthand for DMB SY. All possible combinations of types and domains are legal operations on any processor, even if it does not implement the specific functionality described, and can be substituted internally for any stronger barrier.

In the Linux kernel, the DMB instruction is used for the smp_*mb() macros.

Data Synchronization Barrier (DSB)

The Data Synchronization Barrier enforces the same ordering as the Data Memory Barrier, but it also blocks execution of any further instructions until synchronization is complete. It also waits until all cache and branch predictor maintenance operations have completed for the specified shareability domain. If the access type is load and store then it also waits for any TLB maintenance operations to complete.

In the Linux kernel, the DSB instruction is used for the *mb() macros.

Shareability domains

The ordering of memory accesses in the Arm architecture takes place within what is called a Shareability domain. Shareability domains define "zones" within the bus topology within which memory accesses are to be kept consistent (taking place in a predictable way) and potentially coherent (with hardware support). Outside of this domain, observers might not see the same order of memory accesses as inside it.

The following table shows the different shareability options available in an Armv7-A system:

Domain	Abbreviation	Description
Non-shareable	NSH	A domain consisting only of the local agent. Accesses that never need to be syncronized with other cores, processors or devices. Not normally used in SMP systems.
Inner Shareable	ISH	A domain (potentially) shared by multiple agents, but usually not all agents in the system. A system can have multiple Inner Shareable domains. An operation that affects one Inner Shareable domain does not affect other Inner Shareable domains in the system.
Outer Shareable	OSH	A domain almost certainly shared by multiple agents, and quite likely consisting of several Inner Shareable domains. An operation that affects an Outer Shareable domain also implicitly affects all Inner Shareable domains within it (but does not otherwise behave as an Inner Shareable operation). For processors such as the Cortex-A15 MPCore that implement the LPAE, all Device memory accesses are considered Outer Shareable. For other processors, the shareability attribute can be explicitly set (to shareable or non-shareable). Devices within an Outer Shareable domain will normally be complex enough to have a concept of memory management and cache coherency (for example Mali graphics accelerators), although they might not be fully integrated in it.
Full system	SY	An operation on the full system affects all agents in the system; all Non-shareable regions, all Inner Shareable regions and all Outer Shareable regions. Simple peripherals such as UARTs, and several more complex ones, are not normally necessary to have in a restricted shareability domain.

Armv6 architecture does not support a separate outer shareable domain.

Shareability is effectively assigned to each memory transaction in the system, based on:

Memory attributes for the region accessed (determined by MMU translation tables)
Core configuration (can differ between cores within one multi-core processor)
Implementation of interconnect
Integration between interconnect and the masters connected to it

But there are also specific operations (instructions or cp15 configuration options) that can be performed with a domain defining their scope.

The diagram below shows an example system and one way the shareability domains can be implemented. Here we have each individual execution unit within the MPCore clusters possessing their own internal Non-shareable domain. The two MPCore clusters have been configured to make up one inner-shareable domain. There is an outer-shareable domain holding graphics and video accelerators and graphics output as well as memory. And finally, anything not in this subsystem is simply part of the System domain.

Barrier domains and levels

Unless any additional parameters are specified, the barriers apply to the System domain. All barrier instructions can take a domain specifier, and it is architecturally defined that any unsupported specifier will be treated as if it specified System. The Data barriers can additionally take a separate parameter (ST for "store") to indicate that this barrier should only affect store accesses, and that loads can be freely reordered around them.

For example, on a processor that does not distinguish between shareability domains DMB ISHST will execute simply as if it was DMB SY, whereas on one that does, it will be a lot more efficient.

Example:

DMB OSST

External caches

With AMBA4 ACE

The Cortex-A15 implements the AXI Coherency Extension, which makes barriers propagate through the interconnect. This makes it easy to maintain ordering within the ACE-aware portion of the system.

Before AMBA4 ACE

Before AMBA4 ACE, there was still a chance that bufferable memory transactions could be reordered in the external memory system even after barrier instructions had ensured that they left the Master interface in the correct order. On the Arm Versatile Express development platform with the 4x Cortex-A9 module, accesses can be reordered in the level 2 cache controller. For example, a write to a DMA descriptor can be overtaken by a subsequent write to a control register to initiate the DMA transaction. This is resolved by the *mb() macros expanding to perform an explicit "outer sync" with the PL310 when CONFIG_Arm_DMA_MEM_BUFFERABLE is set in the kernel configuration file.

Summary

Memory access ordering is a complex topic, but hopefully this 3-part series has provided a useful introduction. For complete and proper information of the memory model of the Arm architecture and the ordering requirements (and tools) for the AMBA interconnect, please see the resources listed below.

References

Other blogs in this series:

1944.zip

mj over 10 years ago

leiflindholm
First a very good article, this cleared my doubts which were raised after reading the ARM arch ref manual.
There are a few points which I need some help in understanding
(a) Inner / Outer is always wrt to a MPCore ?
If this is true then sharability is not a phenomena between 2 cores in same cluster (MPCore)
(b) The concept of Inner and Outer in Inner and Outer write through is same as Inner and outer shareable ?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Stephen Tether over 10 years ago

This is a nice, readable explanation of some difficult concepts. I like the diagram of shareability domains in particular. Firefox 3.6.x seems to have a problem with the table, though. The entries in the first two columns are horizontally aligned with the centers of the entries in the third. There are no horizontal rules separating entries which makes it difficult to see where they begin and end.Do you happen to know what the shareability domains are for the Xilinx Zynq system-on-a-chip? I suspect that the arrangement is very like the one in your diagram, with the two ARM cores forming the Inner domain. I haven't been able to get confirmation from Xilinx, though.Does it do any harm to use the IS variants of cache operations even when the system is not in SMP mode?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Leif Lindholm over 10 years ago

Thank you for your question.

Normal memory is effectively for all of your data and executable
code. This memory type permits speculative reads, merging of accesses
and (if interrupted by an exception) repeating of writes without side
effects.

Firstly, seeing that sentence again after many months, it jumps out of me that it is incorrect. It is meant to say "repeating of reads without side effects". Apologies for that mistake, which has now been corrected above. That does not however answer your question.

What confuses me is that it was my impression that interrupts could not occur during the execution of an instruction. In other words half of an "add" instruction (for example) can't elapse and then an interrupt come along and stop the processor from finishing the full "add" instruction. I think I even remember reading that instructions like the "load/store multiple" can't be interrupted meaning one instruction could go on for quite a while before the processor could be interrupted.

Could you help me understand what it is you meant that writes could be interrupted and subsequently repeated?

What you say was certainly true for many historical ARM processors, but these days when they multi-issue, speculate and execute out-of-order the world works somewhat differently. Even the simple statement "during the execution of an instruction" becomes quite convoluted when you may have 10 or more instructions "in flight" simultaneously - meaning they have been issued, but their results are not yet committed for use by all subsequent instructions.

To give a programmer's-view explanation that would make the processor designers cringe: the core needs to "pick" an instruction which is the one where the interrupt "happened". The results of instructions up to and including that one in the execution stream must be visible in the process context stored by the exception handler. The core must flush visible results of any instructions beyond that point. Indeed, this is what can cause the repeating of reads: rather than postpone taking an interrupt until a read operation has completed, the core "picks" the instruction before the access as the one where the interrupt "happened". It can then go off to the interrupt handler while the read continues in the background (discarding the data), and reissue the transaction when the handler returns upon completion.

This is not to say that all modern processors will do this, only that they are permitted to, and that system programmers must be aware of this so that they do not map any memory regions with read-sensitive peripherals as "Normal", which could lead to interesting things like certain data entities "disappearing" from FIFOs when a read operation was repeated.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Trevor Woerner over 10 years ago

Hi,

In this article you wrote:

Normal memory is effectively for all of your data and executable
code. This memory type permits speculative reads, merging of accesses
and (if interrupted by an exception) repeating of writes without side
effects.

What confuses me is that it was my impression that interrupts could not occur during the execution of an instruction. In other words half of an "add" instruction (for example) can't elapse and then an interrupt come along and stop the processor from finishing the full "add" instruction. I think I even remember reading that instructions like the "load/store multiple" can't be interrupted meaning one instruction could go on for quite a while before the processor could be interrupted.

Could you help me understand what it is you meant that writes could be interrupted and subsequently repeated?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Part 2: Arm Scalable Matrix Extension (SME) Instructions

Zenon Xiu (修志龙）

This blog is the second half of a two-part blog for SME Instructions. See link to Part 1 in the note at the top of this blog post.
- June 24, 2024
Part 1: Arm Scalable Matrix Extension (SME) Introduction

Zenon Xiu (修志龙）

This blog series provides an introduction to the Arm Scalable Matrix Extension (SME) including SVE and SVE2.
- May 23, 2024
MPAM-Style cache partitioning with ATP-Engine and gem5

Hristo Belchev

Upstream gem5 and ATP-Engine MPAM-style cache partitioning are discussed, with experiments for the feature being proposed and analyzed.
- April 24, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog