With multithreaded applications, there are situations where it is unavoidable or desirable to wait for other threads. Implementing such wait instruction sequences correctly is important for both multithreaded scalability and power efficiency. Scalability is measured both in terms of aggregated throughput and fairness. Fairness is when all contending threads get an equal share of the contended resource. Fairness is an important but often overlooked characteristic and with poor fairness, some threads could starve, potentially forever, and thus contribute to high tail latencies.
This blog post describes different user space delay and wait implementations for the Armv8+ architecture and recommends best practices for the purpose of improving throughput and fair access to shared resources.
There are two common situations where a thread needs to or ought to wait:
Waiting in user space ‘spin-waiting’ may be preferred over invoking the operating system which requires system calls. If a thread spends more time waiting than processing, it can be questioned whether the thread should be present at all. If long and/or unpredictable waiting periods are acceptable, consider using relevant OS mechanisms for synchronization as this allows the OS to manage the processor in a more efficient way.
When waiting for other threads, the primary objective is to avoid memory contention. We define memory contention as multiple threads accessing the same cache line and at least one access is a write. Writes do not have to be explicit, for example software and hardware prefetching or other processor operations for example cache management operations can have the same practical effect. Memory contention hurts multithreaded scalability due to the use of MESI-based cache coherency protocols. With MESI, a write requires that processor to get the corresponding cache line in exclusive state which causes the invalidation of that line from all other processors that hold a copy. Another consideration is the behavior of reads following a write. Other processors reading the line will cause the line either to migrate or lose exclusive state. When spin-polling, processors are continuously reading a memory location and reacquiring a copy of the corresponding line if necessary. This can lead to inefficiencies when the owner comes to, for example, release a lock which requires another write to the cache line and thus exclusive ownership of the line. Together, these properties expose software to bottlenecks in and communication latencies of the memory system. The execution time overhead may grow linearly with the number of processors contending for the cache line. Write/write contention may also cause CAS (atomic compare and swap) operations to fail, wasting cache coherence bandwidth without making progress.
Reducing demand to the shared memory location decreases memory contention and allows for other threads to make quicker progress. This improves aggregated throughput and fairness and enables greater system wide progress. Reducing such demand is also known as backoff.
The chart below shows the behavior of a spinlock as the number of threads contending on it increases. From the chart we can see that using a backoff approach to reduce demand when waiting scales much better with high thread counts. Both throughput and fairness improve. The backoff is implemented by the thread waiting for several cycles between accesses to the contended spinlock.
Since the waiting processor is not doing any useful work, we would also like to decrease power consumption while waiting. Avoid keeping the processor pipeline and memory system busy when no progress is being made. Ideally, the processor should enter a low power state with clocks disabled.
Waiting is usually implemented using an instruction sequence which causes the processor to delay or stall execution. Such instruction sequences should ideally be independent of microarchitectural characteristics for example instruction latencies and processor clock frequency. The implementation will also affect power consumption of the delay function, the lower power consumption while waiting, the better.
The Arm architecture provides a user space accessible counter-timer which is incremented at a fixed but machine-specific rate. Software can (spin) wait until the counter-timer reaches some desired value. To make the waiting more power efficient, Armv8.7-A introduced the WFET (FEAT_WFxT) instruction, which allows the processor to enter a low power state for a set time, or until an event is received. Without this feature, we can instead use the ISB instruction to decrease processor activity and thus power consumption between checks of the counter-timer. Note that we do not depend on the latency of the ISB instruction.
C code implementing an Armv8.0-A compatible delay function using the counter-timer is shown below.
static inline unsigned long counter_read(void) { unsigned long cnt; //Need __volatile below since CNTVCT_EL0 does not return a constant value __asm __volatile("mrs %0,cntvct_el0" : "=r" (cnt)); return cnt; } static inline unsigned long counter_freq(void) { unsigned long freq; __asm("mrs %0,cntfrq_el0" : "=r" (freq)); return freq; } void nano_delay(unsigned long delay_ns) { //Prevent speculation of subsequent counter/timer reads and memory accesses __asm __volatile("isb"); if (delay_ns > 18) { //Adjust for overhead of initial ISB delay_ns -= 18; //unsigned long delay_ticks = delay_ns * counter_freq() / 1000000000UL; //With the below simplifications and adjustments, //we are usually within 2% of the correct value unsigned long delay_ticks = (delay_ns + delay_ns / 16) * counter_freq() >> 30; if (delay_ticks != 0) { unsigned long start = counter_read(); if (delay_ticks > 40) { //Delay using ISB for all but the last ticks do { __asm __volatile("isb" ::: "memory"); } //Subtract and compare to handle counter roll-over while (counter_read() - start < delay_ticks - 40); } //Spin in empty loop for the last 1..40 ticks to increase accuracy do { } //Subtract and compare to handle counter roll-over while (counter_read() - start < delay_ticks); } } }
Compiling this code, for example using GCC, generates the following machine code:
0000000000000000 <nano_delay>: 0: d5033fdf isb 4: f100481f cmp x0, #0x12 8: 54000289 b.ls 58 <nano_delay+0x58> c: d1004800 sub x0, x0, #0x12 10: d53be001 mrs x1, cntfrq_el0 14: 8b401000 add x0, x0, x0, lsr #4 18: 9b017c00 mul x0, x0, x1 1c: d35efc00 lsr x0, x0, #30 20: b40001c0 cbz x0, 58 <nano_delay+0x58> 24: d53be042 mrs x2, cntvct_el0 28: f100a01f cmp x0, #0x28 2c: 540000e9 b.ls 48 <nano_delay+0x48> 30: d100a003 sub x3, x0, #0x28 34: d5033fdf isb 38: d53be041 mrs x1, cntvct_el0 3c: cb020021 sub x1, x1, x2 40: eb03003f cmp x1, x3 44: 54ffff83 b.cc 34 <nano_delay+0x34> 48: d53be041 mrs x1, cntvct_el0 4c: cb020021 sub x1, x1, x2 50: eb00003f cmp x1, x0 54: 54ffffa3 b.cc 48 <nano_delay+0x48> 58: d65f03c0 ret
Potential issues are that the initial ISB instruction causes some undefined delay (approximatively 30-40 cycles) and that the counter-timer frequency may be too low to provide proper accuracy. We try to compensate for this by subtracting some nanoseconds from the requested delay time and spin-wait in an empty loop at end of the waiting period. The above implementation has the following behaviours when running on Graviton 2/3/4.
Later versions of the Arm architecture include a number of new features which improve the situation and simplify the implementation. From Armv8.6-A, the counter-timer must have the resolution of 1GHz (but actual update rate could be lower due to counter scaling). When FEAT_SB (Armv8.5-A), FEAT_ECV (Armv8.6-A) and FEAT_WFxT (Armv8.7-A) are supported, the following implementation can instead be used:
//Self-synchronising read of CNTVCTSS_EL0, no need for any speculation barrier static inline unsigned long counter_read_ss(void) { unsigned long cnt; //Need __volatile below since CNTVCTSS_EL0 does not return a constant value __asm __volatile("mrs %0,cntvctss_el0" : "=r" (cnt)); //Requires FEAT_ECV (Armv8.6) return cnt; } void nano_delay_v87(unsigned long delay_ns) { unsigned long delay_ticks = delay_ns; //From Armv8.6, CNTFRQ_EL0 always return 1GHz (1000000000) if (delay_ticks != 0) { unsigned long start = counter_read_ss(); unsigned long end = start + delay_ticks; do { __asm __volatile("wfet %0" : : "r" (end) : "memory"); //Requires FEAT_WFxT (Armv8.7) //Spurious wakeups may occur } while (counter_read_ss() - start < delay_ticks);//Handle counter roll-over } //Prevent speculation of subsequent memory accesses __asm __volatile("sb" ::: "memory"); //Requires FEAT_SB (Armv8.5) }
The following machine code is generated:
0000000000000000 <nano_delay_v87>: 0: b4000100 cbz x0, 20 <nano_delay_v87+0x20> 4: d53be0c2 mrs x2, cntvctss_el0 8: 8b020003 add x3, x0, x2 c: d5031003 wfet x3 10: d53be0c1 mrs x1, cntvctss_el0 14: cb020021 sub x1, x1, x2 18: eb00003f cmp x1, x0 1c: 54ffff83 b.cc c <nano_delay_v87+0xc> 20: d50330ff sb 24: d65f03c0 ret
With these functions, high accuracy delay can be achieved. The actual resolution still depends on the actual update rate of the counter-timer.
Many different approaches to implementing delay and backoff functionality have been attempted with varying results. Some are listed below with explanations why they are not recommended. Existing code may use the x86 PAUSE instruction to create the delay used for backoff. When porting the code to Arm, many of the approaches below have been used, often with poor results.
Yes, firmware executing in highest exception level is required to initialise CNTFRQ_EL0 with the correct value.
"//From Armv8.6, CNTFRQ_EL0 always return 1GHz (1000000000)"
ARM DDI 0487:
"This register is provided so that software can discover the frequency of the system counter. It must be programmedwith this value as part of system initialization...
On a Warm reset, this field resets to an architecturally UNKNOWN value."