Cortex-M33 Exclusive Monitors: Behavior of LDREX/STREX on Internal SRAM vs. Peripherals for Atomic RMW

Hello everyone,

I am working on an ARM Cortex-M33 device and implementing a custom atomic Read-Modify-Write (RMW) bit manipulation routine without native bit-banding support.

When compiling my code with optimization turned off (-O0), the compiler injects standard stack store (STR) instructions between __LDREXW and __STREXW. This causes STREX to continuously return 1 (fail), resulting in an infinite retry loop.

/* Atomic Bit Set using Exclusive Load/Store Primitives */

void bb_set_bit(volatile uint32_t *addr, uint32_t bit)

{

uint32_t value;

uint32_t status;

do {

// 1. Read the memory location and set an exclusive lock monitor

value = __LDREXW(addr);

// 2. Modify the bit locally

value |= (1UL << bit);

// 3. Attempt to store back conditionally.

// Returns 0 if successful, 1 if interrupted/failed.

status = __STREXW(value, addr);

} while (status != 0); // If status is 1, loop back and retry automatically

}


To better understand the underlying hardware behavior of the Cortex-M33 core, I would appreciate clarification on the following architectural rules:


-Local vs. Global Exclusive Monitors: How does the internal Cortex-M33 Local Exclusive Monitor handle intermediate standard data stores (STR) to the stack if they happen between an LDREX and STREX sequence targeting a completely different SRAM address? Does any standard write instruction automatically clear or invalidate the local exclusive state?


-Peripheral Register Access: Can LDREX/STREX primitives be reliably used to achieve atomic access on hardware peripheral register regions (e.g., GPIO or Timer configurations), or do these regions completely lack the necessary hardware monitor support, causing STREX to inherently fail?


- Best Practices for Atomic Peripheral Access: If the exclusive monitors are restricted to Normal/SRAM memory space, what is the architecturally recommended strategy on a Cortex-M33 core to implement safe, atomic bit-toggling on peripherals without globally disabling interrupts (via PRIMASK) or using heavy RTOS mutexes?


Could u please help me to get proper solution for this so that i can implement it in my project..