Doing some research of the LDREX and STREX it appears that the exclusivity address range for these instructions on the M3,M4,M7 is the entire memory space. Hence you can only use the LDREX/STREX with one address. Does this not limit you to one Mutex (or at most 32 if you can bit map them?).
Thus it does not seem to be a very practical solution for an RTOS, or am I missing something?
Instead of keep running in the spin lock loop, often an semaphore API could send a request to OS to context switch to other tasks if the spin lock cannot proceed, the OS could put this task in wait queue and get back to this spin lock later when another task release the semaphore (this require the semaphore release API to inform OS kernel of the change).