No LDREX/STREX-based implementations of __cxa_guard_acquire/release/abort in ARM code?

Basically: the same question I asked there: CMSIS_5/issues/1393, without getting a satisfying answer.

My initial question in that issue:

From C++ ABI for the ARM architecture, ARM IHI 0041D:

3.2.3.1 Guard variables
To support the potential use of initialization guard variables as semaphores that are the target of ARM SWP and
LDREX/STREX synchronizing instructions we define a static initialization guard variable to be a 4-byte aligned, 4-
byte word with the following inline access protocol.

#define INITIALIZED 1

// inline guard test...
if ((obj_guard & INITIALIZED)!= INITIALIZED) {
    // TST obj_guard, #1; BNE already_initialized
    if (__cxa_guard_acquire(&obj_guard)) {
    ...
}

Usually, a guard variable should be allocated in the same data section as the object whose construction it guards.
3.2.3.2 One-time construction API

extern "C" int __cxa_guard_acquire(int *guard_object);

If the guarded object has not yet been initialized, this function returns 1. Otherwise it returns 0.
If it returns 1, a semaphore might have been claimed and associated with guard_object, and either
__cxa_guard_release or __cxa_guard_abort must be called with the same argument to release the semaphore.

extern "C" void __cxa_guard_release(int *guard_object);

This function is called on completing the initialization of the guarded object. It sets the least significant bit of
guard_object (allowing subsequent inline checks to succeed) and releases any semaphore associated with it.

extern "C" void __cxa_guard_abort(int *guard_object);

This function is called if any part of the initialization of the guarded object terminates by throwing an exception. It
releases any semaphore associated with guard_object.

Is my interpretation correct that only one bit of the obj_guard variable is accessed at all by the code that provides it (the code that invokes the __cxa_guard_xxx functions), and that because the rest is unused, the obj_guard variable itself could be used for the semaphore implementation?

If that is the case, and I certainly hope so, since an alternative implementation would have to "manually" allocate semaphore memory on the side of every static variable, which would be quite cumbersome (or would have to use some kind of recursive mutex that would handle the case when the OS is not started, but not everyone has that kind of luxury), how come the following search on the CMSIS_5 repository gives no result in source code?

$ git grep __cxa_guard_acquire

I mean, if there is a possibly trivial implementation based on the ABI documentation (of the three functions), why would ARM themselves not provide it? Or is it provided in some other repository? I made such an implementation myself, but seeing naive non-thread-safe implementations all over the Internet (which systematically break C++ static object creation semantics!) really makes me wonder whether we are not missing a great opportunity of improving many embedded C++ applications with a small effort.

What am I missing?