Basically: the same question I asked there: CMSIS_5/issues/1393, without getting a satisfying answer.
My initial question in that issue:
From C++ ABI for the ARM architecture, ARM IHI 0041D:
3.2.3.1 Guard variables To support the potential use of initialization guard variables as semaphores that are the target of ARM SWP and LDREX/STREX synchronizing instructions we define a static initialization guard variable to be a 4-byte aligned, 4- byte word with the following inline access protocol.
#define INITIALIZED 1 // inline guard test... if ((obj_guard & INITIALIZED)!= INITIALIZED) { // TST obj_guard, #1; BNE already_initialized if (__cxa_guard_acquire(&obj_guard)) { ... }
Usually, a guard variable should be allocated in the same data section as the object whose construction it guards. 3.2.3.2 One-time construction API
extern "C" int __cxa_guard_acquire(int *guard_object);
If the guarded object has not yet been initialized, this function returns 1. Otherwise it returns 0. If it returns 1, a semaphore might have been claimed and associated with guard_object, and either __cxa_guard_release or __cxa_guard_abort must be called with the same argument to release the semaphore.
extern "C" void __cxa_guard_release(int *guard_object);
This function is called on completing the initialization of the guarded object. It sets the least significant bit of guard_object (allowing subsequent inline checks to succeed) and releases any semaphore associated with it.
extern "C" void __cxa_guard_abort(int *guard_object);
This function is called if any part of the initialization of the guarded object terminates by throwing an exception. It releases any semaphore associated with guard_object.
Is my interpretation correct that only one bit of the obj_guard variable is accessed at all by the code that provides it (the code that invokes the __cxa_guard_xxx functions), and that because the rest is unused, the obj_guard variable itself could be used for the semaphore implementation?
obj_guard
__cxa_guard_xxx
If that is the case, and I certainly hope so, since an alternative implementation would have to "manually" allocate semaphore memory on the side of every static variable, which would be quite cumbersome (or would have to use some kind of recursive mutex that would handle the case when the OS is not started, but not everyone has that kind of luxury), how come the following search on the CMSIS_5 repository gives no result in source code?
$ git grep __cxa_guard_acquire
I mean, if there is a possibly trivial implementation based on the ABI documentation (of the three functions), why would ARM themselves not provide it? Or is it provided in some other repository? I made such an implementation myself, but seeing naive non-thread-safe implementations all over the Internet (which systematically break C++ static object creation semantics!) really makes me wonder whether we are not missing a great opportunity of improving many embedded C++ applications with a small effort.
What am I missing?
Alain Mosnier said: just would like ARM to confirm what they meant in the ABI specification, and the best and simplest way for them to do that is to provide an implementation in CMSIS
I still do not understand the reason CMSIS should be the one to provide an implementation. And it is not as if there are no implementations in production. The open-source implementations of GCC and LLVM are available. Even with closed-source Arm C++ compiler, the disassembly generated by it can be examined to derive the implementation.
Alain Mosnier said:My own implementation is in not in terms of LDREX/STREX, but in terms of GCC's atomic built-ins, which I am pretty sure are quite straightforwardly based on LDREX/STREX for ARM.
LDREX
STREX
The ABI isn't forcing anyone to utilize or to not utilize ldrex/strex. But the ABI is indeed forcing the size and the alignment of the guard variable, to facilitate an option of using ldrex/strex on the guard variable itself if an implementation so chooses.
The Arm's C++ ABI for 32-bit architecture changes the alignment and the size of the guard variable so that ldrex/strex and swp instructions can be used with the same guard variable to implement a mutex. I believe that the reason the variable is kept 4-byte in size and 4-byte aligned is to allow SWP instruction to work, since SWP can only work with a 4-byte or a 1-byte variable (naturally aligned, in both cases). Keeping the variable 1 byte in size would be insufficient, since the Itanium C++ ABI itself reserves 1 byte for storing status information.
Since the Itanium C++ ABI reserves the first byte, any compliant implementation of the ABI has the freedom to implement the mutex within the 3 higher order bytes of the guard variable. One is not even forced to work with ldrexb/swpb - ldrex/swp on the full word can also be utilized instead, by taking care of testing and setting a value of 0x100, for instance, which tests and sets the LSB of the second byte of the guard variable.
An implementation of those functions can be found in the binary code generated by any compliant C++ compiler, or in the compiler's own source code. Here is GCC's implementation.
Below is from a sample compiled with gcc-arm-10.3-2021.07-x86_64-arm-none-linux-gnueabihf toolchain. And it does indeed utilize the LSB of the 2nd byte of the guard variable as a lock.
; r0 has the address of the guard variable 00010570 <__cxa_guard_acquire>: . . . 10584: 4606 mov r6, r0 . . . 1058a: f44f 7380 mov.w r3, #256 ; 0x100 1058e: f3bf 8f5b dmb ish 10592: e856 2f00 ldrex r2, [r6] 10596: 2a00 cmp r2, #0 10598: d103 bne.n 105a2 <__cxa_guard_acquire+0x32> 1059a: e846 3100 strex r1, r3, [r6] . . . ; Call futex_time32 with operation FUTEX_WAIT 105f8: 462b mov r3, r5 105fa: 2200 movs r2, #0; FUTEX_WAIT 105fc: 4631 mov r1, r6 105fe: 20f0 movs r0, #240 ; 0xf0 10600: 9400 str r4, [sp, #0] 10602: f019 fbb5 bl 29d70 <syscall> . . .
The above usage complies with both the Arm C++ specification of reserving the LSB of the first byte for status, and also with the Itanium C++ ABI which reserves one full byte for status.
The question about whether only a single bit or a full byte is reserved for status is not clarified in the Arm32 C++ ABI, but is indeed clarified in the Arm64 C++ ABI. The difference does not matter though, since the Itanium C++ ABI reserves one full byte for status. GCC builds one of its mutexes in the LSB of the second byte of the guard variable, in both Arm32 and Arm64 compilations.
If anyone reads this, I can mention why I accepted the answer above. It is mostly because the GCC linked code made me think some more and made me realize a few things. In my (Cortex-M) situation, the compiler is not aware of the OS. But ARM is right when they say that one cannot make a reliable implementation of the __cxa_guard functions without OS support. Mutual exclusion will be possible with atomic compare and swap, but not without either some OS support, or some kind of dead lock, since some threads might have to wait for object creation, while all threads compete for the CPU. Since an arm-none-eabi toolchain is typically not OS-aware, this has to be an issue for quite many embedded C++ developers, unless of course one forbids static object creation once the OS is started, which might in fact be a good idea. Otherwise, the __cxa_guard has to involve the OS somehow.
__cxa_guard