This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

No LDREX/STREX-based implementations of __cxa_guard_acquire/release/abort in ARM code?

Basically: the same question I asked there: CMSIS_5/issues/1393, without getting a satisfying answer.

My initial question in that issue:

From C++ ABI for the ARM architecture, ARM IHI 0041D:

3.2.3.1 Guard variables
To support the potential use of initialization guard variables as semaphores that are the target of ARM SWP and
LDREX/STREX synchronizing instructions we define a static initialization guard variable to be a 4-byte aligned, 4-
byte word with the following inline access protocol.

#define INITIALIZED 1

// inline guard test...
if ((obj_guard & INITIALIZED)!= INITIALIZED) {
    // TST obj_guard, #1; BNE already_initialized
    if (__cxa_guard_acquire(&obj_guard)) {
    ...
}

Usually, a guard variable should be allocated in the same data section as the object whose construction it guards.
3.2.3.2 One-time construction API

extern "C" int __cxa_guard_acquire(int *guard_object);

If the guarded object has not yet been initialized, this function returns 1. Otherwise it returns 0.
If it returns 1, a semaphore might have been claimed and associated with guard_object, and either
__cxa_guard_release or __cxa_guard_abort must be called with the same argument to release the semaphore.

extern "C" void __cxa_guard_release(int *guard_object);

This function is called on completing the initialization of the guarded object. It sets the least significant bit of
guard_object (allowing subsequent inline checks to succeed) and releases any semaphore associated with it.

extern "C" void __cxa_guard_abort(int *guard_object);

This function is called if any part of the initialization of the guarded object terminates by throwing an exception. It
releases any semaphore associated with guard_object.

Is my interpretation correct that only one bit of the obj_guard variable is accessed at all by the code that provides it (the code that invokes the __cxa_guard_xxx functions), and that because the rest is unused, the obj_guard variable itself could be used for the semaphore implementation?

If that is the case, and I certainly hope so, since an alternative implementation would have to "manually" allocate semaphore memory on the side of every static variable, which would be quite cumbersome (or would have to use some kind of recursive mutex that would handle the case when the OS is not started, but not everyone has that kind of luxury), how come the following search on the CMSIS_5 repository gives no result in source code?

$ git grep __cxa_guard_acquire

I mean, if there is a possibly trivial implementation based on the ABI documentation (of the three functions), why would ARM themselves not provide it? Or is it provided in some other repository? I made such an implementation myself, but seeing naive non-thread-safe implementations all over the Internet (which systematically break C++ static object creation semantics!) really makes me wonder whether we are not missing a great opportunity of improving many embedded C++ applications with a small effort.

What am I missing?

Parents
  • just would like ARM to confirm what they meant in the ABI specification, and the best and simplest way for them to do that is to provide an implementation in CMSIS

    I still do not understand the reason CMSIS should be the one to provide an implementation. And it is not as if there are no implementations in production. The open-source implementations of GCC and LLVM are available. Even with closed-source Arm C++ compiler, the disassembly generated by it can be examined to derive the implementation.

    My own implementation is in not in terms  of LDREX/STREX, but in terms of GCC's atomic built-ins, which I am pretty sure are quite straightforwardly based on LDREX/STREX for ARM.

    The ABI isn't forcing anyone to utilize or to not utilize ldrex/strex. But the ABI is indeed forcing the size and the alignment of the guard variable, to facilitate an option of using ldrex/strex on the guard variable itself if an implementation so chooses.


    The Arm's C++ ABI for 32-bit architecture changes the alignment and the size of the guard variable so that ldrex/strex and swp instructions can be used with the same guard variable to implement a mutex. I believe that the reason the variable is kept 4-byte in size and 4-byte aligned is to allow SWP instruction to work, since SWP can only work with a 4-byte or a 1-byte variable (naturally aligned, in both cases). Keeping the variable 1 byte in size would be insufficient, since the Itanium C++ ABI itself reserves 1 byte for storing status information.

    Since the Itanium C++ ABI reserves the first byte, any compliant implementation of the ABI has the freedom to implement the mutex within the 3 higher order bytes of the guard variable. One is not even forced to work with ldrexb/swpb - ldrex/swp on the full word can also be utilized instead, by taking care of testing and setting a value of 0x100, for instance, which tests and sets the LSB of the second byte of the guard variable.

    An implementation of those functions can be found in the binary code generated by any compliant C++ compiler, or in the compiler's own source code. Here is GCC's implementation.

    Below is from a sample compiled with gcc-arm-10.3-2021.07-x86_64-arm-none-linux-gnueabihf toolchain. And it does indeed utilize the LSB of the 2nd byte of the guard variable as a lock.

    ; r0 has the address of the guard variable
    00010570 <__cxa_guard_acquire>:
    . . .
       10584:	4606      	mov	r6, r0
    . . .
       1058a:	f44f 7380 	mov.w	r3, #256	; 0x100
       1058e:	f3bf 8f5b 	dmb	ish
       10592:	e856 2f00 	ldrex	r2, [r6]
       10596:	2a00      	cmp	r2, #0
       10598:	d103      	bne.n	105a2 <__cxa_guard_acquire+0x32>
       1059a:	e846 3100 	strex	r1, r3, [r6]
    . . .
    ; Call futex_time32 with operation FUTEX_WAIT
       105f8:	462b      	mov	r3, r5
       105fa:	2200      	movs	r2, #0; FUTEX_WAIT
       105fc:	4631      	mov	r1, r6
       105fe:	20f0      	movs	r0, #240	; 0xf0
       10600:	9400      	str	r4, [sp, #0]
       10602:	f019 fbb5 	bl	29d70 <syscall>
    . . .

    The above usage complies with both the Arm C++ specification of reserving the LSB of the first byte for status, and also with the Itanium C++ ABI which reserves one full byte for status.

    The question about whether only a single bit or a full byte is reserved for status is not clarified in the Arm32 C++ ABI, but is indeed clarified in the Arm64 C++ ABI. The difference does not matter though, since the Itanium C++ ABI reserves one full byte for status. GCC builds one of its mutexes in the LSB of the second byte of the guard variable, in both Arm32 and Arm64 compilations.

Reply
  • just would like ARM to confirm what they meant in the ABI specification, and the best and simplest way for them to do that is to provide an implementation in CMSIS

    I still do not understand the reason CMSIS should be the one to provide an implementation. And it is not as if there are no implementations in production. The open-source implementations of GCC and LLVM are available. Even with closed-source Arm C++ compiler, the disassembly generated by it can be examined to derive the implementation.

    My own implementation is in not in terms  of LDREX/STREX, but in terms of GCC's atomic built-ins, which I am pretty sure are quite straightforwardly based on LDREX/STREX for ARM.

    The ABI isn't forcing anyone to utilize or to not utilize ldrex/strex. But the ABI is indeed forcing the size and the alignment of the guard variable, to facilitate an option of using ldrex/strex on the guard variable itself if an implementation so chooses.


    The Arm's C++ ABI for 32-bit architecture changes the alignment and the size of the guard variable so that ldrex/strex and swp instructions can be used with the same guard variable to implement a mutex. I believe that the reason the variable is kept 4-byte in size and 4-byte aligned is to allow SWP instruction to work, since SWP can only work with a 4-byte or a 1-byte variable (naturally aligned, in both cases). Keeping the variable 1 byte in size would be insufficient, since the Itanium C++ ABI itself reserves 1 byte for storing status information.

    Since the Itanium C++ ABI reserves the first byte, any compliant implementation of the ABI has the freedom to implement the mutex within the 3 higher order bytes of the guard variable. One is not even forced to work with ldrexb/swpb - ldrex/swp on the full word can also be utilized instead, by taking care of testing and setting a value of 0x100, for instance, which tests and sets the LSB of the second byte of the guard variable.

    An implementation of those functions can be found in the binary code generated by any compliant C++ compiler, or in the compiler's own source code. Here is GCC's implementation.

    Below is from a sample compiled with gcc-arm-10.3-2021.07-x86_64-arm-none-linux-gnueabihf toolchain. And it does indeed utilize the LSB of the 2nd byte of the guard variable as a lock.

    ; r0 has the address of the guard variable
    00010570 <__cxa_guard_acquire>:
    . . .
       10584:	4606      	mov	r6, r0
    . . .
       1058a:	f44f 7380 	mov.w	r3, #256	; 0x100
       1058e:	f3bf 8f5b 	dmb	ish
       10592:	e856 2f00 	ldrex	r2, [r6]
       10596:	2a00      	cmp	r2, #0
       10598:	d103      	bne.n	105a2 <__cxa_guard_acquire+0x32>
       1059a:	e846 3100 	strex	r1, r3, [r6]
    . . .
    ; Call futex_time32 with operation FUTEX_WAIT
       105f8:	462b      	mov	r3, r5
       105fa:	2200      	movs	r2, #0; FUTEX_WAIT
       105fc:	4631      	mov	r1, r6
       105fe:	20f0      	movs	r0, #240	; 0xf0
       10600:	9400      	str	r4, [sp, #0]
       10602:	f019 fbb5 	bl	29d70 <syscall>
    . . .

    The above usage complies with both the Arm C++ specification of reserving the LSB of the first byte for status, and also with the Itanium C++ ABI which reserves one full byte for status.

    The question about whether only a single bit or a full byte is reserved for status is not clarified in the Arm32 C++ ABI, but is indeed clarified in the Arm64 C++ ABI. The difference does not matter though, since the Itanium C++ ABI reserves one full byte for status. GCC builds one of its mutexes in the LSB of the second byte of the guard variable, in both Arm32 and Arm64 compilations.

Children