ARM GCC - Some call to memcpy results in exception

I've been using the ARM GCC release aarch64-none-elf-gcc-11.2.1 in a baremetal project for some time in a large project that has successfully used libc functions (malloc/memcpy) many times without issue using these options -L

$AARCH64_GCC_PATH/aarch64-none-elf/lib -lc -lnosys -lg

I recently saw an exception due to an unaligned access during memcpy despite compiling with -mstrict-align.

After isolating the issue and creating a unit test I believe I've found a bug, please ignore the addresses from the objdump and memcpy call, just made them up for this test.

When performing a memcpy on device type memory where size = 0x8 + 0x4*n where n is any natural number.

An exception will be thrown as even though care may be taken to have src/dst pointers aligned, the instruction seen on 6009c from the below objdump of memcpy on aarch64 leads to ldur    x7, [x4, #-8]. Which in the case of a size 0xc copy would do an LDUR of a 32bit aligned address ending in 0x4 to a 64 bit x* register, which results in a Data Abort.

//unit test
#include <stdlib.h>
#include <string.h>
volatile int bssTest;

void swap(int a, int b) {

0000000000060040 <memcpy>:
   60040:	f9800020 	prfm	pldl1keep, [x1]
   60044:	8b020024 	add	x4, x1, x2
   60048:	8b020005 	add	x5, x0, x2
   6004c:	f100405f 	cmp	x2, #0x10
   60050:	54000209	60090 <memcpy+0x50>  // b.plast
   60054:	f101805f 	cmp	x2, #0x60
   60058:	54000648 	b.hi	60120 <memcpy+0xe0>  // b.pmore
   6005c:	d1000449 	sub	x9, x2, #0x1
   60060:	a9401c26 	ldp	x6, x7, [x1]
   60064:	37300469 	tbnz	w9, #6, 600f0 <memcpy+0xb0>
   60068:	a97f348c 	ldp	x12, x13, [x4, #-16]
   6006c:	362800a9 	tbz	w9, #5, 60080 <memcpy+0x40>
   60070:	a9412428 	ldp	x8, x9, [x1, #16]
   60074:	a97e2c8a 	ldp	x10, x11, [x4, #-32]
   60078:	a9012408 	stp	x8, x9, [x0, #16]
   6007c:	a93e2caa 	stp	x10, x11, [x5, #-32]
   60080:	a9001c06 	stp	x6, x7, [x0]
   60084:	a93f34ac 	stp	x12, x13, [x5, #-16]
   60088:	d65f03c0 	ret
   6008c:	d503201f 	nop
   60090:	f100205f 	cmp	x2, #0x8
   60094:	540000e3	600b0 <memcpy+0x70>  // b.lo, b.ul, b.last
   60098:	f9400026 	ldr	x6, [x1]
   6009c:	f85f8087 	ldur	x7, [x4, #-8]
   600a0:	f9000006 	str	x6, [x0]
   600a4:	f81f80a7 	stur	x7, [x5, #-8]
   600a8:	d65f03c0 	ret
   600ac:	d503201f 	nop
   600b0:	361000c2 	tbz	w2, #2, 600c8 <memcpy+0x88>
   600b4:	b9400026 	ldr	w6, [x1]
   600b8:	b85fc087 	ldur	w7, [x4, #-4]
   600bc:	b9000006 	str	w6, [x0]
   600c0:	b81fc0a7 	stur	w7, [x5, #-4]
   600c4:	d65f03c0 	ret
   600c8:	b4000102 	cbz	x2, 600e8 <memcpy+0xa8>
   600cc:	d341fc49 	lsr	x9, x2, #1
   600d0:	39400026 	ldrb	w6, [x1]
   600d4:	385ff087 	ldurb	w7, [x4, #-1]
   600d8:	38696828 	ldrb	w8, [x1, x9]
   600dc:	39000006 	strb	w6, [x0]
   600e0:	38296808 	strb	w8, [x0, x9]
   600e4:	381ff0a7 	sturb	w7, [x5, #-1]
   600e8:	d65f03c0 	ret
   600ec:	d503201f 	nop
   600f0:	a9412428 	ldp	x8, x9, [x1, #16]
   600f4:	a9422c2a 	ldp	x10, x11, [x1, #32]
   600f8:	a943342c 	ldp	x12, x13, [x1, #48]
   600fc:	a97e0881 	ldp	x1, x2, [x4, #-32]
   60100:	a97f0c84 	ldp	x4, x3, [x4, #-16]
   60104:	a9001c06 	stp	x6, x7, [x0]
   60108:	a9012408 	stp	x8, x9, [x0, #16]
   6010c:	a9022c0a 	stp	x10, x11, [x0, #32]
   60110:	a903340c 	stp	x12, x13, [x0, #48]
   60114:	a93e08a1 	stp	x1, x2, [x5, #-32]
   60118:	a93f0ca4 	stp	x4, x3, [x5, #-16]
   6011c:	d65f03c0 	ret
   60120:	92400c09 	and	x9, x0, #0xf
   60124:	927cec03 	and	x3, x0, #0xfffffffffffffff0
   60128:	a940342c 	ldp	x12, x13, [x1]
   6012c:	cb090021 	sub	x1, x1, x9
   60130:	8b090042 	add	x2, x2, x9
   60134:	a9411c26 	ldp	x6, x7, [x1, #16]
   60138:	a900340c 	stp	x12, x13, [x0]
   6013c:	a9422428 	ldp	x8, x9, [x1, #32]
   60140:	a9432c2a 	ldp	x10, x11, [x1, #48]
   60144:	a9c4342c 	ldp	x12, x13, [x1, #64]!
   60148:	f1024042 	subs	x2, x2, #0x90
   6014c:	54000169	60178 <memcpy+0x138>  // b.plast
   60150:	a9011c66 	stp	x6, x7, [x3, #16]
   60154:	a9411c26 	ldp	x6, x7, [x1, #16]
   60158:	a9022468 	stp	x8, x9, [x3, #32]
   6015c:	a9422428 	ldp	x8, x9, [x1, #32]
   60160:	a9032c6a 	stp	x10, x11, [x3, #48]
   60164:	a9432c2a 	ldp	x10, x11, [x1, #48]
   60168:	a984346c 	stp	x12, x13, [x3, #64]!
   6016c:	a9c4342c 	ldp	x12, x13, [x1, #64]!
   60170:	f1010042 	subs	x2, x2, #0x40
   60174:	54fffee8 	b.hi	60150 <memcpy+0x110>  // b.pmore
   60178:	a97c0881 	ldp	x1, x2, [x4, #-64]
   6017c:	a9011c66 	stp	x6, x7, [x3, #16]
   60180:	a97d1c86 	ldp	x6, x7, [x4, #-48]
   60184:	a9022468 	stp	x8, x9, [x3, #32]
   60188:	a97e2488 	ldp	x8, x9, [x4, #-32]
   6018c:	a9032c6a 	stp	x10, x11, [x3, #48]
   60190:	a97f2c8a 	ldp	x10, x11, [x4, #-16]
   60194:	a904346c 	stp	x12, x13, [x3, #64]
   60198:	a93c08a1 	stp	x1, x2, [x5, #-64]
   6019c:	a93d1ca6 	stp	x6, x7, [x5, #-48]
   601a0:	a93e24a8 	stp	x8, x9, [x5, #-32]
   601a4:	a93f2caa 	stp	x10, x11, [x5, #-16]
   601a8:	d65f03c0 	ret
   601ac:	00000000 	udf	#0

While I understand that care must be taken when using stdlib functions in a baremetal application, due to the nature of our codebase it would be very difficult to ensure that every call to memcpy has a size that is 64bit aligned. Shouldn't newlib/compiler take care to ensure that memcpy will use 32bit w registers for any 32bit aligned memcpy anyway? Especially with -mstrict-align?

What are my options as far as providing an immediate fix in the meantime, I suppose I could try to override the definition of memcpy but what source should I base the replacement implementation on in that case.

Any help on this is appreciated, thanks.