Undeterministic behaviour of memcpy using NEON registers

Hi All,

I am currently implementing faster memcpy function which uses NEON registers q0 and q1.

Memcpy function is running on SoC with Cortex-A53 core.

Old, slower memcpy function is using natural 64-bit alignment during copying (after copying enough bytes at the start of memcpy to achieve natural alignment).

However, AXI Interconnect of SoC can handle 128-bit accesses, so we can make that memcpy faster.

Therefore, in case of 32 Byte alignment I am augmenting memcpy function by making following copy operations in case of 256-bit alignment:

// copy (32byte) wise
// test for 32byte alignment
if ((psrc  & double_qword_mask)==0){
	// 256 bits aligned
	while (len>double_qword_mask){
		asm volatile ("ldp q0, q1, [%0]"::"r"(psrc));
		asm volatile ("stp q0, q1, [%0]"::"r"(pdst));
		pdst += double_qword_length;
		psrc += double_qword_length;
		len -= double_qword_length;
	}
}

Rest of memcpy function stays the same.

This really made memcpy function faster for about 70%.

However, sporadically this memcpy function makes errors during copying data. 

I am making checks after doing memcpy to see how does error in copying data happen, and it was always mismatch in 16 or 32 bytes, coming from accesses to NEON registers.
This problem happens only when function memcpy is compiled with optimisation level -O3 (highest optimisation, gcc compiler) and very sporadic.

When compiling function memcpy with optimisation level -O0 (no optimisation, gcc compiler), such errors do not happen, but memcpy function runs slower as well.

Source and destination of memcpy are in uncached RAM.

Do you have any idea where does this problem come from?

Kind Regards,

Rijad