I have a C function that copies 8 x 32-bit words from src to dest specified by pointers:
static inline void PktProcWrite8( uint32_t* p_src, // Source address of data uint32_t* p_dest ) // Destination address { #ifndef __cplusplus register #endif uint32_t r0, r1, r2, r3, r4, r5, r6, r7; // Use 'register' hint to encourage C compiler to use STM instruction { r0 = p_src[0]; r1 = p_src[1]; r2 = p_src[2]; r3 = p_src[3]; r4 = p_src[4]; r5 = p_src[5]; r6 = p_src[6]; r7 = p_src[7]; p_dest[0] = r0; p_dest[1] = r1; p_dest[2] = r2; p_dest[3] = r3; p_dest[4] = r4; p_dest[5] = r5; p_dest[6] = r6; p_dest[7] = r7; } }
The generated assembler is:
PktProcWrite8_asm: .fnstart .cfi_sections .debug_frame .cfi_startproc @ %bb.0: .save {r4, r5, r6, lr} push {r4, r5, r6, lr} .cfi_def_cfa_offset 16 .cfi_offset lr, -4 .cfi_offset r6, -8 .cfi_offset r5, -12 .cfi_offset r4, -16 ldm.w r0, {r2, r3, r12, lr} add.w r6, r0, #16 ldm r6, {r4, r5, r6} ldr r0, [r0, #28] stm.w r1, {r2, r3, r12, lr} add.w r2, r1, #16 stm r2!, {r4, r5, r6} str r0, [r1, #28] pop {r4, r5, r6, pc} .Lfunc_end0:
It is important for us to maximize the use of burst writes. The above assembler does a burst write of 4 words, followed by a burst of 3 words, followed by a single word.
Is there any reason why we could not modify the assembler to use a single burst of 8 words or, less efficiently, two bursts of 4 words?
The target is Cortex-M4 and we are using armclang.
Hi David,
I'm not doing a great job answering this...
I wrote
> I forgot about the inline component... might be best to stay away from using the lr. In the below I've used r8 instead.
However... while it works in the above trivial case, it does not work in general. Whatever values were in r2 and r3 before the copy function is called will be lost, which likely means that the overall foo() function will break.
Use the above with extreme caution.