How to optimize an assembler copy function?

I have a C function that copies 8 x 32-bit words from src to dest specified by pointers:

static inline void PktProcWrite8( uint32_t* p_src,     // Source address of data
                                  uint32_t* p_dest )   // Destination address 
#ifndef __cplusplus    
    uint32_t r0, r1, r2, r3, r4, r5, r6, r7;   // Use 'register' hint to encourage C compiler to use STM instruction
        r0 = p_src[0];
        r1 = p_src[1];
        r2 = p_src[2];
        r3 = p_src[3];
        r4 = p_src[4];
        r5 = p_src[5];
        r6 = p_src[6];
        r7 = p_src[7];

        p_dest[0] = r0;
        p_dest[1] = r1;
        p_dest[2] = r2;
        p_dest[3] = r3;
        p_dest[4] = r4;
        p_dest[5] = r5;
        p_dest[6] = r6;
        p_dest[7] = r7;

The generated assembler is:

	.cfi_sections .debug_frame
@ %bb.0:
	.save	{r4, r5, r6, lr}
	push	{r4, r5, r6, lr}
	.cfi_def_cfa_offset 16
	.cfi_offset lr, -4
	.cfi_offset r6, -8
	.cfi_offset r5, -12
	.cfi_offset r4, -16
	ldm.w	r0, {r2, r3, r12, lr}
	add.w	r6, r0, #16
	ldm	r6, {r4, r5, r6}
	ldr	r0, [r0, #28]
	stm.w	r1, {r2, r3, r12, lr}
	add.w	r2, r1, #16
	stm	r2!, {r4, r5, r6}
	str	r0, [r1, #28]
	pop	{r4, r5, r6, pc}

It is important for us to maximize the use of burst writes. The above assembler does a burst write of 4 words, followed by a burst of 3 words, followed by a single word.

Is there any reason why we could not modify the assembler to use a single burst of 8 words or, less efficiently, two bursts of 4 words?

The target is Cortex-M4 and we are using armclang.