This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to optimize an assembler copy function?

I have a C function that copies 8 x 32-bit words from src to dest specified by pointers:

static inline void PktProcWrite8( uint32_t* p_src,     // Source address of data
                                  uint32_t* p_dest )   // Destination address 
{
#ifndef __cplusplus    
    register 
#endif    
    uint32_t r0, r1, r2, r3, r4, r5, r6, r7;   // Use 'register' hint to encourage C compiler to use STM instruction
    {
        r0 = p_src[0];
        r1 = p_src[1];
        r2 = p_src[2];
        r3 = p_src[3];
        r4 = p_src[4];
        r5 = p_src[5];
        r6 = p_src[6];
        r7 = p_src[7];

        p_dest[0] = r0;
        p_dest[1] = r1;
        p_dest[2] = r2;
        p_dest[3] = r3;
        p_dest[4] = r4;
        p_dest[5] = r5;
        p_dest[6] = r6;
        p_dest[7] = r7;
    }
}

The generated assembler is:

PktProcWrite8_asm:
	.fnstart
	.cfi_sections .debug_frame
	.cfi_startproc
@ %bb.0:
	.save	{r4, r5, r6, lr}
	push	{r4, r5, r6, lr}
	.cfi_def_cfa_offset 16
	.cfi_offset lr, -4
	.cfi_offset r6, -8
	.cfi_offset r5, -12
	.cfi_offset r4, -16
	ldm.w	r0, {r2, r3, r12, lr}
	add.w	r6, r0, #16
	ldm	r6, {r4, r5, r6}
	ldr	r0, [r0, #28]
	stm.w	r1, {r2, r3, r12, lr}
	add.w	r2, r1, #16
	stm	r2!, {r4, r5, r6}
	str	r0, [r1, #28]
	pop	{r4, r5, r6, pc}
.Lfunc_end0:

It is important for us to maximize the use of burst writes. The above assembler does a burst write of 4 words, followed by a burst of 3 words, followed by a single word.

Is there any reason why we could not modify the assembler to use a single burst of 8 words or, less efficiently, two bursts of 4 words?

The target is Cortex-M4 and we are using armclang.

Parents
  • Hi David,

    I redacted my original reply - my code had a silly error in it.

    However the root cause is the same, that the AAPCS specifies that registers r0-r3, r12, (and lr) are the only ones corruptible at the function boundary, and so with r0 and r1 used for your pointers, you are limited to r2, r3, r12, (and lr, from the stack).

    The following is smaller, but I am not sure that it is any faster, due to the extra memory accesses in push/pop.

            0x00000000:    b5f0        ..      PUSH     {r4-r7,lr}
            0x00000002:    e89050fc    ...P    LDM      r0,{r2-r7,r12,lr}
            0x00000006:    e88150fc    ...P    STM      r1,{r2-r7,r12,lr}
            0x0000000a:    e8bd40f0    ...@    POP      {r4-r7,lr}

    Ronan

Reply
  • Hi David,

    I redacted my original reply - my code had a silly error in it.

    However the root cause is the same, that the AAPCS specifies that registers r0-r3, r12, (and lr) are the only ones corruptible at the function boundary, and so with r0 and r1 used for your pointers, you are limited to r2, r3, r12, (and lr, from the stack).

    The following is smaller, but I am not sure that it is any faster, due to the extra memory accesses in push/pop.

            0x00000000:    b5f0        ..      PUSH     {r4-r7,lr}
            0x00000002:    e89050fc    ...P    LDM      r0,{r2-r7,r12,lr}
            0x00000006:    e88150fc    ...P    STM      r1,{r2-r7,r12,lr}
            0x0000000a:    e8bd40f0    ...@    POP      {r4-r7,lr}

    Ronan

Children