This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to optimize an assembler copy function?

I have a C function that copies 8 x 32-bit words from src to dest specified by pointers:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static inline void PktProcWrite8( uint32_t* p_src, // Source address of data
uint32_t* p_dest ) // Destination address
{
#ifndef __cplusplus
register
#endif
uint32_t r0, r1, r2, r3, r4, r5, r6, r7; // Use 'register' hint to encourage C compiler to use STM instruction
{
r0 = p_src[0];
r1 = p_src[1];
r2 = p_src[2];
r3 = p_src[3];
r4 = p_src[4];
r5 = p_src[5];
r6 = p_src[6];
r7 = p_src[7];
p_dest[0] = r0;
p_dest[1] = r1;
p_dest[2] = r2;
p_dest[3] = r3;
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The generated assembler is:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
PktProcWrite8_asm:
.fnstart
.cfi_sections .debug_frame
.cfi_startproc
@ %bb.0:
.save {r4, r5, r6, lr}
push {r4, r5, r6, lr}
.cfi_def_cfa_offset 16
.cfi_offset lr, -4
.cfi_offset r6, -8
.cfi_offset r5, -12
.cfi_offset r4, -16
ldm.w r0, {r2, r3, r12, lr}
add.w r6, r0, #16
ldm r6, {r4, r5, r6}
ldr r0, [r0, #28]
stm.w r1, {r2, r3, r12, lr}
add.w r2, r1, #16
stm r2!, {r4, r5, r6}
str r0, [r1, #28]
pop {r4, r5, r6, pc}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

It is important for us to maximize the use of burst writes. The above assembler does a burst write of 4 words, followed by a burst of 3 words, followed by a single word.

Is there any reason why we could not modify the assembler to use a single burst of 8 words or, less efficiently, two bursts of 4 words?

The target is Cortex-M4 and we are using armclang.

0