Why is the cycle count for running the same instruction multiple times not linearly increasing on the STM32F4?

Hi everyone,

I am trying to understand why the cycle count is not linearly increasing if I run the same instruction x times on the STM32F405RGT6.

When playing around with the STM32F4 to get a better understanding of instructions to cycles counts, I stumbled across this problem. From the documentation I know that a NOP and an ADD operation both should take one cycle. Therefore I would expect x NOPs to take x cycle and the same for ADD. What I found is somewhat different.

To produce these results I wrote a quick script that creates 30 assembly code functions. I can read the cycle count from the memory address 0xE0001004 on the STM32F405RGT6. I can also input any values into r3 and r4 for the ADD operation and check the result in r5. In r6 I get the number of cycles my instructions took. I checked the final elf file with objdump to verify that no operations were removed/rearranged/altered by the compiler.

# Replace XXXX by #instructions
.global asm_add_XXXX
.type asm_add_XXXX, %function
.align 2
asm_add_XXXX:
push {r3-r12}

# Reading inputs to register
ldr r3, [r0, #0]
ldr r4, [r0, #4]
# r5, r6 = outputs register
mov r5, #0
mov r6, #0

# Cycle count address to register aka. DWT_CYCCNT
ldr r9, =0xE0001004
.align 4
# Save current cycle count in r7
ldr r7, [r9, #0]

######################################
### Start of assembly code ###
######################################

# Insert add instruction x times
add r5, r3, r4

######################################
### End of assembly code #####
######################################

# Save current cycle count in r8
ldr r8, [r9, #0]
# Calculate cycles in r6 = r8 - r7
sub r6, r8, r7

# Write back output
str r5, [r2, #0]
str r6, [r2, #4]

pop {r3-r12}
bx lr
# Avoid literal pools due to fake ldr
.LTORG

Can someone please explain why the cycle count is increasing non-linearly? Thanks in advance for any insights.