Hi everyone,
I am trying to understand why the cycle count is not linearly increasing if I run the same instruction x times on the STM32F405RGT6.
When playing around with the STM32F4 to get a better understanding of instructions to cycles counts, I stumbled across this problem. From the documentation I know that a NOP and an ADD operation both should take one cycle. Therefore I would expect x NOPs to take x cycle and the same for ADD. What I found is somewhat different.
To produce these results I wrote a quick script that creates 30 assembly code functions. I can read the cycle count from the memory address 0xE0001004 on the STM32F405RGT6. I can also input any values into r3 and r4 for the ADD operation and check the result in r5. In r6 I get the number of cycles my instructions took. I checked the final elf file with objdump to verify that no operations were removed/rearranged/altered by the compiler.
# Replace XXXX by #instructions .global asm_add_XXXX .type asm_add_XXXX, %function .align 2 asm_add_XXXX: push {r3-r12} # Reading inputs to register ldr r3, [r0, #0] ldr r4, [r0, #4] # r5, r6 = outputs register mov r5, #0 mov r6, #0 # Cycle count address to register aka. DWT_CYCCNT ldr r9, =0xE0001004 .align 4 # Save current cycle count in r7 ldr r7, [r9, #0] ###################################### ### Start of assembly code ### ###################################### # Insert add instruction x times add r5, r3, r4 ###################################### ### End of assembly code ##### ###################################### # Save current cycle count in r8 ldr r8, [r9, #0] # Calculate cycles in r6 = r8 - r7 sub r6, r8, r7 # Write back output str r5, [r2, #0] str r6, [r2, #4] pop {r3-r12} bx lr # Avoid literal pools due to fake ldr .LTORG
Can someone please explain why the cycle count is increasing non-linearly? Thanks in advance for any insights.
Are you sure you don't have a perhaps-hidden board_init() function or something that is upping the clock rate and turning on wait states?(you could check by READING flash_acr...)
I checked the FLASH_ACR using a JTAG debugger and found out that there really is a hidden section in the board_init() which changes the WAIT_CYCLES to 5. That is a small victory.
Sadly I can still not make sense of the plot. Instead of plotting the absolute cycle count, I did plot the number of extra cycles it took when adding another instruction. Meaning in a perfect world every NOP I add should take an extra 1 cycle.
As you can see every 8 NOP instructions (16-bit instruction) and every 4 ADD instructions (32-bit instruction) there is an increase in needed cycles. This makes sense as the FLASH loads 128-bit at a time. Therefore I would expect the board to take 6 cycles whenever a load happens.
This is true for the ADD instruction, but not for the NOP instruction. Also, there are times where another NOP takes 0 cycles which should not be possible. With the given explanations I can predict the cycle count for 4 or more consecutive ADD instructions. Any idea what is the reason for the behaviour for less than 4 ADDs?