Hi everyone,
I am trying to understand why the cycle count is not linearly increasing if I run the same instruction x times on the STM32F405RGT6.
When playing around with the STM32F4 to get a better understanding of instructions to cycles counts, I stumbled across this problem. From the documentation I know that a NOP and an ADD operation both should take one cycle. Therefore I would expect x NOPs to take x cycle and the same for ADD. What I found is somewhat different.
To produce these results I wrote a quick script that creates 30 assembly code functions. I can read the cycle count from the memory address 0xE0001004 on the STM32F405RGT6. I can also input any values into r3 and r4 for the ADD operation and check the result in r5. In r6 I get the number of cycles my instructions took. I checked the final elf file with objdump to verify that no operations were removed/rearranged/altered by the compiler.
# Replace XXXX by #instructions .global asm_add_XXXX .type asm_add_XXXX, %function .align 2 asm_add_XXXX: push {r3-r12} # Reading inputs to register ldr r3, [r0, #0] ldr r4, [r0, #4] # r5, r6 = outputs register mov r5, #0 mov r6, #0 # Cycle count address to register aka. DWT_CYCCNT ldr r9, =0xE0001004 .align 4 # Save current cycle count in r7 ldr r7, [r9, #0] ###################################### ### Start of assembly code ### ###################################### # Insert add instruction x times add r5, r3, r4 ###################################### ### End of assembly code ##### ###################################### # Save current cycle count in r8 ldr r8, [r9, #0] # Calculate cycles in r6 = r8 - r7 sub r6, r8, r7 # Write back output str r5, [r2, #0] str r6, [r2, #4] pop {r3-r12} bx lr # Avoid literal pools due to fake ldr .LTORG
Can someone please explain why the cycle count is increasing non-linearly? Thanks in advance for any insights.
According to the STM32F4 docu the FLASH_ACR is set to 0 wait states after reset. Therefore a flash memory read should take 1 CPU cycle.
If I understand the docu correctly each flash memory read operation reads 128 bits at a time. Meaning 8 NOPs or 4 ADDs which seems to correspond to the behaviour in my plots. The only difference is that in my plots the jumps which seem like they are the flash memory read operations are 4 cycle for the NOP and 6 cycle for the ADD. There must be something more to it than just the flash memory read, because that should only take 1 cycle.
Is there a way I can circumvent the trouble with the flash memory? You mentioned TCM previously. Is that a good alternative to have a more predictable behaviour or does that come with other costs?
I checked, and this chip has CCM, which seems to be something like TCM. Since it is limited in size, you need to find out the bottleneck code.
Are you sure you don't have a perhaps-hidden board_init() function or something that is upping the clock rate and turning on wait states?(you could check by READING flash_acr...)
I checked the FLASH_ACR using a JTAG debugger and found out that there really is a hidden section in the board_init() which changes the WAIT_CYCLES to 5. That is a small victory.
Sadly I can still not make sense of the plot. Instead of plotting the absolute cycle count, I did plot the number of extra cycles it took when adding another instruction. Meaning in a perfect world every NOP I add should take an extra 1 cycle.
As you can see every 8 NOP instructions (16-bit instruction) and every 4 ADD instructions (32-bit instruction) there is an increase in needed cycles. This makes sense as the FLASH loads 128-bit at a time. Therefore I would expect the board to take 6 cycles whenever a load happens.
This is true for the ADD instruction, but not for the NOP instruction. Also, there are times where another NOP takes 0 cycles which should not be possible. With the given explanations I can predict the cycle count for 4 or more consecutive ADD instructions. Any idea what is the reason for the behaviour for less than 4 ADDs?