This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why is the cycle count for running the same instruction multiple times not linearly increasing on the STM32F4?

Hi everyone,

I am trying to understand why the cycle count is not linearly increasing if I run the same instruction x times on the STM32F405RGT6.

When playing around with the STM32F4 to get a better understanding of instructions to cycles counts, I stumbled across this problem. From the documentation I know that a NOP and an ADD operation both should take one cycle. Therefore I would expect x NOPs to take x cycle and the same for ADD. What I found is somewhat different.

To produce these results I wrote a quick script that creates 30 assembly code functions. I can read the cycle count from the memory address 0xE0001004 on the STM32F405RGT6. I can also input any values into r3 and r4 for the ADD operation and check the result in r5. In r6 I get the number of cycles my instructions took. I checked the final elf file with objdump to verify that no operations were removed/rearranged/altered by the compiler.

# Replace XXXX by #instructions
.global asm_add_XXXX
.type asm_add_XXXX, %function
.align 2
asm_add_XXXX:
push {r3-r12}

# Reading inputs to register
ldr r3, [r0, #0]
ldr r4, [r0, #4]
# r5, r6 = outputs register
mov r5, #0
mov r6, #0

# Cycle count address to register aka. DWT_CYCCNT
ldr r9, =0xE0001004
.align 4
# Save current cycle count in r7
ldr r7, [r9, #0]

######################################
### Start of assembly code ###
######################################

# Insert add instruction x times
add r5, r3, r4

######################################
### End of assembly code #####
######################################

# Save current cycle count in r8
ldr r8, [r9, #0]
# Calculate cycles in r6 = r8 - r7
sub r6, r8, r7

# Write back output
str r5, [r2, #0]
str r6, [r2, #4]

pop {r3-r12}
bx lr
# Avoid literal pools due to fake ldr
.LTORG

Can someone please explain why the cycle count is increasing non-linearly? Thanks in advance for any insights.

Top replies

42Bastian Schick over 3 years ago in reply to PatrickG +1 verified

See "Flash access control register (FLASH_ACR)"

Parents

0 PatrickG over 3 years ago in reply to 42Bastian Schick

According to the STM32F4 docu the FLASH_ACR is set to 0 wait states after reset. Therefore a flash memory read should take 1 CPU cycle.

If I understand the docu correctly each flash memory read operation reads 128 bits at a time. Meaning 8 NOPs or 4 ADDs which seems to correspond to the behaviour in my plots. The only difference is that in my plots the jumps which seem like they are the flash memory read operations are 4 cycle for the NOP and 6 cycle for the ADD. There must be something more to it than just the flash memory read, because that should only take 1 cycle.

Is there a way I can circumvent the trouble with the flash memory? You mentioned TCM previously. Is that a good alternative to have a more predictable behaviour or does that come with other costs?
Cancel
Up 0 Down

Cancel

Reply

0 PatrickG over 3 years ago in reply to 42Bastian Schick

According to the STM32F4 docu the FLASH_ACR is set to 0 wait states after reset. Therefore a flash memory read should take 1 CPU cycle.

If I understand the docu correctly each flash memory read operation reads 128 bits at a time. Meaning 8 NOPs or 4 ADDs which seems to correspond to the behaviour in my plots. The only difference is that in my plots the jumps which seem like they are the flash memory read operations are 4 cycle for the NOP and 6 cycle for the ADD. There must be something more to it than just the flash memory read, because that should only take 1 cycle.

Is there a way I can circumvent the trouble with the flash memory? You mentioned TCM previously. Is that a good alternative to have a more predictable behaviour or does that come with other costs?
Cancel
Up 0 Down

Cancel

Children

0 42Bastian Schick over 3 years ago in reply to PatrickG

I checked, and this chip has CCM, which seems to be something like TCM. Since it is limited in size, you need to find out the bottleneck code.
Cancel
Up 0 Down

Cancel
0 WestfW over 3 years ago in reply to PatrickG

Are you sure you don't have a perhaps-hidden board_init() function or something that is upping the clock rate and turning on wait states?
(you could check by READING flash_acr...)
Cancel
Up 0 Down

Cancel
0 PatrickG over 3 years ago in reply to WestfW

I checked the FLASH_ACR using a JTAG debugger and found out that there really is a hidden section in the board_init() which changes the WAIT_CYCLES to 5. That is a small victory.

Sadly I can still not make sense of the plot. Instead of plotting the absolute cycle count, I did plot the number of extra cycles it took when adding another instruction. Meaning in a perfect world every NOP I add should take an extra 1 cycle.

As you can see every 8 NOP instructions (16-bit instruction) and every 4 ADD instructions (32-bit instruction) there is an increase in needed cycles. This makes sense as the FLASH loads 128-bit at a time. Therefore I would expect the board to take 6 cycles whenever a load happens.

This is true for the ADD instruction, but not for the NOP instruction. Also, there are times where another NOP takes 0 cycles which should not be possible. With the given explanations I can predict the cycle count for 4 or more consecutive ADD instructions. Any idea what is the reason for the behaviour for less than 4 ADDs?
Cancel
Up 0 Down

Cancel