This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Measuring Cortex-M4 instruction clock cycle counts

I'm trying to find a reliable method for measuring instruction clock cycles on the STM32F429 MCU that incorporates a Cortex-M4 processor. Part of the challenge is that although the core CPU has no cache, ST added their own proprietary ART Accelerator between the flash memory and the CPU. It provides an instruction cache of 1024 bytes and an instruction prefetch that allow the CPU to run at 180 MHz with 0 wait states, even though there is a 5 clock wait state to reload a cache line from flash.

My main program is written in C. It calls an assembly language function that contains the code I'm trying to time. I'm using the DWT cycle counter that is driven directly by the CPU clock. To eliminate the effect of the cache, I'm using the following approach that repeats the execution until the cycle count is stable. I do this twice - (1) to account for the overhead cycles required to read the DWT counter and for the cycles required to simply call and return from a function containing only a BX LR, and (2) to measure the cycle count of the code within TargetFunction (not counting the BL or BX LR instructions that do the call and return).

// Measure overhead cycles
overhead = 0 ;
do
{
save = overhead ;
start = ReadDWTCounter() ;
DummyFunction() ; // <------ This function contains nothing but a BX LR instruction
stop = ReadDWTCounter() ;
overhead = stop - start ;
} while (overhead != save) ;

// Measure function cycles
difference = 0 ;
do
{
save = difference ;
start = ReadDWTCounter() ;
TargetFunction() ; // <--------- This is the function containing the code I want to measure
stop = ReadDWTCounter() ;
difference = stop - start ;
} while (difference != save) ;

// Remove overhead cycles
cycles = difference - overhead ;

As expected, the loops each run for only two iterations, where the first iteration loads the code into cache and the second executes from cache with zero wait states. This seems to give very good and repeatable results, except that the final value of cycles is one greater than I would expect.

For example, if the code I'm timing is a single 16-bit ADD instructions (inside TargetFunction), the measured cycle count should be 1 clock cycle, but I get 2. If I try to time two 16-bit ADD instructions, the measured cycle count should be 2 clock cycles, but I get 3, and so on.

Can anyone explain the extra cycle?

Thanks!
Dan

Top replies

Parents

0 Dan Lewis over 6 years ago in reply to Dan Lewis

I looked at the assembly output of the compiler for the C code, but I didn't see anything suspicious. But as a sanity check, I then implemented each of the two loops in assembly for a comparison. The number of cycles required to calculate the value of overhead remained the same, but the number of cycles required to calculate the value of difference decreased by 1 so that the value of cycles now seems correct. It's very strange, since both loops are identical except for the function being timed. Here's my assembly version of the loop to calculate difference, followed by the compiler's version (with comments added by me): The compiled loop that computes overhead is identical except that some of the register choices made by the compiler are different.

.global GetDifference
.thumb_func
.align
GetDifference:
PUSH {R4-R6,LR}
MOVS R4,0 // difference = 0
L2: MOVS R5,R4 // save = difference
BL ReadDWTCounter
MOVS R6,R0 // R6 = start
BL TargetFunction
BL ReadDWTCounter // R0 = stop
SUBS R4,R0,R6 // difference = stop - start
CMP R4,R5
BNE L2
MOVS R0,R4 // Return difference
POP {R4-R6,PC}

Here's the compiled output:

movs r1, #0
str r1, [r7] // r7 preloaded with address of difference
.L3:
str r1, [r6] // save = difference (r6 preloaded with address of save)
bl ReadDWTCounter
str r0, [r4] // keep start in a temporary
bl TargetFunction
bl ReadDWTCounter
ldr r1, [r4] // r1 <-- start
ldr r3, [r6] // r6 <-- save
subs r1, r0, r1 // r1 <-- difference = stop - start
cmp r1, r3 // difference == save?
str r1, [r7] // store difference back in memory
bne .L3

One last observation: As I mentioned above, all the C variables are declared static. However, if I declare variable start to be a register variable, then both compiled loops return values that are 1 cycle greater than the values returned by the assembly version of the loops, so that the value of cycles is the same for both versions!
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Dan Lewis over 6 years ago in reply to Dan Lewis

I looked at the assembly output of the compiler for the C code, but I didn't see anything suspicious. But as a sanity check, I then implemented each of the two loops in assembly for a comparison. The number of cycles required to calculate the value of overhead remained the same, but the number of cycles required to calculate the value of difference decreased by 1 so that the value of cycles now seems correct. It's very strange, since both loops are identical except for the function being timed. Here's my assembly version of the loop to calculate difference, followed by the compiler's version (with comments added by me): The compiled loop that computes overhead is identical except that some of the register choices made by the compiler are different.

.global GetDifference
.thumb_func
.align
GetDifference:
PUSH {R4-R6,LR}
MOVS R4,0 // difference = 0
L2: MOVS R5,R4 // save = difference
BL ReadDWTCounter
MOVS R6,R0 // R6 = start
BL TargetFunction
BL ReadDWTCounter // R0 = stop
SUBS R4,R0,R6 // difference = stop - start
CMP R4,R5
BNE L2
MOVS R0,R4 // Return difference
POP {R4-R6,PC}

Here's the compiled output:

movs r1, #0
str r1, [r7] // r7 preloaded with address of difference
.L3:
str r1, [r6] // save = difference (r6 preloaded with address of save)
bl ReadDWTCounter
str r0, [r4] // keep start in a temporary
bl TargetFunction
bl ReadDWTCounter
ldr r1, [r4] // r1 <-- start
ldr r3, [r6] // r6 <-- save
subs r1, r0, r1 // r1 <-- difference = stop - start
cmp r1, r3 // difference == save?
str r1, [r7] // store difference back in memory
bne .L3

One last observation: As I mentioned above, all the C variables are declared static. However, if I declare variable start to be a register variable, then both compiled loops return values that are 1 cycle greater than the values returned by the assembly version of the loops, so that the value of cycles is the same for both versions!
Cancel
Vote up 0 Vote down

Cancel

Children

No data