This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Measuring Cortex-M4 instruction clock cycle counts

I'm trying to find a reliable method for measuring instruction clock cycles on the STM32F429 MCU that incorporates a Cortex-M4 processor. Part of the challenge is that although the core CPU has no cache, ST added their own proprietary ART Accelerator between the flash memory and the CPU. It provides an instruction cache of 1024 bytes and an instruction prefetch that allow the CPU to run at 180 MHz with 0 wait states, even though there is a 5 clock wait state to reload a cache line from flash.

My main program is written in C. It calls an assembly language function that contains the code I'm trying to time. I'm using the DWT cycle counter that is driven directly by the CPU clock. To eliminate the effect of the cache, I'm using the following approach that repeats the execution until the cycle count is stable. I do this twice - (1) to account for the overhead cycles required to read the DWT counter and for the cycles required to simply call and return from a function containing only a BX LR, and (2) to measure the cycle count of the code within TargetFunction (not counting the BL or BX LR instructions that do the call and return).

// Measure overhead cycles
overhead = 0 ;
do
{
save = overhead ;
start = ReadDWTCounter() ;
DummyFunction() ; // <------ This function contains nothing but a BX LR instruction
stop = ReadDWTCounter() ;
overhead = stop - start ;
} while (overhead != save) ;

// Measure function cycles
difference = 0 ;
do
{
save = difference ;
start = ReadDWTCounter() ;
TargetFunction() ; // <--------- This is the function containing the code I want to measure
stop = ReadDWTCounter() ;
difference = stop - start ;
} while (difference != save) ;

// Remove overhead cycles
cycles = difference - overhead ;

As expected, the loops each run for only two iterations, where the first iteration loads the code into cache and the second executes from cache with zero wait states. This seems to give very good and repeatable results, except that the final value of cycles is one greater than I would expect.

For example, if the code I'm timing is a single 16-bit ADD instructions (inside TargetFunction), the measured cycle count should be 1 clock cycle, but I get 2. If I try to time two 16-bit ADD instructions, the measured cycle count should be 2 clock cycles, but I get 3, and so on.

Can anyone explain the extra cycle?

Thanks!
Dan

Top replies

0 42Bastian Schick over 6 years ago

Did you try running the test w/o ART?

Or this (from Armv7-M manual) is the reason:

"In particular, the architecture does not define the point in a pipeline where a particular instruction increments a performance
counter, relative to the point where software can read the performance counter. Therefore, pipelining can add some
imprecision. "
Cancel
Vote up 0 Vote down

Cancel
0 a.surati over 6 years ago

To exhaust a few more options on top, you may want to:

- check (in the disassembly of the C program) if an additional instruction got covered when calculating the value for 'difference'.

- add 2-3 dummy, simple instructions as the first instructions in both the Dummy and the Target functions. My (theoretical) guess is that a branch prediction scheme, which knows about a 'BX LR being the target of a BL' can skip the BX LR. This is similar to the removal of an empty function call by a compiler. This would also imply that testing with an empty/flushed/disabled BTB can provide results unaffected by the branch predictions.

Edit: The steps, possibly along these lines:

1. BL is decoded, the BTB is checked. There is a hit for this PC, where the BTB entry has the predicted address, the instruction at that address (if BTB supports branch folding, it could contain the instruction), and a bit which says that the instruction is a BX LR.

Simultaneously, the fetch stage fetches the instruction (ir) following the BL instruction. This is also the instruction at the return address.

The prediction, assumed to be correct, implies that decode need not change the stream that fetch is fetching.

2. BL proceeds to the execute stage to calculate the actual target address, and update the LR. The instruction ir proceeds to the decode stage at the same time. And the instruction following ir (ir + 1) is fetched.

At the end of BL's execute stage, the prediction (predicted target address == actual target address) is known to be correct, and the pipeline does not need to be flushed. BX LR has effectively disappeared.

The above is an attempt to come up with a plausible situation where the variable "overhead" has one less than the expected value. You can also try to see whether the variable 'difference' is one higher, or the variable 'overhead' is one lesser, to indicate which of the two calculations is off by one.
Cancel
Vote up +1 Vote down

Cancel
0 42Bastian Schick over 6 years ago in reply to a.surati

I double the hint with the dummy instructions.
Cancel
Vote up 0 Vote down

Cancel
0 Dan Lewis over 6 years ago in reply to 42Bastian Schick

I didn't want to turn off the ART Accelerator since that would cause the Flash memory to insert 5 wait states on every access, thus confusing the situation even more. I had seen the quote from the ARMv7-M manual before, and you may be right - 'hard to know.
Cancel
Vote up 0 Vote down

Cancel
0 Dan Lewis over 6 years ago in reply to a.surati

Thanks. The C main program is compiled with -O3 optimization, so I would be surprised if there were extra instructions covered during the computation of the difference - but I will certainly check. I will also try adding some NOP's at the beginning of the two functions to see if that makes a difference. However, I should explain that this is for a sophomore class I teach on ARM assembly, and I'm trying to provide the students with cycle counts that they can measure and that are consistent with those published by ARM. It's also a motivator for them; I provide the C program and ask them to write the assembly language functions that are being timed. When they see that another student's implementation of the same function has lower cycle counts, it encourages them to try to find a more efficient solution. If the cycle counts don't seem accurate, then they won't have quite the same reaction. :-)
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 6 years ago in reply to Dan Lewis

So first they learn, that a manual and real-life do not fit 1:1 ;-)

Anyway, I'd say it is a good learning effect to see that a function which should be quicker in theory might not be in practice.

And, if teaching is the goal, slow down the clock, disable ART and reduce Flash wait-states.
Cancel
Vote up 0 Vote down

Cancel
0 Dan Lewis over 6 years ago in reply to 42Bastian Schick

Good point about learning! :-) I had considered slowing down the clock so that no wait states would be needed and then turning off the ART Accelerator, but I'm so close to making it work at full speed, that I hate to give up. :-)

I forgot to mention in my original post that all of the variables are declared static. Allowing them to be allocated on the stack (auto) changes the numbers. I haven't had a chance to explore why this happens, but I found it curious and something on my to do list. :-)

I added two NOP's at the beginning of both functions (DummyFunction and TargetFunction). It simply increased the overhead cycle count by 2 and that of difference by 2, so that the value of cycles remained the same. Now to look at the assembly code generated by the compiler.

Dan
Cancel
Vote up 0 Vote down

Cancel
0 Dan Lewis over 6 years ago in reply to Dan Lewis

I looked at the assembly output of the compiler for the C code, but I didn't see anything suspicious. But as a sanity check, I then implemented each of the two loops in assembly for a comparison. The number of cycles required to calculate the value of overhead remained the same, but the number of cycles required to calculate the value of difference decreased by 1 so that the value of cycles now seems correct. It's very strange, since both loops are identical except for the function being timed. Here's my assembly version of the loop to calculate difference, followed by the compiler's version (with comments added by me): The compiled loop that computes overhead is identical except that some of the register choices made by the compiler are different.

.global GetDifference
.thumb_func
.align
GetDifference:
PUSH {R4-R6,LR}
MOVS R4,0 // difference = 0
L2: MOVS R5,R4 // save = difference
BL ReadDWTCounter
MOVS R6,R0 // R6 = start
BL TargetFunction
BL ReadDWTCounter // R0 = stop
SUBS R4,R0,R6 // difference = stop - start
CMP R4,R5
BNE L2
MOVS R0,R4 // Return difference
POP {R4-R6,PC}

Here's the compiled output:

movs r1, #0
str r1, [r7] // r7 preloaded with address of difference
.L3:
str r1, [r6] // save = difference (r6 preloaded with address of save)
bl ReadDWTCounter
str r0, [r4] // keep start in a temporary
bl TargetFunction
bl ReadDWTCounter
ldr r1, [r4] // r1 <-- start
ldr r3, [r6] // r6 <-- save
subs r1, r0, r1 // r1 <-- difference = stop - start
cmp r1, r3 // difference == save?
str r1, [r7] // store difference back in memory
bne .L3

One last observation: As I mentioned above, all the C variables are declared static. However, if I declare variable start to be a register variable, then both compiled loops return values that are 1 cycle greater than the values returned by the assembly version of the loops, so that the value of cycles is the same for both versions!
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 6 years ago in reply to Dan Lewis

Do you get also an extra cycle if the TargetFunction is a NOP?

I wonder if there is a register dependency from the ADD to the reading of the cycle counter.
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 6 years ago in reply to Dan Lewis

Another idea to try: Inline the test code instead using a subroutine.
Cancel
Vote up 0 Vote down

Cancel
0 Dan Lewis over 6 years ago in reply to 42Bastian Schick

I think I may have found something. When I looked at the assembly output of the compiler, I noticed that the value returned by the first call to ReadDWTCounter was saved in variable start using an STR instruction - i.e., it was being stored in memory, not kept in a register. That made me wonder if there was an extra cycle due to the STR. I then coded assembly language versions of both loops by hand, keeping start in a register. When I ran that version, the cycle count came out correct.

With that hint, I went back to my C code and changed the storage class of variable start to register. That gave me the same result as my assembly language versions of the loops. it also no longer makes any difference whether the other variables are static or auto.

However, these changes should have affected both loops the same way, so even though the value of cycles now seems correct, I am still at a loss to explain it. :-)
Cancel
Vote up 0 Vote down

Cancel
+1 a.surati over 6 years ago in reply to Dan Lewis

Given that the CPU is pipelined, when we expect the CPU to consume one cycle per instruction, we are looking at the throughput of the pipeline, and not the individual latency of an instruction. The throughput is sensitive to fetch and load/store delays, data hazards, branches, variable-delay instructions. The DWT framework seems to support measuring a few of such properties, in addition to the plain # of cycles.

Let r10 contain the address of the counter.

Run (possibly multiple iterations as performed in your original code)

LDR r0, [r10]

  ADD r1, r1, r2

  LDR r3, [r10]

Then, diff0 = r3-r0 would provide us with a base-line cycle count on this device.

Now run,

  LDR r0, [r10]

ADD r1, r1, r2

ADD r3, r3, r4

  LDR r5, [r10]

Then, diff1 = r5-r0, where diff1 is expected to be 1 larger than diff0, since the extra add instruction does not disrupt the flow of the pipeline. Subsequent insertion, of more add instructions which do not cause stalling hazards with their predecessors or successors (and which are not interrupted by async. behaviour like exceptions or interrupts), should continue to increment the difference by 1.

One can work forward from here to arrive at a stable configuration which makes sense and which can be ported to C.
Cancel
Vote up +1 Vote down

Cancel
0 Dan Lewis over 6 years ago in reply to a.surati

I totally agree, except that the assembly language must be implemented as a function called from a C main program. The resulting function call and return obviously disrupt the pipeline and require instruction fetch from different regions of memory than the function being measured. I believe there is something about how this affects the cycle counts that I have yet to understand.
Cancel
Vote up 0 Vote down

Cancel