I'm trying to find a reliable method for measuring instruction clock cycles on the STM32F429 MCU that incorporates a Cortex-M4 processor. Part of the challenge is that although the core CPU has no cache, ST added their own proprietary ART Accelerator between the flash memory and the CPU. It provides an instruction cache of 1024 bytes and an instruction prefetch that allow the CPU to run at 180 MHz with 0 wait states, even though there is a 5 clock wait state to reload a cache line from flash.
My main program is written in C. It calls an assembly language function that contains the code I'm trying to time. I'm using the DWT cycle counter that is driven directly by the CPU clock. To eliminate the effect of the cache, I'm using the following approach that repeats the execution until the cycle count is stable. I do this twice - (1) to account for the overhead cycles required to read the DWT counter and for the cycles required to simply call and return from a function containing only a BX LR, and (2) to measure the cycle count of the code within TargetFunction (not counting the BL or BX LR instructions that do the call and return).
// Measure overhead cycles overhead = 0 ; do { save = overhead ; start = ReadDWTCounter() ; DummyFunction() ; // <------ This function contains nothing but a BX LR instruction stop = ReadDWTCounter() ; overhead = stop - start ; } while (overhead != save) ;
// Measure function cycles difference = 0 ; do { save = difference ; start = ReadDWTCounter() ; TargetFunction() ; // <--------- This is the function containing the code I want to measure stop = ReadDWTCounter() ; difference = stop - start ; } while (difference != save) ;
// Remove overhead cycles cycles = difference - overhead ;
As expected, the loops each run for only two iterations, where the first iteration loads the code into cache and the second executes from cache with zero wait states. This seems to give very good and repeatable results, except that the final value of cycles is one greater than I would expect.
For example, if the code I'm timing is a single 16-bit ADD instructions (inside TargetFunction), the measured cycle count should be 1 clock cycle, but I get 2. If I try to time two 16-bit ADD instructions, the measured cycle count should be 2 clock cycles, but I get 3, and so on.
Can anyone explain the extra cycle?
Thanks!Dan
To exhaust a few more options on top, you may want to:
- check (in the disassembly of the C program) if an additional instruction got covered when calculating the value for 'difference'.
- add 2-3 dummy, simple instructions as the first instructions in both the Dummy and the Target functions. My (theoretical) guess is that a branch prediction scheme, which knows about a 'BX LR being the target of a BL' can skip the BX LR. This is similar to the removal of an empty function call by a compiler. This would also imply that testing with an empty/flushed/disabled BTB can provide results unaffected by the branch predictions.
Edit: The steps, possibly along these lines:
1. BL is decoded, the BTB is checked. There is a hit for this PC, where the BTB entry has the predicted address, the instruction at that address (if BTB supports branch folding, it could contain the instruction), and a bit which says that the instruction is a BX LR.
Simultaneously, the fetch stage fetches the instruction (ir) following the BL instruction. This is also the instruction at the return address. The prediction, assumed to be correct, implies that decode need not change the stream that fetch is fetching.
2. BL proceeds to the execute stage to calculate the actual target address, and update the LR. The instruction ir proceeds to the decode stage at the same time. And the instruction following ir (ir + 1) is fetched.
At the end of BL's execute stage, the prediction (predicted target address == actual target address) is known to be correct, and the pipeline does not need to be flushed. BX LR has effectively disappeared.
The above is an attempt to come up with a plausible situation where the variable "overhead" has one less than the expected value. You can also try to see whether the variable 'difference' is one higher, or the variable 'overhead' is one lesser, to indicate which of the two calculations is off by one.
I double the hint with the dummy instructions.
View all questions in Cortex-M / M-Profile forum