This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Measuring Cortex-M4 instruction clock cycle counts

I'm trying to find a reliable method for measuring instruction clock cycles on the STM32F429 MCU that incorporates a Cortex-M4 processor. Part of the challenge is that although the core CPU has no cache, ST added their own proprietary ART Accelerator between the flash memory and the CPU. It provides an instruction cache of 1024 bytes and an instruction prefetch that allow the CPU to run at 180 MHz with 0 wait states, even though there is a 5 clock wait state to reload a cache line from flash.

My main program is written in C. It calls an assembly language function that contains the code I'm trying to time. I'm using the DWT cycle counter that is driven directly by the CPU clock. To eliminate the effect of the cache, I'm using the following approach that repeats the execution until the cycle count is stable. I do this twice - (1) to account for the overhead cycles required to read the DWT counter and for the cycles required to simply call and return from a function containing only a BX LR, and (2) to measure the cycle count of the code within TargetFunction (not counting the BL or BX LR instructions that do the call and return).

// Measure overhead cycles
overhead = 0 ;
do
    {  
    save = overhead ;
    start = ReadDWTCounter() ; 
    DummyFunction() ; // <------ This function contains nothing but a BX LR instruction
    stop = ReadDWTCounter() ;
    overhead = stop - start ;
    } while (overhead != save) ;

// Measure function cycles
difference = 0 ;
do
    {
    save = difference ;
    start = ReadDWTCounter() ;
    TargetFunction() ; // <--------- This is the function containing the code I want to measure
    stop = ReadDWTCounter() ;
    difference = stop - start ;
    } while (difference != save) ;

// Remove overhead cycles
cycles = difference - overhead ;

As expected, the loops each run for only two iterations, where the first iteration loads the code into cache and the second executes from cache with zero wait states. This seems to give very good and repeatable results, except that the final value of cycles is one greater than I would expect.

For example, if the code I'm timing is a single 16-bit ADD instructions (inside TargetFunction), the measured cycle count should be 1 clock cycle, but I get 2. If I try to time two 16-bit ADD instructions, the measured cycle count should be 2 clock cycles, but I get 3, and so on.

Can anyone explain the extra cycle?

Thanks!
Dan

Parents
  • Given that the CPU is pipelined, when we expect the CPU to consume one cycle per instruction, we are looking at the throughput of the pipeline, and not the individual latency of an instruction. The throughput is sensitive to fetch and load/store delays, data hazards, branches, variable-delay instructions. The DWT framework seems to support measuring a few of such properties, in addition to the plain # of cycles.

    Let r10 contain the address of the counter.

    Run (possibly multiple iterations as performed in your original code)

        LDR r0, [r10]

        ADD r1, r1, r2

        LDR r3, [r10]

    Then, diff0 = r3-r0 would provide us with a base-line cycle count on this device.

    Now run,

        LDR r0, [r10]

        ADD r1, r1, r2

        ADD r3, r3, r4

        LDR r5, [r10]

    Then, diff1 = r5-r0, where diff1 is expected to be 1 larger than diff0, since the extra add instruction does not disrupt the flow of the pipeline. Subsequent insertion, of more add instructions which do not cause stalling hazards with their predecessors or successors (and which are not interrupted by async. behaviour like exceptions or interrupts), should continue to increment the difference by 1.

    One can work forward from here to arrive at a stable configuration which makes sense and which can be ported to C.

Reply
  • Given that the CPU is pipelined, when we expect the CPU to consume one cycle per instruction, we are looking at the throughput of the pipeline, and not the individual latency of an instruction. The throughput is sensitive to fetch and load/store delays, data hazards, branches, variable-delay instructions. The DWT framework seems to support measuring a few of such properties, in addition to the plain # of cycles.

    Let r10 contain the address of the counter.

    Run (possibly multiple iterations as performed in your original code)

        LDR r0, [r10]

        ADD r1, r1, r2

        LDR r3, [r10]

    Then, diff0 = r3-r0 would provide us with a base-line cycle count on this device.

    Now run,

        LDR r0, [r10]

        ADD r1, r1, r2

        ADD r3, r3, r4

        LDR r5, [r10]

    Then, diff1 = r5-r0, where diff1 is expected to be 1 larger than diff0, since the extra add instruction does not disrupt the flow of the pipeline. Subsequent insertion, of more add instructions which do not cause stalling hazards with their predecessors or successors (and which are not interrupted by async. behaviour like exceptions or interrupts), should continue to increment the difference by 1.

    One can work forward from here to arrive at a stable configuration which makes sense and which can be ported to C.

Children
  • I totally agree, except that the assembly language must be implemented as a function called from a C main program. The resulting function call and return obviously disrupt the pipeline and require instruction fetch from different regions of memory than the function being measured. I believe there is something about how this affects the cycle counts that I have yet to understand.