This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Load / Store timings with different cache settings

Hello,

I am timing load and store instructions for baremetal program by stepping though execution using OpenOCD and using the PMU cycle counter with single cycle granularity. I am running the program on a single core of a Cortex-A9 on a Xilinix Zynq-7000 (I have a Zybo board).

I have tried several different cache configurations, and am now trying to make sense of the results.

First: All caches enabled

This histogram is showing that the vast majority of LDR and STR instructions took from 10 to 15 cycles (14 cycles from the raw data). I see this and think: Okay, it takes about 14 cycles for a L1 Cache hit.

Then I ran with the L1 caches disabled (so only the L2-cache):

Now a bunch of accesses have shifted to the right, taking ~35 cycles. Maybe that is how long it takes for an L2 hit? But why are there still so many ~14 cycle accesses (from the raw data, these are both load and store instructions).

This seems weird to me, so I dig through the docs and try to turn off features like pre-fetching and branch speculation... but that doesn't get rid of the 14 cycle accesses.

Any ideas as to what is going on? Is my "14 cycles = an L1 hit" assumption incorrect? Are there other cache options I should try turning off? Is my method of getting the instruction's cycle count flawed? 14 cycles seems like a lot to me, but I am assuming that this is a result of stepping through the program clearing the pipeline (instructions like ADD also take ~14 cycles).

Thank you in advance for your help!

Parents
  • I have some concerns about your approach.  I've not used OpenOCD before and don't know how it handles stepping, however if it's similar to other debuggers the process of stepping could be quite invasive. Which would disrupt the timing of the instruction you being stepped (as compared to not stepping it). 

    The other concern is that the PMU isn't really designed to measure the timing of a single instruction in isolation.  If for no other reason than the time need to take the measurements would end up being a significant portion of the measured time (if you did it from software).  It also misses the interactions of the instructions with what's around it.  It's more common to measure blocks, rather than individual instructions.

Reply
  • I have some concerns about your approach.  I've not used OpenOCD before and don't know how it handles stepping, however if it's similar to other debuggers the process of stepping could be quite invasive. Which would disrupt the timing of the instruction you being stepped (as compared to not stepping it). 

    The other concern is that the PMU isn't really designed to measure the timing of a single instruction in isolation.  If for no other reason than the time need to take the measurements would end up being a significant portion of the measured time (if you did it from software).  It also misses the interactions of the instructions with what's around it.  It's more common to measure blocks, rather than individual instructions.

Children