Hello,
I am timing load and store instructions for baremetal program by stepping though execution using OpenOCD and using the PMU cycle counter with single cycle granularity. I am running the program on a single core of a Cortex-A9 on a Xilinix Zynq-7000 (I have a Zybo board).
I have tried several different cache configurations, and am now trying to make sense of the results.
First: All caches enabled
This histogram is showing that the vast majority of LDR and STR instructions took from 10 to 15 cycles (14 cycles from the raw data). I see this and think: Okay, it takes about 14 cycles for a L1 Cache hit.
Then I ran with the L1 caches disabled (so only the L2-cache):
Now a bunch of accesses have shifted to the right, taking ~35 cycles. Maybe that is how long it takes for an L2 hit? But why are there still so many ~14 cycle accesses (from the raw data, these are both load and store instructions).
This seems weird to me, so I dig through the docs and try to turn off features like pre-fetching and branch speculation... but that doesn't get rid of the 14 cycle accesses.
Any ideas as to what is going on? Is my "14 cycles = an L1 hit" assumption incorrect? Are there other cache options I should try turning off? Is my method of getting the instruction's cycle count flawed? 14 cycles seems like a lot to me, but I am assuming that this is a result of stepping through the program clearing the pipeline (instructions like ADD also take ~14 cycles).
Thank you in advance for your help!
I have some concerns about your approach. I've not used OpenOCD before and don't know how it handles stepping, however if it's similar to other debuggers the process of stepping could be quite invasive. Which would disrupt the timing of the instruction you being stepped (as compared to not stepping it).
The other concern is that the PMU isn't really designed to measure the timing of a single instruction in isolation. If for no other reason than the time need to take the measurements would end up being a significant portion of the measured time (if you did it from software). It also misses the interactions of the instructions with what's around it. It's more common to measure blocks, rather than individual instructions.
Thanks for the response!Since I've posted, I've confirmed that stepping the processor clears the pipeline, so the minimum cycle count for an instruction is 14 cycles when measured this way (I measured a short block together vs. each instruction in the block individually). So, you are right about the invasive part!
So it seems like both an L1 hit and an L1 miss / L2 hit are under the 14 cycle threshold that I can measure this way. I have tried finding expected latency in the Xilinx docs and have asked on their forums to confirm this, but don't have an answer yet.
I am reading the PMU from a second computer using OpenOCD, I am not concerned about how much wall clock time it takes to get the cycle counts as the device I'm testing is halted while I do this.
My goal is to be able to know the cache state at any point in a program's execution. I thought I could do this by executing the program one line at a time, and watch the cycle counter for load instructions to see which ones result in a hit vs a miss. It looks like this method will not work for L1 caches (due to the pipeline clearing issue), but I'm hoping it will still work for the L2 cache since the latencies will be higher. I'm also recording the store instructions and can predict what will be in the cache based off of previous loads and stores, but I really need to check this prediction against the hardware.
Is there a better way? I don't see a way to do it while measuring blocks instead of individual instruction cycle counts. I'm looking into the Level 2 Cache Controller (L2C-310) counters. The PMU has some counters related to the L1 cache, but I haven't had a chance to dig in yet... Is this a reasonable route?
Thanks!