We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
I am timing load and store instructions for baremetal program by stepping though execution using OpenOCD and using the PMU cycle counter with single cycle granularity. I am running the program on a single core of a Cortex-A9 on a Xilinix Zynq-7000 (I have a Zybo board).
I have tried several different cache configurations, and am now trying to make sense of the results.
First: All caches enabled
This histogram is showing that the vast majority of LDR and STR instructions took from 10 to 15 cycles (14 cycles from the raw data). I see this and think: Okay, it takes about 14 cycles for a L1 Cache hit.
Then I ran with the L1 caches disabled (so only the L2-cache):
Now a bunch of accesses have shifted to the right, taking ~35 cycles. Maybe that is how long it takes for an L2 hit? But why are there still so many ~14 cycle accesses (from the raw data, these are both load and store instructions).
This seems weird to me, so I dig through the docs and try to turn off features like pre-fetching and branch speculation... but that doesn't get rid of the 14 cycle accesses.
Any ideas as to what is going on? Is my "14 cycles = an L1 hit" assumption incorrect? Are there other cache options I should try turning off? Is my method of getting the instruction's cycle count flawed? 14 cycles seems like a lot to me, but I am assuming that this is a result of stepping through the program clearing the pipeline (instructions like ADD also take ~14 cycles).
Thank you in advance for your help!