This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Load / Store timings with different cache settings

Hello,

I am timing load and store instructions for baremetal program by stepping though execution using OpenOCD and using the PMU cycle counter with single cycle granularity. I am running the program on a single core of a Cortex-A9 on a Xilinix Zynq-7000 (I have a Zybo board).

I have tried several different cache configurations, and am now trying to make sense of the results.

First: All caches enabled

This histogram is showing that the vast majority of LDR and STR instructions took from 10 to 15 cycles (14 cycles from the raw data). I see this and think: Okay, it takes about 14 cycles for a L1 Cache hit.

Then I ran with the L1 caches disabled (so only the L2-cache):

Now a bunch of accesses have shifted to the right, taking ~35 cycles. Maybe that is how long it takes for an L2 hit? But why are there still so many ~14 cycle accesses (from the raw data, these are both load and store instructions).

This seems weird to me, so I dig through the docs and try to turn off features like pre-fetching and branch speculation... but that doesn't get rid of the 14 cycle accesses.

Any ideas as to what is going on? Is my "14 cycles = an L1 hit" assumption incorrect? Are there other cache options I should try turning off? Is my method of getting the instruction's cycle count flawed? 14 cycles seems like a lot to me, but I am assuming that this is a result of stepping through the program clearing the pipeline (instructions like ADD also take ~14 cycles).

Thank you in advance for your help!

Parents Reply Children
No data