This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Load / Store timings with different cache settings

Hello,

I am timing load and store instructions for baremetal program by stepping though execution using OpenOCD and using the PMU cycle counter with single cycle granularity. I am running the program on a single core of a Cortex-A9 on a Xilinix Zynq-7000 (I have a Zybo board).

I have tried several different cache configurations, and am now trying to make sense of the results.

First: All caches enabled

This histogram is showing that the vast majority of LDR and STR instructions took from 10 to 15 cycles (14 cycles from the raw data). I see this and think: Okay, it takes about 14 cycles for a L1 Cache hit.

Then I ran with the L1 caches disabled (so only the L2-cache):

Now a bunch of accesses have shifted to the right, taking ~35 cycles. Maybe that is how long it takes for an L2 hit? But why are there still so many ~14 cycle accesses (from the raw data, these are both load and store instructions).

This seems weird to me, so I dig through the docs and try to turn off features like pre-fetching and branch speculation... but that doesn't get rid of the 14 cycle accesses.

Any ideas as to what is going on? Is my "14 cycles = an L1 hit" assumption incorrect? Are there other cache options I should try turning off? Is my method of getting the instruction's cycle count flawed? 14 cycles seems like a lot to me, but I am assuming that this is a result of stepping through the program clearing the pipeline (instructions like ADD also take ~14 cycles).

Thank you in advance for your help!

Top replies

Martin Weidmann over 7 years ago +2 suggested

I have some concerns about your approach. I've not used OpenOCD before and don't know how it handles stepping, however if it's similar to other debuggers the process of stepping could be quite invasive...

Parents

0 superdesk over 7 years ago in reply to Martin Weidmann

Thanks for the response!

Since I've posted, I've confirmed that stepping the processor clears the pipeline, so the minimum cycle count for an instruction is 14 cycles when measured this way (I measured a short block together vs. each instruction in the block individually). So, you are right about the invasive part!

So it seems like both an L1 hit and an L1 miss / L2 hit are under the 14 cycle threshold that I can measure this way. I have tried finding expected latency in the Xilinx docs and have asked on their forums to confirm this, but don't have an answer yet.

I am reading the PMU from a second computer using OpenOCD, I am not concerned about how much wall clock time it takes to get the cycle counts as the device I'm testing is halted while I do this.

My goal is to be able to know the cache state at any point in a program's execution. I thought I could do this by executing the program one line at a time, and watch the cycle counter for load instructions to see which ones result in a hit vs a miss. It looks like this method will not work for L1 caches (due to the pipeline clearing issue), but I'm hoping it will still work for the L2 cache since the latencies will be higher. I'm also recording the store instructions and can predict what will be in the cache based off of previous loads and stores, but I really need to check this prediction against the hardware.

Is there a better way? I don't see a way to do it while measuring blocks instead of individual instruction cycle counts. I'm looking into the Level 2 Cache Controller (L2C-310) counters. The PMU has some counters related to the L1 cache, but I haven't had a chance to dig in yet... Is this a reasonable route?

Thanks!
Cancel
Vote up 0 Vote down

Cancel

Reply

0 superdesk over 7 years ago in reply to Martin Weidmann

Thanks for the response!

Since I've posted, I've confirmed that stepping the processor clears the pipeline, so the minimum cycle count for an instruction is 14 cycles when measured this way (I measured a short block together vs. each instruction in the block individually). So, you are right about the invasive part!

So it seems like both an L1 hit and an L1 miss / L2 hit are under the 14 cycle threshold that I can measure this way. I have tried finding expected latency in the Xilinx docs and have asked on their forums to confirm this, but don't have an answer yet.

I am reading the PMU from a second computer using OpenOCD, I am not concerned about how much wall clock time it takes to get the cycle counts as the device I'm testing is halted while I do this.

My goal is to be able to know the cache state at any point in a program's execution. I thought I could do this by executing the program one line at a time, and watch the cycle counter for load instructions to see which ones result in a hit vs a miss. It looks like this method will not work for L1 caches (due to the pipeline clearing issue), but I'm hoping it will still work for the L2 cache since the latencies will be higher. I'm also recording the store instructions and can predict what will be in the cache based off of previous loads and stores, but I really need to check this prediction against the hardware.

Is there a better way? I don't see a way to do it while measuring blocks instead of individual instruction cycle counts. I'm looking into the Level 2 Cache Controller (L2C-310) counters. The PMU has some counters related to the L1 cache, but I haven't had a chance to dig in yet... Is this a reasonable route?

Thanks!
Cancel
Vote up 0 Vote down

Cancel

Children

No data