This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9/A15 L1 d-cache architecture

Note: This was originally posted on 21st March 2012 at http://forums.arm.com

Dear friends,

I'm a PhD candidate at the Complutense University of Madrid. I'm doing reasearch on memory allocation over the memory hierarchy, and I've built a trace-based simulator for memory hierarchies (it's slightly different than existing ones such as Dinero, so I had to build it anew).

I'm using this simulator to compare the performance of different allocation policies over different memory hierarchies, including comparissons between hardware-managed caches and software-managed memories. For the cache-based systems, I'm using as a basis the Cortex-A9 and the Cortex-A15 cache configurations. However, I've a doubt about them and I would like to get as much information as possible before proceeding. Please, notice that I'm not trying to compare the ARM solutions to anything else, but rather software methods for taking advantage of the available memory hierarchies.

My problem is that I know that the L1 data cache is configured as a 32 KB block, 2-way associative, with 64-byte lines, but I can't find any reference to the number of banks into which it's organized. My question is: "Is the cache organized into 8 banks?" That would make sense as then, a memory access from the processor would read just the 64-bits that contain the word. But it's also possible that the cache in configured into 16 banks, so the processor reads 32 bits instead. Or, it could even be that the cache is divided into less banks and the processor uses some method for internal storage of data just read from the cache...

Also, I'm using the energy consumption values calculated by Cacti 5.3 . However, I would really appreciate if anyone could tell me if it's possible to get the actual numbers for any ARM parts (I mean, given a manufacturer and a feature size). This way, I could make more precise results.

Thank you in advance for your help!

Miguel

Miguel Peón Quirós over 12 years ago

Note: This was originally posted on 22nd March 2012 at http://forums.arm.com

Thanks for your comments!

[...] I have a correction: on Cortex-A9 the L1 caches are 4-way set associative and cache lines are only 32 bytes large. [...]

Thanks! I just made a mistake when writting the message ;-)

[...]From what I understand banking is useful to allow multiple accesses (to separate banks) per cycle, with the number of banks decreasing the number of collisions and usually correlating with the read size. [..] This line in the TRM seems to suggest the entire cache line is accessed, with a buffer to prevent accessing the same cache line consecutively [...]

Indeed, banking is even more important to reduce the energy consumption in the L1 data cache! That was the whole point of my doubts. The point is that accessing the whole cache line for every word has a higher energy consumption than activating only the bank that needs to be read. The other well-known option is to use line buffering, where a small memory with the size of a line is put in front of the data array so, if accesses are consecutive, the array is not accessed again. The line you point out from the tech reference could mean that they use line buffering or that the tag comparison is avoided and the data array is accessed directly. However, in academic literature it is usually accepted that blocking has a higher impact in the energy consumption of the data caches, because line buffering depends on a high sequentiality of accesses (thus, it's more useful for the instruction caches).

[...] The mention of banking for L2 and not L1 cache seems conspicuous if there's banking on both. It's possible something else is used for L1 parallelism. [...]

The really surprising point is that in literature, it is usually acknowledged that reducing the energy consumption of the L1 data cache is much more important than reducing the energy consumption of the L2, so the logical step would have been to apply banking at least to the L1...

[...] Maybe tags are duplicated instead of banked. The cache RAM itself could be read + write ported. [...]

Yes, I agree if the goal would be performance, but ARM processors have usually been focused on energy efficiency...

[...] Unfortunately, I doubt you'll get an official explanation. [...]

Well, I hope I can get this one. After all, it shouldn't be so critical to make this information public. Other processors include this information on their datasheets... For instance, Intel does include it for all the current architectures...

In any case, I really appreciate your help and the chance to discuss these issues with someone else. Thanks a lot for your comments!

Miguel
Cancel
Vote up 0 Vote down

Cancel
Gilead Kutnick over 12 years ago

Note: This was originally posted on 21st March 2012 at http://forums.arm.com

Hi Miguel,

This isn't a direct answer to your question but I have a correction: on Cortex-A9 the L1 caches are 4-way set associative and cache lines are only 32 bytes large. Getting that correct will probably have a larger impact on your modeling. Your specifications are correct for Cortex-A15.

As for bank organization, I don't really know for sure, but I suspect that Cortex-A9 isn't banked. From what I understand banking is useful to allow multiple accesses (to separate banks) per cycle, with the number of banks decreasing the number of collisions and usually correlating with the read size. On Cortex-A9 the dcache interface is 64-bits wide, so if it were banked I'd expect there to be 4 banks. This line in the TRM seems to suggest the entire cache line is accessed, with a buffer to prevent accessing the same cache line consecutively:

"To reduce power consumption, the number of full cache reads is reduced by taking advantage of the sequential nature of many cache operations. If a cache read is sequential to the previous cache read, and the read is within the same cache line, only the data RAM set that was previously read is accessed."

Cortex-A15 does allow a load and a store simultaneously in the same cycle. No information is given on banking for L1 dcache, but it's noteworthy that banking information IS given for its L2 cache, which specifies 4 banks for tags and 4 banks for data. This is similar to the banking described for the L2 cache on Cortex-A8. The mention of banking for L2 and not L1 cache seems conspicuous if there's banking on both. It's possible something else is used for L1 parallelism. Maybe tags are duplicated instead of banked. The cache RAM itself could be read + write ported.

Unfortunately, I doubt you'll get an official explanation.
Cancel
Vote up 0 Vote down

Cancel