This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9/A15 L1 d-cache architecture

Note: This was originally posted on 21st March 2012 at http://forums.arm.com

Dear friends,

I'm a PhD candidate at the Complutense University of Madrid. I'm doing reasearch on memory allocation over the memory hierarchy, and I've built a trace-based simulator for memory hierarchies (it's slightly different than existing ones such as Dinero, so I had to build it anew).

I'm using this simulator to compare the performance of different allocation policies over different memory hierarchies, including comparissons between hardware-managed caches and software-managed memories. For the cache-based systems, I'm using as a basis the Cortex-A9 and the Cortex-A15 cache configurations. However, I've a doubt about them and I would like to get as much information as possible before proceeding. Please, notice that I'm not trying to compare the ARM solutions to anything else, but rather software methods for taking advantage of the available memory hierarchies.

My problem is that I know that the L1 data cache is configured as a 32 KB block, 2-way associative, with 64-byte lines, but I can't find any reference to the number of banks into which it's organized. My question is: "Is the cache organized into 8 banks?" That would make sense as then, a memory access from the processor would read just the 64-bits that contain the word. But it's also possible that the cache in configured into 16 banks, so the processor reads 32 bits instead. Or, it could even be that the cache is divided into less banks and the processor uses some method for internal storage of data just read from the cache...

Also, I'm using the energy consumption values calculated by Cacti 5.3 . However, I would really appreciate if anyone could tell me if it's possible to get the actual numbers for any ARM parts (I mean, given a manufacturer and a feature size). This way, I could make more precise results.

Thank you in advance for your help!

Miguel
Parents
  • Note: This was originally posted on 22nd March 2012 at http://forums.arm.com





    Thanks for your comments!

    [...] I have a correction: on Cortex-A9 the L1 caches are 4-way set associative and cache lines are only 32 bytes large. [...]

    Thanks! I just made a mistake when writting the message ;-)

    [...]From what I understand banking is useful to allow multiple accesses (to separate banks) per cycle, with the number of banks decreasing the number of collisions and usually correlating with the read size. [..] This line in the TRM seems to suggest the entire cache line is accessed, with a buffer to prevent accessing the same cache line consecutively [...]

    Indeed, banking is even  more important to reduce the energy consumption in the L1 data cache! That was the whole point of my doubts. The point is that accessing the whole cache line for every word has a higher energy consumption than activating only the bank that needs to be read. The other well-known option is to use line buffering, where a small memory with the size of a line is put in front of the data array so, if accesses are consecutive, the array is not accessed again. The line you point out from the tech reference could mean that they use line buffering or that the tag comparison is avoided and the data array is accessed directly. However, in academic literature it is usually accepted that blocking has a higher impact in the energy consumption of the data caches, because line buffering depends on a high sequentiality of accesses (thus, it's more useful for the instruction caches).

    [...] The mention of banking for L2 and not L1 cache seems conspicuous if there's banking on both. It's possible something else is used for L1 parallelism. [...]

    The really surprising point is that in literature, it is usually acknowledged that reducing the energy consumption of the L1 data cache is much more important than reducing the energy consumption of the L2, so the logical step would have been to apply banking at least to the L1...

    [...] Maybe tags are duplicated instead of banked. The cache RAM itself could be read + write ported. [...]

    Yes, I agree if the goal would be performance, but ARM processors have usually been focused on energy efficiency...

    [...] Unfortunately, I doubt you'll get an official explanation. [...]

    Well, I hope I can get this one. After all, it shouldn't be so critical to make this information public. Other processors include this information on their datasheets... For instance, Intel does include it for all the current architectures...

    In any case, I really appreciate your help and the chance to discuss these issues with someone else. Thanks a lot for your comments!

    Miguel


Reply
  • Note: This was originally posted on 22nd March 2012 at http://forums.arm.com





    Thanks for your comments!

    [...] I have a correction: on Cortex-A9 the L1 caches are 4-way set associative and cache lines are only 32 bytes large. [...]

    Thanks! I just made a mistake when writting the message ;-)

    [...]From what I understand banking is useful to allow multiple accesses (to separate banks) per cycle, with the number of banks decreasing the number of collisions and usually correlating with the read size. [..] This line in the TRM seems to suggest the entire cache line is accessed, with a buffer to prevent accessing the same cache line consecutively [...]

    Indeed, banking is even  more important to reduce the energy consumption in the L1 data cache! That was the whole point of my doubts. The point is that accessing the whole cache line for every word has a higher energy consumption than activating only the bank that needs to be read. The other well-known option is to use line buffering, where a small memory with the size of a line is put in front of the data array so, if accesses are consecutive, the array is not accessed again. The line you point out from the tech reference could mean that they use line buffering or that the tag comparison is avoided and the data array is accessed directly. However, in academic literature it is usually accepted that blocking has a higher impact in the energy consumption of the data caches, because line buffering depends on a high sequentiality of accesses (thus, it's more useful for the instruction caches).

    [...] The mention of banking for L2 and not L1 cache seems conspicuous if there's banking on both. It's possible something else is used for L1 parallelism. [...]

    The really surprising point is that in literature, it is usually acknowledged that reducing the energy consumption of the L1 data cache is much more important than reducing the energy consumption of the L2, so the logical step would have been to apply banking at least to the L1...

    [...] Maybe tags are duplicated instead of banked. The cache RAM itself could be read + write ported. [...]

    Yes, I agree if the goal would be performance, but ARM processors have usually been focused on energy efficiency...

    [...] Unfortunately, I doubt you'll get an official explanation. [...]

    Well, I hope I can get this one. After all, it shouldn't be so critical to make this information public. Other processors include this information on their datasheets... For instance, Intel does include it for all the current architectures...

    In any case, I really appreciate your help and the chance to discuss these issues with someone else. Thanks a lot for your comments!

    Miguel


Children
No data