This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Using the whole Cortex-A L2 Cache without external memory

I'm thinking about using a cortex-a7 in "bare-metal" where I don't need much memory, so i'd like to avoid using external memory.

The CPU boots from an external 4MBytes SPI NOR FLASH chip.

It has 512 KBytes of L2 cache and 32 KBytes of internal SRAM that is just used during initial boot since it's so slow.

Using MMU and L2 cache configurations, I am wondering if there is a way to fill the whole L2 cache with code/data ?

Since the internal SRAM is 16 times smaller than the L2 cache, it might be tricky.

Could the following work ?

1. CPU boots initial code from SPI FLASH (like a 8-16KB , let's says @ 0x00000000 where SRAM is located )

2. First, MMU is configured so that this bootloader code/data is never cached.

Then,

3. CPU loads one block of 16KB from SPI FLASH, and writes it at a fixed address in internal SRAM ( 0x00004000 )

4. CPU reads 16KB of data from increasing addresses:

for 1st block : 0x80000000-0x80003fff

for 2nd block: 0x80004000-0x80007fff

... and so on ... with MMU/L2 cache configured so that those addresses always map to 0x00004000 - 0x00007fff where the block is located ( the question is here, can this be done? )

5. Those reads provoke L2 cache-misses which fills the 16KB of L2 cache with the block data.

6. Repeat 3-4-5 steps 32 times to fill the whole 512KB of L2 cache

Configure MMU L1 caches (or maybe that must be also done in previous steps?)

Jump to program entry point (so, somewhere between 0x80000000 and 0x80000000 + 512KB).

Parents

0 Laurent over 10 years ago in reply to Matt Sealey

Hi Matt,
Yes, the key is to have an area of size = L2 size where the core can do reads/writes without locking, even if those writes go nowhere and the read data are meaningless.
I can see two ways of doing this that has worked for me on A7 & A8.
The first is to init the DRAM controller as if there was a chip physically installed on board, but there is not.
The second is more lazy (doesn't need any initialization) and is SOC-depend. It's to find an area where it's possible to do those read/writes without locking, yeah that one is quite bad I admit, but also worked on two SOCs i have here.
Then, the 1MB section that contains this area is marked as cacheable (and it's the only one), and MMU is enabled.
Consecutive reads of L2 size are performed from this area, and at this point L2 is fully populated with meaningless garbage from consecutive "area" addresses.
Any further writes to this area will update L1 for sure and possibly L2, but shouldn't have to update L3. Even if the core did want to write back a new L2 word to L3, it would not lock so we wouldn't have a problem here.
Any further read will be performed from L1 if data is there, or from L2 (and bringing back data to L1), but should not have to read from L3 since those data are all already all in L2, and that's where the whole 'unsafe' thing comes...
Current facts show me it's working, and code execution works normally. But I would not guarantee this method is problem-free on A7. On the A8, L2 can be locked down, so i'd be less worried
Note that i don't know what L1 does and the complex pseudo-random algorithm it may perform when it's time to evict data or take speculative action, but the point is, at any time, all L1 I-Cache & D-Cache words are somewhere in L2.
Actually, the code that is solely executed from L2 and read/write data from there is the same as if this code was located in DRAM or SRAM, except DMA can't be performed to the area corresponding to L2.
Peripherals and SRAM are located in non-cacheable areas, so they are not a problem. The only thing is boot-loader, that must perform the init mentioned above, and write code&data to the L2 area.
Regards,
L.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Laurent over 10 years ago in reply to Matt Sealey

Hi Matt,
Yes, the key is to have an area of size = L2 size where the core can do reads/writes without locking, even if those writes go nowhere and the read data are meaningless.
I can see two ways of doing this that has worked for me on A7 & A8.
The first is to init the DRAM controller as if there was a chip physically installed on board, but there is not.
The second is more lazy (doesn't need any initialization) and is SOC-depend. It's to find an area where it's possible to do those read/writes without locking, yeah that one is quite bad I admit, but also worked on two SOCs i have here.
Then, the 1MB section that contains this area is marked as cacheable (and it's the only one), and MMU is enabled.
Consecutive reads of L2 size are performed from this area, and at this point L2 is fully populated with meaningless garbage from consecutive "area" addresses.
Any further writes to this area will update L1 for sure and possibly L2, but shouldn't have to update L3. Even if the core did want to write back a new L2 word to L3, it would not lock so we wouldn't have a problem here.
Any further read will be performed from L1 if data is there, or from L2 (and bringing back data to L1), but should not have to read from L3 since those data are all already all in L2, and that's where the whole 'unsafe' thing comes...
Current facts show me it's working, and code execution works normally. But I would not guarantee this method is problem-free on A7. On the A8, L2 can be locked down, so i'd be less worried
Note that i don't know what L1 does and the complex pseudo-random algorithm it may perform when it's time to evict data or take speculative action, but the point is, at any time, all L1 I-Cache & D-Cache words are somewhere in L2.
Actually, the code that is solely executed from L2 and read/write data from there is the same as if this code was located in DRAM or SRAM, except DMA can't be performed to the area corresponding to L2.
Peripherals and SRAM are located in non-cacheable areas, so they are not a problem. The only thing is boot-loader, that must perform the init mentioned above, and write code&data to the L2 area.
Regards,
L.
Cancel
Vote up 0 Vote down

Cancel

Children

0 Matt Sealey over 10 years ago in reply to Laurent

You're very lucky that your DRAM controller returns garbage data if there's no DRAM connected.. the more common scenario is that it locks the bus by never allowing a transaction to finish (because it's waiting on data from RAM that doesn't exist, and that can never arrive).
Cancel
Vote up 0 Vote down

Cancel
0 Laurent over 10 years ago in reply to Matt Sealey

The DRAM controller blindly assumes the DRAM chip honors its read/write requests according to the various clocks it delivers relative to all the timing information it's been configured with.
It will not wait on the DRAM IC, so it's not a problem if the chip is not there.
Cancel
Vote up 0 Vote down

Cancel
0 Matt Sealey over 10 years ago in reply to Laurent

Hi 0xffff,
Understood, but again if you do move to a core with lockdown features you might not get the same controller for the DRAM and therefore less easy a way to 'pollute' the cache with read-allocated garbage in order to write over it, then lock it..
Cancel
Vote up 0 Vote down

Cancel
0 Aleksandar Krstic over 9 years ago in reply to Matt Sealey

Any progress?
I am currently working with arm a13 allwinner.
I look at how to avoid external DDR3 memory, because the program size is less than 128k, boot with sd - card and loader , which fed program into the l2 cache.
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris over 9 years ago in reply to Aleksandar Krstic

As per the multiple replies higher up the thread, this isn't possible in a manner which is reliable unless the CPU supports cache lockdown which the Cortex-A cores do not.
HTH,
Pete
Cancel
Vote up 0 Vote down

Cancel