I'm thinking about using a cortex-a7 in "bare-metal" where I don't need much memory, so i'd like to avoid using external memory.
The CPU boots from an external 4MBytes SPI NOR FLASH chip.
It has 512 KBytes of L2 cache and 32 KBytes of internal SRAM that is just used during initial boot since it's so slow.
Using MMU and L2 cache configurations, I am wondering if there is a way to fill the whole L2 cache with code/data ?
Since the internal SRAM is 16 times smaller than the L2 cache, it might be tricky.
Could the following work ?
1. CPU boots initial code from SPI FLASH (like a 8-16KB , let's says @ 0x00000000 where SRAM is located )
2. First, MMU is configured so that this bootloader code/data is never cached.
Then,
3. CPU loads one block of 16KB from SPI FLASH, and writes it at a fixed address in internal SRAM ( 0x00004000 )
4. CPU reads 16KB of data from increasing addresses:
for 1st block : 0x80000000-0x80003fff
for 2nd block: 0x80004000-0x80007fff
... and so on ... with MMU/L2 cache configured so that those addresses always map to 0x00004000 - 0x00007fff where the block is located ( the question is here, can this be done? )
5. Those reads provoke L2 cache-misses which fills the 16KB of L2 cache with the block data.
6. Repeat 3-4-5 steps 32 times to fill the whole 512KB of L2 cache
Configure MMU L1 caches (or maybe that must be also done in previous steps?)
Jump to program entry point (so, somewhere between 0x80000000 and 0x80000000 + 512KB).
This isn't how the L2 in the Cortex-A7 was intended to be used.
First problem is that the architecture allows caches lines to be speculatively filled and evicted. Meaning that there is no guarantee that a given line will stay in the cache. The processor might attempt to write it back to memory - which in this case doesn't exist.
Cache line locking would fix this, but cache lock down is not supported on the Cortex-A7 (or any of the other recent Cortex-A processors). You can reduce the possibility of eviction by only mapping as much cacheable memory as you have cache space. However, that doesn't actually guarantee you wouldn't get evictions, just makes it unlikely.
Hi Martin, thanks for your answer,
Normally I wouldn't play much with cache/MMU, i just configure the MMU at startup, then let it do its job and it works well for me.
The only cache operations i do is invalidation after some DMA transfers to cacheable areas.
I can see that the functioning of cache & MMU is quite complex (the ARMv7-A architecture reference manual is a "mammoth" book !!!), but with such a complexity i hope there is a trick/loophole, to achieve what i want.
With external memory, i am able to fill the whole L2 cache. For that I read (and print) the content of 512 KB of cacheable external memory (size of my L2 cache), then do a DMA transfer to overwrite this memory area with new data, and re-read and re-print those 512KB. It still shows the old data, proofing that they were all put in L2 cache (L1 DCache is only 32KB). If i try to do this with more than 512KB, i start to get random inconsistencies by blocks of 64 bytes, which makes sense.
I haven't tried to execute code yet, but it does seem logical that L1 ICache & DCache would get filled with data all grabbed from L2 cache since they are all there, without the need to read from external memory.
I'm not too worried if the CPU attempted to write back data to ext memory, i could configure the memory controller as if the RAM chip was there, writes will not block anything but would go nowhere.
However, any read would obviously return garbage.
The problem is, since I will not have external memory and only have 64KB SRAM, I can't figure how I could fill the whole 512KB L2.
I've tried using MMU and 2-level tables (for one of the 4096 1MB sections descriptor, i used one 256 entries 4KB pages descriptor) .
I mapped those 256 4KB entries to the same 4KB cacheable SRAM area.
My hope was to use DMA transfers to the 4K SRAM area and invalidate 4KB of DCache corresponding the current virtual address in order to provoke an update of the 4KB L2 cache area that it would get from SRAM.
But that does not work... I'm feeling there must be a way, but i haven't found it yet
I have some vague memory about U-boot wanting to do something like this to avoid some problem they perceived in accessing DRAM early on. Sounds like a lot of unnecessary work to me but if anyone has a solution with the current cores I'd guess they would.
Actually, what I wrote here made me think of a trivial solution, and i successfully populated the whole 512KB L2 cache with data
I'm not too worried if the CPU attempted to write back data to ext memory, i could configure the memory controller as if the RAM chip was there
Now, the Cortex-A Programmer's guide says : When the core executes a store instruction, a cache lookup on the address(es) to be written is performed. For a cache hit on a write, there are two choices. (Write-through & Write-back).
But the important is, both of them will update the cache.
So, even if there is not physical DRAM chip onboard, as long as DRAM controller is initialized, writing to DRAM locations can populate L2 thru this mechanism.
To test that this works, i modified the DRAM init code so that reading/writing to DRAM is allowed, but produce garbage.
Then, I just write 512KB of valid data to that "broken" DRAM, when re-reading, they are all good, while if i have DCache disabled it is all garbage.
Now it's time to try to execute code from L2 to see if that works...
As Martin has already said, this isn't going to work. Data will be randomly evicted (if dirty) or discarded (if clean), and when next needed it will attempt to reload from external memory which will return garage. What you are trying is architecturally unsafe, so according to the spec can't work reliably, and may be prone to somewhat unpredictable failures.
What you are basically describing is a means to lock the caches (commonly called cache lockdown) which forces the cache to hold on to data (and not write to external memory). The Cortex-A family caches do not support this feature, although some ARM cores in the past have done.
HTH, Pete
Hi 0xffff,
Normally I wouldn't play much with cache/MMU, i just configure the MMU at startup, then let it do its job and it works well for me. The only cache operations i do is invalidation after some DMA transfers to cacheable areas. I can see that the functioning of cache & MMU is quite complex (the ARMv7-A architecture reference manual is a "mammoth" book !!!), but with such a complexity i hope there is a trick/loophole, to achieve what i want.
This is where the architecture will bite back - as Martin stated, the core is allowed to speculatively fill and evict cache lines. This means that the core can do what it likes, within the architectural bounds, to cache entries.
Older cores used to come with a feature called cache lockdown, but that basically doesn't exist anymore in ARM-implemented cores, it is an optional architectural feature that has been deemed less than necessary. It isn't there to "use cache as RAM" but to make sure that something always stays in the cache, and they aren't the same thing by any means.
The idea of using the cache as a kind of "we don't have a lot of RAM yet" buffer is more easily solved by using that 32KiB of SRAM. You run the risk - if the DRAM controller handles writes to an uninitialized bank in a less than favorable way - of those cache entries being evicted to make room for other entries. If the DRAM controller receives that write request as a cache clean operation is performed to evict the entry, then you may simply lock your DRAM controller and therefore the rest of the system.
You can't make any guarantees even with cache lockdown that entries won't be "cleaned" out to main memory at any time, just that they will never be invalidated.
If your goal is to prevent writes to DRAM space before initializing the controller, there is no loophole or trick you can pull either from an architectural or a Cortex-A7 implementation point of view. To be architecturally compliant you MUST mark that DRAM region as faulting (or No Access, XN) in the translation tables to prevent speculative accesses to it.
32KiB of SRAM is more than enough space to write code to initialize the DRAM controller, mapping it properly before and after, and execute further code. It may be "slow" in your view, but slow is a lot better than "locks the system up." In every case.
Ta,
Matt
The idea of using the cache as a kind of "we don't have a lot of RAM yet" buffer is more easily solved by using that 32KiB of SRAM
In my case it's more like "using the cache because I don't have external RAM at all !".
I'm testing if there's a way to use that Cortex-A7 CPU "as is" (running my application in bare-metal), with no external DRAM chip installed on the PCB. Now, that might sound crazy and wasteful, but that's another topic
I know I can safely use it with as much RAM as there is internal SRAM (like 64 KB in my case), but if I could use 512 KB, that's much better.
Like you say, when external memory is present, a small and 'slow' SRAM is enough to configure DRAM, then load a second, bigger boot-loader to DRAM and execute it from there, i wouldn't need to diverge from that scheme.
This is where the architecture will bite back - as Martin stated, the core is allowed to speculatively fill and evict cache lines.
I hope i'm not bothering you guys with that question, i'm just trying to understand.
You have been very clear that it is unsafe and not architecturally compliant to try doing this. So, I've been warned
I expect that L1 icache/dcache lines get evicted. But if all the code&data have always been in L2 in the first place, I don't understand why the system would ever 'speculatively' evict data from L2 ? and to replace it with what ?
I would understand, if the code wanted to get data located in a cacheable area that is not in L1&L2, that some line would have to be evicted in order to free a spot for the new data.
And i'm sure there is some very clever algorithm (using some pseudo-random scheme?) that chooses what line to evict. And in that case, there would be a read from external memory, which would return invalid data.
But I accept this situation since it would mean I made an error in my code.
Right now, i'm able to put all my code+data in L2 and code executes correctly from there, and I'm trying to make it fail.
For the same reasons it would evict a line from L1 - to make room. The replacement policy means if a line is speculatively fetched and the cache decides for any reason that a line in a particular way should be evicted to make room for it, it will evict it. You may be able to predict WHICH lines will be evicted at any time (pseudo-random or LRU), you have no real control over speculative accesses, especially in a performance environment, so you don't know WHEN it will happen, and will have a hard time preventing it. Caches are technically able, per the architecture, invalidate and clean out lines at any time they deem appropriate, even if no explicit or speculative load or store happened to provoke it in the instruction stream (as a result of another processor or coherent device causing a coherency transaction, for example).
The only way to stop lines being allocated into the data cache is to disable the data caches. This doesn't stop it from evicting lines, though. It also doesn't architecturally guarantee to stop the instruction cache being filled on instruction fetch (even if the I bit is disabled). Since your L2 is unified, an instruction fetch which might end up being cached (even though caches are 'off') might end up evicting a data line in L2. Depending on the coherency protocol in use, this may make sure the line is not in L1 either..
There are a couple ways you could be almost sure that speculative accesses would not happen - XN any region you know there is not executable code. Mark faulting or No Access any other region you won't access. Mark your translation tables as non-cacheable in the TTBCR/TCR. Disable the caches. Disable the branch predictor (cross your fingers it can be architecturally disabled). If none of those work, disable the MMU. If you thought running from slow SRAM was bad, try doing it with all the caches turned off and a strongly-ordered memory model..
I think the solution for an environment where you have a small application and don't need external DRAM is easy: pick an SoC where you have suitable amounts of internal SRAM. There are cores and a couple of architecture profiles specifically designed to work this way (Cortex-R or Cortex-M with a TCM or two). Some vendors have special ways of making the L2 into SRAM (for anything with a PL310 for example, or any similar non-architectural, external L2 cache) which is for exactly the reasons you want (early boot code which needs a bit more time and space than to bootstrap DRAM, and a flexible environment where you don't need determinism but don't need memory).
I'm starting to get a little confused about what you actually want to do here anyway - is the only goal to not access SRAM because it's "slow?"
Hi Matt,
Thanks for all your detailed explanation, and although if it's been working flawlessly so far on the A7, I see I'm "really playing with fire" with what I'm doing.
It's hard for me to not be excited about it, but I better not think about this too much, because it sounds quite unsafe
The motivation is, for some hardware products that don't require much RAM for code/data, the electronic board could be designed without any DRAM chip on it.
The same libraries, especially DSP stuff like NEON code that I develop for bigger products that do have DRAM chip(s) on-board (and maybe with a different Cortex-A CPU), or on phones/tablets, could be reused in this smaller DRAM-less device.
Then, the goal is to have as much memory as possible for code&data. Nowadays, SRAM size is much smaller than L2, hence my desire to use L2.
Regarding speed, the fact that with this system all code&data would always reside in L2 is definitely a nice "side-effect", but it was not my first motivation.
Actually, SRAM would still be needed for DMA transfers, and the core would have to manually copy data to/from SRAM, which is 'slow', but sometimes still better than directly accessing the peripherals without DMA.
For some applications, that could certainly be a "no go" but in my case (audio) it's just fine.
By the way, i might very well use an A8 instead of an A7, especially since NEON performance is much better.
And for the A8 it does look like I'll be able to lock all ways of L2, achieving the exact same thing, but in a safe way this time
How do you actually manage to do this on the basis that the SRAM is so small? Lockdown works best when you are simply attempting to make sure that lines are not evicted from the cache. However it is less useful - to the point of being useless - for making memory appear to 'exist' outside the region you are able to cache. Without lockdown it would be relatively difficult to achieve.. the issue with your plan of reading each block with the CPU to stimulate read-misses and fetch the data is that you actually need as much RAM as you have cache.
There do exist several SoC options with a rather large amount of internal SRAM. Given your application, one would suggest, again, that you size your chosen SoC appropriately. This will mean that in the future if you need to move to another core entirely, that your code will still run regardless of any microarchitectural games you play with the caches. If you're willing to go for a completely different core then you may as well go find one with enough SRAM.
Yes, the key is to have an area of size = L2 size where the core can do reads/writes without locking, even if those writes go nowhere and the read data are meaningless.
I can see two ways of doing this that has worked for me on A7 & A8.
The first is to init the DRAM controller as if there was a chip physically installed on board, but there is not.
The second is more lazy (doesn't need any initialization) and is SOC-depend. It's to find an area where it's possible to do those read/writes without locking, yeah that one is quite bad I admit, but also worked on two SOCs i have here.
Then, the 1MB section that contains this area is marked as cacheable (and it's the only one), and MMU is enabled.
Consecutive reads of L2 size are performed from this area, and at this point L2 is fully populated with meaningless garbage from consecutive "area" addresses.
Any further writes to this area will update L1 for sure and possibly L2, but shouldn't have to update L3. Even if the core did want to write back a new L2 word to L3, it would not lock so we wouldn't have a problem here.
Any further read will be performed from L1 if data is there, or from L2 (and bringing back data to L1), but should not have to read from L3 since those data are all already all in L2, and that's where the whole 'unsafe' thing comes...
Current facts show me it's working, and code execution works normally. But I would not guarantee this method is problem-free on A7. On the A8, L2 can be locked down, so i'd be less worried
Note that i don't know what L1 does and the complex pseudo-random algorithm it may perform when it's time to evict data or take speculative action, but the point is, at any time, all L1 I-Cache & D-Cache words are somewhere in L2.
Actually, the code that is solely executed from L2 and read/write data from there is the same as if this code was located in DRAM or SRAM, except DMA can't be performed to the area corresponding to L2.
Peripherals and SRAM are located in non-cacheable areas, so they are not a problem. The only thing is boot-loader, that must perform the init mentioned above, and write code&data to the L2 area.
Regards,
L.
You're very lucky that your DRAM controller returns garbage data if there's no DRAM connected.. the more common scenario is that it locks the bus by never allowing a transaction to finish (because it's waiting on data from RAM that doesn't exist, and that can never arrive).
The DRAM controller blindly assumes the DRAM chip honors its read/write requests according to the various clocks it delivers relative to all the timing information it's been configured with.
It will not wait on the DRAM IC, so it's not a problem if the chip is not there.
Understood, but again if you do move to a core with lockdown features you might not get the same controller for the DRAM and therefore less easy a way to 'pollute' the cache with read-allocated garbage in order to write over it, then lock it..
Any progress?
I am currently working with arm a13 allwinner.
I look at how to avoid external DDR3 memory, because the program size is less than 128k, boot with sd - card and loader , which fed program into the l2 cache.