I'm thinking about using a cortex-a7 in "bare-metal" where I don't need much memory, so i'd like to avoid using external memory.
The CPU boots from an external 4MBytes SPI NOR FLASH chip.
It has 512 KBytes of L2 cache and 32 KBytes of internal SRAM that is just used during initial boot since it's so slow.
Using MMU and L2 cache configurations, I am wondering if there is a way to fill the whole L2 cache with code/data ?
Since the internal SRAM is 16 times smaller than the L2 cache, it might be tricky.
Could the following work ?
1. CPU boots initial code from SPI FLASH (like a 8-16KB , let's says @ 0x00000000 where SRAM is located )
2. First, MMU is configured so that this bootloader code/data is never cached.
Then,
3. CPU loads one block of 16KB from SPI FLASH, and writes it at a fixed address in internal SRAM ( 0x00004000 )
4. CPU reads 16KB of data from increasing addresses:
for 1st block : 0x80000000-0x80003fff
for 2nd block: 0x80004000-0x80007fff
... and so on ... with MMU/L2 cache configured so that those addresses always map to 0x00004000 - 0x00007fff where the block is located ( the question is here, can this be done? )
5. Those reads provoke L2 cache-misses which fills the 16KB of L2 cache with the block data.
6. Repeat 3-4-5 steps 32 times to fill the whole 512KB of L2 cache
Configure MMU L1 caches (or maybe that must be also done in previous steps?)
Jump to program entry point (so, somewhere between 0x80000000 and 0x80000000 + 512KB).
Hi 0xffff,
Normally I wouldn't play much with cache/MMU, i just configure the MMU at startup, then let it do its job and it works well for me. The only cache operations i do is invalidation after some DMA transfers to cacheable areas. I can see that the functioning of cache & MMU is quite complex (the ARMv7-A architecture reference manual is a "mammoth" book !!!), but with such a complexity i hope there is a trick/loophole, to achieve what i want.
Normally I wouldn't play much with cache/MMU, i just configure the MMU at startup, then let it do its job and it works well for me.
The only cache operations i do is invalidation after some DMA transfers to cacheable areas.
I can see that the functioning of cache & MMU is quite complex (the ARMv7-A architecture reference manual is a "mammoth" book !!!), but with such a complexity i hope there is a trick/loophole, to achieve what i want.
This is where the architecture will bite back - as Martin stated, the core is allowed to speculatively fill and evict cache lines. This means that the core can do what it likes, within the architectural bounds, to cache entries.
Older cores used to come with a feature called cache lockdown, but that basically doesn't exist anymore in ARM-implemented cores, it is an optional architectural feature that has been deemed less than necessary. It isn't there to "use cache as RAM" but to make sure that something always stays in the cache, and they aren't the same thing by any means.
The idea of using the cache as a kind of "we don't have a lot of RAM yet" buffer is more easily solved by using that 32KiB of SRAM. You run the risk - if the DRAM controller handles writes to an uninitialized bank in a less than favorable way - of those cache entries being evicted to make room for other entries. If the DRAM controller receives that write request as a cache clean operation is performed to evict the entry, then you may simply lock your DRAM controller and therefore the rest of the system.
You can't make any guarantees even with cache lockdown that entries won't be "cleaned" out to main memory at any time, just that they will never be invalidated.
If your goal is to prevent writes to DRAM space before initializing the controller, there is no loophole or trick you can pull either from an architectural or a Cortex-A7 implementation point of view. To be architecturally compliant you MUST mark that DRAM region as faulting (or No Access, XN) in the translation tables to prevent speculative accesses to it.
32KiB of SRAM is more than enough space to write code to initialize the DRAM controller, mapping it properly before and after, and execute further code. It may be "slow" in your view, but slow is a lot better than "locks the system up." In every case.
Ta,
Matt
The idea of using the cache as a kind of "we don't have a lot of RAM yet" buffer is more easily solved by using that 32KiB of SRAM
In my case it's more like "using the cache because I don't have external RAM at all !".
I'm testing if there's a way to use that Cortex-A7 CPU "as is" (running my application in bare-metal), with no external DRAM chip installed on the PCB. Now, that might sound crazy and wasteful, but that's another topic
I know I can safely use it with as much RAM as there is internal SRAM (like 64 KB in my case), but if I could use 512 KB, that's much better.
Like you say, when external memory is present, a small and 'slow' SRAM is enough to configure DRAM, then load a second, bigger boot-loader to DRAM and execute it from there, i wouldn't need to diverge from that scheme.
As Martin has already said, this isn't going to work. Data will be randomly evicted (if dirty) or discarded (if clean), and when next needed it will attempt to reload from external memory which will return garage. What you are trying is architecturally unsafe, so according to the spec can't work reliably, and may be prone to somewhat unpredictable failures.
This is where the architecture will bite back - as Martin stated, the core is allowed to speculatively fill and evict cache lines.
I hope i'm not bothering you guys with that question, i'm just trying to understand.
You have been very clear that it is unsafe and not architecturally compliant to try doing this. So, I've been warned
I expect that L1 icache/dcache lines get evicted. But if all the code&data have always been in L2 in the first place, I don't understand why the system would ever 'speculatively' evict data from L2 ? and to replace it with what ?
I would understand, if the code wanted to get data located in a cacheable area that is not in L1&L2, that some line would have to be evicted in order to free a spot for the new data.
And i'm sure there is some very clever algorithm (using some pseudo-random scheme?) that chooses what line to evict. And in that case, there would be a read from external memory, which would return invalid data.
But I accept this situation since it would mean I made an error in my code.
Right now, i'm able to put all my code+data in L2 and code executes correctly from there, and I'm trying to make it fail.
For the same reasons it would evict a line from L1 - to make room. The replacement policy means if a line is speculatively fetched and the cache decides for any reason that a line in a particular way should be evicted to make room for it, it will evict it. You may be able to predict WHICH lines will be evicted at any time (pseudo-random or LRU), you have no real control over speculative accesses, especially in a performance environment, so you don't know WHEN it will happen, and will have a hard time preventing it. Caches are technically able, per the architecture, invalidate and clean out lines at any time they deem appropriate, even if no explicit or speculative load or store happened to provoke it in the instruction stream (as a result of another processor or coherent device causing a coherency transaction, for example).
The only way to stop lines being allocated into the data cache is to disable the data caches. This doesn't stop it from evicting lines, though. It also doesn't architecturally guarantee to stop the instruction cache being filled on instruction fetch (even if the I bit is disabled). Since your L2 is unified, an instruction fetch which might end up being cached (even though caches are 'off') might end up evicting a data line in L2. Depending on the coherency protocol in use, this may make sure the line is not in L1 either..
There are a couple ways you could be almost sure that speculative accesses would not happen - XN any region you know there is not executable code. Mark faulting or No Access any other region you won't access. Mark your translation tables as non-cacheable in the TTBCR/TCR. Disable the caches. Disable the branch predictor (cross your fingers it can be architecturally disabled). If none of those work, disable the MMU. If you thought running from slow SRAM was bad, try doing it with all the caches turned off and a strongly-ordered memory model..
I think the solution for an environment where you have a small application and don't need external DRAM is easy: pick an SoC where you have suitable amounts of internal SRAM. There are cores and a couple of architecture profiles specifically designed to work this way (Cortex-R or Cortex-M with a TCM or two). Some vendors have special ways of making the L2 into SRAM (for anything with a PL310 for example, or any similar non-architectural, external L2 cache) which is for exactly the reasons you want (early boot code which needs a bit more time and space than to bootstrap DRAM, and a flexible environment where you don't need determinism but don't need memory).
I'm starting to get a little confused about what you actually want to do here anyway - is the only goal to not access SRAM because it's "slow?"
Hi Matt,
Thanks for all your detailed explanation, and although if it's been working flawlessly so far on the A7, I see I'm "really playing with fire" with what I'm doing.
It's hard for me to not be excited about it, but I better not think about this too much, because it sounds quite unsafe
The motivation is, for some hardware products that don't require much RAM for code/data, the electronic board could be designed without any DRAM chip on it.
The same libraries, especially DSP stuff like NEON code that I develop for bigger products that do have DRAM chip(s) on-board (and maybe with a different Cortex-A CPU), or on phones/tablets, could be reused in this smaller DRAM-less device.
Then, the goal is to have as much memory as possible for code&data. Nowadays, SRAM size is much smaller than L2, hence my desire to use L2.
Regarding speed, the fact that with this system all code&data would always reside in L2 is definitely a nice "side-effect", but it was not my first motivation.
Actually, SRAM would still be needed for DMA transfers, and the core would have to manually copy data to/from SRAM, which is 'slow', but sometimes still better than directly accessing the peripherals without DMA.
For some applications, that could certainly be a "no go" but in my case (audio) it's just fine.
By the way, i might very well use an A8 instead of an A7, especially since NEON performance is much better.
And for the A8 it does look like I'll be able to lock all ways of L2, achieving the exact same thing, but in a safe way this time
How do you actually manage to do this on the basis that the SRAM is so small? Lockdown works best when you are simply attempting to make sure that lines are not evicted from the cache. However it is less useful - to the point of being useless - for making memory appear to 'exist' outside the region you are able to cache. Without lockdown it would be relatively difficult to achieve.. the issue with your plan of reading each block with the CPU to stimulate read-misses and fetch the data is that you actually need as much RAM as you have cache.
There do exist several SoC options with a rather large amount of internal SRAM. Given your application, one would suggest, again, that you size your chosen SoC appropriately. This will mean that in the future if you need to move to another core entirely, that your code will still run regardless of any microarchitectural games you play with the caches. If you're willing to go for a completely different core then you may as well go find one with enough SRAM.
Yes, the key is to have an area of size = L2 size where the core can do reads/writes without locking, even if those writes go nowhere and the read data are meaningless.
I can see two ways of doing this that has worked for me on A7 & A8.
The first is to init the DRAM controller as if there was a chip physically installed on board, but there is not.
The second is more lazy (doesn't need any initialization) and is SOC-depend. It's to find an area where it's possible to do those read/writes without locking, yeah that one is quite bad I admit, but also worked on two SOCs i have here.
Then, the 1MB section that contains this area is marked as cacheable (and it's the only one), and MMU is enabled.
Consecutive reads of L2 size are performed from this area, and at this point L2 is fully populated with meaningless garbage from consecutive "area" addresses.
Any further writes to this area will update L1 for sure and possibly L2, but shouldn't have to update L3. Even if the core did want to write back a new L2 word to L3, it would not lock so we wouldn't have a problem here.
Any further read will be performed from L1 if data is there, or from L2 (and bringing back data to L1), but should not have to read from L3 since those data are all already all in L2, and that's where the whole 'unsafe' thing comes...
Current facts show me it's working, and code execution works normally. But I would not guarantee this method is problem-free on A7. On the A8, L2 can be locked down, so i'd be less worried
Note that i don't know what L1 does and the complex pseudo-random algorithm it may perform when it's time to evict data or take speculative action, but the point is, at any time, all L1 I-Cache & D-Cache words are somewhere in L2.
Actually, the code that is solely executed from L2 and read/write data from there is the same as if this code was located in DRAM or SRAM, except DMA can't be performed to the area corresponding to L2.
Peripherals and SRAM are located in non-cacheable areas, so they are not a problem. The only thing is boot-loader, that must perform the init mentioned above, and write code&data to the L2 area.
Regards,
L.
You're very lucky that your DRAM controller returns garbage data if there's no DRAM connected.. the more common scenario is that it locks the bus by never allowing a transaction to finish (because it's waiting on data from RAM that doesn't exist, and that can never arrive).
The DRAM controller blindly assumes the DRAM chip honors its read/write requests according to the various clocks it delivers relative to all the timing information it's been configured with.
It will not wait on the DRAM IC, so it's not a problem if the chip is not there.
Understood, but again if you do move to a core with lockdown features you might not get the same controller for the DRAM and therefore less easy a way to 'pollute' the cache with read-allocated garbage in order to write over it, then lock it..
Any progress?
I am currently working with arm a13 allwinner.
I look at how to avoid external DDR3 memory, because the program size is less than 128k, boot with sd - card and loader , which fed program into the l2 cache.
As per the multiple replies higher up the thread, this isn't possible in a manner which is reliable unless the CPU supports cache lockdown which the Cortex-A cores do not.
HTH,
Pete