Hi All!
I am working with a Xilinx Zynq 7000 SoC which uses the Cortex A9 as a CPU.
I've observed a problem wherein a section of memory marked strongly-ordered and non-cacheable (0xc02) in the MMU table gets corrupted by what appears to be L1 evictions back to DDR.
Setup: Linux master on CPU0 with FreeRTOS on CPU1. During the boot process, the region from 512MB to 1GB is marked 0xc02 in the translation table and aliased back to the lower region (0 to 512MB). This has the effect of allowing accesses to the same physical memory with different region attributes. The Linux CPU owns the L2 cache and its controller, the L2 cache is disabled on CPU1.
Thus, a pointer offset by 0x20000000 from its original value returned by malloc should be considered uncacheable, and all memory accesses will go directly to memory. I am using a buffer of 1024 integers, which is allocated by malloc then offset to make all accesses uncacheable.
Issue: After performing a memcpy to the uncached buffer, the value matches the source exactly. However, after a short amount of time, the uncached buffer drifts from the source (which remains unchanged throughout). When the buffer is instead marked as cached, this corruption does not occur, which leads me to believe that stale data is being evicted from the L1 cache and overwriting the new clean data that was placed in DDR.
I have tried disabling, flushing, and invalidating the cache (both before and after the memcpy), but these did not work. The buffer is unaligned to the L1 cache size, which would cause corruption at the front and end entries in the buffer from accesses to the cached pointers before and after, but the corruption is spread randomly throughout the buffer in chunks of 8 entries (8*4 = 32, the L1 line size). Additionally, I've tried disabling the prefetch bits in the ACTLR. Looking at the the disassembly of memcpy though, it does not issue any PLD instructions to the destination, only to the source.
What else could be the cause of this, and what else could I try to fix the issue of not being able to write to an uncached region?
Thanks!!!
Hi Pete, dedowes,
The Linux CPU owns the L2 cache and its controller, the L2 cache is disabled on CPU1.It's a single cache shared by both cores. It's either on or off - it can't be on for one, and off for the other.
The Linux CPU owns the L2 cache and its controller, the L2 cache is disabled on CPU1.
It's a single cache shared by both cores. It's either on or off - it can't be on for one, and off for the other.
As an interesting side-note, while you can't "disable" the L2 cache on a per-CPU basis, the CPU SCTLR.C bit will affect whether the CPU can generate what the L2 cache (or, more specifically, the L2 memory system) sees as cacheable transactions. This is somewhat architecturally defined, but it is probably an awful idea to say that CPU1 has it's L1 disabled just to prevent allocation into L2. Note that disabling caches does NOT prevent lookups or hits in caches, nor does it technically prevent a device or strongly-ordered memory access from being looked up in a cache (that seems counter-intuitive but it is architecturally acceptable).
It's possible, if that L2 cache is an L2C-310 and it has been synthesized with the appropriate option to enable "Lockdown by Master ID", to configure the L2 cache to essentially lock all ways as unavailable for allocation to a particular CPU, effectively 'disabling' L2 for that CPU, but any transaction will still have to pass through the L2C-310 on it's way to L3, there is no short-circuit.
~
What's really wrong here, though, is exactly as Pete says: you can't map two virtual addresses with two sets of (conflicting) attributes to the same physical address. From your description of the problem, it isn't so much the L2 cache that is causing the issue here, nor who "owns" it, but flouting the rules of memory coherency which will bite you on any processor architecture (not just ARM). Even if you are considering that using a single CPU and the OS running on it is the only one generating accesses to that physical memory location, you still have to abide by the rules of the memory model, multiple observers (not just CPUs, but the MMU, Instruction- and Data-side logic).
If you've got cacheable memory then by any definition you HAVE to deal with cache coherency. Simply mapping it as strongly-ordered somewhere else would never remove the requirement to maintain the caches for the cacheable alias, not only before and after using the cacheable alias, but before and after using the strongly-ordered alias too.
Unfortunately, you're detailing the symptoms of a problem but you never really described the original intent - is this 512MB strongly-ordered alias an attempt by FreeRTOS to read the memory that Linux is using? Or is it a buffer owned solely by FreeRTOS as an alias of it's own (not-shared-with-Linux) cacheable memory?
Ta,
Matt
As an interesting side-note, while you can't "disable" the L2 cache on a per-CPU basis, the CPU SCTLR.C bit will affect whether the CPU can generate what the L2 cache (or, more specifically, the L2 memory system) sees as cacheable transactions.
Interesting - thanks Matt - I didn't know about that one.
Cheers,
Pete
I figured I would cross-check this, since I had a very small doubt about it.. and it turns out that the Cortex-A9 isn't quite as nice about it.. Cortex-A9 TRM r4p1 states that when SCTLR.C=0 all accesses to Cacheable regions are treated as Normal Non-Cacheable without lookup in L1, but then it goes on to say:
ARUSER[4:0] and AWUSER[4:0] directly reflect the value of the Inner attributes and Shared attribute as defined in the corresponding page descriptor. They do not reflect how the Cortex-A9 processor interprets them, and whether the access was treated as Cacheable or not
.. that may interfere with things a little since even though it will never allocate into L1, it will still present Shareability and Cacheability attributes externally, to the L2C-310. I am sure there are some cute use cases of not using L1 but wanting to use the L2.. but as hard as I look, I actually can't find anything architecturally that would prevent a processor from presenting the intended attributes vs. the ones it used (in fact, doing so would not be terribly friendly to designs with a system cache). The case of requiring a processor not to output any cacheability or shareability attributes on the bus is well handled by marking translation tables correctly or just not turning on the MMU in the first place.
So, I was dead wrong. Oops
Thanks for checking
Hey Matt and Pete,
Thanks so much for your help. I took a few days off from the issue and will be getting back to it later this week. The intent of what I was doing is to alias two virtual memory regions to the same physical memory region. Basically allowing 0-256 MB virtual and 512-768 MB virtual to map to 0-256 MB physical. This is all contained with in RTOS and is not meant to allow RTOS to see Linux's memory or vice versa. I am doing a lot of stuff with DMA so a number of the buffers that I have to allocate for must never be cached. This is what was driving force behind this aliasing decision. The buffer that I am having issues with is actually not dealing with the DMA and I set it to noncachable sort of as a test and noticed this issue. I have two quick fixes to solve this temporarily, one being simply to make the buffer cacheable again, and the issue goes away. But I wanted to understand the underlying problem, as this issue could be showing up elsewhere in the code without my knowledge.
One odd thing that I noticed, was that if I set up a nonblocking delay of anything longer than 100 ms right before I allocated this buffer, the corruption problem never occurred once it was allocated and populated. This lead me to suspect possible latency issues or clashes with the other threads running, but I am doubtful. A careful monitoring of the executing thread during the block didn't yield anything promising.
Could you explain to me what the ARUSER and the AWUSER bits are doing in what you mentioned above? As far as I understand it, they would allow me to ignore the L1 cache, not the L2 cache, which is opposite the assumption I have been working on.
I will start by making sure I do not have any coherency issues, eliminating the aliasing and ensuring that I perform proper maintenance operations on the L2 cache from RTOS. Do you think it would be possible to instead work at the physical and not virtual level when setting memory access attributes? In this way providing me with the noncacheable region that I need without aliasing.
Thank you guys so much for your help!
- Dan
One odd thing that I noticed, was that if I set up a nonblocking delay of anything longer than 100 ms right before I allocated this buffer, the corruption problem never occurred once it was allocated and populated.
Yes, that sounds a pretty typical manifestation of a coherency failure with conflicting attributes.
If you end up with conflicting page table attributes then you are really at the mercy of what is wedged in TLB / uTLBs around the system.
Inserting waiting makes it more likely that the conflicting TLB entries and cache lines have been evicted, so you are more likely to end up with only one of the attribute sets active at any point in time, but really you are just playing with statistics at this point and there are no guarantees it would work. You would be very surprised at how long things can lurk in the main TLB ...
Could you explain to me what the ARUSER and the AWUSER bits are doing in what you mentioned above?
These signals basically propagate the shareability information out of the core and down to the masters further down the cache hierarchy. The Cortex-A9 L2 cache is an external block (PL310) from the CPU core, and these signals effectively propagate the essential TLB page information to the L2 so it does the right thing.
Do you think it would be possible to instead work at the physical and not virtual level when setting memory access attributes?
No; it's a virtual memory architecture, so everything operates on virtual addresses. Any consistency management for aliased mappings has to be enforced by the various software pieces that are running; the hardware can't do this itself.