I would like to find the most optimal method to initialize the DRAM ECC of my Xilinx Zynq7000 SoC. Zynq7000 SoC comprises a dual core Cortex A9 CPU with L1 data and and L1 instruction caches. L1 data cache is write back write-allocate, and this apparently can not be changed.
Among the solutions I have investigated to initialize the DRAM ECC over the entire 4 GB range there are 3:
1) for loop of 32-bit store operations from the start address of the DRAM until the end address of the DRAM: for this purpose, the AXI address is casted as a volatile unsigned integer, and dereferenced for writing to it;
2) memset: the destination pointer is the start address of the DRAM, and the length is the size of the DRAM (4GB)
3) DMA transfer between a fixed source address to the addresses within the DRAM range: Zynq SoC has a PCAP DMA engine between the PS and the PL, which can be used for this purpose. The source address is configured to be not incremented (contrary to the destination address).
Of course the DMA transfer (3) is much faster than CPU initiated stores (1 & 2), because each generated AXI transaction to the DDR controller is of burst length 16 while for CPU initiated stores the generated AXI transaction has a burst length of 1. (1) is faster than (2) because it guarantees a granularity of access of 32-bit per AXI transaction.
However since CPU initiated stores can use caches, I have considered activating L1 data cache and L2 cache for the following reason:
- stores can be coalesced to a 32 byte cacheline before being written back to the DDR controller as an AXI transaction of burst length 8 (32-bit x 8 = 32 bytes).
- CPU stores which directly access DDR controller are of burst length 1 only!!
However in the specific case of DRAM ECC initialization, all accesses are sequential and never hit in the caches. In other words, they always miss in the L1 data cache, which is again configured as write-allocate. As a result before writing the caches, a load is first issued to the back-end DDR DRAM. Because the DDR DRAM address contains unitialized data, the ECC check will result in an ECC uncorrectable error and raise an exception. And it occurs for every new cacheline!!!
Therefore, from my understanding, it is impossible to initialize DRAM ECC with caches enabled, and especially L1 data cache enabled. My conclusion was that the DMA transfer was still the optimal solution for DRAM ECC initialization with a SoC comprising a Cortex A9 CPU.
I would appreciate if anyone could tell me if I missed something? During the tests with L1 data cache enabled, the CPU was stalled, but I did not have any Cache Management Operation (CMO) Data Abort exception.... So I still have some doubt if the AXI read transactions issued by the cache controller are still returned to the CPU as a SLVERR, or only to the cache controller which is the actual master...
Thanks in advance for your help and the discussion.
No matter what you do, remember, the first 1MB cannot be access by DMA! AFAIK, FSBL uses DMA, so why re-invent the wheel?
In my case the first 1MB is On-Chip Memory (OCM), and it can be accessed by PCAP DMA. I never remap the first 1MB to DRAM.
Besides, I use OCM as source of the DMA, so that the read latency is lowered down.
I know ps7_init of the FSBL uses PCAP DMA. But I have a custom boot software, and I don't want all DDR to be initialized while executing in OCM. OCM is SRAM and sensitive to SEUs in space environment. Thus I initialize one part of DDR while executing FSBL in OCM. and the other part while executing SSBL in DDR. So my purpose was to demonstrate that I can reuse this method. and demonstrate it is the most efficient and robust one.
IIRC the whole DDRAM needs to be initialized for ECC to work correctly, but I can be wrong.
If the DRAM's first MB is never mapped, it can not be accessed neither by software or DMA. By mapping I mean routing the AXI transactions to either OCM or DRAM for address within the first MB, which is configured using a dedicated register in Xilinx Zynq PS by BootROM and potentially FSBL.
Besides, at DDRC initialization, I guess you suspect write/read transactions issued to DRAM for training purpose (write leveling, read leveling...). I am not sure about DDR3, but for DDR4 controllers, the write leveling always precedes read leveling. So no unitialized DRAM region is read out during DDRC training.
View all questions in Cortex-A / A-Profile forum