Hello,
I have been working on developing comprehensive Data Abort and Prefetch Abort handlers for our ARM Cortex A9 dual-core CPU.
Among the exceptions which are covered by these handlers, I am now trying to develop the part related to DDR ECC uncorrectable errors. Such an error actually results in an AXI slave error on the AXI read bus, which in turn triggers a Data Abort and in particular an asynchronous external memory abort. In this case the Fault Status bits of the Data Fault Status Register (DFSR) are equal to 0x16 (in short descriptor format).
When such an exception occurs, I would like to further check whether the DDR ECC uncorrectable error is due to 2 stuck bits, 1 stuck bit or only 2 flip bits. The mitigation won't the same in the 3 cases.
For this purpose I have to comply with the following procedure:
1. Disable DDR ECC and single bit error scrubbing;
2. Write all 1's
3. Read and check if there are bits stuck at 0; if so I count them as stuck bits.
4. Write all 0's;
5. Read and check there are bits stuck at 1; if so I count them as stuck bits and add the result to the stuck-at-0 stuck bits.
6. Enable DDR ECC and single bit error scrubbing.
When I perform this sequence inside the Data Abort handler, in Abort mode, I end up with a MMU translation fault before I have finished doing it.
So I am a bit confused. I don't see why my MMU table has been corrupted...
1) My bare metal application runs in CPU system mode. The Data Abort handler runs in CPU Abort mode. The general purpose registers {r0-r3,r12,lr} are saved and restored when switching from one mode to the other. But the same MMU table is used right?
2) My MMU table address is set in TTBR0. I don't use TTBR1. From my understanding TTBR1 is used for pages shared accross processes such a OS kernel. Shall I used TTBR1 for page table instead of TTBR0?
3) all caches are disabled during my application execution. So even though I let the TTBR0's IRGN and RGN bits as write-back write-allocate the MMU is not supposed to be evicted and written back to DDR from any of the cache...in other words there shall be no write transaction to the MMU table section in DDR...
I would really appreciate if anyone has any hint on this issue.
Thank you.
Florian
I found out why I had a MMU translation error. The reason was that I provided the DRAM physical address for the stuck bit check. So the untranslated address hits in MMU page with NO_ACCESS attribute (AP[1:0] = 0x0).
Actually the DDR controller only reports physical address of the DDR ECC uncorrectable error, but not the logical AXI address.
As a result I developed translation function to convert the reported physical address into an AXI logical address.
However I have another exception right after re-enabling DDR ECC, and before reading out the corrupted cacheline:
The error address is in the last section of the bare metal program...
Even when I do nothing between disabling and re-enabling ECC, I still get this exception...
I have narrowed down the issue to one thing:
I have developed a low-level driver for DDR controller as dramps.c and dramps.h files.
Within these files I have declared and defined dramps_DisableECC and dramps_EnableECC functions, which themselves use a static inline function Xil_Out32 function to write to the DDRC registers.
In the main.c file of my bare metal test I call dramps_DisableECC and dramps_EnableECC functions. They are plain public void functions.
When I call dramps_EnableECC and dramps_DisableECC in the main, I end up with additional DDR ECC uncorrectable error either catched by the Data Abort Handler or the Prefetch Abort Handler....
But if I use the Xil_Out32 function directly within the main.c, I don't get these extra aborts....
So I came to the conclusion that I needed to define dramps_DisableECC and dramps_EnableECC function as static inline function... My guess is that not using inline function implies that calling these functions will result in a jump to a different MMU section, thus requiring MMU to load a new Page Table Entry and maybe evicting another...
I still want to keep a separate low level driver with dramps.c and dramps.h files.
How would you suggest me to implement that?
Thank a lot for any help.
Shall dramps_DisableECC and dramps_EnableECC be defined in header file dramps.h? so that their types will be static inline?
That's the only usage I became aware of and which is compatible with a shareable low-level driver philosophy...
This did not seem to help. So I am just working with the static inline Xil_Out32 functions.... I also implemented error injection for MMU Page Table Entry and instruction using this flow, but one must makes sure that no extra read or write to DDR is done between ECC is disabled and then re-enabled except the injection write.