in my recent design I have used a processor with Cortex-A5 core (it is SAMA5D27 from Microchip). There is one critical task which needs to be performed in real-time. Could you, please, give me a hint on how to configure the processor for that?
My system runs with a bare-metal application (custom firmware), not any "big" OS.
The critical task involves iterating in a loop, each iteration should last no more than 1us. I believe this is possible, but my best time so far is almost 8us. I can see that the reason is very slow access to peripherals on APB. According to specification of the processor I've expected five-six cycles of system clock, that would be fine. But the accesses can take ten-twenty times more.
During execution of the critical task the processor doesn't have to deal with anything else. Just this thing, everything else can wait. The execution of the critical task starts on user command, and it can take whatever time necessary to prepare for execution.
I think that the reason of the problem is that the system tries to be too versatile. During execution of the task I need only access to peripherals on APB (GPIO controller and four SPI controllers), and to DDR memory. (And probably to SRAM, where is the code of the application - but maybe it is I-cached already?) There is no need to access anything in parallel with something else.
Each iteration reads one 32-bit value from DDR. Then it only acceses GPIOs and SPIs. After it finishes all the transfers, it stores four 32-bit values in DDR.
First I've run the firmware with only I-cache enabled. With enabled also MMU and D-cache it seems to run a bit faster. The system also has L2 cache, which is enabled.
I am a hardware designer and so far I have no programming experience with caches and MMU. Do I need those things here? Is it possible that they are somehow "misconfigured" from my point of view and impede the operation? Which features would make such task run faster? Is it possible to configure AHB and APB so that such accesses which I need could be really quick?
Please, any hints will be appreciated.
Running deterministic code with caches and MMU isn't an easy task.
For example, make sure to align the data to read or write on a cache line (likely 32bytes).
You can also use PLD to load the next cache line while handling the current.
Setting the cache as write-thru helps to avoid stalls when a cache line needs to be evicted.Is the code running in interrupt? Anything running in foreground?
the code runs as a usual code. No interrupts, no other tasks at that time.
I can switch MMU and caches off. I would prefer to have the configuration of the system as simple and as predictable as possible.
Naively I would expect that MMU and D-cache wouldn't have any work here, because there is really nothing to cache. Just the first value read from DDR could be "prefetched" at a convenient moment. All the other values are "volatile", and can be known after reading from peripherals. But, MMU and D-cache seem to help a little bit.
You need the MMU to make sure the peripherals are mapped as device.
If you do not use operating system:
- turn off MMU upon booting and use flat mapping;
- configure clocks to desirable speeds or maximum frequencies or try to test different frequencies;
- try to use xDMA for some memory transfer operations if this possible;
- you can configure caches if you wish, try to test with enabled caches and without them;
- download your code to internal SRAM (if I am not mistaken internal SRAM works at processor's clock speed);
- don't forget about Trustzone embedded in Bus Matrix, it can cause unpredictable interrupts because of protected memory zones.