Hi,
I am using iMX 8X which has 1 cluster of 4 Cortex-A35 cores, with DDR3L (DDR3-1866) with ECC enabled.
I performed some measurement for MEMCPY and MEMSET functions to have an estimate of the DDR bandwidth, with one cortex-A35 core running. Here are the best results I have:
- MEMSET: 6079 MB/s
- MEMCPY: 2081 MB/s
- MEMREAD: 2880 MB/s
The functions are based on NEON instructions with prefetch memory instructions (except MEMSET which has no prefetch memory instructions), and caches and MMU are active.
The idea here is to configure the core or the cluster components to get as close as possible to the theoretical bandwidth, which is 7464MB/s (DDR-1866, 32 bits), in order to fasten code execution from DDR for a normal application running on 1 Cortex-A35 core.
As the MEMSET measured bandwidth seems acceptable (81% of theoretical bandwidth), it would be surprising if read accesses were not optimizable.
According to read access latency of the used DDR chip (13 cycles) and write access latency (9 cycles), I would have expected a difference between MEMSET and MEMREAD functions, but not as much, especially because dur to MMU and caches activation, I would expect the controller to perform continuous accesses to the DDR, where the read latency and write latency impact is minimized.
Though I have already posted some questions about the DDR controller of the iMX 8X on NXP forum, I also tried different settings in the Cortex-A35 to try to optimize the read accesses, but I can't get significant improvements:
In some discussions on iMX forums, I also found that using 4 cores instead of 1 also enhance the bandwidth available, by 10%-20%.
Using the caches and MMU or not impacts directly the memory tests results (because cache lines are filled in background), and I am convinced that there are still certain things that I should understand to be able to configure the core correctly, but I can't find what.
Does anyone has information on:
Thanks,
Gael
Hello Gael,
As I said, I just referenced the Early write response in the context of Cortex A9 MPCore, which includes an external SCU and L2 cache controller. This optimization strategy is provided by SCU, which sends an early AXI response as soon as it buffers the write data.
For the case of Cortex A35, there might be other options as you mentionned.
Are you using a single core Cortex A35 CPU? or an MPCore? in which SoC?
I also could not find a clear answer in publicly available documents. I had this value in mind from previous projects.
For further analysis of performance metrics, you could use the Performance Monitor Unit (PMU) to count the load/store instructions, data cache accesses, data read/write, data cache miss, etc... This is also more accurate then using the timers for bandwidth measurements.