Hi,
I am using iMX 8X which has 1 cluster of 4 Cortex-A35 cores, with DDR3L (DDR3-1866) with ECC enabled.
I performed some measurement for MEMCPY and MEMSET functions to have an estimate of the DDR bandwidth, with one cortex-A35 core running. Here are the best results I have:
- MEMSET: 6079 MB/s
- MEMCPY: 2081 MB/s
- MEMREAD: 2880 MB/s
The functions are based on NEON instructions with prefetch memory instructions (except MEMSET which has no prefetch memory instructions), and caches and MMU are active.
The idea here is to configure the core or the cluster components to get as close as possible to the theoretical bandwidth, which is 7464MB/s (DDR-1866, 32 bits), in order to fasten code execution from DDR for a normal application running on 1 Cortex-A35 core.
As the MEMSET measured bandwidth seems acceptable (81% of theoretical bandwidth), it would be surprising if read accesses were not optimizable.
According to read access latency of the used DDR chip (13 cycles) and write access latency (9 cycles), I would have expected a difference between MEMSET and MEMREAD functions, but not as much, especially because dur to MMU and caches activation, I would expect the controller to perform continuous accesses to the DDR, where the read latency and write latency impact is minimized.
Though I have already posted some questions about the DDR controller of the iMX 8X on NXP forum, I also tried different settings in the Cortex-A35 to try to optimize the read accesses, but I can't get significant improvements:
In some discussions on iMX forums, I also found that using 4 cores instead of 1 also enhance the bandwidth available, by 10%-20%.
Using the caches and MMU or not impacts directly the memory tests results (because cache lines are filled in background), and I am convinced that there are still certain things that I should understand to be able to configure the core correctly, but I can't find what.
Does anyone has information on:
Thanks,
Gael
Hello Gael,
What value did you use for memset? only zero?
Since you use cacheable memory attributes for the range you set, this performance depends on the size of the data (whether it fits or not in L1 and/or L2 caches).
If you want to evaluate DDR3 bandwidth, you may disable caches or make DDR pages as non-cacheable in MMU translation table.
Depending on whether you have a multi-core SoC and a Snoop Coherency Unit (SCU) you may also need to check whether there are memory access optimizations such as Early Write Response or Full Line of Zero Write.
The read bandwith can be better utilized with DMA engine. With CPU load instructions, you can only have 4 pending loads at one time.
Regards,
Florian
Hello Florian,
Thanks for your answer.
Yes for memset, the value used is 0 in these tests. I am also aware of the Data Cache Zero instruction that bypasses the L1 and L2 caches but I didn't look into optimizations for writes since it appeared acceptable. Moreover, according to CPUACTLR.RADIS and L1RADIS, I understand the writes to consecutive addresses bypass the caches after a specific number of accesses.
The functions I coded use NEON instructions (loop performing 4 ld1 and/or 4 st1 instructions, so {128 bits x 4 = 64 Bytes} by iteration).
I don't understand why the performance depends on the data size: activating the MMU (which defines this memory as a normal memory cacheable) makes accesses grouped together, increasing the performance, normally, right?
I also tried disabling the caches and/or MMU, and this is really worst. Here are the numbers I have (the numbers measured with cache and MMU active are not the ones sent in the original post because the code has been changed after that):
MMU and caches disabled:
MMU disabled, caches enabled (no difference):
MMU enabled, caches disabled:
MMU enabled, caches enabled:
When you mention "Early write response" and "Full line of Zero write", I guess this would be optimizations for writes, right? I checked the Cortex-A35 Reference Manual, and the first one doesn't show up, and I think the second one would be the Data Cache Zero by Virtual Address instruction, is it?
About pending loads, where is described this in the Cortex-A35 TRM? I only found out that L1 data side, reads are 128 bits wide compared to 256 bits wide for writes to the L2 memory system. But in the L2 memory system, the master memory interface is 128 bits only. I also found out the "Support for eight outstanding linefill requests" in the L1 memory system description, and I tried changing the default value configured in CPUACTLR.L1PCTL but this had not effect.
Cachefills are in burst of 32 or 64 bytes (not sure about cache-line size in A35). This is for sure an explanation for the large difference.
About data size: If data fits into the cache then you are going to measure the pure performance.
So if cache is 32K do a memcpy() of at least 64k.
As for memset(): Use a value != 0.
1. Cacheability considerations for memory system throughput:
For non-cacheable memory regions, the merge write buffer on data path to memory will indeed combine requests to the same cachelines. For cacheable memory regions, it may prefer instead to issue individual write transactions to the cache system. Write merging works well for sequential accesses such as for memset.
Cortex A35 is based on ARMv8 architecture, and supports both Aarch64 and Aarch32 instructions. For optimal performance I guess you use the Aarch64 instructions, and thus the cacheline size is 64 Byte.
As Bastian said, the maximum merged write burst is therefore 64 Byte, with 64B alignment.
The overall performance for cacheable accesses depends on the data size, but it depends on the cache hit rate. If the accesses always hit in the cache, the latency is the one of the cache. If they miss in the cache, there will be the latency overhead due to accesses to back end memory.
2. Effect of disabling MMU:
Disabling MMU will actually make all your memory accesses to be strongly ordered.without any pipelining as only one access is issued at one time. Which explains minimal performance. This is not the same as having MMU enabled, memory attributes set as normal but with cache disabled.
3. L2 cache system
The "Early write response" and "Full line of Zero write" optimizations are performed by the L2 cache controller and/or the SCU, and operate on both L2 cache and back end memory. I never worked with Cortex A35 CPU, but it seems that as you mentioned that the Data Cache Zero by MVA can clear one cacheline in the L1 cache, and the rest of the cache system up to back end memory for the same cacheline if set properly (the Point of Coherence shall be set to L3).
4. Number of allowed pending loads
I am not familiar with coding with NEON instructions, but if you issue a VLDM, I guess it will be transferred to the load/store unit, which is limited by the number of possibly pending load and store requests. issued by the Load Store Unit (LSU) or what ARM seems to call the Data Cache Unit (DCU) of the L1 cache system. Maybe someone else can elaborate on that?
Sorry I didn't give all the information regarding the system and cluster caches. Here are the information to complete the description:
Memory tests are based on 64MB block. MEMCPY test uses a different source and destination blocks.
Memory tests functions use 128 bits load and store instructions.
For MEMSET function, I used 0 indeed, loaded in a register which is provided to stp instruction. What would be different by using an other value?
I also checked the Early write response, and according to chapter B2.7.2 from Armv8-A reference manual, this is used for MMU device memory accesses, but I can't find anything on it for normal memory, excpet in Cortex-A9 documentation where the setting in Auxiliary control register is mentioned (and not found in Cortex-A35 Reference Manual).
For the number of pending loads, I couldn't find any specific information in Cortex-A35 reference manual.
I think if anyone from ARM could give more details about the DCU and SCU (capacities, behaviour), and even to give hints on why reads are much slower than writes, it would be appreciated.
As I said, I just referenced the Early write response in the context of Cortex A9 MPCore, which includes an external SCU and L2 cache controller. This optimization strategy is provided by SCU, which sends an early AXI response as soon as it buffers the write data.
For the case of Cortex A35, there might be other options as you mentionned.
Are you using a single core Cortex A35 CPU? or an MPCore? in which SoC?
I also could not find a clear answer in publicly available documents. I had this value in mind from previous projects.
For further analysis of performance metrics, you could use the Performance Monitor Unit (PMU) to count the load/store instructions, data cache accesses, data read/write, data cache miss, etc... This is also more accurate then using the timers for bandwidth measurements.
Your access to the information in this Cortex-A Series Programmer’s Guide is conditional upon your acceptance that you will not use or permit others to use the information for the purposes of determining whether implementations of the information herein infringe any third party patents.
I am not sure how I should understand your answer Curtisi. I am just looking for information on the functional behaviour of the Cortex-A35 in order to enhance performance regarding DDR bandwidth, if possible.