Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
I evaluated the performance overhead of memcpy (memory copy) in both NS.EL1 (kernel module) and EL3 (arm trusted firmware):
1. I allocated two contiguous physical memory buffers A and B via Linux's CMA allocator (specifically, via cma_alloc()).
2. In NS.EL1's kernel module, I directly use memcpy(A, B, sizeof(A)) for memory copy.
3. In EL3 BL31, I use memcpy(phys_addr(A), phys_addr(B), sizeof(A)) for memory copy. Note that I initialized the page table (flat region map) of EL3 during bl31_setup and directly pass two buffers' physical addresses to perform memory copy, thus no page fault will be triggered.
4. To evaluate the overhead, I read the PMU counter register pmccntr: asm volatile("mrs %0, pmccntr_el0" : "=r" (r));
So that the cycle evaluation is like that:
```start = getCycle();
memcpy(A, B, ...);
end = getCycle();
cycle = end - start.
```I performed the evaluation on a Juno R2 board and set the memory buffer size from 4KB to 64MB.During the evaluation, I just enabled only one CPU core.This is a short summary of the results:
Counterintuitively, I find that memcpy() in EL3 is 10x slower than in NS.EL1 kernel module. Are there any possible explanations? Is this due to different cache & data coherence models in EL1 and EL3?
If you are using Linux in NS.EL1, the memory copy routine is highly optimised one and imported from https://github.com/ARM-software/optimized-routines. Not sure about the same for the software your are running in EL3