Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why the overhead of memcpy() in EL3 is higher than in NS.EL1?

I evaluated the performance overhead of memcpy (memory copy) in both NS.EL1 (kernel module) and EL3 (arm trusted firmware):

1. I allocated two contiguous physical memory buffers A and B via Linux's CMA allocator (specifically, via cma_alloc()). 

2. In NS.EL1's kernel module, I directly use memcpy(A, B, sizeof(A)) for memory copy.

3. In EL3 BL31, I use memcpy(phys_addr(A), phys_addr(B), sizeof(A)) for memory copy. Note that I initialized the page table (flat region map) of EL3 during bl31_setup and directly pass two buffers' physical addresses to perform memory copy, thus no page fault will be triggered.

4. To evaluate the overhead, I read the PMU counter register pmccntr: asm volatile("mrs %0, pmccntr_el0" : "=r" (r));

So that the cycle evaluation is like that:

```
start = getCycle();

memcpy(A, B, ...);

end = getCycle();

cycle = end - start.

```
I performed the evaluation on a Juno R2 board and set the memory buffer size from 4KB to 64MB.During the evaluation, I just enabled only one CPU core.This is a short summary of the results:

Memcpy  Time in EL1 Kernel Module Time in EL3 BL31 (arm trusted firmware)
4KB 1,324 cycles / 0.0015 ms 20,785 cycles / 0.02 ms
64KB 22,412 cycles / 0.026 ms 328,951 cycles / 0.39 ms
1MB 549,383 cycles / 0.66 ms 5,446,983 cycles / 6.5 ms
64MB 38,262,713 cycles / 45.91 ms 348,783,503 cycles / 418.5 ms

 Counterintuitively, I find that memcpy() in EL3 is 10x slower than in NS.EL1 kernel module. Are there any possible explanations? Is this due to different cache & data coherence models in EL1 and EL3?