I evaluated the performance overhead of memcpy (memory copy) in both NS.EL1 (kernel module) and EL3 (arm trusted firmware):
1. I allocated two contiguous physical memory buffers A and B via Linux's CMA allocator (specifically, via cma_alloc()).
2. In NS.EL1's kernel module, I directly use memcpy(A, B, sizeof(A)) for memory copy.
3. In EL3 BL31, I use memcpy(phys_addr(A), phys_addr(B), sizeof(A)) for memory copy. Note that I initialized the page table (flat region map) of EL3 during bl31_setup and directly pass two buffers' physical addresses to perform memory copy, thus no page fault will be triggered.
4. To evaluate the overhead, I read the PMU counter register pmccntr: asm volatile("mrs %0, pmccntr_el0" : "=r" (r));
So that the cycle evaluation is like that:
```start = getCycle();
memcpy(A, B, ...);
end = getCycle();
cycle = end - start.
```I performed the evaluation on a Juno R2 board and set the memory buffer size from 4KB to 64MB.During the evaluation, I just enabled only one CPU core.This is a short summary of the results:
Counterintuitively, I find that memcpy() in EL3 is 10x slower than in NS.EL1 kernel module. Are there any possible explanations? Is this due to different cache & data coherence models in EL1 and EL3?