I evaluated the performance overhead of memcpy (memory copy) in both NS.EL1 (kernel module) and EL3 (arm trusted firmware):
1. I allocated two contiguous physical memory buffers A and B via Linux's CMA allocator (specifically, via cma_alloc()).
In EL1, I obtain the physical addresses of A and B: A.phys_addr = virt_to_phys(A.virt_addr), B.phys_addr = virt_to_phys(B.virt_addr),
2. In NS.EL1's kernel module, I directly use memcpy(A.virt_addr, B.virt_addr, sizeof(A)) for memory copy by passing two buffers' virtual addresses.
3. In EL3 BL31, I register a SMC handler and use memcpy(A.phys_addr, B.phys_addr, sizeof(A)) for memory copy by passing two buffers' physical addresses. Note that I initialized the page table (flat region map) of EL3 during bl31_setup and directly pass two buffers' physical addresses to perform memory copy, thus no page fault will be triggered.
4. To evaluate the overhead, I implement a function get_cycle() by reading the PMU counter register pmccntr: asm volatile("mrs %0, pmccntr_el0" : "=r" (r));
So that the cycle evaluation is like that:start = getCycle();
memcpy(A, B, ...);
end = getCycle();
cycle = end - start.
NOTE: all the cycles are calculated immediately before & after the memcpy in both EL1 and EL3 correspondingly.
I performed the evaluation on a Juno R2 board and set the memory buffer size from 4KB to 64MB.During the evaluation, I just enabled only one CPU core.This is a short summary of the results:
Counterintuitively, I find that memcpy() in EL3 is 10x slower than in NS.EL1 kernel module. Are there any possible explanations? Is this due to different cache & data coherence models in EL1 and EL3?