We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I evaluated the performance overhead of memcpy (memory copy) in both NS.EL1 (kernel module) and EL3 (arm trusted firmware):
1. I allocated two contiguous physical memory buffers A and B via Linux's CMA allocator (specifically, via cma_alloc()).
In EL1, I obtain the physical addresses of A and B: A.phys_addr = virt_to_phys(A.virt_addr), B.phys_addr = virt_to_phys(B.virt_addr),
2. In NS.EL1's kernel module, I directly use memcpy(A.virt_addr, B.virt_addr, sizeof(A)) for memory copy by passing two buffers' virtual addresses.
3. In EL3 BL31, I register a SMC handler and use memcpy(A.phys_addr, B.phys_addr, sizeof(A)) for memory copy by passing two buffers' physical addresses. Note that I initialized the page table (flat region map) of EL3 during bl31_setup and directly pass two buffers' physical addresses to perform memory copy, thus no page fault will be triggered.
4. To evaluate the overhead, I implement a function get_cycle() by reading the PMU counter register pmccntr: asm volatile("mrs %0, pmccntr_el0" : "=r" (r));
So that the cycle evaluation is like that:start = getCycle();
memcpy(A, B, ...);
end = getCycle();
cycle = end - start.
NOTE: all the cycles are calculated immediately before & after the memcpy in both EL1 and EL3 correspondingly.
I performed the evaluation on a Juno R2 board and set the memory buffer size from 4KB to 64MB.During the evaluation, I just enabled only one CPU core.This is a short summary of the results:
Counterintuitively, I find that memcpy() in EL3 is 10x slower than in NS.EL1 kernel module. Are there any possible explanations? Is this due to different cache & data coherence models in EL1 and EL3?