This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why the overhead of memcpy() in EL3 is higher than in NS.EL1 (linux kernel module)?

I evaluated the performance overhead of memcpy (memory copy) in both NS.EL1 (kernel module) and EL3 (arm trusted firmware):

1. I allocated two contiguous physical memory buffers A and B via Linux's CMA allocator (specifically, via cma_alloc()). 

In EL1, I obtain the physical addresses of A and B: A.phys_addr = virt_to_phys(A.virt_addr), B.phys_addr = virt_to_phys(B.virt_addr),

2. In NS.EL1's kernel module, I directly use memcpy(A.virt_addr, B.virt_addr, sizeof(A)) for memory copy by passing two buffers' virtual addresses.

3. In EL3 BL31, I register a SMC handler and use memcpy(A.phys_addr, B.phys_addr, sizeof(A)) for memory copy by passing two buffers' physical addresses. Note that I initialized the page table (flat region map) of EL3 during bl31_setup and directly pass two buffers' physical addresses to perform memory copy, thus no page fault will be triggered.

4. To evaluate the overhead, I implement a function get_cycle() by reading the PMU counter register pmccntr: asm volatile("mrs %0, pmccntr_el0" : "=r" (r));

So that the cycle evaluation is like that:
start = getCycle();

memcpy(A, B, ...);

end = getCycle();

cycle = end - start.

NOTE: all the cycles are calculated immediately before & after the memcpy in both EL1 and EL3 correspondingly. 


I performed the evaluation on a Juno R2 board and set the memory buffer size from 4KB to 64MB.During the evaluation, I just enabled only one CPU core.This is a short summary of the results:

Memcpy  Time in EL1 Kernel Module Time in EL3 BL31 (arm trusted firmware)
4KB 1,324 cycles / 0.0015 ms 20,785 cycles / 0.02 ms
64KB 22,412 cycles / 0.026 ms 328,951 cycles / 0.39 ms
1MB 549,383 cycles / 0.66 ms 5,446,983 cycles / 6.5 ms
64MB 38,262,713 cycles / 45.91 ms 348,783,503 cycles / 418.5 ms

 Counterintuitively, I find that memcpy() in EL3 is 10x slower than in NS.EL1 kernel module. Are there any possible explanations? Is this due to different cache & data coherence models in EL1 and EL3?