Why the overhead of memcpy() in EL3 is higher than in NS.EL1?

I evaluated the performance overhead of memcpy (memory copy) in both NS.EL1 (kernel module) and EL3 (arm trusted firmware):

1. I allocated two contiguous physical memory buffers A and B via Linux's CMA allocator (specifically, via cma_alloc()). 

2. In NS.EL1's kernel module, I directly use memcpy(A, B, sizeof(A)) for memory copy.

3. In EL3 BL31, I use memcpy(phys_addr(A), phys_addr(B), sizeof(A)) for memory copy. Note that I initialized the page table (flat region map) of EL3 during bl31_setup and directly pass two buffers' physical addresses to perform memory copy, thus no page fault will be triggered.

4. To evaluate the overhead, I read the PMU counter register pmccntr: asm volatile("mrs %0, pmccntr_el0" : "=r" (r));

So that the cycle evaluation is like that:

start = getCycle();

memcpy(A, B, ...);

end = getCycle();

cycle = end - start.

I performed the evaluation on a Juno R2 board and set the memory buffer size from 4KB to 64MB.During the evaluation, I just enabled only one CPU core.This is a short summary of the results:

Memcpy  Time in EL1 Kernel Module Time in EL3 BL31 (arm trusted firmware)
4KB 1,324 cycles / 0.0015 ms 20,785 cycles / 0.02 ms
64KB 22,412 cycles / 0.026 ms 328,951 cycles / 0.39 ms
1MB 549,383 cycles / 0.66 ms 5,446,983 cycles / 6.5 ms
64MB 38,262,713 cycles / 45.91 ms 348,783,503 cycles / 418.5 ms

 Counterintuitively, I find that memcpy() in EL3 is 10x slower than in NS.EL1 kernel module. Are there any possible explanations? Is this due to different cache & data coherence models in EL1 and EL3?