M4 Cycle count

I'm trying to copy an array from source to destination (both are zero wait state memories). The complier generated code is inefficient and hence I was trying two methods. How many cycles will both take?

The 1st one takes 4 cycles to move every word from source to destination which results in an overall latency of 16 cycles. The 2nd one takes 10 cycles. Is that consent with the ARM pipeline? Can you provide the pipeline diagram so that I can analyse this at my end.

1st One

__asm(" LDR R2, [R1, #0]");
__asm(" STR R2, [R0, #0]");
__asm(" LDR R2, [R1, #4]");
__asm(" STR R2, [R0, #4]");
__asm(" LDR R2, [R1, #8]");
__asm(" STR R2, [R0, #8]");
__asm(" LDR R2, [R1, #12]");
__asm(" STR R2, [R0, #12]");

2nd One

__asm(" LDM R1!, {R2-R5}");
__asm(" STM R0!, {R2-R5}");