If I use three buffers with 4096 bytes size, the cycle count increases to around 140 cycles. I have given the code here, r1,r2,r3 has the address of buffer1,buffer2 and buffer3 respectively. However, if the buffer size is not a multiple of 4096, cycle count is normal.