Good morning, I'm studying ARM assembly, Cortex A series. Reading the ARM documentation I found out this paper (Cortex A8, fast memcpy examples). My attention went to the PDL instruction, preloading into cache. I have read about it on the ARM manuals, but I still don't understand why the offset is in this way:
WordCopyPLD PLD [r1, #0x100] MOV r12, #16 WordCopyPLD1 LDR r3, [r1], #4 STR r3, [r0], #4 SUBS r12, r12, #1 BNE WordCopyPLD1 SUBS r2, r2, #0x40 BNE WordCopyPLD
Why the offset in this case is 128 byte ahead? if I read words from memory pointed by R1, for 16 times, I was supposed that the bytes ahead were 4*16=64. Why 128?
The same question with this example:
NEONCopyPLD PLD [r1, #0xC0] VLDM r1!,{d0-d7} VSTM r0!,{d0-d7} SUBS r2,r2,#0x40 BGE NEONCopyPLD
Why 192 byte ahead if the Dn are 8 bytes each one and I load 8 registers each iteration?
Thank you for any answer.