Good morning, I'm studying ARM assembly, Cortex A series. Reading the ARM documentation I found out this paper (Cortex A8, fast memcpy examples). My attention went to the PDL instruction, preloading into cache. I have read about it on the ARM manuals, but I still don't understand why the offset is in this way:
WordCopyPLD PLD [r1, #0x100] MOV r12, #16 WordCopyPLD1 LDR r3, [r1], #4 STR r3, [r0], #4 SUBS r12, r12, #1 BNE WordCopyPLD1 SUBS r2, r2, #0x40 BNE WordCopyPLD
Why the offset in this case is 128 byte ahead? if I read words from memory pointed by R1, for 16 times, I was supposed that the bytes ahead were 4*16=64. Why 128?
The same question with this example:
NEONCopyPLD PLD [r1, #0xC0] VLDM r1!,{d0-d7} VSTM r0!,{d0-d7} SUBS r2,r2,#0x40 BGE NEONCopyPLD
Why 192 byte ahead if the Dn are 8 bytes each one and I load 8 registers each iteration?
Thank you for any answer.
Hi, I am speculating:1st case: The cache will do a speculative access in the next cache line (size 64) when it fill the current line (triggered by the ldr, so "pld" tells it to do so for the overnext.
2nd case: I have no idea.