Hello,
I have a question if following sequence of instructions involving post-indexed LDRs could be re-ordered on say Cortex A8:
To simplify, lets consider, r0 = 0xC, Cache line size 16 Bytes
ldr r1, [r0], #4 /* 1 */
ldr r2, [r0], #4 /* 2 */
ldr r3, [r0], #4 /* 3 */
ldr r4, [r0], #4 /* 4 */
/* At this point, r0 = 0x1C */
Now, will the above instructions always be executed in order 1-2-3-4 (because r0 is getting updated across) or there are chances that it could execute as 2-3-4-1 etc?
Thanks.
It depends why you have an abort. If the abort is due to a fault from the MMU, it will always be synchronous. So if load (1) faults in the MMU, DFAR will contain the address accessed by (1) and the preferred return address pointed to by LR_abt will be instruction (1) (i.e. the address of instruction (1) plus whatever offset the A.R.M. requires). Note that the processor might have executed the other instructions, if the memory is Normal memory -- in effect it speculates these loads.
If the abort has come from the external system, it depends on the processor implementation whether this is taken synchronously or asynchronously. My recollection is that Cortex-A8 would take it asynchronously, but this should be confirmed from the Technical Reference Manual. (It might depend on the memory type.) If it is asynchronous then DFAR doesn't contain a valid address, and LR_abt will point to whatever instruction was interrupted to take the asynchronous Abort.
On the original instruction sequence, you are correct that there is a register dependency between these instructions that might hinder out-of-order issue. Rewriting using pre-indexed loads might give better performance on some processors:
ldr r1, [r0] /* 1 */
ldr r2, [r0, #4] /* 2 */
ldr r3, [r0, #8] /* 3 */
ldr r4, [r0, #12] /* 4 */
add r0, r0, #16
But as with any optimization you should benchmark this. On simpler processors, the additional instruction will make it go slower; you might be able to fold this in:
ldr r1, [r0], #16 /* 1 */
Plus, for this example, using ldrd or ldmia would be a better option for some processors.
ldrd r3, r4, [r0, #8] /* 3, 4 */
ldrd r1, r2, [r0], #16 /* 1, 2 */
ldmia r0!, {r1-r4}
That's enough from me!