This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Reordering between multiple loads

Hello,

I have a question if following sequence of instructions involving post-indexed LDRs could be re-ordered on say Cortex A8:

To simplify, lets consider, r0 = 0xC, Cache line size 16 Bytes

ldr     r1, [r0], #4     /* 1 */

ldr     r2, [r0], #4     /* 2 */

ldr     r3, [r0], #4     /* 3 */

ldr     r4, [r0], #4     /* 4 */

/* At this point, r0 = 0x1C */

Now, will the above instructions always be executed in order 1-2-3-4 (because r0 is getting updated across) or there are chances that it could execute as 2-3-4-1 etc?

Thanks.

Parents
  • Thanks. So in this case, if there is a abort on memory access of step 1 but currently being executed instruction is 4 (in above example), is it possible that it would still be (kind of) imprecise abort?

    - That is, (out of order, NORMAL) memory access for instruction 1 aborts

    - But we are at instruction 4

    - Will DFAR is guaranteed to have address corresponding to instruction 1?

    - Will LR (in ABT) guaranteed to have PC stored of instruction 1 always?

    Thanks.

Reply
  • Thanks. So in this case, if there is a abort on memory access of step 1 but currently being executed instruction is 4 (in above example), is it possible that it would still be (kind of) imprecise abort?

    - That is, (out of order, NORMAL) memory access for instruction 1 aborts

    - But we are at instruction 4

    - Will DFAR is guaranteed to have address corresponding to instruction 1?

    - Will LR (in ABT) guaranteed to have PC stored of instruction 1 always?

    Thanks.

Children
  • It would depend on what kind of an abort it was.

    For MMU based faults (translation fault, permission faults, access flag faults...) these are synchronous with the instruction that caused them.  And as mentioned, instructions must appear to be executed in order.

    For the instruction sequence you gave, imagine that the starting value of r0 was 0x3FFF,FFF8.  That is, the first to instructions access one page (page A) and the next two access the following page (page B). 

    Let's say page A is marked as Fault and page B as Normal.  The first LDR will trigger a synchronous fault.  The processor _might_ have speculatively already performed the two loads from page B, but when we take the exception the state will be consistent with none of the later instructions having executed.   

  • Thanks a lot, that clarifies!

  • It depends why you have an abort. If the abort is due to a fault from the MMU, it will always be synchronous. So if load (1) faults in the MMU, DFAR will contain the address accessed by (1) and the preferred return address pointed to by LR_abt will be instruction (1) (i.e. the address of instruction (1) plus whatever offset the A.R.M. requires). Note that the processor might have executed the other instructions, if the memory is Normal memory -- in effect it speculates these loads.

    If the abort has come from the external system, it depends on the processor implementation whether this is taken synchronously or asynchronously. My recollection is that Cortex-A8 would take it asynchronously, but this should be confirmed from the Technical Reference Manual. (It might depend on the memory type.) If it is asynchronous then DFAR doesn't contain a valid address, and LR_abt will point to whatever instruction was interrupted to take the asynchronous Abort.

    On the original instruction sequence, you are correct that there is a register dependency between these instructions that might hinder out-of-order issue. Rewriting using pre-indexed loads might give better performance on some processors:

    ldr     r1, [r0]     /* 1 */

    ldr     r2, [r0, #4]     /* 2 */

    ldr     r3, [r0, #8]     /* 3 */

    ldr     r4, [r0, #12]     /* 4 */

    add   r0, r0, #16

    /* At this point, r0 = 0x1C */

    But as with any optimization you should benchmark this. On simpler processors, the additional instruction will make it go slower; you might be able to fold this in:

    ldr     r2, [r0, #4]     /* 2 */

    ldr     r3, [r0, #8]     /* 3 */

    ldr     r4, [r0, #12]     /* 4 */

    ldr     r1, [r0], #16     /* 1 */

    /* At this point, r0 = 0x1C */

    Plus, for this example, using ldrd or ldmia would be a better option for some processors.

    ldrd    r3, r4, [r0, #8]     /* 3, 4 */

    ldrd    r1, r2, [r0], #16     /* 1, 2 */

    /* At this point, r0 = 0x1C */

    ldmia r0!, {r1-r4}

    /* At this point, r0 = 0x1C */

    That's enough from me!