This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Reordering between multiple loads

Hello,

I have a question if following sequence of instructions involving post-indexed LDRs could be re-ordered on say Cortex A8:

To simplify, lets consider, r0 = 0xC, Cache line size 16 Bytes

ldr     r1, [r0], #4     /* 1 */

ldr     r2, [r0], #4     /* 2 */

ldr     r3, [r0], #4     /* 3 */

ldr     r4, [r0], #4     /* 4 */

/* At this point, r0 = 0x1C */

Now, will the above instructions always be executed in order 1-2-3-4 (because r0 is getting updated across) or there are chances that it could execute as 2-3-4-1 etc?

Thanks.

Parents
  • It depends why you have an abort. If the abort is due to a fault from the MMU, it will always be synchronous. So if load (1) faults in the MMU, DFAR will contain the address accessed by (1) and the preferred return address pointed to by LR_abt will be instruction (1) (i.e. the address of instruction (1) plus whatever offset the A.R.M. requires). Note that the processor might have executed the other instructions, if the memory is Normal memory -- in effect it speculates these loads.

    If the abort has come from the external system, it depends on the processor implementation whether this is taken synchronously or asynchronously. My recollection is that Cortex-A8 would take it asynchronously, but this should be confirmed from the Technical Reference Manual. (It might depend on the memory type.) If it is asynchronous then DFAR doesn't contain a valid address, and LR_abt will point to whatever instruction was interrupted to take the asynchronous Abort.

    On the original instruction sequence, you are correct that there is a register dependency between these instructions that might hinder out-of-order issue. Rewriting using pre-indexed loads might give better performance on some processors:

    ldr     r1, [r0]     /* 1 */

    ldr     r2, [r0, #4]     /* 2 */

    ldr     r3, [r0, #8]     /* 3 */

    ldr     r4, [r0, #12]     /* 4 */

    add   r0, r0, #16

    /* At this point, r0 = 0x1C */

    But as with any optimization you should benchmark this. On simpler processors, the additional instruction will make it go slower; you might be able to fold this in:

    ldr     r2, [r0, #4]     /* 2 */

    ldr     r3, [r0, #8]     /* 3 */

    ldr     r4, [r0, #12]     /* 4 */

    ldr     r1, [r0], #16     /* 1 */

    /* At this point, r0 = 0x1C */

    Plus, for this example, using ldrd or ldmia would be a better option for some processors.

    ldrd    r3, r4, [r0, #8]     /* 3, 4 */

    ldrd    r1, r2, [r0], #16     /* 1, 2 */

    /* At this point, r0 = 0x1C */

    ldmia r0!, {r1-r4}

    /* At this point, r0 = 0x1C */

    That's enough from me!

Reply
  • It depends why you have an abort. If the abort is due to a fault from the MMU, it will always be synchronous. So if load (1) faults in the MMU, DFAR will contain the address accessed by (1) and the preferred return address pointed to by LR_abt will be instruction (1) (i.e. the address of instruction (1) plus whatever offset the A.R.M. requires). Note that the processor might have executed the other instructions, if the memory is Normal memory -- in effect it speculates these loads.

    If the abort has come from the external system, it depends on the processor implementation whether this is taken synchronously or asynchronously. My recollection is that Cortex-A8 would take it asynchronously, but this should be confirmed from the Technical Reference Manual. (It might depend on the memory type.) If it is asynchronous then DFAR doesn't contain a valid address, and LR_abt will point to whatever instruction was interrupted to take the asynchronous Abort.

    On the original instruction sequence, you are correct that there is a register dependency between these instructions that might hinder out-of-order issue. Rewriting using pre-indexed loads might give better performance on some processors:

    ldr     r1, [r0]     /* 1 */

    ldr     r2, [r0, #4]     /* 2 */

    ldr     r3, [r0, #8]     /* 3 */

    ldr     r4, [r0, #12]     /* 4 */

    add   r0, r0, #16

    /* At this point, r0 = 0x1C */

    But as with any optimization you should benchmark this. On simpler processors, the additional instruction will make it go slower; you might be able to fold this in:

    ldr     r2, [r0, #4]     /* 2 */

    ldr     r3, [r0, #8]     /* 3 */

    ldr     r4, [r0, #12]     /* 4 */

    ldr     r1, [r0], #16     /* 1 */

    /* At this point, r0 = 0x1C */

    Plus, for this example, using ldrd or ldmia would be a better option for some processors.

    ldrd    r3, r4, [r0, #8]     /* 3, 4 */

    ldrd    r1, r2, [r0], #16     /* 1, 2 */

    /* At this point, r0 = 0x1C */

    ldmia r0!, {r1-r4}

    /* At this point, r0 = 0x1C */

    That's enough from me!

Children
No data