This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Reordering between multiple loads

Hemant over 10 years ago

Hello,

I have a question if following sequence of instructions involving post-indexed LDRs could be re-ordered on say Cortex A8:

To simplify, lets consider, r0 = 0xC, Cache line size 16 Bytes

ldr r1, [r0], #4 /* 1 */

ldr r2, [r0], #4 /* 2 */

ldr r3, [r0], #4 /* 3 */

ldr r4, [r0], #4 /* 4 */

/* At this point, r0 = 0x1C */

Now, will the above instructions always be executed in order 1-2-3-4 (because r0 is getting updated across) or there are chances that it could execute as 2-3-4-1 etc?

Thanks.

Parents

0 Martin Weidmann over 10 years ago

It's going to depend on how the target address region is defined.
If it is defined as Device (or Strongly Ordered), then the accesses would be observed in order.
If it is defined as Normal, then the accesses could be observed out of order (that is, you could see 1-2-3-4 or 4-3-2-1 or 4-1-3-2, etc....). It's also possible that the accesses might be merged together into a smaller number of accesses. At a memory system level, if this is cacheable memory, all you are likely to see is a single cache line fill.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Martin Weidmann over 10 years ago

It's going to depend on how the target address region is defined.
If it is defined as Device (or Strongly Ordered), then the accesses would be observed in order.
If it is defined as Normal, then the accesses could be observed out of order (that is, you could see 1-2-3-4 or 4-3-2-1 or 4-1-3-2, etc....). It's also possible that the accesses might be merged together into a smaller number of accesses. At a memory system level, if this is cacheable memory, all you are likely to see is a single cache line fill.
Cancel
Vote up 0 Vote down

Cancel

Children

0 Hemant over 10 years ago in reply to Martin Weidmann

So does this mean the post-index load being used (with r0 getting updated every instruction) doesn't have any constraint on out of order execution of the loads - that is, is this sequence not treated as r0 value having side effect of previous instructions?
Thanks.
Cancel
Vote up 0 Vote down

Cancel
0 Martin Weidmann over 10 years ago in reply to Hemant

There are two different, but related, concepts here. The ordering of instructions and the ordering of memory accesses (or perhaps more accurately, the order they observed in).
Take Cortex-A7, it has an in-order pipeline. That means it would execute your instruction sequence in order. However, the fact the instructions executed in order doesn't tell you mean that the memory accesses won't be re-ordered. Memory accesses caused by those loads could still be re-ordered (and/or merged, and/or speculated) if the target address was marked as Normal.
Instruction ordering is mostly invisible to software, as the architecture requires an implementation to behave "as if" the instructions executed in order. Memory access order often doesn't matter to software, but some times does. Which is where things like the barrier instructions (DMB, DSB, ISB, LDAR, STLR) come in.
Cancel
Vote up 0 Vote down

Cancel
0 Hemant over 10 years ago in reply to Martin Weidmann

Thanks. So in this case, if there is a abort on memory access of step 1 but currently being executed instruction is 4 (in above example), is it possible that it would still be (kind of) imprecise abort?
- That is, (out of order, NORMAL) memory access for instruction 1 aborts
- But we are at instruction 4
- Will DFAR is guaranteed to have address corresponding to instruction 1?
- Will LR (in ABT) guaranteed to have PC stored of instruction 1 always?
Thanks.
Cancel
Vote up 0 Vote down

Cancel
0 Martin Weidmann over 10 years ago in reply to Hemant

It would depend on what kind of an abort it was.
For MMU based faults (translation fault, permission faults, access flag faults...) these are synchronous with the instruction that caused them. And as mentioned, instructions must appear to be executed in order.
For the instruction sequence you gave, imagine that the starting value of r0 was 0x3FFF,FFF8. That is, the first to instructions access one page (page A) and the next two access the following page (page B).
Let's say page A is marked as Fault and page B as Normal. The first LDR will trigger a synchronous fault. The processor _might_ have speculatively already performed the two loads from page B, but when we take the exception the state will be consistent with none of the later instructions having executed.
Cancel
Vote up 0 Vote down

Cancel
0 Hemant over 10 years ago in reply to Martin Weidmann

Thanks a lot, that clarifies!
Cancel
Vote up 0 Vote down

Cancel
0 Michael Williams over 10 years ago in reply to Hemant

It depends why you have an abort. If the abort is due to a fault from the MMU, it will always be synchronous. So if load (1) faults in the MMU, DFAR will contain the address accessed by (1) and the preferred return address pointed to by LR_abt will be instruction (1) (i.e. the address of instruction (1) plus whatever offset the A.R.M. requires). Note that the processor might have executed the other instructions, if the memory is Normal memory -- in effect it speculates these loads.
If the abort has come from the external system, it depends on the processor implementation whether this is taken synchronously or asynchronously. My recollection is that Cortex-A8 would take it asynchronously, but this should be confirmed from the Technical Reference Manual. (It might depend on the memory type.) If it is asynchronous then DFAR doesn't contain a valid address, and LR_abt will point to whatever instruction was interrupted to take the asynchronous Abort.
On the original instruction sequence, you are correct that there is a register dependency between these instructions that might hinder out-of-order issue. Rewriting using pre-indexed loads might give better performance on some processors:
ldr     r1, [r0]     /* 1 */
ldr     r2, [r0, #4]     /* 2 */
ldr     r3, [r0, #8]     /* 3 */
ldr     r4, [r0, #12]     /* 4 */
add   r0, r0, #16
/* At this point, r0 = 0x1C */
But as with any optimization you should benchmark this. On simpler processors, the additional instruction will make it go slower; you might be able to fold this in:
ldr     r2, [r0, #4]     /* 2 */
ldr     r3, [r0, #8]     /* 3 */
ldr     r4, [r0, #12]     /* 4 */
ldr     r1, [r0], #16     /* 1 */
/* At this point, r0 = 0x1C */
Plus, for this example, using ldrd or ldmia would be a better option for some processors.
ldrd    r3, r4, [r0, #8]     /* 3, 4 */
ldrd    r1, r2, [r0], #16     /* 1, 2 */
/* At this point, r0 = 0x1C */
ldmia r0!, {r1-r4}
/* At this point, r0 = 0x1C */
That's enough from me!
Cancel
Vote up 0 Vote down

Cancel