Hi,
We experiment the following loop code (runs 4096 iterations) and we get CPI=0.66 (in other words, loop initiation interval (II) is about 6 machine cycles). We are trying really hard to reason why II is ~6 not ~5. Having said this, could you advise us whether our manual simulation is correct or not.
.LBB1_1: @ %for.body @ =>This Inner Loop Header: Depth=1 ldr r6, [r5] I1 mov r3, r2 I2
ldr r4, [r2] I3 ldr r7, [r3, I4
adds r1, r1, I5 str r7, [r2] I6 mla r0, r4, r6, r0 I7 mov r2, r3 I8 bne .LBB1_1 I9
I10
I11
I12
To simplify my question, please assume that the execution of the loop iteration is in a steady state (BTB, BPB, and cache is matured). Also, I ignored the decoding stages to bring out our question more clearly.
Since Cortex-A8 is a dual-issue in-order execute/commit processor, we simulate by fetching two instructions at a time. Please give us the comment whether it is correct or not. Thanks,
time 1: fetch I1 and I2
time 2: fetch I3 and I4 - issue I1 in pipe0 and I2 in pipe1
time 3: fetch I5 and I6 - issue I3 in pipe0 (structural hazard)
time 4: fetch I7 and I8 - issue I4 in pipe0 and I5 in pipe1
time 5: fetch I9 and I10 (next sequential addr) - issue I6 in pipe0
time 6: fetch I11 and I12 - issue I7 in pipe0 and I8 in pipe1
since I9 (branch) is predicted as taken, discard I10, I11, and I12
time 7: fetch I1 and I2 - issue I9 in pipe0
time 8: fetch I3 and I4 - issue I1 in pipe0 and I2 in pipe1
...