Hi,
We experiment the following loop code (runs 4096 iterations) and we get CPI=0.66 (in other words, loop initiation interval (II) is about 6 machine cycles). We are trying really hard to reason why II is ~6 not ~5. Having said this, could you advise us whether our manual simulation is correct or not.
.LBB1_1: @ %for.body @ =>This Inner Loop Header: Depth=1 ldr r6, [r5] I1 mov r3, r2 I2
ldr r4, [r2] I3 ldr r7, [r3, I4
adds r1, r1, I5 str r7, [r2] I6 mla r0, r4, r6, r0 I7 mov r2, r3 I8 bne .LBB1_1 I9
I10
I11
I12
To simplify my question, please assume that the execution of the loop iteration is in a steady state (BTB, BPB, and cache is matured). Also, I ignored the decoding stages to bring out our question more clearly.
Since Cortex-A8 is a dual-issue in-order execute/commit processor, we simulate by fetching two instructions at a time. Please give us the comment whether it is correct or not. Thanks,
time 1: fetch I1 and I2
time 2: fetch I3 and I4 - issue I1 in pipe0 and I2 in pipe1
time 3: fetch I5 and I6 - issue I3 in pipe0 (structural hazard)
time 4: fetch I7 and I8 - issue I4 in pipe0 and I5 in pipe1
time 5: fetch I9 and I10 (next sequential addr) - issue I6 in pipe0
time 6: fetch I11 and I12 - issue I7 in pipe0 and I8 in pipe1
since I9 (branch) is predicted as taken, discard I10, I11, and I12
time 7: fetch I1 and I2 - issue I9 in pipe0
time 8: fetch I3 and I4 - issue I1 in pipe0 and I2 in pipe1
...
Thanks for your question.
I suspect that your model is a little too simplistic. If you look at Chapter 16 of the Cortex-A8 Technical Reference Manual, you will see that there are restrictions on the combinations of instructions which can be dual-issued. I think you will need to factor this in to your simulation.
Out of interest, what is the purpose of your simulation?It looks very interesting.
Regards
Chris
Hi Chris,
Thank you so much for your kind response. We are currently implementing modulo scheduling compiler optimization at the machine code level (using LLVM-VPO compiler). The objective is to close the performance gap between Cortex A8 and A9. In order to model architecture restrictions within our compiler, we have been trying to understand the cycle counts reported by Carbon bare metal processor tool. I think we are getting better:):)