This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A8 : instruction fetch for dual-issue


Hi,

We experiment the following loop code (runs 4096 iterations) and we get CPI=0.66 (in other words, loop initiation interval (II) is about 6 machine cycles). We are trying really hard  to reason why II is ~6 not ~5. Having said this, could you advise us whether our manual simulation is correct or not.

 

.LBB1_1:                                @ %for.body
                                                @ =>This Inner Loop Header: Depth=1
        ldr        r6, [r5]            I1
        mov     r3, r2             I2

        ldr        r4, [r2]           I3
        ldr        r7, [r3,           I4

        adds    r1, r1,           I5
        str     r7, [r2]             I6
        mla     r0, r4, r6, r0     I7
        mov     r2, r3               I8
        bne     .LBB1_1         I9

I10

I11

I12

To simplify my question, please assume that the execution of the loop iteration is in a steady state (BTB, BPB, and cache is matured). Also, I ignored the decoding stages to bring out our question more clearly.

Since Cortex-A8 is a dual-issue in-order execute/commit processor, we simulate by fetching two instructions at a time. Please give us the comment whether it is correct or not. Thanks,

time 1:    fetch I1 and I2

time 2:    fetch I3 and I4       -    issue I1 in pipe0 and I2 in pipe1

time 3:    fetch I5 and I6       -    issue I3 in pipe0  (structural hazard)

time 4:    fetch I7 and I8       -    issue I4 in pipe0 and I5 in pipe1

time 5:    fetch I9 and I10  (next sequential addr)   -   issue I6 in pipe0

time 6:    fetch I11 and I12    - issue I7 in pipe0 and I8 in pipe1

                                                  since I9 (branch) is predicted as taken, discard I10, I11, and I12

time 7:   fetch I1 and I2       -    issue I9 in pipe0

time 8:   fetch I3 and I4       -    issue I1 in pipe0 and I2 in pipe1

...

  • Hello.

    Unfortunately I cannot answer your question. In fact, I'm just curious, how did you measure the CPI?

  • Hi,

    Thanks for your question.

    I suspect that your model is a little too simplistic. If you look at Chapter 16 of the Cortex-A8 Technical Reference Manual, you will see that there are restrictions on the combinations of instructions which can be dual-issued. I think you will need to factor this in to your simulation.

    Out of interest, what is the purpose of your simulation?It looks very interesting.

    Regards

    Chris

  • Hi Chris,

        Thank you so much for your kind response. We are currently implementing modulo scheduling compiler optimization at the machine code level (using LLVM-VPO compiler). The objective is to close the performance gap between Cortex A8 and A9. In order to model architecture restrictions within our compiler, we have been trying to understand the cycle counts reported by Carbon bare metal processor tool. I think we are getting better:):)