Pipeline Stage Read and Write

Note: This was originally posted on 5th March 2011 at http://forums.arm.com

I'm still trying to understand the cycle table of the cortex A8.

Most of the test I've made suppose this:
- source register are needed at the beginning of the stage
- destination register are released at the end of the stage.

With those rules it's seem's that result are quite good.

For example, this code take 3 cycles

    add   r5, r5, #1
    mov   r6, r5

because ADD release r5 on the end of stage 2 while MOV need it at the beginning of stage 1

That 's work and that's real cycle execution timing.

But I've a problem with the MLA shortcuts


    mul   r4, r5, r4
    mla   r0, r6, r7, r4


the MUL should release R4 at the end of stage 5 (of the second cycle of the MUL)
the MLA need r4 at the beginning of the stage 4 (due to MLA shortcut).

So the code should take 5 cycles, but in fact It takes only 4 cycles.

Is it possible that r4 is only needed at the beginning of the stage 4 of the second cycle of the MLA ???
Or may be the forwarding is done at the end of the stage 4. So I could suppose this is the same thing as the beginning of the stage 5 !

That could explain the missing cycle.
  • Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

    A multiply that is followed by a MAC with a dependency on the accumulator, Rn register, triggers a special accumulator
    forwarding. This enables both instructions to issue back-to-back because Rn is required as a source in E4. If this accumulator
    forwarding is not used, Rn is required in E2.
  • Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

    Sorry but what you mean when say word shortcut)))))

    and how do you know that mul then mla, takes 4 cycles?? )
    I mean how you test it??
  • Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

    mul     r4, r5, r4
    mla     r0, r6, r7, r4


    we are in the 1 cycle
    The mul instruction block r4 untill E5, so r4 will be avieble in 1+6=7 th cycle
    but mla need r4 onlu in E4, so here we win 3 cycles(during this 3 cycles they executed in "paraler")
    so mla can use R4 only in 7-3=4 th cycle
    bu in your site http://pulsar.webshaker.net/, the mla start to execute in 3 cycle, can you please explain why??
  • Note: This was originally posted on 6th March 2011 at http://forums.arm.com

    Hello

    I think that the first justification 
      r4 is only needed at the beginning of the stage 4 of the second cycle of the MLA 
    is what causing execution to be of 4 cycles.

    But please can you provide me from where from the TRM you got the data that the 1st multiply will have its result ready at E5 of second cycle. OK, it's commonsense but I remember that I read such detail once but I'm unable to find it again. Actually I want to read the part of documentation in which this info is included another time, surely the answer will be hidden somewhere. Thanks.
  • Note: This was originally posted on 23rd March 2011 at http://forums.arm.com


    Sorry but what you mean when say word shortcut)))))

    and how do you know that mul then mla, takes 4 cycles?? )
    I mean how you test it??


    Oups RUBO !
    I don't saw that you give the explanation of shortcuts to Vahag! Sorry

    For real bench, the best solution I found is to put the instruction into a loop and looking for real time taken ;)

    You have a cycle counter registrer on CORTEX A8, but I never succed to use it on my linux distribution !
  • Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

    Yes.
    That's written but...

    What Vahag says is still correct !
    If you just apply depency rules of stage.

    The MAL should not take 2 cycles !!!
  • Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

    According to Ben AVISON
    http://www.avison.me.uk/ben/programming/cortex-a8.html

    or to Hilbert-space
    http://hilbert-space.de/?p=66


    The MUL cycle table of the documentation seem's to be wrong.

    I've made my own tests.
    For the SMULBB, it seems clear that there is an error in the documentation.

    For shortcuts (shortcut = special forwarding)
    You're right. Your number of counted cycle is correct.
    But one time again in real timing


    mul r4, r5, r4
    mla r0, r6, r7, r4

    Take only 4 cycles. (the counter program say 4 cycles)
    I've no good explanation for that. I do not think this is an error of the documentation. I think we don't really understand the forwarding process.
  • Note: This was originally posted on 7th March 2011 at http://forums.arm.com

    You can find information here:

    http://infocenter.ar...i/Babcagee.html


    Destination available is always given with respect to the last cycle in a data processing multi-cycle instruction. This rule does not apply to load/store multiple instructions.

More questions in this forum