This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Pipeline Stage Read and Write

Note: This was originally posted on 5th March 2011 at http://forums.arm.com

I'm still trying to understand the cycle table of the cortex A8.

Most of the test I've made suppose this:
- source register are needed at the beginning of the stage
- destination register are released at the end of the stage.

With those rules it's seem's that result are quite good.

For example, this code take 3 cycles


    add   r5, r5, #1
    mov   r6, r5

because ADD release r5 on the end of stage 2 while MOV need it at the beginning of stage 1

That 's work and that's real cycle execution timing.

But I've a problem with the MLA shortcuts


    mul   r4, r5, r4
    mla   r0, r6, r7, r4

the MUL should release R4 at the end of stage 5 (of the second cycle of the MUL)
the MLA need r4 at the beginning of the stage 4 (due to MLA shortcut).

So the code should take 5 cycles, but in fact It takes only 4 cycles.

Is it possible that r4 is only needed at the beginning of the stage 4 of the second cycle of the MLA ???
Or may be the forwarding is done at the end of the stage 4. So I could suppose this is the same thing as the beginning of the stage 5 !

That could explain the missing cycle.

Etienne SOBOLE over 12 years ago

Note: This was originally posted on 7th March 2011 at http://forums.arm.com

You can find information here:

http://infocenter.ar...i/Babcagee.html

Destination available is always given with respect to the last cycle in a data processing multi-cycle instruction. This rule does not apply to load/store multiple instructions.
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

According to Ben AVISON
http://www.avison.me.uk/ben/programming/cortex-a8.html

or to Hilbert-space
http://hilbert-space.de/?p=66

The MUL cycle table of the documentation seem's to be wrong.

I've made my own tests.
For the SMULBB, it seems clear that there is an error in the documentation.

For shortcuts (shortcut = special forwarding)
You're right. Your number of counted cycle is correct.
But one time again in real timing

mul r4, r5, r4
mla r0, r6, r7, r4

Take only 4 cycles. (the counter program say 4 cycles)
I've no good explanation for that. I do not think this is an error of the documentation. I think we don't really understand the forwarding process.
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

Yes.
That's written but...

What Vahag says is still correct !
If you just apply depency rules of stage.

The MAL should not take 2 cycles !!!
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

Sorry but what you mean when say word shortcut)))))

and how do you know that mul then mla, takes 4 cycles?? )
I mean how you test it??

Oups RUBO !
I don't saw that you give the explanation of shortcuts to Vahag! Sorry

For real bench, the best solution I found is to put the instruction into a loop and looking for real time taken

You have a cycle counter registrer on CORTEX A8, but I never succed to use it on my linux distribution !
Cancel
Vote up 0 Vote down

Cancel
MayaDirect Marketing Safieddine over 12 years ago

Note: This was originally posted on 6th March 2011 at http://forums.arm.com

Hello

I think that the first justification
r4 is only needed at the beginning of the stage 4 of the second cycle of the MLA
is what causing execution to be of 4 cycles.

But please can you provide me from where from the TRM you got the data that the 1st multiply will have its result ready at E5 of second cycle. OK, it's commonsense but I remember that I read such detail once but I'm unable to find it again. Actually I want to read the part of documentation in which this info is included another time, surely the answer will be hidden somewhere. Thanks.
Cancel
Vote up 0 Vote down

Cancel
barney vardanyan over 12 years ago

Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

mul r4, r5, r4
mla r0, r6, r7, r4

we are in the 1 cycle
The mul instruction block r4 untill E5, so r4 will be avieble in 1+6=7 th cycle
but mla need r4 onlu in E4, so here we win 3 cycles(during this 3 cycles they executed in "paraler")
so mla can use R4 only in 7-3=4 th cycle
bu in your site http://pulsar.webshaker.net/, the mla start to execute in 3 cycle, can you please explain why??
Cancel
Vote up 0 Vote down

Cancel
barney vardanyan over 12 years ago

Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

Sorry but what you mean when say word shortcut)))))

and how do you know that mul then mla, takes 4 cycles?? )
I mean how you test it??
Cancel
Vote up 0 Vote down

Cancel
Ruben Buchatskiy over 12 years ago

Note: This was originally posted on 23rd March 2011 at http://forums.arm.com

A multiply that is followed by a MAC with a dependency on the accumulator, Rn register, triggers a special accumulator
forwarding. This enables both instructions to issue back-to-back because Rn is required as a source in E4. If this accumulator
forwarding is not used, Rn is required in E2.
Cancel
Vote up 0 Vote down

Cancel