This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

cycle penality before using a register as pointer

Note: This was originally posted on 28th January 2011 at http://forums.arm.com

Hi.

it seem's that you can't use a modified register as a load address directly in the next cycle (with the Cortex A8)


For example

ADD   r0, r0, #16
LDR   r1,[r0]

will not execute in 2 cycles but in 3 cycles.
I'm looking in the ARM documentation where this penality cycle is explain but I do not find !!!

If I simply use the cortex A8 cycle table:
ADD will write his result in E2
while LDR will need R0 in E1

So If I just apply those rules, the 2 instructions should execute in 2 cycles.

So !!! Does anybody can tell me where this pipeline-dependent latency is explain (or simply notify) ?

Thank's

Etienne SOBOLE over 12 years ago

Note: This was originally posted on 28th January 2011 at http://forums.arm.com

Yes isogen...
You're right, it finally take 2 cycles.

First of all.
I found an example here:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babhefaj.html
the 5th example "Data source hazard" let think that there were a penality cycle.

After that I found this information on Internet too.

Finally, I've made a small test which was wrong.

But.
In fact, you're right, there is not a such latency !
Sorry.

I think, that except for memory delai cycle, it should be possible to count correct cycles (in most case).
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 28th January 2011 at http://forums.arm.com

Ok.

I'm back, and this time I'm a little bit more sure about my purpose...

This code take 3 cycles
LDR r0, [r8] ADD r1, r1, r0

LDR give is result in E3
ADD need r0 in E2

So this code should take 2 cycles

LDR r0, [r8] MOV r1, r0

take 4 cycles while MOV need r0 in E1 !
It should take 3 cycles !!!

Why ???
I do not find case where those code take correct cycles.
It give more correct result if I say LDR give is result in E4 !
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 31st January 2011 at http://forums.arm.com

Well. To bench my code I'm using this loop

.loop: @ bench code smuad   r9, r9, r9 mov   r1, #0 smuad   r10, r10, r10     smuad   r11, r11, r11     smuad   r12, r12, r12 mov   r5, #0 subs   r0, r0, #1 bgt   .loop

This code take 5 cycles. I replace @bench code by the code I want to bench !
With this test protocol,

ldr   r1,[r8] add   r2, r1, r1
take 3 cycles.

mov   r8, r7 ldr   r1, [r8] add   r2, r1, r1
take 4 cycles.

So I conclude that modify a register before using as a pointer take 1 cycle
While there 1 a missing cycles before the add !

I've tried to simulate the 4 functional unit of the cortex but for the moment I do not succeed to obtain real result !
Cancel
Vote up 0 Vote down

Cancel
MayaDirect Marketing Safieddine over 12 years ago

Note: This was originally posted on 30th January 2011 at http://forums.arm.com

In the 5th example in the trm,
ADD r0, r0, #16 LDR r1,[r0]
I considered that the load will need the register at the beginning of the cycle for address calculation, whereas the add will produce that value of the register (r0) at the end of the cycle. So forwarding is not applicable at the same cycle due to the limited cycle time, and an extra cycle is therefore required at which forwarding will take place.

So am I right or shall we consider the real latency of the ALU and the Address Generating Unit and if they can complete sequentially in the same cycle??
But are you sure that you got 2 cycles when you tested again?
Because if you considered what I said, same logic applies on the other two pieces of codes.
I think here comes the extra cycle!
Cancel
Vote up 0 Vote down

Cancel
MayaDirect Marketing Safieddine over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Peter Harris over 12 years ago

Note: This was originally posted on 28th January 2011 at http://forums.arm.com

How are you measuring the 3 cycles?

... also be aware that ARM make it very clear in the manual that the timing tables for Cortex-A8 are only approximations. From the TRM:

This chapter provides the information to estimate how much execution time particular code sequences require. The complexity of the processor makes it impossible to guarantee precise timing information with hand calculations.
Cancel
Vote up 0 Vote down

Cancel