This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

cycle penality before using a register as pointer

Note: This was originally posted on 28th January 2011 at http://forums.arm.com

Hi.

it seem's that you can't use a modified register as a load address directly in the next cycle (with the Cortex A8)


For example

ADD   r0, r0, #16
LDR   r1,[r0]


will not execute in 2 cycles but in 3 cycles.
I'm looking in the ARM documentation where this penality cycle is explain but I do not find !!!

If I simply use the cortex A8 cycle table:
ADD will write his result in E2
while LDR will need R0 in E1

So If I just apply those rules, the 2 instructions should execute in 2 cycles.

So !!! Does anybody can tell me where this pipeline-dependent latency is explain (or simply notify) ?

Thank's
  • Note: This was originally posted on 28th January 2011 at http://forums.arm.com

    Yes isogen...
    You're right, it finally take 2 cycles.

    First of all.
    I found an example here:
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babhefaj.html
    the 5th example "Data source hazard" let think that there were a penality cycle.

    After that I found this information on Internet too.

    Finally, I've made a small test which was wrong.

    But.
    In fact, you're right, there is not a such latency !
    Sorry.

    I think, that except for memory delai cycle, it should be possible to count correct cycles (in most case).
  • Note: This was originally posted on 28th January 2011 at http://forums.arm.com

    Ok.

    I'm back, and this time I'm a little bit more sure about my purpose...

    This code take 3 cycles

    LDR   r0, [r8]
    ADD   r1, r1, r0


    LDR give is result in E3
    ADD need r0 in E2

    So this code should take 2 cycles




    LDR   r0, [r8]
    MOV   r1, r0


    take 4 cycles while MOV need r0 in E1 !
    It should take 3 cycles !!!

    Why ???
    I do not find case where those code take correct cycles.
    It give more correct result if I say LDR give is result in E4 !
  • Note: This was originally posted on 31st January 2011 at http://forums.arm.com

    Well. To bench my code I'm using this loop


    .loop:
    @ bench code
    smuad   r9, r9, r9
    mov   r1, #0
    smuad   r10, r10, r10
        smuad   r11, r11, r11
        smuad   r12, r12, r12
    mov   r5, #0
    subs   r0, r0, #1
    bgt   .loop


    This code take 5 cycles. I replace @bench code by the code I want to bench !
    With this test protocol,


    ldr   r1,[r8]
    add   r2, r1, r1

    take 3 cycles.


    mov   r8, r7
    ldr   r1, [r8]
    add   r2, r1, r1

    take 4 cycles.

    So I conclude that modify a register before using as a pointer take 1 cycle
    While there 1 a missing cycles before the add !

    I've tried to simulate the 4 functional unit of the cortex but for the moment I do not succeed to obtain real result !
  • Note: This was originally posted on 30th January 2011 at http://forums.arm.com

    In the  5th example in the trm,

    ADD   r0, r0, #16
    LDR   r1,[r0]

    I considered that the load will need the register at the beginning of the cycle for address calculation, whereas the add will produce that value of the register (r0) at the end of the cycle. So forwarding is not applicable at the same cycle due to the limited cycle time, and an extra cycle is therefore required at which forwarding will take place.

    So am I right or shall we consider the real latency of the ALU and the Address Generating Unit and if they can complete sequentially in the same cycle?? 
    But are you sure that you got 2 cycles when you tested again?
    Because if you considered what I said, same logic applies on the other two pieces of codes.
    I think here comes the extra cycle!
  • Note: This was originally posted on 28th January 2011 at http://forums.arm.com

    How are you measuring the 3 cycles?

    ... also be aware that ARM make it very clear in the manual that the timing tables for Cortex-A8 are only approximations. From the TRM:


    This chapter provides the information to estimate how much execution time particular code sequences require. The complexity of the processor makes it impossible to guarantee precise timing information with hand calculations.