This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

cycle penality before using a register as pointer

Note: This was originally posted on 28th January 2011 at http://forums.arm.com

Hi.

it seem's that you can't use a modified register as a load address directly in the next cycle (with the Cortex A8)


For example

ADD   r0, r0, #16
LDR   r1,[r0]


will not execute in 2 cycles but in 3 cycles.
I'm looking in the ARM documentation where this penality cycle is explain but I do not find !!!

If I simply use the cortex A8 cycle table:
ADD will write his result in E2
while LDR will need R0 in E1

So If I just apply those rules, the 2 instructions should execute in 2 cycles.

So !!! Does anybody can tell me where this pipeline-dependent latency is explain (or simply notify) ?

Thank's
Parents
  • Note: This was originally posted on 31st January 2011 at http://forums.arm.com

    Well. To bench my code I'm using this loop


    .loop:
    @ bench code
    smuad   r9, r9, r9
    mov   r1, #0
    smuad   r10, r10, r10
        smuad   r11, r11, r11
        smuad   r12, r12, r12
    mov   r5, #0
    subs   r0, r0, #1
    bgt   .loop


    This code take 5 cycles. I replace @bench code by the code I want to bench !
    With this test protocol,


    ldr   r1,[r8]
    add   r2, r1, r1

    take 3 cycles.


    mov   r8, r7
    ldr   r1, [r8]
    add   r2, r1, r1

    take 4 cycles.

    So I conclude that modify a register before using as a pointer take 1 cycle
    While there 1 a missing cycles before the add !

    I've tried to simulate the 4 functional unit of the cortex but for the moment I do not succeed to obtain real result !
Reply
  • Note: This was originally posted on 31st January 2011 at http://forums.arm.com

    Well. To bench my code I'm using this loop


    .loop:
    @ bench code
    smuad   r9, r9, r9
    mov   r1, #0
    smuad   r10, r10, r10
        smuad   r11, r11, r11
        smuad   r12, r12, r12
    mov   r5, #0
    subs   r0, r0, #1
    bgt   .loop


    This code take 5 cycles. I replace @bench code by the code I want to bench !
    With this test protocol,


    ldr   r1,[r8]
    add   r2, r1, r1

    take 3 cycles.


    mov   r8, r7
    ldr   r1, [r8]
    add   r2, r1, r1

    take 4 cycles.

    So I conclude that modify a register before using as a pointer take 1 cycle
    While there 1 a missing cycles before the add !

    I've tried to simulate the 4 functional unit of the cortex but for the moment I do not succeed to obtain real result !
Children
No data