This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
Parents
  • Note: This was originally posted on 9th August 2011 at http://forums.arm.com


    hum.

    It was so strange that I've made the test.
    I do not find exactly the same result as yours but the problem il still there.

    That's really strange !!!

    if you replace VMLA.F32 by VMUL.F32 or VMLA.U32 the problem is solved.

    So I assume that the shortcut of the vmla.f32 is not applied if there is another instruction between the mul and the mla.
    It seem's that this problem is only true for float MLA !

    That's strange.

    What is more strange is why the first code take so many time while it should take 9 cycles (if we don't use vmla.f32)  !

    I've tried to change the value of the adress register
    Finally  I changed the address register value.

    add   r2, r1, #16
    add   r3, r2, #16
    add   r4, r3, #16
    b    .loop1
    .align 4
    .loop1:

    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14

    subs   r0, r0, #1
    bgt   .loop1


    This code (the NEON part only) take now 10 cycles. It should take only 8 cycles.
    I assume that there is a conflict into the memory file of NEON when you use the same address register.

    So.
    1 - don't put instruction between MUL and MAL when you use float opérations.
    2 - don't read the same data with NEON (in you never have to do that. You've made thins because you try a bench. In real life this case never happend).

    NEON is not fully detailled in the documentation. There is a lot of hint you'll have to found by testing.
    I do not know the both you found !

    Etienne.


      Hi Etienne,

      Thank you very much for your response. The behavior is strange indeed, and seems it is not documented anywhere. Do you think I should approach ARM people to find whether this is an anomaly. And also find whether the behavior is documented? Please give your suggestion on this.

      Regards,

      Anil M S

Reply
  • Note: This was originally posted on 9th August 2011 at http://forums.arm.com


    hum.

    It was so strange that I've made the test.
    I do not find exactly the same result as yours but the problem il still there.

    That's really strange !!!

    if you replace VMLA.F32 by VMUL.F32 or VMLA.U32 the problem is solved.

    So I assume that the shortcut of the vmla.f32 is not applied if there is another instruction between the mul and the mla.
    It seem's that this problem is only true for float MLA !

    That's strange.

    What is more strange is why the first code take so many time while it should take 9 cycles (if we don't use vmla.f32)  !

    I've tried to change the value of the adress register
    Finally  I changed the address register value.

    add   r2, r1, #16
    add   r3, r2, #16
    add   r4, r3, #16
    b    .loop1
    .align 4
    .loop1:

    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14

    subs   r0, r0, #1
    bgt   .loop1


    This code (the NEON part only) take now 10 cycles. It should take only 8 cycles.
    I assume that there is a conflict into the memory file of NEON when you use the same address register.

    So.
    1 - don't put instruction between MUL and MAL when you use float opérations.
    2 - don't read the same data with NEON (in you never have to do that. You've made thins because you try a bench. In real life this case never happend).

    NEON is not fully detailled in the documentation. There is a lot of hint you'll have to found by testing.
    I do not know the both you found !

    Etienne.


      Hi Etienne,

      Thank you very much for your response. The behavior is strange indeed, and seems it is not documented anywhere. Do you think I should approach ARM people to find whether this is an anomaly. And also find whether the behavior is documented? Please give your suggestion on this.

      Regards,

      Anil M S

Children
No data