Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
  • Note: This was originally posted on 17th March 2011 at http://forums.arm.com

    I need something like this


    http://pulsar.webshaker.net/ccc/result.php?lng=fr
  • Note: This was originally posted on 17th March 2011 at http://forums.arm.com

    thnaks a lot)))
  • Note: This was originally posted on 18th March 2011 at http://forums.arm.com

    I just need algoritm, wich count latence beetwen two instructions

    Is there any idea???(
  • Note: This was originally posted on 14th April 2011 at http://forums.arm.com

    Yeah my beagle board has 500MHz frequncy.
    I measur time  with linux time command)

    and what about caches, how  can I turn it on)?


    and wich compiler you use? gcc or other?




  • Note: This was originally posted on 12th April 2011 at http://forums.arm.com

    mmmm

    10 NOP instructions on my beagle takes 2.508 s
  • Note: This was originally posted on 21st March 2011 at http://forums.arm.com

    And when for example, executing
    mul r1, r2 , r3
    mul command can issue only in pipeline 0, but does it blok pipeline 1 too??
  • Note: This was originally posted on 30th March 2011 at http://forums.arm.com

    anytime Etienne! though I quit ARMing :(
  • Note: This was originally posted on 12th August 2011 at http://forums.arm.com


    Hum.
    In fact, You may be right !
    I've tryed to copy 2, 3 and 4 times the NEON code into the loop

    normally, 8 couples of instruction should take 8 cycles but it takes 10
    16 couples takes 20
    24 couples takes 38
    32 couples takes 48


    The time increase strangely.

    I've replaced the vld1 by vtrn. the timing are
    8 couples takes 8 cyles

    16 couples takes 19 cyles


    24 couples takes 32 cyles
    ..

    So now. I'm think you're right. There is a bottleneck due to bandwith.
    And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.




    We are not agree :) that's could arrives ;)




    Just replace



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14


    by



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r1:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r1:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r1:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r1:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r1:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r1:128]
    vmul.f32 d7,d15,d14


    ...and you'll have your proof.

    What you'll not have is a logical explanation ;)

    During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !


      Hi Etienne,                    

      I also noticed that the cycle count decreases by using the different registers. In the process I encountered one more strange behavior. If I use three buffers with 4096 bytes size, the cycle count increases to around 140 cycles. I have given the code  here, r1,r2,r3 has the address of buffer1,buffer2 and buffer3 respectively.

      

    Int buffer1[1024];

    Int buffer2[1024];

    Int buffer3[1024];

     

      ASM Code:

      

    [indent] vld1.32 {d16,d17},[r1:128]

      vmul.f32 d0,d15,d14

      vld1.32 {d18,d19},[r2:128]

      vmul.f32 d1,d15,d14

      vld1.32 {d20,d21},[r3:128]

      vmul.f32 d2,d15,d14

      vld1.32 {d22,d23},[r1:128]

      vmul.f32 d3,d15,d14

    vld1.32 {d24,d25},[r2:128]

      vmul.f32 d4,d15,d14

      vld1.32 {d26,d27},[r3:128]

      vmul.f32 d5,d15,d14

      vld1.32 {d28,d29},[r1:128]

      vmul.f32 d6,d15,d14

      vld1.32 {d30,d31},[r2:128]

      vmul.f32 d7,d15,d14

     

    [/indent]  However, if the buffer size is not a multiple of 4096, cycle count is normal.

      

      Regards,

      Anil M S
  • Note: This was originally posted on 16th August 2011 at http://forums.arm.com


    140 cycles for each iteration of the loop ?



    Yeah, 140 cycles per single iteration.

    Regards,
    Anil M S
  • Note: This was originally posted on 9th August 2011 at http://forums.arm.com


    hum.

    It was so strange that I've made the test.
    I do not find exactly the same result as yours but the problem il still there.

    That's really strange !!!

    if you replace VMLA.F32 by VMUL.F32 or VMLA.U32 the problem is solved.

    So I assume that the shortcut of the vmla.f32 is not applied if there is another instruction between the mul and the mla.
    It seem's that this problem is only true for float MLA !

    That's strange.

    What is more strange is why the first code take so many time while it should take 9 cycles (if we don't use vmla.f32)  !

    I've tried to change the value of the adress register
    Finally  I changed the address register value.

    add   r2, r1, #16
    add   r3, r2, #16
    add   r4, r3, #16
    b    .loop1
    .align 4
    .loop1:

    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14

    subs   r0, r0, #1
    bgt   .loop1


    This code (the NEON part only) take now 10 cycles. It should take only 8 cycles.
    I assume that there is a conflict into the memory file of NEON when you use the same address register.

    So.
    1 - don't put instruction between MUL and MAL when you use float opérations.
    2 - don't read the same data with NEON (in you never have to do that. You've made thins because you try a bench. In real life this case never happend).

    NEON is not fully detailled in the documentation. There is a lot of hint you'll have to found by testing.
    I do not know the both you found !

    Etienne.


      Hi Etienne,

      Thank you very much for your response. The behavior is strange indeed, and seems it is not documented anywhere. Do you think I should approach ARM people to find whether this is an anomaly. And also find whether the behavior is documented? Please give your suggestion on this.

      Regards,

      Anil M S

  • Note: This was originally posted on 2nd August 2011 at http://forums.arm.com

    Hi all,
    I am doing some profiling analysis on Cortex A8 processor using the Beagle Board-xM. I found a strange behavior with the following piece of code. The code takes 46 cycles. But looking at the code we can see that there is no dependency among each other, so ideally it should have taken only 9 cycles.

    Code:
    [indent][indent]/* 46 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vmla.f32 d0,d15,d14;
    vld1.32 {d18,d19},[r1:128];
    vmla.f32 d1,d15,d14;
    vld1.32 {d20,d21},[r1:128];
    vmla.f32 d2,d15,d14;
    vld1.32 {d22,d23},[r1:128];
    vmla.f32 d3,d15,d14;
    vld1.32 {d24,d25},[r1:128];
    vmla.f32 d4,d15,d14;
    vld1.32 {d26,d27},[r1:128];
    vmla.f32 d5,d15,d14;
    vld1.32 {d28,d29},[r1:128];
    vmla.f32 d6,d15,d14;
    vld1.32 {d30,d31},[r1:128];
    vmla.f32 d7,d15,d14;
    vld1.32 {d12,d13},[r1:128];
    vmla.f32 d8,d15,d14;

    [/indent][/indent]However, if I seperate the vmla and vld then the behavior is as expected, i.e the following codes take 9 and 11 cycles respectively.

    [indent][indent]/*  9 cycles. */
    vmla.f32 d0,d15,d14;
    vmla.f32 d1,d15,d14;
    vmla.f32 d2,d15,d14;
    vmla.f32 d3,d15,d14;
    vmla.f32 d4,d15,d14;
    vmla.f32 d5,d15,d14;
    vmla.f32 d6,d15,d14;
    vmla.f32 d7,d15,d14;
    vmla.f32 d8,d15,d14;

    /* 11 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vld1.32 {d18,d19},[r1:128];
    vld1.32 {d20,d21},[r1:128];
    vld1.32 {d22,d23},[r1:128];
    vld1.32 {d24,d25},[r1:128];
    vld1.32 {d26,d27},[r1:128];
    vld1.32 {d28,d29},[r1:128];
    vld1.32 {d30,d31},[r1:128];
    vld1.32 {d12,d13},[r1:128];

    [/indent][/indent]Can some one please let me know whether I am missing something here or my understanding is wrong.

    Thanks,
    Anil M S
  • Note: This was originally posted on 9th August 2011 at http://forums.arm.com


    What is your test procedure?
    You have made a loop executed 1000 times (for example) and you have found 46.000 cycles for the first example
    and (11 + 9) * 1000 = 20.000 cycles for the second?


    Hi Etienne,
    That's true. I have a loop executed 1000 times and I am getting 46,000 cycles for the first example and 21 for the second example.
    I have given the whole function for your reference. r0 has the loop count and r1 the input buffer pointer.

    First example:

    [indent][indent].text;
    .align 4;
    .global vmlaq_vld_f32_interleaved;
    .type vmlaq_vld_f32_interleaved,%function;

    vmlaq_vld_f32_interleaved:

    core_loop_beg6:
    vld1.32 {d16,d17},[r1:128];
    vmla.f32 d0,d15,d14;
    vld1.32 {d18,d19},[r1:128];
    vmla.f32 d1,d15,d14;
    vld1.32 {d20,d21},[r1:128];
    vmla.f32 d2,d15,d14;
    vld1.32 {d22,d23},[r1:128];
    vmla.f32 d3,d15,d14;
    vld1.32 {d24,d25},[r1:128];
    vmla.f32 d4,d15,d14;
    vld1.32 {d26,d27},[r1:128];
    vmla.f32 d5,d15,d14;
    vld1.32 {d28,d29},[r1:128];
    vmla.f32 d6,d15,d14;
    vld1.32 {d30,d31},[r1:128];
    vmla.f32 d7,d15,d14;
    vld1.32 {d10,d11},[r1:128];
    vmla.f32 d8,d15,d14;
    subs r0,r0,#1;
    bgt core_loop_beg6;
    core_loop_end6:
      BX   lr;
    [/indent][/indent]
    Second example.
    [indent][indent].text;
    .align 4;
    .global vld1_aligned;
    .type vld1_aligned,%function;

    vld1_aligned:

    core_loop_beg:

    vmla.f32 d0,d15,d14;
    vmla.f32 d1,d15,d14;
    vmla.f32 d2,d15,d14;
    vmla.f32 d3,d15,d14;
    vmla.f32 d4,d15,d14;
    vmla.f32 d5,d15,d14;
    vmla.f32 d6,d15,d14;
    vmla.f32 d7,d15,d14;
    vmla.f32 d8,d15,d14;

    vld1.32 {d16,d17},[r1:128];
    vld1.32 {d18,d19},[r1:128];
    vld1.32 {d20,d21},[r1:128];
    vld1.32 {d22,d23},[r1:128];
    vld1.32 {d24,d25},[r1:128];
    vld1.32 {d26,d27},[r1:128];
    vld1.32 {d28,d29},[r1:128];
    vld1.32 {d30,d31},[r1:128];
    vld1.32 {d12,d13},[r1:128];

    subs r0,r0,#1;
    bgt core_loop_beg;
    core_loop_end:
      BX   lr;
    [/indent][/indent]Regards,
    Anil M S
  • Note: This was originally posted on 6th April 2011 at http://forums.arm.com

    hello,everyone.
  • Note: This was originally posted on 12th April 2011 at http://forums.arm.com

    Yes, I already understand that))))))))))))))))))))))))))
  • Note: This was originally posted on 11th April 2011 at http://forums.arm.com

    I have made tests, and get very strange results.

    the test is folowing

    I have some cycle and nop intsructions in it.

    first I have only one nop in cycle, it's takes  0.570 s on my board
    then I add second nop instruction, and it's also takes 0.57 s. and it's true, becaues they executed  in thy same cycle(pipeline 0 and pipeline 1)
    then I continue to add nop instructions.
    So when I got to 5-6 nop pair, I realize that they can't execute parraler
    and after  every nop instruction that I add increase the time of execution

    So why after 5 nop, they don't executed parraler, IS there any idea?

    here is the test file)
More questions in this forum
There are no posts to show. This could be because there are no posts in this forum or due to a filter.