This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
Parents
  • Note: This was originally posted on 8th August 2011 at http://forums.arm.com


    Hi all,
    I am doing some profiling analysis on Cortex A8 processor using the Beagle Board-xM. I found a strange behavior with the following piece of code. The code takes 46 cycles. But looking at the code we can see that there is no dependency among each other, so ideally it should have taken only 9 cycles.

    Code:
    [indent][indent]/* 46 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vmla.f32 d0,d15,d14;
    vld1.32 {d18,d19},[r1:128];
    vmla.f32 d1,d15,d14;
    vld1.32 {d20,d21},[r1:128];
    vmla.f32 d2,d15,d14;
    vld1.32 {d22,d23},[r1:128];
    vmla.f32 d3,d15,d14;
    vld1.32 {d24,d25},[r1:128];
    vmla.f32 d4,d15,d14;
    vld1.32 {d26,d27},[r1:128];
    vmla.f32 d5,d15,d14;
    vld1.32 {d28,d29},[r1:128];
    vmla.f32 d6,d15,d14;
    vld1.32 {d30,d31},[r1:128];
    vmla.f32 d7,d15,d14;
    vld1.32 {d12,d13},[r1:128];
    vmla.f32 d8,d15,d14;

    [/indent][/indent]However, if I seperate the vmla and vld then the behavior is as expected, i.e the following codes take 9 and 11 cycles respectively.

    [indent][indent]/*  9 cycles. */
    vmla.f32 d0,d15,d14;
    vmla.f32 d1,d15,d14;
    vmla.f32 d2,d15,d14;
    vmla.f32 d3,d15,d14;
    vmla.f32 d4,d15,d14;
    vmla.f32 d5,d15,d14;
    vmla.f32 d6,d15,d14;
    vmla.f32 d7,d15,d14;
    vmla.f32 d8,d15,d14;

    /* 11 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vld1.32 {d18,d19},[r1:128];
    vld1.32 {d20,d21},[r1:128];
    vld1.32 {d22,d23},[r1:128];
    vld1.32 {d24,d25},[r1:128];
    vld1.32 {d26,d27},[r1:128];
    vld1.32 {d28,d29},[r1:128];
    vld1.32 {d30,d31},[r1:128];
    vld1.32 {d12,d13},[r1:128];

    [/indent][/indent]Can some one please let me know whether I am missing something here or my understanding is wrong.

    Thanks,
    Anil M S


    What is your test procedure?
    You have made a loop executed 1000 times (for example) and you have found 46.000 cycles for the first example
    and (11 + 9) * 1000 = 20.000 cycles for the second?
Reply
  • Note: This was originally posted on 8th August 2011 at http://forums.arm.com


    Hi all,
    I am doing some profiling analysis on Cortex A8 processor using the Beagle Board-xM. I found a strange behavior with the following piece of code. The code takes 46 cycles. But looking at the code we can see that there is no dependency among each other, so ideally it should have taken only 9 cycles.

    Code:
    [indent][indent]/* 46 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vmla.f32 d0,d15,d14;
    vld1.32 {d18,d19},[r1:128];
    vmla.f32 d1,d15,d14;
    vld1.32 {d20,d21},[r1:128];
    vmla.f32 d2,d15,d14;
    vld1.32 {d22,d23},[r1:128];
    vmla.f32 d3,d15,d14;
    vld1.32 {d24,d25},[r1:128];
    vmla.f32 d4,d15,d14;
    vld1.32 {d26,d27},[r1:128];
    vmla.f32 d5,d15,d14;
    vld1.32 {d28,d29},[r1:128];
    vmla.f32 d6,d15,d14;
    vld1.32 {d30,d31},[r1:128];
    vmla.f32 d7,d15,d14;
    vld1.32 {d12,d13},[r1:128];
    vmla.f32 d8,d15,d14;

    [/indent][/indent]However, if I seperate the vmla and vld then the behavior is as expected, i.e the following codes take 9 and 11 cycles respectively.

    [indent][indent]/*  9 cycles. */
    vmla.f32 d0,d15,d14;
    vmla.f32 d1,d15,d14;
    vmla.f32 d2,d15,d14;
    vmla.f32 d3,d15,d14;
    vmla.f32 d4,d15,d14;
    vmla.f32 d5,d15,d14;
    vmla.f32 d6,d15,d14;
    vmla.f32 d7,d15,d14;
    vmla.f32 d8,d15,d14;

    /* 11 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vld1.32 {d18,d19},[r1:128];
    vld1.32 {d20,d21},[r1:128];
    vld1.32 {d22,d23},[r1:128];
    vld1.32 {d24,d25},[r1:128];
    vld1.32 {d26,d27},[r1:128];
    vld1.32 {d28,d29},[r1:128];
    vld1.32 {d30,d31},[r1:128];
    vld1.32 {d12,d13},[r1:128];

    [/indent][/indent]Can some one please let me know whether I am missing something here or my understanding is wrong.

    Thanks,
    Anil M S


    What is your test procedure?
    You have made a loop executed 1000 times (for example) and you have found 46.000 cycles for the first example
    and (11 + 9) * 1000 = 20.000 cycles for the second?
Children
No data