This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
Parents
  • Note: This was originally posted on 12th August 2011 at http://forums.arm.com


    Hum.
    In fact, You may be right !
    I've tryed to copy 2, 3 and 4 times the NEON code into the loop

    normally, 8 couples of instruction should take 8 cycles but it takes 10
    16 couples takes 20
    24 couples takes 38
    32 couples takes 48


    The time increase strangely.

    I've replaced the vld1 by vtrn. the timing are
    8 couples takes 8 cyles

    16 couples takes 19 cyles


    24 couples takes 32 cyles
    ..

    So now. I'm think you're right. There is a bottleneck due to bandwith.
    And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.




    We are not agree :) that's could arrives ;)




    Just replace



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14


    by



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r1:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r1:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r1:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r1:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r1:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r1:128]
    vmul.f32 d7,d15,d14


    ...and you'll have your proof.

    What you'll not have is a logical explanation ;)

    During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !


      Hi Etienne,                    

      I also noticed that the cycle count decreases by using the different registers. In the process I encountered one more strange behavior. If I use three buffers with 4096 bytes size, the cycle count increases to around 140 cycles. I have given the code  here, r1,r2,r3 has the address of buffer1,buffer2 and buffer3 respectively.

      

    Int buffer1[1024];

    Int buffer2[1024];

    Int buffer3[1024];

     

      ASM Code:

      

    [indent] vld1.32 {d16,d17},[r1:128]

      vmul.f32 d0,d15,d14

      vld1.32 {d18,d19},[r2:128]

      vmul.f32 d1,d15,d14

      vld1.32 {d20,d21},[r3:128]

      vmul.f32 d2,d15,d14

      vld1.32 {d22,d23},[r1:128]

      vmul.f32 d3,d15,d14

    vld1.32 {d24,d25},[r2:128]

      vmul.f32 d4,d15,d14

      vld1.32 {d26,d27},[r3:128]

      vmul.f32 d5,d15,d14

      vld1.32 {d28,d29},[r1:128]

      vmul.f32 d6,d15,d14

      vld1.32 {d30,d31},[r2:128]

      vmul.f32 d7,d15,d14

     

    [/indent]  However, if the buffer size is not a multiple of 4096, cycle count is normal.

      

      Regards,

      Anil M S
Reply
  • Note: This was originally posted on 12th August 2011 at http://forums.arm.com


    Hum.
    In fact, You may be right !
    I've tryed to copy 2, 3 and 4 times the NEON code into the loop

    normally, 8 couples of instruction should take 8 cycles but it takes 10
    16 couples takes 20
    24 couples takes 38
    32 couples takes 48


    The time increase strangely.

    I've replaced the vld1 by vtrn. the timing are
    8 couples takes 8 cyles

    16 couples takes 19 cyles


    24 couples takes 32 cyles
    ..

    So now. I'm think you're right. There is a bottleneck due to bandwith.
    And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.




    We are not agree :) that's could arrives ;)




    Just replace



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14


    by



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r1:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r1:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r1:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r1:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r1:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r1:128]
    vmul.f32 d7,d15,d14


    ...and you'll have your proof.

    What you'll not have is a logical explanation ;)

    During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !


      Hi Etienne,                    

      I also noticed that the cycle count decreases by using the different registers. In the process I encountered one more strange behavior. If I use three buffers with 4096 bytes size, the cycle count increases to around 140 cycles. I have given the code  here, r1,r2,r3 has the address of buffer1,buffer2 and buffer3 respectively.

      

    Int buffer1[1024];

    Int buffer2[1024];

    Int buffer3[1024];

     

      ASM Code:

      

    [indent] vld1.32 {d16,d17},[r1:128]

      vmul.f32 d0,d15,d14

      vld1.32 {d18,d19},[r2:128]

      vmul.f32 d1,d15,d14

      vld1.32 {d20,d21},[r3:128]

      vmul.f32 d2,d15,d14

      vld1.32 {d22,d23},[r1:128]

      vmul.f32 d3,d15,d14

    vld1.32 {d24,d25},[r2:128]

      vmul.f32 d4,d15,d14

      vld1.32 {d26,d27},[r3:128]

      vmul.f32 d5,d15,d14

      vld1.32 {d28,d29},[r1:128]

      vmul.f32 d6,d15,d14

      vld1.32 {d30,d31},[r2:128]

      vmul.f32 d7,d15,d14

     

    [/indent]  However, if the buffer size is not a multiple of 4096, cycle count is normal.

      

      Regards,

      Anil M S
Children
No data