This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
Parents
  • Note: This was originally posted on 11th August 2011 at http://forums.arm.com


    Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.


    Hum.
    In fact, You may be right !
    I've tryed to copy 2, 3 and 4 times the NEON code into the loop

    normally, 8 couples of instruction should take 8 cycles but it takes 10
    16 couples takes 20
    24 couples takes 38
    32 couples takes 48


    The time increase strangely.

    I've replaced the vld1 by vtrn. the timing are
    8 couples takes 8 cyles

    16 couples takes 19 cyles


    24 couples takes 32 cyles
    ..

    So now. I'm think you're right. There is a bottleneck due to bandwith.
    And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.


    This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.



    We are not agree :) that's could arrives ;)


    I'd still like to see some evidence that this causes problems and what exactly.



    Just replace



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14


    by



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r1:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r1:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r1:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r1:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r1:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r1:128]
    vmul.f32 d7,d15,d14


    ...and you'll have your proof.

    What you'll not have is a logical explanation ;)

    During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !
Reply
  • Note: This was originally posted on 11th August 2011 at http://forums.arm.com


    Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.


    Hum.
    In fact, You may be right !
    I've tryed to copy 2, 3 and 4 times the NEON code into the loop

    normally, 8 couples of instruction should take 8 cycles but it takes 10
    16 couples takes 20
    24 couples takes 38
    32 couples takes 48


    The time increase strangely.

    I've replaced the vld1 by vtrn. the timing are
    8 couples takes 8 cyles

    16 couples takes 19 cyles


    24 couples takes 32 cyles
    ..

    So now. I'm think you're right. There is a bottleneck due to bandwith.
    And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.


    This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.



    We are not agree :) that's could arrives ;)


    I'd still like to see some evidence that this causes problems and what exactly.



    Just replace



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14


    by



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r1:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r1:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r1:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r1:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r1:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r1:128]
    vmul.f32 d7,d15,d14


    ...and you'll have your proof.

    What you'll not have is a logical explanation ;)

    During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !
Children
No data