This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
Parents
  • Note: This was originally posted on 11th August 2011 at http://forums.arm.com

    I'n using this code because I'm sure that the end ARM code take exactly 5 cycles and let 2 bubbles in the pipeline for the branch.
    Remember this post ;) http://pulsar.websha...h-instructions/

    I have a beagleboard XM (DM3730). But the processor is not the problem (i believe). Try the code I give and tell me if you found 15 cycles (10 for NEON part and 5 for ARM part).


    Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.

    I do not understand the point 5 and how you get 9 cycles !!!


    9 cycles with no pairing at all, just 9 fmla.f32 in a row, exactly like what Anil posted. I was just using it to start with, and showing that pairing any instructions added several cycles of stalling for every pair.

    - try to load (if it's possible) long time before using datas (there is enough registers to load the datas of the next iteration during the previous one).
    - try to write as soon as possible (that's to say as soon as the register are available for VSAVE).


    This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.

    Likewise, stores need their data in N1 and normal instructions don't make their results available until at least N3. So you need a few cycles after the operation before storing. That said, software pipelining can still help you avoid this and other latencies.

    - and now ;) don't read the same memory bloc with consecutive VLOAD


    I'd still like to see some evidence that this causes problems and what exactly.
Reply
  • Note: This was originally posted on 11th August 2011 at http://forums.arm.com

    I'n using this code because I'm sure that the end ARM code take exactly 5 cycles and let 2 bubbles in the pipeline for the branch.
    Remember this post ;) http://pulsar.websha...h-instructions/

    I have a beagleboard XM (DM3730). But the processor is not the problem (i believe). Try the code I give and tell me if you found 15 cycles (10 for NEON part and 5 for ARM part).


    Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.

    I do not understand the point 5 and how you get 9 cycles !!!


    9 cycles with no pairing at all, just 9 fmla.f32 in a row, exactly like what Anil posted. I was just using it to start with, and showing that pairing any instructions added several cycles of stalling for every pair.

    - try to load (if it's possible) long time before using datas (there is enough registers to load the datas of the next iteration during the previous one).
    - try to write as soon as possible (that's to say as soon as the register are available for VSAVE).


    This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.

    Likewise, stores need their data in N1 and normal instructions don't make their results available until at least N3. So you need a few cycles after the operation before storing. That said, software pipelining can still help you avoid this and other latencies.

    - and now ;) don't read the same memory bloc with consecutive VLOAD


    I'd still like to see some evidence that this causes problems and what exactly.
Children
No data