This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
  • Note: This was originally posted on 17th March 2011 at http://forums.arm.com

    Cool ;)

    You want to understand how to read the documentation and how work's the E stage ???
    I'have asked the same question few month ago !

    This is a godd opportunity for me thank isogen74, sim, mayesta and other people for their help

    The post who really help me to understand pipeline stage is this one
    http://forums.arm.co...8-out-of-order/

    I hope it will help you too.

    Etienne
  • Note: This was originally posted on 12th April 2011 at http://forums.arm.com

    Hum.

    I think it's strange but it possible!
    That's clear that a NOP is replaced by a MOV r0,r0

    But on my beagleboard-XM it seem's that I can execute 2 NOP in the same cycle.
    May be you have another release of the cortex A8.

    That could be a answer.

    Documentation say's

    Assembling the NOP mnemonic as UAL will not change the functionality of the code, but will change:
    "¢ the instruction encoding selected
    "¢ the architecture variants on which the resulting binary will execute successfully, because the NOP instruction was introduced in ARMv6K and ARMv6T2.


    I'm not sure to really understand what it mean.

    I've no other explanation about the difference our results.

    Can you confirme me that with 10 NOP your program do not take 1 seconde ?
  • Note: This was originally posted on 12th April 2011 at http://forums.arm.com

    Interesting !!!

    So!!! Don't use NOP anymore ;)
  • Note: This was originally posted on 12th April 2011 at http://forums.arm.com


    mmmm

    10 NOP instructions on my beagle takes 2.508 s


    Does your beagleboard work's at 500 Mhz ?
  • Note: This was originally posted on 31st March 2011 at http://forums.arm.com


    anytime Etienne! though I quit ARMing :(


    What a strange idea ;)
    See you soon !!!
  • Note: This was originally posted on 11th April 2011 at http://forums.arm.com

    The problem is due to your branch.

    You can't simply expect that, if the branch is in the cache, it will take 1 cycle...
    branch is more complex as it seem's.

    Take this testing procédure. It will be more easy to understand the time taken by your program.


    movw r0, #0x0500                   @ you repeat your loop 83232000 times
    movt r0, #0x04F6
    .loop:
    nop                                    @ here is you nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop

    smuad   r1, r1, r1                 @ you can be sure the ending code take 5 cycles.
    nop
    nop
    smuad   r2, r2, r2
    nop
    subs   r0, r0, #1
    smuad   r3, r3, r3
    bgt   .loop

    bx lr


    If you don't put any nop (I speak about your nop ! don't remove the nop after this smuad)
    the program should take 0.52 s (this is logic because you beagleboard works at 800mhz and 5 * 83232000 ~= 400M cycles
    every time you add 2 nop you program will take ~= 0.10 s more

    you could have more readable result is you repeat you loop 80.000.000 times
    in this case use


    movw r0, #0xB400
    movt r0, #0x04C4


    instead of

    movw r0, #0x0500
    movt r0, #0x04F6
  • Note: This was originally posted on 18th March 2011 at http://forums.arm.com

    In pipelined processor there is a difference between
    - the number of cycle needed to execute an instruction (from the beginning to the end
    - the number of cycle that the pipeline is locked.

    The mul take 6 cycles to execute when the instruction entrer into the pipeline.
    but the pipeline is blocked during only 2 cycle.

    when you execute this code

    mul r0, r1, r2
    mul r3, r4, r5


    During the cycle 2, 3, 5, 6 ARM will execute both the mul


    Explain that with mul is not a good choice because the mul take 2 cycles!
    What you need to understand is that the ARM can start a new instruction most of time every cycle, but this instruction can take more than one cycle to execute.
  • Note: This was originally posted on 4th May 2011 at http://forums.arm.com


    However, in cycle 2, it  has nop instruction. Why does "nop" occur?



    You can forget the NOP. I add a NOP to the code provided in order to know exactly the time taken by the last instruction.


    About NEOn instructions, how to know that 2 NEON instructions can be dual issue? Do they follow the rule you mentioned?


    Hum. As far as I know, you can believe the cycle counter.
    But It can have mistake (I have some times said wrong things in the past).
    For NEON dual issue, I applied this rules
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/BABHBCCB.html

    After I make some test and when I'm not agree with given information I change the rules.
    When I don't find explanation about a technical part, I try to find a model that's seem's to work. Sometime I'm wrong.

    For the cycle counter, the result begin to be quite correct (except for VPf instructions)
    Everytime It's possible, I check that the ccc result is correct by real testing.

    That's all what I can tell you !
  • Note: This was originally posted on 5th May 2011 at http://forums.arm.com

    [font=arial, verdana, tahoma, sans-serif][size=2]

    I read from specs that a Neon load/store instructions can be dual issued with SIMD data-processing instructions. So I tried below code in your website:
    vld1.32 {d0}, [r0]
    vadd d1, d2, d3



    Hum.
    That's exactly a case where, for the moment, I'm not agree with the documentation.
    Read this post
    http://pulsar.websha...bowels-of-neon/

    I've made new tests !!!
    For me, cycle of the documentation are "quite" correct. but you can't have a dual issue if the load/store instruction take more than 1 cycle.

    I'm not sure about that yet, I currently make additional tests...

    the current cycle counter version does not handle the dual issued instruction with memory acces
    It's written here :)[/size][/font]
    http://pulsar.webshaker.net/2011/04/12/program-to-count-the-cycles-of-the-a8-cortex-v0-6/
    or here (for french)
    http://pulsar.webshaker.net/2011/04/12/programme-pour-compter-les-cycles-du-cortex-a8-v0-6/

    ...but the next version does. The new version should be online in few days !
  • Note: This was originally posted on 28th April 2011 at http://forums.arm.com


    Is it right that if the next instruction uses Rd as operand, it has to wait after cycle #16 to start execution? If so, I think it is wasteful because if there no dependency, the next instruction may start execution at cycle #13 or #14.

    Is my thought right?

    Dung!


    Yes that's it... :)


    For branch :
    I'm do not know anything about the first stage of the ARM pipeline.
    I don't know what you want to do.
    But, I think that there is no way to know just with a code source if a (conditional) branch will be mispredict or not.

    I assume that a B instruction is always correctly predict.
    For a conditional branch this is the lottery.

    Is you found somewhere information about how the branch is correctly predict, I'm very interested:

    I remenber having tried something like this one day


       mov r0, #1
       mov r10, #10000

    .loop:
       nop
       nop
       rsbs r0, r0, #1
       beq .else


       subs r10, r10, #1
       beq .exit
       nop
       nop
       b .loop

    .else

       subs r10, r10, #1
       beq .exit

       nop
       nop
       b .loop

    .exit:


    I thought the branches to .else will always be mispredict, but it was not the case.
    It could be very usefull to know the prediction algorithm (but I assume it must be quite secret ;) )!!!
  • Note: This was originally posted on 28th April 2011 at http://forums.arm.com


    My purpose, I think, is just simple. I want to develop a tool to count the number of cycles to execute a short source code.
    I don't have a board or a Cortex-A8, I am just a man of theory :((



    This is quite hard job to do if you don't have hardware to check !!!

    Buy a beagleboard... http://www.watterott.com/en/BeagleBoard-xM
    This is not very expensive !!!

  • Note: This was originally posted on 15th April 2011 at http://forums.arm.com


    Try setting up the timing function inside your program binary and measure a relatively large block of  instructions so that the measurements overheads are small relative to the measurement.


    You're right isogen, but with a 80.000.000 repeated times loop.
    There is no real problème to count cycle even with time command !


  • Note: This was originally posted on 27th April 2011 at http://forums.arm.com


    I used above link to check cycles of some ARM instruction. However, I confused about the pipeline column.
    For example, there are "no, n1, 0, 1"  that happen in 1 cycle. They seem to be stages of pipeline. However, Cortex-A8 has 13 stages of pipeline and there is no name like these name. Also, 1 stage takes 1 cycle, right?

    Please give me some explanations.


    Hum. you start with very complex questions !!!



    First I do not understand what you say about "static scheduling scoreboard, replay and pending queue"
    But I do not really understand what ARM call "data hazard" ;(

    What I can say is that if you apply the stage rules describe into the ARM documentation to count cycle, you'll have a "quite" correct result.

    After that there is a lot of special case (and they are not always documented) that can improve the quality of the counting process.
    shortcut (or fast forward) for example.



    Branch mispredict penality : you can't handle this kind of stall cycles because you can't know when the ARM will have a mispredict branch. It's the same problem with memory read outside the cache !
    So you can just expect that most of case you don't have those stall cycle and then ignore those case.



    For the 0 / 1 / n0 / n1 : this is not stages of the pipeline.
    This is the name of the 2 ARM pipelines (0 and 1) and the 2 NEON pipelines (n0 / n1)

    The Cortex "can start" 4 instructions in the same cycle.
    Don't believe you'll be able to execute 4 instructions at each cycle! that's wrong !
    But in some case, in some cycle, the Cortex Can start 4 instructions (2 ARM and 2 NEON) in the same cycle.

    Rem : I don't speak about VPf because Vpf and NEON interaction are another problem !



    About the Cycle Counter:
    I do not handle the 13 pipelines stages. I handle instructions when they enter into a functional unit.
    The cycle counter is not so complex (in fact decode step are not usefull to count cycle (I guess)).

    All that stuff is not very easy to understand.
    To start, forget NEON and its 2 pipelines (n0 and n1).
    Do some tests if you have a Cortex.

    Etienne

  • Note: This was originally posted on 16th May 2011 at http://forums.arm.com

    All is describe here
    http://pulsar.webshaker.net/2011/05/15/program-to-count-the-cycles-of-the-a8-cortex-v0-7/

    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]a.1-0 1c[/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]
    [/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]neam:[/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]a: it's an ARM instruction (opposite to NEON or VPf instruction)[/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]1: running cycle[/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]0: pipeline 0[/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]1c: the instruction take 1 cycle to execute.[/size][/font][/color]
  • Note: This was originally posted on 25th May 2011 at http://forums.arm.com


    Dear Webshaker,
    I am thinking how to test the cycle count module of Cortex-A8.
    I think, for each instruction, I have to combine it with each other instruction to see how they work together.
    However, I got a problem. Because the number of instructions of Arm is too big, so the number of testcases is big too.

    Do you have other ideal for testing?


    I'll write a post to explain how works the cycle counter and how you can write your own cycle counter in few days (weeks)...
    That will be more more simple that triyng to explain part by part how the program works !!!

    But your solution is not a good solution... to much work !!!

    Etienne