This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
  • Note: This was originally posted on 12th April 2011 at http://forums.arm.com

    I try your test, so every time that I add nop, the time incresed => two nop instructions can't executed in pair.
    I analyze with gdb , gcc  generates the wrong opcode for nop, it is mov r0,r0, so here appear dependancy for two nops.

    But in armv7 nop has it's own opcode which is 0xE320F000, so I edit the binary file, and replace opcode that generate gcc with this opcode, and two nops executed in parralel.

    So in your site two nops go in parralel, but gcc doesn't think so)))))))))))))))

    what you think about this?
  • Note: This was originally posted on 18th March 2011 at http://forums.arm.com

    Thanks  a loooot for explanation)

    but I still have one quation, what you mean  when you say cycle

    I don't understand, for example, the mul instruction takes 2 cycle, how  its can block the rd during 6 cycle?
    you want to say that in this case

    mul r5, r1, r2
    mov r3 ,r5


    mov will wait 6 cycle for r5??
    and this two instruction toghether wiil take 7 cycles??
  • Note: This was originally posted on 18th March 2011 at http://forums.arm.com

    Thanks a looooooooooooot))
    you help me very mach))))
  • Note: This was originally posted on 21st March 2011 at http://forums.arm.com

    a
  • Note: This was originally posted on 2nd August 2011 at http://forums.arm.com

    HI Etienne

    I have checked some floating-point instructions as below:
    VADD,VSUB,VABD,VMUL,VCEQ,VCGE,VCGTVCAGE,VCAGT,VMAX,VMIN

    In specs, they all require source registers at N2 stage.
    However, in your database (excel file), they require source registers at N1 stage.

    Why is this difference?
  • Note: This was originally posted on 29th June 2011 at http://forums.arm.com


    n + 4

    Thank you very much for your information.
    I tested 2 below instructions:
    smlal r0, r1, r3, r4
    smlal r0, r1, r3, r4
    smlal takes 3 cycle, destination register is available in E5. So the first instruction releases r0, r1 at the cycle 3 + 5 = 8.
    Because the second instruction requires r1 in E1, it must start after the cycle 8.
    However, the counter module say it starts at the cycle 6 (http://pulsar.webshaker.net/ccc/sample-4830f428)
    Please explain this case for me.
  • Note: This was originally posted on 8th July 2011 at http://forums.arm.com

    Hi Etienne. Have a good day

    So far, I found some instructions that your cycle count module can't analyze. I don't know why.
    Please check and give me some explanations:

    vbic.i16 d0, #1 ; 0x0001
    vbic.i32 q2, #1 ; 0x00000001

    vmov.i16 q0, #1 ; 0x0001
    vmov.i16 d0, #1 ; 0x0001

    vmvn.i16 q1, #1 ; 0x0001
    vmvn.i16 d1, #1 ; 0x0001

    vorr.i16 q0, #1 ; 0x0001
    vorr.i16 d0, #1 ; 0x0001

    Dung
  • Note: This was originally posted on 12th July 2011 at http://forums.arm.com


    It was missing a callback that check the validity of the immediate value.
    I've added it now.

    I am impressed by how fast you modified your code.
    I wish, I would be as good as you.
    I have just found some instructions that your module report  as unrecognized. Please check them:

    vqdmulh.s16 d0, d1, d2[0]

    vqrdmulh.s16 d0, d1, d2

    vqrdmulh.s16 d0, d1, d2[0]

    vqshlu.s32 q1, q2, #1

    vrecpe.u32 d1, d0
    vrecpe.u32 q1, q0

    vrsqrte.u32 d1, d0
    vrsqrte.u32 q1, q0

    vshll.s16 d2, q0, #1
    vshll.u16 d2, q0, #1

    vpmax.s16 d0, d1, d2
    vpmin.s16 d2, d1, d0

    vqdmulh.s16 d0, d1, d2
  • Note: This was originally posted on 23rd June 2011 at http://forums.arm.com

    I have some questions related to this file: http://pulsar.websha...x-A8-cycle.xlsx
    1. Why does some lines is darker than others? What does it mean?
        For example: line that contains "MUL::MLA" is darker.
    2. What does "::" mean? For example,  "MUL::MLA" or "MUL::SMLAWB"
  • Note: This was originally posted on 27th June 2011 at http://forums.arm.com

    I have 2 questions related to multiple load instructions. For example: ldm r1!, {r2, r3}.
    This instruction takes 2 cycle. So it is broken down to 2 single-cycle operations.
    1. Because write back is enabled, r1 is written in E2 stage. However, which single-cycle operation is r1 written? The first single-cycle operation or the second single-cycle operation?
    2. When are r2, r3 available for other instructions? I can not find any scheduling information in specs.
    I assume this instruction is similar to LDR, that means, r2, r3 is available in E3. Because r3 is written by the second single-cycle operation, it is available in E3 of the second single-cycle operation.

    If my assumption is correct, 2 below instructions should produce 2 cycle penalty:
    ldm r1, {r2, r3}
    ldr r4, [r3, #1]
    However, the result is different in http://pulsar.websha...sult.php?lng=fr
    Please explain for me.
  • Note: This was originally posted on 27th April 2011 at http://forums.arm.com


    Hum !!!
    You "just need" that ;)

    I can't give you the source code of the cycle counter but I can explain how it's work.
    There Is two part:
    - the general case
    - the specific case (register restriction, shortcuts, ...)

    You are at cycle #10

    1 - The ARM check before starting an instruction that all the registers will be available when the instruction will need them.
    For example:
    you want to execute a MUL Rd, Rm, Rs
    Rm must be available at cycle #11 (#10 + 1 see MUL cycle table http://infocenter.ar...ch16s02s03.html)
    If at least 1 register is not avalable, then the ARM do not start the instruction and you have a stall cycle.


    As far as I know, Cortex-A8 implements some forwarding hardware support, static scheduling scoreboard, replay and pending queue . They help to avoid any kind of data hazard between instructions. So is it correct that the kind of latency you said above is not count?

    In my opinion, to calculate the number of executed cycle, we just have to care about cycle penalty or cycle stall. I mean: branch taken penalty, replay penalty, branch mispredict penalty. I don't know why you didn't mention branch penalty in your explanation?

    I am new in ARM. So please forgive me for any silly understanding.
  • Note: This was originally posted on 27th April 2011 at http://forums.arm.com


    I need something like this
    http://pulsar.websha...sult.php?lng=fr


    I used above link to check cycles of some ARM instruction. However, I confused about the pipeline column.
    For example, there are "no, n1, 0, 1"  that happen in 1 cycle. They seem to be stages of pipeline. However, Cortex-A8 has 13 stages of pipeline and there is no name like these name. Also, 1 stage takes 1 cycle, right?

    Please give me some explanations.
  • Note: This was originally posted on 4th May 2011 at http://forums.arm.com


    Do you need to handle then in a cycle counter ???

    Yes, I do.
    I checked 2 below instructions that can be dual issue in your website: http://pulsar.webshaker.net
    mov r1, r2
    mov r3, r4
    However, in cycle 2, it  has nop instruction. Why does "nop" occur?
    About NEOn instructions, how to know that 2 NEON instructions can be dual issue? Do they follow the rule you mentioned?
  • Note: This was originally posted on 5th May 2011 at http://forums.arm.com

    Thank you very much for sharing your experience. You have helped me a lot.
    I read from specs that a Neon load/store instructions can be dual issued with SIMD data-processing instructions. So I tried below code in your website:
    vld1.32 {d0}, [r0]
    vadd d1, d2, d3
    Mov r1, r2
    However, VADD is in separate cycle (cycle 4). Am I wrong?
  • Note: This was originally posted on 28th April 2011 at http://forums.arm.com


    Hum. you start with very complex questions !!!

    First I do not understand what you say about "static scheduling scoreboard, replay and pending queue"
    But I do not really understand what ARM call "data hazard" ;(

    What I can say is that if you apply the stage rules describe into the ARM documentation to count cycle, you'll have a "quite" correct result.

    After that there is a lot of special case (and they are not always documented) that can improve the quality of the counting process.
    shortcut (or fast forward) for example.



    Thank you very much, Etienne. I am sorry for my unclear questions.

    About  "static scheduling scoreboard, replay and pending queue", they are some parts of pipeline of Cortex-a8. You can refer to this document: here

    Is it true that you ignore the shortcut (or forwarding, or bypassing) in your method of counting number of cycles? therefore, if  a instruction has an operand that is the source of previous instruction, it may have to wait 1 cycle (or more).


    Branch mispredict penality : you can't handle this kind of stall cycles  because you can't know when the ARM will have a mispredict branch. It's  the same problem with memory read outside the cache !
    So you can just expect that most of case you don't have those stall cycle and then ignore those case.



    I found in Cortex-A8 document that describe mispredict penalty. It happens when the target address that  is predicted by "program flow prediction" is different from target address that is generated in Execution (E5).   I think we can trap instructions that cause mispredict penalty. Please refer to here


    The Cortex "can start" 4 instructions in the same cycle.
    Don't believe you'll be able to execute 4 instructions at each cycle! that's wrong !
    But in some case, in some cycle, the Cortex Can start 4 instructions (2 ARM and 2 NEON) in the same cycle.



    yes, I still can't imagine that. Especially, in IF stage, the pipeline fetches 4 instructions at a same cycle. I don't know how can it handle if there is one branch in these 4 instructions? If you have any related material, please let me know.


    I do not handle the 13 pipelines stages. I handle instructions when they enter into a functional unit.
    The cycle counter is not so complex (in fact decode step are not usefull to count cycle (I guess)).



    What do you mean by "functional unit"? I agree that ID step is not useful. But in IF step, if branch is taken, it causes 1 cycle penalty. I think we need to care about this case, right?

    Dung!