This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
  • Note: This was originally posted on 10th May 2011 at http://forums.arm.com


    To do this check, I intend to create a database as below:
    Instruction...... Cycle......Rd........Rm..........Rn
    Mul......................2...........E5.........E1...........E1
    ...
    When my tool read an instruction, it look up into this database to get the available cycle of each register. However, it seems to be a lot of work for me at the moment :((
    Do you have any other ideal?
    Can you share with me your ideal to implementation!



    I can give you my database, that could help you.
    Download it here.

    http://pulsar.webshaker.net/ccc/cortex-A8-cycle.xlsx
    (it might have errors !!!)


    The algorithm is quite simple in fact (for the generic case). I even explain it in this post.

    You'll have to work a little bit ;)
  • Note: This was originally posted on 10th May 2011 at http://forums.arm.com


    Thank you very much. Your database will help me save a lot of time. If I find any mistake, I will inform you immediately.
    I hope your next version will be online soon.


    Thank's !!!

    Be carreful, There is no LDM and STM instruction rules.
    Those instruction were too numerous to be manually describe. Rules for these instructions are build on loading table.
    They are notified in the table by nbcycle = -1
  • Note: This was originally posted on 11th May 2011 at http://forums.arm.com


    I am confused.
    From the specs, ADD needs source registers at E2 and destination register is available at E2 too. So this 2 instructions can be dual issued:
    add r1, r2, r3
    add r4, r5, r1
    Because the second ADD requires r1 at E2 and the first ADD makes r1 available at E2 too.
    If ADD needs source registers at E1, I agree that 2 instructions above can't be dual issued.
    One explanation, I think, is that ADD needs source registers at the beginning of E2 and make destination register available at the end of E2. However, why doesn't specs say that destination register is available at E3?
    I know I 'm wrong, but I can't explain.




    You're absolutely right !
    It's a strange convention used by ARM !
    It must have a good reason, but I do not know it !!!
    - Maybe it could be quite confusing to say that a MUL result cycle is 7 while the functional unit have only 6 stage !!!
    - Maybe it can happen something between the end of the cycle and before the beginning of the next one. I speak about shortcuts. MUL shortcuts are one cycle faster than indicated (or understood) in the documentation. It's possible that the forward cycle is executed before the beginning of the cycle ! that could explain this difference.

    But finally, This is not really a problem once you understood how to read the cycle table.
  • Note: This was originally posted on 29th April 2011 at http://forums.arm.com

    for dual rules, all is here
    http://infocenter.ar...k/Babhefaj.html

    For the functional unit:
    Once the instruction have been decode, it is seended to a specific functional unit (called fu).
    Those "fu" are linked to pipelines.

    On the ARM, you have 2 pipelines and 4 fu
    ALU0
    MUL0
    ALU1
    LS (load store)

    ALU0 and MUL0 a linked to pipeline 0
    ALU1 is linked to pipeline 1

    There is no MUL1. That's why you can't execute a MUL opération into pipeline 1

    LS fu is linked to pipeline0 and pipeline1. That's why you execute only one memory access, but this acces can be done into pipeline 0 ou pipeline 1.

    On ARM you can only execute 1 MUL by cycle and 1 LDR / STR by cycle. But why these instructions can't be dual is not the same !!!

    Do you need to handle then in a cycle counter ???

    Rem : what I'll say now is not very sure ! These are only speculations (but they seems to be true) !!!
    Let suppose you have this code

    LDRD r0, r1, [r5]!
    LDR r3, [r6]!


    LDRD take 2 cycles. (and start on cycle 1 pipeline 0)
    Because it is a multicyle instruction, only the last cycle can be dual.
    LDR can be executed into pipeline 1

    So! If you just apply the ARM rules described into the link I gave you sooner. The LDR should execute into cycle 2 pipeline 1.
    For me, this is not possible because the LS unit is in use (it is in use for 2 cycles). So LDR will execute in cycle 3 pipeline 0.

    This working mode seems to be correct, but there is not many case where stall cycle are due to fu conflict !

    In fact I think the rules should be
    "Multi-cycle instructions must issue in pipeline 0 and can only dual issue in their last iteration if it does not use the same functional unit."
  • Note: This was originally posted on 23rd June 2011 at http://forums.arm.com


    I have some questions related to this file: http://pulsar.websha...x-A8-cycle.xlsx
    1. Why does some lines is darker than others? What does it mean?
        For example: line that contains "MUL::MLA" is darker.
    2. What does "::" mean? For example,  "MUL::MLA" or "MUL::SMLAWB"


    Color line are just for me the color is not used for the parsing.
    MUL::MLA define a specific shortcut rules.
    the firts MUL is not a instruction but a shortcut name you found in the MUL line column W

    That just mean that

    if the MLA intruction is executed just after a instruction that have MUL into column W then
      use this rules (MUL::MLA)
    else
      use standard MLA rules
    endif ;)

    Etienne
  • Note: This was originally posted on 28th June 2011 at http://forums.arm.com


    I have 2 questions related to multiple load instructions. For example: ldm r1!, {r2, r3}.
    This instruction takes 2 cycle. So it is broken down to 2 single-cycle operations.
    1. Because write back is enabled, r1 is written in E2 stage. However, which single-cycle operation is r1 written? The first single-cycle operation or the second single-cycle operation?


    I've used stage 2 for pointer register of a LDM STM operation.

    2. When are r2, r3 available for other instructions? I can not find any scheduling information in specs.
    I assume this instruction is similar to LDR, that means, r2, r3 is available in E3. Because r3 is written by the second single-cycle operation, it is available in E3 of the second single-cycle operation.


    You right.
    the 2 last register are available at stage 3
    the 2 previous regsiter are available at stage 2
    the n previous register are available at stage 1.

    I'm not sure this is the real values, but this is the values I used in the cycle counter.


    If my assumption is correct, 2 below instructions should produce 2 cycle penalty:
    ldm r1, {r2, r3}
    ldr r4, [r3, #1]
    However, the result is different in http://pulsar.websha...sult.php?lng=fr
    Please explain for me.


    Use the permalink to give example !!!
    http://pulsar.websha...sample-23918fa0


    Hum. I think you're right.
    There is a problem into the cycle counter.
    I'll check...


    [Update 14.34]  : I've patch the register stage in the cycle counter. Thank's
  • Note: This was originally posted on 28th June 2011 at http://forums.arm.com


    I am sorry because I am still confused. For example: ldm r1, {r2, r3}
    Assuming that this instruction starts at  the cycle n.
    If this instruction took only 1 cycle, r2, r3 would be available at the cycle n + 3.
    However, this instruction takes 2 cycle, so when are r2 and r3 available? (n + 3) or (n + 4)?


    n + 4
  • Note: This was originally posted on 14th June 2011 at http://forums.arm.com

    Hi Dung.

    for MRS and MSR: there is a lot of instruction that I've not found real cycle timing and I do not have time to test.
    In this case the rules are in the file only for parsing purpose...

    Take the last version (but keep the previous one because I've change a lot of things).
    For example I remove all the STM and LDM rules. There is to many case. Now I build this rules automaticaly in the cycle counter.

    dstCond is the cycle for destination register when the instruction is conditional. That's the case for
    MOVEQ r0, #5
    in this case r0 is written a stage 2 while without conditionnal information
    MOV r0, #5
    r0 will be written at stage 1

    In conditional instruction, destination register must be read
    cc-dst1 and cc-dst2 are the stage where destination register are read for conditional instructions.

    Etienne
  • Note: This was originally posted on 17th June 2011 at http://forums.arm.com


    Now I understand how hard to find cycle timing for all instructions.


    Ben avison have made a very usefull work for that
    http://www.avison.me.uk/ben/programming/cortex-a8.html


    For not found instructions, you treat them as unrecognized,  right?


    Not exactly.
    I added rules into the cycle counter but cycle information could be wrong.



    I tried some instructions such as: SETEND, BLKP, SMI, SMC and your cycle count module said unrecognized.


    I do not add this instruction because I do not know them.
    And nobody used them since v0.7
    If somebody put them into the cycle counter, these instruction will be put into the unrecognized instruction log file, and I'll add them.


    How can I get the lastest version. Is it here: http://pulsar.websha...x-A8-cycle.xlsx
    I found some instructions are updated. For example, SUBS pc, lr, #imm isn't in "cortex-A8-cycle.xlsx" but it is available in http://pulsar.websha...ult.php?lng=fr.


    Yes the last rules version is always at the same place.
    For SUBS !
    I do not write every possible instruction. There is 2^32 possible instruction into the ARM.

    I use regular expressions.
    For Example: SUB is defined into sheet "Instruction" Line 26
    it takes 1 cycle
    it can be executed into both pipeline 0 and 1 (a&B)
    it use ALU functional unit of the pipeline
    the type of the instruction is 'data'. This information is used only with instruction that can run only one time by cycle (LDR for example)
    flag: define if the instruction can be conditional and if you can use the S bit to set the flag register.
    callback: the callback  is used to make additional control after the regexp check... for SUB imm8plus check that the immediate value is valid.
    wait: is a special field to indicated global lock operation like VMOV r0, d0
    sform: is used for short mnemonic instruction (when a register can be omit).
    and the all the src and dst stage...

    Finally this line is transformed automatically into a regexp


    /^\s*(and|eor|sub|rsb|add|adc|sbc|rcsc|orr|bic)(al|eq|ne|cs|cc|mi|pl|vs|vc|hi|ls|ge|lt|gt|le|lo|hs)?(s)?()
    (\s+(r\d|r[1][012345]|sb|sl|fp|ip|sp|pc|lr)\s*,
    \s*(r\d|r[1][012345]|sb|sl|fp|ip|sp|pc|lr)\s*,
    \s*([^;@,\[\]:]*)\s*)?(?:\s(@.*|\/\/.*))?$/iU



    I can't  understand why there is too many cases. I guess you calculate how many registers and get how many cycle by the formula in specs.
    Please explain for me if you can.


    That's what I do now.
    I'm using the callback to count the number of register to LOAD (or STORE) and then affect the correct number of cycle to the instruction.

    About the v 0.8
    The engine is now finished (i.e. I do not know what I could add again)
    Now, I'll:
    - add missing rules.
    - improve callback quality to detect wrong instruction
    - may be make a faster engine by removing functional unit handling (it seems that it is possible to do the same engine without manage functional unit).
    - and test.

    Etienne

    I've put the right link for Ben AVISON website.
  • Note: This was originally posted on 11th May 2011 at http://forums.arm.com


    How can you treat this situation (my example)?
    I guess when you know the available stage of a register is E2, you treat as below:
    - If the register is source, you know it is available at  E2
    - If the register is destination, you know it is available at  E3
    Is my guess right?


    That's quite easy in fact.
    current cycle is 1
    current pipeline is 0

    To know if you first add can be executed you must check that
    r2 must be available at cycle 1 + 2 (current cycle + stage)

    r3 must be available at cycle 1 + 2 (current cycle + stage)

    That's ok to execute.
    You mark r0 to be locked until 1 + 2 + 1 (current cycle + stage + 1 cycle)
    So r0 is locked until 4


    Now when you try to execute the second ADD
    r0 must be available at cycle 1 + 2 (current cycle + stage) = 3
    but it's lock until cycle 4... so you have to wait.
  • Note: This was originally posted on 8th August 2011 at http://forums.arm.com


    HI Etienne

    I have checked some floating-point instructions as below:
    VADD,VSUB,VABD,VMUL,VCEQ,VCGE,VCGTVCAGE,VCAGT,VMAX,VMIN

    In specs, they all require source registers at N2 stage.
    However, in your database (excel file), they require source registers at N1 stage.

    Why is this difference?


    Yes.
    There were a mistake into the excel file
  • Note: This was originally posted on 8th August 2011 at http://forums.arm.com


    Hi all,
    I am doing some profiling analysis on Cortex A8 processor using the Beagle Board-xM. I found a strange behavior with the following piece of code. The code takes 46 cycles. But looking at the code we can see that there is no dependency among each other, so ideally it should have taken only 9 cycles.

    Code:
    [indent][indent]/* 46 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vmla.f32 d0,d15,d14;
    vld1.32 {d18,d19},[r1:128];
    vmla.f32 d1,d15,d14;
    vld1.32 {d20,d21},[r1:128];
    vmla.f32 d2,d15,d14;
    vld1.32 {d22,d23},[r1:128];
    vmla.f32 d3,d15,d14;
    vld1.32 {d24,d25},[r1:128];
    vmla.f32 d4,d15,d14;
    vld1.32 {d26,d27},[r1:128];
    vmla.f32 d5,d15,d14;
    vld1.32 {d28,d29},[r1:128];
    vmla.f32 d6,d15,d14;
    vld1.32 {d30,d31},[r1:128];
    vmla.f32 d7,d15,d14;
    vld1.32 {d12,d13},[r1:128];
    vmla.f32 d8,d15,d14;

    [/indent][/indent]However, if I seperate the vmla and vld then the behavior is as expected, i.e the following codes take 9 and 11 cycles respectively.

    [indent][indent]/*  9 cycles. */
    vmla.f32 d0,d15,d14;
    vmla.f32 d1,d15,d14;
    vmla.f32 d2,d15,d14;
    vmla.f32 d3,d15,d14;
    vmla.f32 d4,d15,d14;
    vmla.f32 d5,d15,d14;
    vmla.f32 d6,d15,d14;
    vmla.f32 d7,d15,d14;
    vmla.f32 d8,d15,d14;

    /* 11 cycles. */
    vld1.32 {d16,d17},[r1:128];
    vld1.32 {d18,d19},[r1:128];
    vld1.32 {d20,d21},[r1:128];
    vld1.32 {d22,d23},[r1:128];
    vld1.32 {d24,d25},[r1:128];
    vld1.32 {d26,d27},[r1:128];
    vld1.32 {d28,d29},[r1:128];
    vld1.32 {d30,d31},[r1:128];
    vld1.32 {d12,d13},[r1:128];

    [/indent][/indent]Can some one please let me know whether I am missing something here or my understanding is wrong.

    Thanks,
    Anil M S


    What is your test procedure?
    You have made a loop executed 1000 times (for example) and you have found 46.000 cycles for the first example
    and (11 + 9) * 1000 = 20.000 cycles for the second?
  • Note: This was originally posted on 9th August 2011 at http://forums.arm.com

    hum.

    It was so strange that I've made the test.
    I do not find exactly the same result as yours but the problem il still there.

    That's really strange !!!

    if you replace VMLA.F32 by VMUL.F32 or VMLA.U32 the problem is solved.

    So I assume that the shortcut of the vmla.f32 is not applied if there is another instruction between the mul and the mla.
    It seem's that this problem is only true for float MLA !

    That's strange.

    What is more strange is why the first code take so many time while it should take 9 cycles (if we don't use vmla.f32)  !

    I've tried to change the value of the adress register
    Finally  I changed the address register value.

    add   r2, r1, #16
    add   r3, r2, #16
    add   r4, r3, #16
    b    .loop1
    .align 4
    .loop1:

    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14

    subs   r0, r0, #1
    bgt   .loop1


    This code (the NEON part only) take now 10 cycles. It should take only 8 cycles.
    I assume that there is a conflict into the memory file of NEON when you use the same address register.

    So.
    1 - don't put instruction between MUL and MAL when you use float opérations.
    2 - don't read the same data with NEON (in you never have to do that. You've made thins because you try a bench. In real life this case never happend).

    NEON is not fully detailled in the documentation. There is a lot of hint you'll have to found by testing.
    I do not know the both you found !

    Etienne.
  • Note: This was originally posted on 29th June 2011 at http://forums.arm.com


    smlal r0, r1, r3, r4
    smlal r0, r1, r3, r4
    smlal takes 3 cycle, destination register is available in E5. So the first instruction releases r0, r1 at the cycle 3 + 5 = 8.


    You're right.
    There were an error into the my cycle table (on the accumulator acces stage).

    Be carreful! you're explanation was not complete.
    r0 is available 8 cycles later but, it could have been possible that the next instruction only need r0 on stage 3.
    The second smlal need r0 in stage 1 so you remark is correct.

    Just replace the second smlal by a add and you could believe that smlal is faster
    http://pulsar.webshaker.net/ccc/sample-4de8ce82

    ...

    The correction is done.
    The excel file hab been updated and the cycle counter too.

    Thank you.
    Your help is very useful.
  • Note: This was originally posted on 11th July 2011 at http://forums.arm.com


    Hi Etienne. Have a good day

    So far, I found some instructions that your cycle count module can't analyze. I don't know why.
    Please check and give me some explanations:

    vbic.i16 d0, #1 ; 0x0001
    vbic.i32 q2, #1 ; 0x00000001

    vmov.i16 q0, #1 ; 0x0001
    vmov.i16 d0, #1 ; 0x0001

    vmvn.i16 q1, #1 ; 0x0001
    vmvn.i16 d1, #1 ; 0x0001

    vorr.i16 q0, #1 ; 0x0001
    vorr.i16 d0, #1 ; 0x0001

    Dung


    Hi dung.

    You're right.
    It was missing a callback that check the validity of the immediate value.
    I've added it now.

    Thank's

    Etienne