This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
  • Note: This was originally posted on 13th July 2011 at http://forums.arm.com


    I am impressed by how fast you modified your code.
    I wish, I would be as good as you.
    I have just found some instructions that your module report  as unrecognized. Please check them:



    This is not so complex.
    Most of the time rules are missing or a wrong into the excel file...
    so update juste take few minutes...


    vqdmulh.s16 d0, d1, d2[0]
    vqrdmulh.s16 d0, d1, d2
    vqrdmulh.s16 d0, d1, d2[0]
    vqshlu.s32 q1, q2, #1
    vrecpe.u32 d1, d0
    vrecpe.u32 q1, q0
    vrsqrte.u32 d1, d0
    vrsqrte.u32 q1, q0
    vpmax.s16 d0, d1, d2
    vpmin.s16 d2, d1, d0
    vqdmulh.s16 d0, d1, d2


    vshll.s16 d2, q0, #1
    vshll.u16 d2, q0, #1




    Well.
    You were right for the 11 first rules.
    they were missing. I've added them

    For VSHLL
    Registers are not in the right position VSHLL take a quadword as destination register and a doubleword as source.
    So
    vshll.u16 d2, q0, #1
    is not a valid instruction.

    Thank's for the report !

    Etienne
  • Note: This was originally posted on 18th March 2011 at http://forums.arm.com

    Hum !!!
    You "just need" that ;)

    I can't give you the source code of the cycle counter but I can explain how it's work.
    There Is two part:
    - the general case
    - the specific case (register restriction, shortcuts, ...)

    For the Generel case:
    It's quite easy:

    You are at cycle #10

    1 - The ARM check before starting an instruction that all the registers will be available when the instruction will need them.
    For example:
    you want to execute a MUL Rd, Rm, Rs
    Rm must be available at cycle #11 (#10 + 1 see MUL cycle table http://infocenter.ar...ch16s02s03.html)
    If at least 1 register is not avalable, then the ARM do not start the instruction and you have a stall cycle.

    2 - The ARM start to execute the instruction and lock destination registers (to prevent and other instruction using the same registers as source)
    For example with our previous MUL
    Rd is written to be lockd until cycle #16 (#10 + Rd : E5 + 1 because the mul take 2 cycle, and destination stage are always given for the last cycle of a multicyle instruction)

    3 - Free the register...
    This is quite complexe to explain but some time you can have a register locked by more than one instruction.
    For example:
    MUL r0, r1, r2
    MUL r0, r1, r2
    The first MUL will lock r0 until cycle #16. The second MUL will start at cycle #12 and lock r0 until cycle #18.
    So during cycle #13, #14 and #15 you have 2 instructions that have locked the register r0 !


    For the Specific case:
    This is less fun !!! There is a lot of case that you will have to handle.
    For exemple, You can't execute 2 instructions using the same destination register.
    I remember having post a message with this example

    mov r0, #5
    add r0, r0, r2


    MOV will lock r0 until E1.
    ADD do not need for r0 before E2.
    So there is not good reason (it must have one but I do not know it) to not execute both instructions in the same cycle.

    Finally.
    The program is not so complexe to do as it could seems.

    But !!!
    First of all, you must be sure that you understand:
    - How work the pipeline stages.
    - What is exactly a pipeline

    Well !!!
    You know now what to to for the 3 next month ;)
  • Note: This was originally posted on 11th August 2011 at http://forums.arm.com


    Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.


    Hum.
    In fact, You may be right !
    I've tryed to copy 2, 3 and 4 times the NEON code into the loop

    normally, 8 couples of instruction should take 8 cycles but it takes 10
    16 couples takes 20
    24 couples takes 38
    32 couples takes 48


    The time increase strangely.

    I've replaced the vld1 by vtrn. the timing are
    8 couples takes 8 cyles

    16 couples takes 19 cyles


    24 couples takes 32 cyles
    ..

    So now. I'm think you're right. There is a bottleneck due to bandwith.
    And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.


    This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.



    We are not agree :) that's could arrives ;)


    I'd still like to see some evidence that this causes problems and what exactly.



    Just replace



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14


    by



    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r1:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r1:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r1:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r1:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r1:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r1:128]
    vmul.f32 d7,d15,d14


    ...and you'll have your proof.

    What you'll not have is a logical explanation ;)

    During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !
  • Note: This was originally posted on 15th August 2011 at http://forums.arm.com


    If I use three buffers with 4096 bytes size, the cycle count increases to around 140 cycles. I have given the code  here, r1,r2,r3 has the address of buffer1,buffer2 and buffer3 respectively.


    However, if the buffer size is not a multiple of 4096, cycle count is normal.


    140 cycles for each iteration of the loop ?
  • Note: This was originally posted on 10th August 2011 at http://forums.arm.com

    Well Anil.
    I've tried some time to ask infomation to ARM.
    They are very cool, but finaly they never answered me.

    The problem you found is very specific, and the documentation is clear:


    This chapter provides the information to estimate how much execution time particular
    code sequences require. The complexity of the processor makes it impossible to
    guarantee precise timing information with hand calculations.


    I think that nobody knows every special case of NEON working (except maybe the guy who design it).
    You can try ;)

    Exo,
    I thaught NEON queue was 16 entries deep, and memory access queue was 8 or 12 entries deep !

    I will try your load example, but for me, you can only pair un multi cycle instruction on it's first cycle OR it's last cycle.
    the doc say's


    There are also similar restrictions to the ARM integer pipeline in terms of dual issue
    pairing with multi-cycle instructions. The NEON engine can potentially dual issue on
    both the first and last cycle of a multi-cycle instruction, but not on any of the
    intermediate cycles.


    Reading that, it seem's that your are right, but I think I had made tests about that, and I do not remember to have succefully pair  one instruction twice.
    I'll check again !
  • Note: This was originally posted on 11th August 2011 at http://forums.arm.com


    I thought I remembered issuing on both first and last cycle but I'm having trouble doing it now too. I'm also having trouble getting the loop you mentioned earlier down to 10 cycles. It looks like it's taking at least 12. The entire loop is taking 14 - since there is stalling, it's difficult to tell how much, if any, is overlapping the 2 cycles of integer loop overhead. You would think that at least one cycle would be overlapped since it's purely a fetch cycle.


    As I said 10 cycles is the NEON code time
    the full code is this one

    movw   r1, #:lower16:coef
    movt   r1, #:upper16:coef

    add   r2, r1, #16
    add   r3, r2, #16
    add   r4, r3, #16
    b    .loop1
    .align 4
    .loop1:

    vld1.32 {d16,d17},[r1:128]
    vmul.f32 d0,d15,d14
    vld1.32 {d18,d19},[r2:128]
    vmul.f32 d1,d15,d14
    vld1.32 {d20,d21},[r3:128]
    vmul.f32 d2,d15,d14
    vld1.32 {d22,d23},[r4:128]
    vmul.f32 d3,d15,d14
    vld1.32 {d24,d25},[r1:128]
    vmul.f32 d4,d15,d14
    vld1.32 {d26,d27},[r2:128]
    vmul.f32 d5,d15,d14
    vld1.32 {d28,d29},[r3:128]
    vmul.f32 d6,d15,d14
    vld1.32 {d30,d31},[r4:128]
    vmul.f32 d7,d15,d14

    smuad   r10, r10, r10
    nop
    nop
    smuad   r11, r11, r11
    nop
    subs   r0, r0, #1
    smuad   r12, r12, r12
    bgt   .loop1


    I'n using this code because I'm sure that the end ARM code take exactly 5 cycles and let 2 bubbles in the pipeline for the branch.
    Remember this post ;) http://pulsar.webshaker.net/2011/04/17/focus-on-branch-instructions/


    The number of cycles stays the same for me regardless of if I load to different registers or using different base registers with the same arrangement as in your example. Maybe we're using different versions of Cortex-A8? I'm using OMAP3530, how about you?

    I have a beagleboard XM (DM3730). But the processor is not the problem (i believe). Try the code I give and tell me if you found 15 cycles (10 for NEON part and 5 for ARM part).


    Here are some interesting things I've observed:

    1) If I add one or two pairs of nops in the middle I get the same speed (14 cycles for the loop). If I add a third pair the speed goes down to 13 cycles. With the fourth pair it goes back up to 14 cycles, and with every pair after that it adds 2 cycles. So, with 3 nop pairs I get no stalls in the NEON code, because there are 12 pairs of instructions (+1 cycle for fetch stall).

    2) If I change three or more of the vld1s to independent vext.8 I get 10 cycles, or full pairing. Same with vmovn, vswp, vrev16, vzip, and vuzp. So the bottleneck is not dual-issue, it's loads and stores.

    3) If I change to 64-bit loads instead of 128-bit I still get 14 cycles for the loop. So I don't think it's a bandwidth limitation.

    4) If I change to 64-bit or 128-bit store I get 21 cycles for the loop. However, here if I store to separate 16-byte addresses in a 64-byte block I get something like 15.5 cycles (this is with a cache-line aligned destination). This is probably due to coalescing filling a whole cache line in the write buffer, where otherwise the cache line has to be loaded. I tried "warming" the buffer by memcpying it to itself to make sure it was in L1 cache, but that didn't make a difference.

    5) If I change the vmul.f32s to vmla.f32 things get bad. If I start at a baseline of no-pairing I get the expected 9 cycles. Then pairing a single vmovn turns it into 12. And from there every new pair adds 4 cycles. I get the same cycles with vrecps.f32, and presumably will with the other chained pipeline instructions.

    So I guess the lessons are to not do too many loads/stores in a row, and that chained pipeline instructions hate being dual issued with anything for some reason. We should do some more testing to see if there are any other instructions that cause a big penalty over dual-issue like this.


    I do not understand the point 5 and how you get 9 cycles !!!

    I think the load process of NEON is very complex and it is not easy to really understand it.
    I've stopped to try to understand it because test context in never the same as realtime context.

    The best we can do for the moment is to use some guidelines:
    - don't use the same buffer for read and for write (when it's possible)
    - don't update data with the ARM while you're reading them with NEON (in a loop I mean). in general case avoid to access to the same datas with ARM and NEON load and store operations. You can do that only if the ARM and NEON are doing LOADs.
    - try to load (if it's possible) long time before using datas (there is enough registers to load the datas of the next iteration during the previous one).
    - try to write as soon as possible (that's to say as soon as the register are available for VSAVE).
    - use alignment if it's possible.
    - and now ;) don't read the same memory bloc with consecutive VLOAD

    It could be usefull to understand the real LOAD and STORE NEON process, but I think you could search for you all life without understanding it !

    Etienne.
  • Note: This was originally posted on 21st March 2011 at http://forums.arm.com

    Multi cycle instruction lock both the pipeline.
    You can only have dual issue on last cycle

    You'll have more information here
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babhefaj.html
  • Note: This was originally posted on 15th April 2011 at http://forums.arm.com

    If you are using time to measure the elapsed time for a very short program then the odds are that most of the time is spent loading the program from storage and setting up the memory map.

    Try setting up the timing function inside your program binary and measure a relatively large block of  instructions so that the measurements overheads are small relative to the measurement.

    Iso
  • Note: This was originally posted on 12th April 2011 at http://forums.arm.com


    [color=#222222][size=2]10 NOP instructions on my beagle takes 2.508 s[/size][/color]


    Two questions ...

    (1) Have you turned on the processor caches?
    (2) How are you measuring the time taken?

    Assuming perfect execution then 10 NOP instructions should be around 5 cycles - so 0.01 us ... so, not meaning to be rude, but something is wrong with your setup =)
  • Note: This was originally posted on 9th August 2011 at http://forums.arm.com

    Hi Anil, webshaker..

    I have personally found something very strange in NEON that might be related to what you're describing. It seems that if you dual-issue too many instructions in a row that you start seeing stalls. I haven't attempted to formally understand this, all I know is that if you start with a loop with a few dual issued instructions (and then the loop ends with some ARM code that takes two cycles) it works as expected. Then as you add more pairs eventually it starts adding more than 1 cycle. At its worst it seemed to give even lower performance than if they were all single issued, like what Anil was seeing in his first example. At this point I've actually been able to improve performance by adding nops between the pairs!

    Note that this doesn't happen if you pair NEON and ARM code. You can seemingly do that as much as you want without penalty.

    The only possible explanation that comes to mind is that the NEON queue could be bottlenecking its throughput. The queue is 12 instructions long, so you can fill it up in 6 cycles. You will note that the pipeline stage where NEON instructions are dispatched is more than 6 before the stage where NEON begins. This is even worse if instructions are not removed from the queue until later in the NEON pipeline. So if the queue is filled while there are still more NEON instructions to be issued it will have to stall until the NEON unit consumes the instructions. Normally this shouldn't be a problem because once the stall happens they'd reach equilibrium, where the NEON consumes two old instructions at the same rate that the dispatch adds two new instructions to the queue. But there could be something that's causing the stall to be more serious than this and costing a lot more cycles.

    If the instructions are in fact not removed until the end of the pipeline then vmla.f32 would exacerbate things because it effectively adds a bunch of stages to the NEON pipeline.

    One thing to try is instead of doing something like this:

    vld1.32 { q0 }, [ r0, : 128 ]
    vmla.f32 q8, q8, q9

    vld1.32 { q1 }, [ r0, : 128 ]
    vmla.f32 q10, q10, q11

    You could try doing this:

    vld1.32 { q0, q1 }, [ r0, : 128 ]
    vmla.f32 q8, q8, q9

    vmla.f32 q10, q10, q11

    Because multi-cycle instructions can pair on both the first and last cycle this should work the same. But if the queue is really a bottleneck this may relieve pressure to it, assuming that the multi-cycle load doesn't get turned into more than one entry on the queue.

    For what it's worth, I haven't had worse problems reading the same data over and over again, so I don't think that's contributing to it. This is actually useful in the real world: because loads finish in N1 and most instructions need their inputs in N2 you can actually dual issue a load with an instruction that uses the result in the same cycle. So you can use a load to prepare a constant if you don't have enough registers for it, or if you need to prepare the destination of a vmla/vmls. Note that unlike with loads there isn't a way to pair a 128-bit move with a normal instruction; you can fake a 64-bit one, though.
  • Note: This was originally posted on 10th August 2011 at http://forums.arm.com

    Yeah, I may be misremembering the queue length.. I'll have to check again later today when I have access to the description.

    I thought I remembered issuing on both first and last cycle but I'm having trouble doing it now too. I'm also having trouble getting the loop you mentioned earlier down to 10 cycles. It looks like it's taking at least 12. The entire loop is taking 14 - since there is stalling, it's difficult to tell how much, if any, is overlapping the 2 cycles of integer loop overhead. You would think that at least one cycle would be overlapped since it's purely a fetch cycle.

    The number of cycles stays the same for me regardless of if I load to different registers or using different base registers with the same arrangement as in your example. Maybe we're using different versions of Cortex-A8? I'm using OMAP3530, how about you?

    Here are some interesting things I've observed:

    1) If I add one or two pairs of nops in the middle I get the same speed (14 cycles for the loop). If I add a third pair the speed goes down to 13 cycles. With the fourth pair it goes back up to 14 cycles, and with every pair after that it adds 2 cycles. So, with 3 nop pairs I get no stalls in the NEON code, because there are 12 pairs of instructions (+1 cycle for fetch stall).

    2) If I change three or more of the vld1s to independent vext.8 I get 10 cycles, or full pairing. Same with vmovn, vswp, vrev16, vzip, and vuzp. So the bottleneck is not dual-issue, it's loads and stores.

    3) If I change to 64-bit loads instead of 128-bit I still get 14 cycles for the loop. So I don't think it's a bandwidth limitation.

    4) If I change to 64-bit or 128-bit store I get 21 cycles for the loop. However, here if I store to separate 16-byte addresses in a 64-byte block I get something like 15.5 cycles (this is with a cache-line aligned destination). This is probably due to coalescing filling a whole cache line in the write buffer, where otherwise the cache line has to be loaded. I tried "warming" the buffer by memcpying it to itself to make sure it was in L1 cache, but that didn't make a difference.

    5) If I change the vmul.f32s to vmla.f32 things get bad. If I start at a baseline of no-pairing I get the expected 9 cycles. Then pairing a single vmovn turns it into 12. And from there every new pair adds 4 cycles. I get the same cycles with vrecps.f32, and presumably will with the other chained pipeline instructions.

    So I guess the lessons are to not do too many loads/stores in a row, and that chained pipeline instructions hate being dual issued with anything for some reason. We should do some more testing to see if there are any other instructions that cause a big penalty over dual-issue like this.
  • Note: This was originally posted on 11th August 2011 at http://forums.arm.com

    I'n using this code because I'm sure that the end ARM code take exactly 5 cycles and let 2 bubbles in the pipeline for the branch.
    Remember this post ;) http://pulsar.websha...h-instructions/

    I have a beagleboard XM (DM3730). But the processor is not the problem (i believe). Try the code I give and tell me if you found 15 cycles (10 for NEON part and 5 for ARM part).


    Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.

    I do not understand the point 5 and how you get 9 cycles !!!


    9 cycles with no pairing at all, just 9 fmla.f32 in a row, exactly like what Anil posted. I was just using it to start with, and showing that pairing any instructions added several cycles of stalling for every pair.

    - try to load (if it's possible) long time before using datas (there is enough registers to load the datas of the next iteration during the previous one).
    - try to write as soon as possible (that's to say as soon as the register are available for VSAVE).


    This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.

    Likewise, stores need their data in N1 and normal instructions don't make their results available until at least N3. So you need a few cycles after the operation before storing. That said, software pipelining can still help you avoid this and other latencies.

    - and now ;) don't read the same memory bloc with consecutive VLOAD


    I'd still like to see some evidence that this causes problems and what exactly.