This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(

0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Cool

You want to understand how to read the documentation and how work's the E stage ???
I'have asked the same question few month ago !

This is a godd opportunity for me thank isogen74, sim, mayesta and other people for their help

The post who really help me to understand pipeline stage is this one
http://forums.arm.co...8-out-of-order/

I hope it will help you too.

Etienne
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 12th April 2011 at http://forums.arm.com

Hum.

I think it's strange but it possible!
That's clear that a NOP is replaced by a MOV r0,r0

But on my beagleboard-XM it seem's that I can execute 2 NOP in the same cycle.
May be you have another release of the cortex A8.

That could be a answer.

Documentation say's
Assembling the NOP mnemonic as UAL will not change the functionality of the code, but will change: "¢ the instruction encoding selected "¢ the architecture variants on which the resulting binary will execute successfully, because the NOP instruction was introduced in ARMv6K and ARMv6T2.

I'm not sure to really understand what it mean.

I've no other explanation about the difference our results.

Can you confirme me that with 10 NOP your program do not take 1 seconde ?
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 12th April 2011 at http://forums.arm.com

Interesting !!!

So!!! Don't use NOP anymore
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 12th April 2011 at http://forums.arm.com

mmmm

10 NOP instructions on my beagle takes 2.508 s

Does your beagleboard work's at 500 Mhz ?
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 31st March 2011 at http://forums.arm.com

anytime Etienne! though I quit ARMing

What a strange idea
See you soon !!!
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 11th April 2011 at http://forums.arm.com

The problem is due to your branch.

You can't simply expect that, if the branch is in the cache, it will take 1 cycle...
branch is more complex as it seem's.

Take this testing procédure. It will be more easy to understand the time taken by your program.
movw r0, #0x0500 @ you repeat your loop 83232000 times movt r0, #0x04F6 .loop: nop @ here is you nop nop nop nop nop nop nop nop smuad r1, r1, r1 @ you can be sure the ending code take 5 cycles. nop nop smuad r2, r2, r2 nop subs r0, r0, #1 smuad r3, r3, r3 bgt .loop bx lr

If you don't put any nop (I speak about your nop ! don't remove the nop after this smuad)
the program should take 0.52 s (this is logic because you beagleboard works at 800mhz and 5 * 83232000 ~= 400M cycles
every time you add 2 nop you program will take ~= 0.10 s more

you could have more readable result is you repeat you loop 80.000.000 times
in this case use

movw r0, #0xB400 movt r0, #0x04C4

instead of

movw r0, #0x0500 movt r0, #0x04F6
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 18th March 2011 at http://forums.arm.com

In pipelined processor there is a difference between
- the number of cycle needed to execute an instruction (from the beginning to the end
- the number of cycle that the pipeline is locked.

The mul take 6 cycles to execute when the instruction entrer into the pipeline.
but the pipeline is blocked during only 2 cycle.

when you execute this code
mul r0, r1, r2 mul r3, r4, r5

During the cycle 2, 3, 5, 6 ARM will execute both the mul

Explain that with mul is not a good choice because the mul take 2 cycles!
What you need to understand is that the ARM can start a new instruction most of time every cycle, but this instruction can take more than one cycle to execute.
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 4th May 2011 at http://forums.arm.com

However, in cycle 2, it has nop instruction. Why does "nop" occur?

You can forget the NOP. I add a NOP to the code provided in order to know exactly the time taken by the last instruction.

About NEOn instructions, how to know that 2 NEON instructions can be dual issue? Do they follow the rule you mentioned?

Hum. As far as I know, you can believe the cycle counter.
But It can have mistake (I have some times said wrong things in the past).
For NEON dual issue, I applied this rules
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/BABHBCCB.html

After I make some test and when I'm not agree with given information I change the rules.
When I don't find explanation about a technical part, I try to find a model that's seem's to work. Sometime I'm wrong.

For the cycle counter, the result begin to be quite correct (except for VPf instructions)
Everytime It's possible, I check that the ccc result is correct by real testing.

That's all what I can tell you !
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 5th May 2011 at http://forums.arm.com

[font=arial, verdana, tahoma, sans-serif][size=2]

I read from specs that a Neon load/store instructions can be dual issued with SIMD data-processing instructions. So I tried below code in your website:
vld1.32 {d0}, [r0]
vadd d1, d2, d3

Hum.
That's exactly a case where, for the moment, I'm not agree with the documentation.
Read this post
http://pulsar.websha...bowels-of-neon/

I've made new tests !!!
For me, cycle of the documentation are "quite" correct. but you can't have a dual issue if the load/store instruction take more than 1 cycle.

I'm not sure about that yet, I currently make additional tests...

the current cycle counter version does not handle the dual issued instruction with memory acces
It's written here [/size][/font]
http://pulsar.webshaker.net/2011/04/12/program-to-count-the-cycles-of-the-a8-cortex-v0-6/
or here (for french)
http://pulsar.webshaker.net/2011/04/12/programme-pour-compter-les-cycles-du-cortex-a8-v0-6/

...but the next version does. The new version should be online in few days !
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 28th April 2011 at http://forums.arm.com

Is it right that if the next instruction uses Rd as operand, it has to wait after cycle #16 to start execution? If so, I think it is wasteful because if there no dependency, the next instruction may start execution at cycle #13 or #14.

Is my thought right?

Dung!

Yes that's it...

For branch :
I'm do not know anything about the first stage of the ARM pipeline.
I don't know what you want to do.
But, I think that there is no way to know just with a code source if a (conditional) branch will be mispredict or not.

I assume that a B instruction is always correctly predict.
For a conditional branch this is the lottery.

Is you found somewhere information about how the branch is correctly predict, I'm very interested:

I remenber having tried something like this one day

mov r0, #1 mov r10, #10000 .loop: nop nop rsbs r0, r0, #1 beq .else subs r10, r10, #1 beq .exit nop nop b .loop .else subs r10, r10, #1 beq .exit nop nop b .loop .exit:

I thought the branches to .else will always be mispredict, but it was not the case.
It could be very usefull to know the prediction algorithm (but I assume it must be quite secret )!!!
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 28th April 2011 at http://forums.arm.com

My purpose, I think, is just simple. I want to develop a tool to count the number of cycles to execute a short source code.
I don't have a board or a Cortex-A8, I am just a man of theory (

This is quite hard job to do if you don't have hardware to check !!!

Buy a beagleboard... http://www.watterott.com/en/BeagleBoard-xM
This is not very expensive !!!
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 15th April 2011 at http://forums.arm.com

Try setting up the timing function inside your program binary and measure a relatively large block of instructions so that the measurements overheads are small relative to the measurement.

You're right isogen, but with a 80.000.000 repeated times loop.
There is no real problème to count cycle even with time command !
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 27th April 2011 at http://forums.arm.com

I used above link to check cycles of some ARM instruction. However, I confused about the pipeline column.
For example, there are "no, n1, 0, 1" that happen in 1 cycle. They seem to be stages of pipeline. However, Cortex-A8 has 13 stages of pipeline and there is no name like these name. Also, 1 stage takes 1 cycle, right?

Please give me some explanations.

Hum. you start with very complex questions !!!

First I do not understand what you say about "static scheduling scoreboard, replay and pending queue"
But I do not really understand what ARM call "data hazard" ;(

What I can say is that if you apply the stage rules describe into the ARM documentation to count cycle, you'll have a "quite" correct result.

After that there is a lot of special case (and they are not always documented) that can improve the quality of the counting process.
shortcut (or fast forward) for example.

Branch mispredict penality : you can't handle this kind of stall cycles because you can't know when the ARM will have a mispredict branch. It's the same problem with memory read outside the cache !
So you can just expect that most of case you don't have those stall cycle and then ignore those case.

For the 0 / 1 / n0 / n1 : this is not stages of the pipeline.
This is the name of the 2 ARM pipelines (0 and 1) and the 2 NEON pipelines (n0 / n1)

The Cortex "can start" 4 instructions in the same cycle.
Don't believe you'll be able to execute 4 instructions at each cycle! that's wrong !
But in some case, in some cycle, the Cortex Can start 4 instructions (2 ARM and 2 NEON) in the same cycle.

Rem : I don't speak about VPf because Vpf and NEON interaction are another problem !

About the Cycle Counter:
I do not handle the 13 pipelines stages. I handle instructions when they enter into a functional unit.
The cycle counter is not so complex (in fact decode step are not usefull to count cycle (I guess)).

All that stuff is not very easy to understand.
To start, forget NEON and its 2 pipelines (n0 and n1).
Do some tests if you have a Cortex.

Etienne
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 16th May 2011 at http://forums.arm.com

All is describe here
http://pulsar.webshaker.net/2011/05/15/program-to-count-the-cycles-of-the-a8-cortex-v0-7/

[color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]a.1-0 1c[/size][/font][/color]
[color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]
[/size][/font][/color]
[color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]neam:[/size][/font][/color]
[color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]a: it's an ARM instruction (opposite to NEON or VPf instruction)[/size][/font][/color]
[color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]1: running cycle[/size][/font][/color]
[color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]0: pipeline 0[/size][/font][/color]
[color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]1c: the instruction take 1 cycle to execute.[/size][/font][/color]
Cancel
Vote up 0 Vote down

Cancel
0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 25th May 2011 at http://forums.arm.com

Dear Webshaker,
I am thinking how to test the cycle count module of Cortex-A8.
I think, for each instruction, I have to combine it with each other instruction to see how they work together.
However, I got a problem. Because the number of instructions of Arm is too big, so the number of testcases is big too.

Do you have other ideal for testing?

I'll write a post to explain how works the cycle counter and how you can write your own cycle counter in few days (weeks)...
That will be more more simple that triyng to explain part by part how the program works !!!

But your solution is not a good solution... to much work !!!

Etienne
Cancel
Vote up 0 Vote down

Cancel