This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(

0 barney vardanyan over 10 years ago

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

I need something like this

http://pulsar.webshaker.net/ccc/result.php?lng=fr
Cancel
Up 0 Down

Cancel
0 barney vardanyan over 10 years ago

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

thnaks a lot)))
Cancel
Up 0 Down

Cancel
0 barney vardanyan over 10 years ago

Note: This was originally posted on 18th March 2011 at http://forums.arm.com

I just need algoritm, wich count latence beetwen two instructions

Is there any idea???(
Cancel
Up 0 Down

Cancel
0 barney vardanyan over 10 years ago

Note: This was originally posted on 14th April 2011 at http://forums.arm.com

Yeah my beagle board has 500MHz frequncy.
I measur time with linux time command)

and what about caches, how can I turn it on)?

and wich compiler you use? gcc or other?
Cancel
Up 0 Down

Cancel
0 barney vardanyan over 10 years ago

Note: This was originally posted on 12th April 2011 at http://forums.arm.com

mmmm

10 NOP instructions on my beagle takes 2.508 s
Cancel
Up 0 Down

Cancel
0 barney vardanyan over 10 years ago

Note: This was originally posted on 21st March 2011 at http://forums.arm.com

And when for example, executing
mul r1, r2 , r3
mul command can issue only in pipeline 0, but does it blok pipeline 1 too??
Cancel
Up 0 Down

Cancel
0 MayaDirect Marketing Safieddine over 10 years ago

Note: This was originally posted on 30th March 2011 at http://forums.arm.com

anytime Etienne! though I quit ARMing
Cancel
Up 0 Down

Cancel
0 Anil M Sripadarao over 10 years ago

Note: This was originally posted on 12th August 2011 at http://forums.arm.com

Hum.
In fact, You may be right !
I've tryed to copy 2, 3 and 4 times the NEON code into the loop

normally, 8 couples of instruction should take 8 cycles but it takes 10
16 couples takes 20
24 couples takes 38
32 couples takes 48

The time increase strangely.

I've replaced the vld1 by vtrn. the timing are
8 couples takes 8 cyles

16 couples takes 19 cyles

24 couples takes 32 cyles
..

So now. I'm think you're right. There is a bottleneck due to bandwith.
And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.

We are not agree that's could arrives

Just replace

vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14

by

vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r1:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r1:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r1:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r1:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r1:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r1:128] vmul.f32 d7,d15,d14

...and you'll have your proof.

What you'll not have is a logical explanation

During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !

Hi Etienne,

I also noticed that the cycle count decreases by using the different registers. In the process I encountered one more strange behavior. If I use three buffers with 4096 bytes size, the cycle count increases to around 140 cycles. I have given the code here, r1,r2,r3 has the address of buffer1,buffer2 and buffer3 respectively.



Int buffer1[1024];

Int buffer2[1024];

Int buffer3[1024];

ASM Code:



[indent] vld1.32 {d16,d17},[r1:128]

vmul.f32 d0,d15,d14

vld1.32 {d18,d19},[r2:128]

vmul.f32 d1,d15,d14

vld1.32 {d20,d21},[r3:128]

vmul.f32 d2,d15,d14

vld1.32 {d22,d23},[r1:128]

vmul.f32 d3,d15,d14

vld1.32 {d24,d25},[r2:128]

vmul.f32 d4,d15,d14

vld1.32 {d26,d27},[r3:128]

vmul.f32 d5,d15,d14

vld1.32 {d28,d29},[r1:128]

vmul.f32 d6,d15,d14

vld1.32 {d30,d31},[r2:128]

vmul.f32 d7,d15,d14

[/indent] However, if the buffer size is not a multiple of 4096, cycle count is normal.



Regards,

Anil M S
Cancel
Up 0 Down

Cancel
0 Anil M Sripadarao over 10 years ago

Note: This was originally posted on 16th August 2011 at http://forums.arm.com

140 cycles for each iteration of the loop ?

Yeah, 140 cycles per single iteration.

Regards,
Anil M S
Cancel
Up 0 Down

Cancel
0 Anil M Sripadarao over 10 years ago

Note: This was originally posted on 9th August 2011 at http://forums.arm.com

hum.

It was so strange that I've made the test.
I do not find exactly the same result as yours but the problem il still there.

That's really strange !!!

if you replace VMLA.F32 by VMUL.F32 or VMLA.U32 the problem is solved.

So I assume that the shortcut of the vmla.f32 is not applied if there is another instruction between the mul and the mla.
It seem's that this problem is only true for float MLA !

That's strange.

What is more strange is why the first code take so many time while it should take 9 cycles (if we don't use vmla.f32) !

I've tried to change the value of the adress register
Finally I changed the address register value.
add r2, r1, #16 add r3, r2, #16 add r4, r3, #16 b .loop1 .align 4 .loop1: vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14 subs r0, r0, #1 bgt .loop1

This code (the NEON part only) take now 10 cycles. It should take only 8 cycles.
I assume that there is a conflict into the memory file of NEON when you use the same address register.

So.
1 - don't put instruction between MUL and MAL when you use float opérations.
2 - don't read the same data with NEON (in you never have to do that. You've made thins because you try a bench. In real life this case never happend).

NEON is not fully detailled in the documentation. There is a lot of hint you'll have to found by testing.
I do not know the both you found !

Etienne.

Hi Etienne,

Thank you very much for your response. The behavior is strange indeed, and seems it is not documented anywhere. Do you think I should approach ARM people to find whether this is an anomaly. And also find whether the behavior is documented? Please give your suggestion on this.

Regards,

Anil M S
Cancel
Up 0 Down

Cancel
0 Anil M Sripadarao over 10 years ago

Note: This was originally posted on 2nd August 2011 at http://forums.arm.com

Hi all,
I am doing some profiling analysis on Cortex A8 processor using the Beagle Board-xM. I found a strange behavior with the following piece of code. The code takes 46 cycles. But looking at the code we can see that there is no dependency among each other, so ideally it should have taken only 9 cycles.

Code:
[indent][indent]/* 46 cycles. */
vld1.32 {d16,d17},[r1:128];
vmla.f32 d0,d15,d14;
vld1.32 {d18,d19},[r1:128];
vmla.f32 d1,d15,d14;
vld1.32 {d20,d21},[r1:128];
vmla.f32 d2,d15,d14;
vld1.32 {d22,d23},[r1:128];
vmla.f32 d3,d15,d14;
vld1.32 {d24,d25},[r1:128];
vmla.f32 d4,d15,d14;
vld1.32 {d26,d27},[r1:128];
vmla.f32 d5,d15,d14;
vld1.32 {d28,d29},[r1:128];
vmla.f32 d6,d15,d14;
vld1.32 {d30,d31},[r1:128];
vmla.f32 d7,d15,d14;
vld1.32 {d12,d13},[r1:128];
vmla.f32 d8,d15,d14;

[/indent][/indent]However, if I seperate the vmla and vld then the behavior is as expected, i.e the following codes take 9 and 11 cycles respectively.

[indent][indent]/* 9 cycles. */
vmla.f32 d0,d15,d14;
vmla.f32 d1,d15,d14;
vmla.f32 d2,d15,d14;
vmla.f32 d3,d15,d14;
vmla.f32 d4,d15,d14;
vmla.f32 d5,d15,d14;
vmla.f32 d6,d15,d14;
vmla.f32 d7,d15,d14;
vmla.f32 d8,d15,d14;

/* 11 cycles. */
vld1.32 {d16,d17},[r1:128];
vld1.32 {d18,d19},[r1:128];
vld1.32 {d20,d21},[r1:128];
vld1.32 {d22,d23},[r1:128];
vld1.32 {d24,d25},[r1:128];
vld1.32 {d26,d27},[r1:128];
vld1.32 {d28,d29},[r1:128];
vld1.32 {d30,d31},[r1:128];
vld1.32 {d12,d13},[r1:128];

[/indent][/indent]Can some one please let me know whether I am missing something here or my understanding is wrong.

Thanks,
Anil M S
Cancel
Up 0 Down

Cancel
0 Anil M Sripadarao over 10 years ago

Note: This was originally posted on 9th August 2011 at http://forums.arm.com

What is your test procedure?
You have made a loop executed 1000 times (for example) and you have found 46.000 cycles for the first example
and (11 + 9) * 1000 = 20.000 cycles for the second?

Hi Etienne,
That's true. I have a loop executed 1000 times and I am getting 46,000 cycles for the first example and 21 for the second example.
I have given the whole function for your reference. r0 has the loop count and r1 the input buffer pointer.

First example:

[indent][indent].text;
.align 4;
.global vmlaq_vld_f32_interleaved;
.type vmlaq_vld_f32_interleaved,%function;

vmlaq_vld_f32_interleaved:

core_loop_beg6:
vld1.32 {d16,d17},[r1:128];
vmla.f32 d0,d15,d14;
vld1.32 {d18,d19},[r1:128];
vmla.f32 d1,d15,d14;
vld1.32 {d20,d21},[r1:128];
vmla.f32 d2,d15,d14;
vld1.32 {d22,d23},[r1:128];
vmla.f32 d3,d15,d14;
vld1.32 {d24,d25},[r1:128];
vmla.f32 d4,d15,d14;
vld1.32 {d26,d27},[r1:128];
vmla.f32 d5,d15,d14;
vld1.32 {d28,d29},[r1:128];
vmla.f32 d6,d15,d14;
vld1.32 {d30,d31},[r1:128];
vmla.f32 d7,d15,d14;
vld1.32 {d10,d11},[r1:128];
vmla.f32 d8,d15,d14;
subs r0,r0,#1;
bgt core_loop_beg6;
core_loop_end6:
BX lr;
[/indent][/indent]
Second example.
[indent][indent].text;
.align 4;
.global vld1_aligned;
.type vld1_aligned,%function;

vld1_aligned:

core_loop_beg:

vmla.f32 d0,d15,d14;
vmla.f32 d1,d15,d14;
vmla.f32 d2,d15,d14;
vmla.f32 d3,d15,d14;
vmla.f32 d4,d15,d14;
vmla.f32 d5,d15,d14;
vmla.f32 d6,d15,d14;
vmla.f32 d7,d15,d14;
vmla.f32 d8,d15,d14;

vld1.32 {d16,d17},[r1:128];
vld1.32 {d18,d19},[r1:128];
vld1.32 {d20,d21},[r1:128];
vld1.32 {d22,d23},[r1:128];
vld1.32 {d24,d25},[r1:128];
vld1.32 {d26,d27},[r1:128];
vld1.32 {d28,d29},[r1:128];
vld1.32 {d30,d31},[r1:128];
vld1.32 {d12,d13},[r1:128];

subs r0,r0,#1;
bgt core_loop_beg;
core_loop_end:
BX lr;
[/indent][/indent]Regards,
Anil M S
Cancel
Up 0 Down

Cancel
0 庆振丁 over 10 years ago

Note: This was originally posted on 6th April 2011 at http://forums.arm.com

hello,everyone.
Cancel
Up 0 Down

Cancel
0 barney vardanyan over 10 years ago

Note: This was originally posted on 12th April 2011 at http://forums.arm.com

Yes, I already understand that))))))))))))))))))))))))))
Cancel
Up 0 Down

Cancel
0 barney vardanyan over 10 years ago

Note: This was originally posted on 11th April 2011 at http://forums.arm.com

I have made tests, and get very strange results.

the test is folowing

I have some cycle and nop intsructions in it.

first I have only one nop in cycle, it's takes 0.570 s on my board
then I add second nop instruction, and it's also takes 0.57 s. and it's true, becaues they executed in thy same cycle(pipeline 0 and pipeline 1)
then I continue to add nop instructions.
So when I got to 5-6 nop pair, I realize that they can't execute parraler
and after every nop instruction that I add increase the time of execution

So why after 5 nop, they don't executed parraler, IS there any idea?

here is the test file)
Cancel
Up 0 Down

Cancel