This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(

Parents

0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 11th August 2011 at http://forums.arm.com

Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.

Hum.
In fact, You may be right !
I've tryed to copy 2, 3 and 4 times the NEON code into the loop

normally, 8 couples of instruction should take 8 cycles but it takes 10
16 couples takes 20
24 couples takes 38
32 couples takes 48

The time increase strangely.

I've replaced the vld1 by vtrn. the timing are
8 couples takes 8 cyles

16 couples takes 19 cyles

24 couples takes 32 cyles
..

So now. I'm think you're right. There is a bottleneck due to bandwith.
And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.

This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.

We are not agree that's could arrives

I'd still like to see some evidence that this causes problems and what exactly.

Just replace

vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14

by

vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r1:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r1:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r1:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r1:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r1:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r1:128] vmul.f32 d7,d15,d14

...and you'll have your proof.

What you'll not have is a logical explanation

During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Etienne SOBOLE over 12 years ago

Note: This was originally posted on 11th August 2011 at http://forums.arm.com

Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.

Hum.
In fact, You may be right !
I've tryed to copy 2, 3 and 4 times the NEON code into the loop

normally, 8 couples of instruction should take 8 cycles but it takes 10
16 couples takes 20
24 couples takes 38
32 couples takes 48

The time increase strangely.

I've replaced the vld1 by vtrn. the timing are
8 couples takes 8 cyles

16 couples takes 19 cyles

24 couples takes 32 cyles
..

So now. I'm think you're right. There is a bottleneck due to bandwith.
And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.

This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.

We are not agree that's could arrives

I'd still like to see some evidence that this causes problems and what exactly.

Just replace

vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14

by

vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r1:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r1:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r1:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r1:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r1:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r1:128] vmul.f32 d7,d15,d14

...and you'll have your proof.

What you'll not have is a logical explanation

During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !
Cancel
Vote up 0 Vote down

Cancel

Children

No data