This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 Instruction Cycle Timing

Note: This was originally posted on 17th March 2011 at http://forums.arm.com

Hi) sorry for bad English

I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(

Parents

0 Gilead Kutnick over 12 years ago

Note: This was originally posted on 9th August 2011 at http://forums.arm.com

Hi Anil, webshaker..

I have personally found something very strange in NEON that might be related to what you're describing. It seems that if you dual-issue too many instructions in a row that you start seeing stalls. I haven't attempted to formally understand this, all I know is that if you start with a loop with a few dual issued instructions (and then the loop ends with some ARM code that takes two cycles) it works as expected. Then as you add more pairs eventually it starts adding more than 1 cycle. At its worst it seemed to give even lower performance than if they were all single issued, like what Anil was seeing in his first example. At this point I've actually been able to improve performance by adding nops between the pairs!

Note that this doesn't happen if you pair NEON and ARM code. You can seemingly do that as much as you want without penalty.

The only possible explanation that comes to mind is that the NEON queue could be bottlenecking its throughput. The queue is 12 instructions long, so you can fill it up in 6 cycles. You will note that the pipeline stage where NEON instructions are dispatched is more than 6 before the stage where NEON begins. This is even worse if instructions are not removed from the queue until later in the NEON pipeline. So if the queue is filled while there are still more NEON instructions to be issued it will have to stall until the NEON unit consumes the instructions. Normally this shouldn't be a problem because once the stall happens they'd reach equilibrium, where the NEON consumes two old instructions at the same rate that the dispatch adds two new instructions to the queue. But there could be something that's causing the stall to be more serious than this and costing a lot more cycles.

If the instructions are in fact not removed until the end of the pipeline then vmla.f32 would exacerbate things because it effectively adds a bunch of stages to the NEON pipeline.

One thing to try is instead of doing something like this:

vld1.32 { q0 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9

vld1.32 { q1 }, [ r0, : 128 ]
vmla.f32 q10, q10, q11

You could try doing this:

vld1.32 { q0, q1 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9

vmla.f32 q10, q10, q11

Because multi-cycle instructions can pair on both the first and last cycle this should work the same. But if the queue is really a bottleneck this may relieve pressure to it, assuming that the multi-cycle load doesn't get turned into more than one entry on the queue.

For what it's worth, I haven't had worse problems reading the same data over and over again, so I don't think that's contributing to it. This is actually useful in the real world: because loads finish in N1 and most instructions need their inputs in N2 you can actually dual issue a load with an instruction that uses the result in the same cycle. So you can use a load to prepare a constant if you don't have enough registers for it, or if you need to prepare the destination of a vmla/vmls. Note that unlike with loads there isn't a way to pair a 128-bit move with a normal instruction; you can fake a 64-bit one, though.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Gilead Kutnick over 12 years ago

Note: This was originally posted on 9th August 2011 at http://forums.arm.com

Hi Anil, webshaker..

I have personally found something very strange in NEON that might be related to what you're describing. It seems that if you dual-issue too many instructions in a row that you start seeing stalls. I haven't attempted to formally understand this, all I know is that if you start with a loop with a few dual issued instructions (and then the loop ends with some ARM code that takes two cycles) it works as expected. Then as you add more pairs eventually it starts adding more than 1 cycle. At its worst it seemed to give even lower performance than if they were all single issued, like what Anil was seeing in his first example. At this point I've actually been able to improve performance by adding nops between the pairs!

Note that this doesn't happen if you pair NEON and ARM code. You can seemingly do that as much as you want without penalty.

The only possible explanation that comes to mind is that the NEON queue could be bottlenecking its throughput. The queue is 12 instructions long, so you can fill it up in 6 cycles. You will note that the pipeline stage where NEON instructions are dispatched is more than 6 before the stage where NEON begins. This is even worse if instructions are not removed from the queue until later in the NEON pipeline. So if the queue is filled while there are still more NEON instructions to be issued it will have to stall until the NEON unit consumes the instructions. Normally this shouldn't be a problem because once the stall happens they'd reach equilibrium, where the NEON consumes two old instructions at the same rate that the dispatch adds two new instructions to the queue. But there could be something that's causing the stall to be more serious than this and costing a lot more cycles.

If the instructions are in fact not removed until the end of the pipeline then vmla.f32 would exacerbate things because it effectively adds a bunch of stages to the NEON pipeline.

One thing to try is instead of doing something like this:

vld1.32 { q0 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9

vld1.32 { q1 }, [ r0, : 128 ]
vmla.f32 q10, q10, q11

You could try doing this:

vld1.32 { q0, q1 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9

vmla.f32 q10, q10, q11

Because multi-cycle instructions can pair on both the first and last cycle this should work the same. But if the queue is really a bottleneck this may relieve pressure to it, assuming that the multi-cycle load doesn't get turned into more than one entry on the queue.

For what it's worth, I haven't had worse problems reading the same data over and over again, so I don't think that's contributing to it. This is actually useful in the real world: because loads finish in N1 and most instructions need their inputs in N2 you can actually dual issue a load with an instruction that uses the result in the same cycle. So you can use a load to prepare a constant if you don't have enough registers for it, or if you need to prepare the destination of a vmla/vmls. Note that unlike with loads there isn't a way to pair a 128-bit move with a normal instruction; you can fake a 64-bit one, though.
Cancel
Vote up 0 Vote down

Cancel

Children

No data