Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Pelion IoT Platform
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
IP Products
Processors
Jump...
Cancel
Processors
Cortex-A / A-Profile forum
Cortex A8 Instruction Cycle Timing
Blogs
Forums
Videos & Files
Help
Jump...
Cancel
New
Replies
90 replies
Subscribers
275 subscribers
Views
64595 views
Users
0 members are here
Cortex-A
Related
Cortex A8 Instruction Cycle Timing
Offline
barney vardanyan
over 7 years ago
Note: This was originally posted on 17th March 2011 at
http://forums.arm.com
Hi) sorry for bad English
I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
Parents
Offline
Gilead Kutnick
over 7 years ago
Note: This was originally posted on 9th August 2011 at
http://forums.arm.com
Hi Anil, webshaker..
I have personally found something very strange in NEON that might be related to what you're describing. It seems that if you dual-issue too many instructions in a row that you start seeing stalls. I haven't attempted to formally understand this, all I know is that if you start with a loop with a few dual issued instructions (and then the loop ends with some ARM code that takes two cycles) it works as expected. Then as you add more pairs eventually it starts adding more than 1 cycle. At its worst it seemed to give even lower performance than if they were all single issued, like what Anil was seeing in his first example. At this point I've actually been able to improve performance by adding nops between the pairs!
Note that this doesn't happen if you pair NEON and ARM code. You can seemingly do that as much as you want without penalty.
The only possible explanation that comes to mind is that the NEON queue could be bottlenecking its throughput. The queue is 12 instructions long, so you can fill it up in 6 cycles. You will note that the pipeline stage where NEON instructions are dispatched is more than 6 before the stage where NEON begins. This is even worse if instructions are not removed from the queue until later in the NEON pipeline. So if the queue is filled while there are still more NEON instructions to be issued it will have to stall until the NEON unit consumes the instructions. Normally this shouldn't be a problem because once the stall happens they'd reach equilibrium, where the NEON consumes two old instructions at the same rate that the dispatch adds two new instructions to the queue. But there could be something that's causing the stall to be more serious than this and costing a lot more cycles.
If the instructions are in fact not removed until the end of the pipeline then vmla.f32 would exacerbate things because it effectively adds a bunch of stages to the NEON pipeline.
One thing to try is instead of doing something like this:
vld1.32 { q0 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9
vld1.32 { q1 }, [ r0, : 128 ]
vmla.f32 q10, q10, q11
You could try doing this:
vld1.32 { q0, q1 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9
vmla.f32 q10, q10, q11
Because multi-cycle instructions can pair on both the first and last cycle this should work the same. But if the queue is really a bottleneck this may relieve pressure to it, assuming that the multi-cycle load doesn't get turned into more than one entry on the queue.
For what it's worth, I haven't had worse problems reading the same data over and over again, so I don't think that's contributing to it. This is actually useful in the real world: because loads finish in N1 and most instructions need their inputs in N2 you can actually dual issue a load with an instruction that uses the result in the same cycle. So you can use a load to prepare a constant if you don't have enough registers for it, or if you need to prepare the destination of a vmla/vmls. Note that unlike with loads there isn't a way to pair a 128-bit move with a normal instruction; you can fake a 64-bit one, though.
Cancel
Up
0
Down
Reply
Cancel
Reply
Offline
Gilead Kutnick
over 7 years ago
Note: This was originally posted on 9th August 2011 at
http://forums.arm.com
Hi Anil, webshaker..
I have personally found something very strange in NEON that might be related to what you're describing. It seems that if you dual-issue too many instructions in a row that you start seeing stalls. I haven't attempted to formally understand this, all I know is that if you start with a loop with a few dual issued instructions (and then the loop ends with some ARM code that takes two cycles) it works as expected. Then as you add more pairs eventually it starts adding more than 1 cycle. At its worst it seemed to give even lower performance than if they were all single issued, like what Anil was seeing in his first example. At this point I've actually been able to improve performance by adding nops between the pairs!
Note that this doesn't happen if you pair NEON and ARM code. You can seemingly do that as much as you want without penalty.
The only possible explanation that comes to mind is that the NEON queue could be bottlenecking its throughput. The queue is 12 instructions long, so you can fill it up in 6 cycles. You will note that the pipeline stage where NEON instructions are dispatched is more than 6 before the stage where NEON begins. This is even worse if instructions are not removed from the queue until later in the NEON pipeline. So if the queue is filled while there are still more NEON instructions to be issued it will have to stall until the NEON unit consumes the instructions. Normally this shouldn't be a problem because once the stall happens they'd reach equilibrium, where the NEON consumes two old instructions at the same rate that the dispatch adds two new instructions to the queue. But there could be something that's causing the stall to be more serious than this and costing a lot more cycles.
If the instructions are in fact not removed until the end of the pipeline then vmla.f32 would exacerbate things because it effectively adds a bunch of stages to the NEON pipeline.
One thing to try is instead of doing something like this:
vld1.32 { q0 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9
vld1.32 { q1 }, [ r0, : 128 ]
vmla.f32 q10, q10, q11
You could try doing this:
vld1.32 { q0, q1 }, [ r0, : 128 ]
vmla.f32 q8, q8, q9
vmla.f32 q10, q10, q11
Because multi-cycle instructions can pair on both the first and last cycle this should work the same. But if the queue is really a bottleneck this may relieve pressure to it, assuming that the multi-cycle load doesn't get turned into more than one entry on the queue.
For what it's worth, I haven't had worse problems reading the same data over and over again, so I don't think that's contributing to it. This is actually useful in the real world: because loads finish in N1 and most instructions need their inputs in N2 you can actually dual issue a load with an instruction that uses the result in the same cycle. So you can use a load to prepare a constant if you don't have enough registers for it, or if you need to prepare the destination of a vmla/vmls. Note that unlike with loads there isn't a way to pair a 128-bit move with a normal instruction; you can fake a 64-bit one, though.
Cancel
Up
0
Down
Reply
Cancel
Children
No data
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Not Answered
Exception type generated upon hardware reset
0
1 (Reset)
Interrupt
2551
views
1
reply
Latest
21 days ago
by
42Bastian Schick
Answered
Debug using gdb debugger, how to get the exception level?
0
Cortex-A53
Debugging
3595
views
8
replies
Latest
22 days ago
by
Boon Khai
Not Answered
Multicontroller communication
0
2810
views
0
replies
Started
28 days ago
by
RUSurya
Not Answered
Is it possible to turn my phone's 64-bit armv8-a (32-bit mode) to 64 bit mode
0
3463
views
1
reply
Latest
1 month ago
by
Raheem
Not Answered
Interrupts not received in secure world for Cortex A7 in Trsuty
0
Arm Trusted Firmware
Cortex-A7
3700
views
1
reply
Latest
1 month ago
by
42Bastian Schick
<
>
View all questions in Cortex-A / A-Profile forum