Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Pelion IoT Platform
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
Tools and Software
Software Tools
Jump...
Cancel
Software Tools
Arm Development Studio forum
NEON pipeline stages in instruction timing
Tools, Software and IDEs blog
Forums
Videos & Files
Help
Jump...
Cancel
New
State
Accepted Answer
Replies
9 replies
Subscribers
127 subscribers
Views
6809 views
Users
0 members are here
Related
NEON pipeline stages in instruction timing
Offline
Kun Feng
over 7 years ago
Note: This was originally posted on 3rd April 2012 at http://forums.arm.com
I'm trying to understand more detail about the instruction timing in Cortex-A8/A9.
In TRM of A8, the timing is described as E1 or N2, which means pipeline stage "Execution 1" in ARM pipeline and "Execution 2" in NEON pipeline, is that right?
I think before executing there must be cycles for fetching and decoding. What is the value of cycles that fetching and decoding take? Are they the same for ARM and NEON?
I got such a figure after googling.
Is that a right description for A8 pipeline?
Assuming it's right, the decoding of NEON instruction is after the ARM pipeline. Does it mean that NEON instructions have to pass through the entire ARM pipeline first then get decoded? And when does dual issue happen, after decoding before pipeline? Why NEON instructions need to be decoded twice? Isn't it a waste of time and die size?
The summing up question: how to calculate the number of cycles that a NEON instruction takes in total, from fetch to write back and taking dual issue into consideration?
Thank you so much.
Parents
0
Offline
Kun Feng
over 7 years ago
Note: This was originally posted on 4th April 2012 at
http://forums.arm.com
Thank you so much for such a careful answer.
Here I got some more questions:
dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.
What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right?
The other similar one:
In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and then the result is forwarded to the fp add pipeline.
Does this "forward" mean a shortcut? skip the instruction queue?
The final one:
What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?
Thanks
Cancel
Up
0
Down
Reply
Accept answer
Cancel
Reply
0
Offline
Kun Feng
over 7 years ago
Note: This was originally posted on 4th April 2012 at
http://forums.arm.com
Thank you so much for such a careful answer.
Here I got some more questions:
dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.
What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right?
The other similar one:
In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and then the result is forwarded to the fp add pipeline.
Does this "forward" mean a shortcut? skip the instruction queue?
The final one:
What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?
Thanks
Cancel
Up
0
Down
Reply
Accept answer
Cancel
Children
No data
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Not Answered
Develop Kotlin apps with OpenCl to run in a Samsung mobile that use Mali G71
0
6630
views
2
replies
Latest
3 months ago
by
ARMStrongssen
Answered
FVP Debug problems for Blinky Example project on Cortex-M4
0
Fixed Virtual Platforms (FVPs)
Debugging
Cortex-M4
7490
views
2
replies
Latest
3 months ago
by
Sagittarius
Answered
Product license check-out for feature "platform_editor:202003" failed
0
ARM Development Suite (ADS)
DSTREAM
7459
views
2
replies
Latest
3 months ago
by
David DV
Suggested Answer
How to execute tag manipulation instructions in Cortex-A76 FVP
0
7326
views
2
replies
Latest
3 months ago
by
Stephen Theobald
Suggested Answer
Failed to debug hello world project on Cortex-A76
0
7333
views
1
reply
Latest
3 months ago
by
Stephen Theobald
<
>
View all questions in Arm Development Studio forum