Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Pelion IoT Platform
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
Tools and Software
Software Tools
Jump...
Cancel
Software Tools
Arm Development Studio forum
Neon instruction timing/latency
Tools, Software and IDEs blog
Forums
Videos & Files
Help
Jump...
Cancel
New
Replies
3 replies
Subscribers
127 subscribers
Views
4513 views
Users
0 members are here
Related
Neon instruction timing/latency
Offline
Leo Barnes
over 7 years ago
Note: This was originally posted on 7th July 2010 at http://forums.arm.com
Hello!
I am having trouble deciphering the tables in the Cortex-A8 technical reference manual that contains the NEON advanced SIMD instruction timings. There is no explanation anywhere of what the different N values mean. I suspect that they are different steps in the pipeline, but since I have as of yet not been able to find any info on the NEON pipeline, they don't tell me anything.
What I would really like to see is the information that was available in the ARM1136 reference manual, specifically which registers are needed as early/late registers, result latency and so on. It is probably possible to use the supplied N-values to get something similar, but I havent managed yet.
There is clearly some latency in the NEON instructions since I can gain quite a bit of performance by rearranging the instructions, but I would like to be able to do this in a more scientific manner where I can actually determine beforehand if I would gain anything by rearranging and not like now where I simply try to place instructions depending on each other as far apart as possible.
Best regards,
//Leo
Parents
Offline
Peter Harris
over 7 years ago
Note: This was originally posted on 7th July 2010 at
http://forums.arm.com
Most of the new cores use this "consume in N{X}" and "produce in N{Y}" syntax - the pipelines are now too complex for the simpler early and late register model for timing using in ARM9 and ARM11 cores. I agree it is a bit of a pain, but it's not too bad once you get used to it.
As you suggest the {X} and {Y} numbers are pipeline stages when registers are consumed or results are produced. You don't actually need the pipeline details to use the tables you can infer everything you need from the pipeline stage numbers. The important facts:
All pipeline stages take one cycle
You can dual issue some instructions, so to optimally fill an N cycle interlock gap you need 2N instructions of suitable pairings
Moving data from NEON to ARM registers is Cortex-A8 is expensive, so NEON in Cortex-A8 is best used for large blocks of work with little ARM pipeline interaction. [
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s05s02.html
] (Cortex-A9 is much better at this).
Using the tables:
If you have an instruction which consumes a register in N1 and produces a result in N3 then the result value is not available to the next instruction until N4 - effectively 3 cycles (4 - 1) after the initial instruction was issued. You would need to fill the gap between the 2 dependent instructions with 6 other (because we dual issue) ARM or NEON instructions.
If you have an instruction which consumes in N2 and produces in N5 (result ready in N6), then a dependent which consumes in N1 then you have a 5 cycle latency. Four cycles for the first instructions latency (6-2) and one cycle because the second instruction consumes a cycle earlier in the pipeline than the first (2-1)
Worth noting:
only certain pairs of instructions can be dual issued, and it is alignment sensitive because the core is in-order (i.e. you may have a sequence ABCD where AB and CD can be dual issued as pairs, but you happen to actually execute xA, then B, then CD. In this case x (the previous instruction) and A hit the pipeline together and were a valid dual issue target so were issued. BC in this case are not valid dual issue pairs so only B is issued, and finally CD are issued. You sequence of ABCD looks like it might take 2 cycles but actually took 3.
Worth noting:
One of the biggest killers on modern cores where clock frequency is much higher than memory clock is not really pipeline cycles for arithmetic instructions, but memory latency when you miss a data load. If you miss in L2 cache for data loads it can take _hundreds_ of cycles to fetch that data from external memory, if you miss in the TLB (MMU cache) and in the L2 data cache it can take a couple of multiples of that. If you know what your data set is going to be then issue PLD instructions or even just manually touch the data with a LDR (even if you then don't use it that time around and reload it later) as early as possible to maximize the chance it is in cache when you actually need it.
Iso
Cancel
Up
0
Down
Reply
Cancel
Reply
Offline
Peter Harris
over 7 years ago
Note: This was originally posted on 7th July 2010 at
http://forums.arm.com
Most of the new cores use this "consume in N{X}" and "produce in N{Y}" syntax - the pipelines are now too complex for the simpler early and late register model for timing using in ARM9 and ARM11 cores. I agree it is a bit of a pain, but it's not too bad once you get used to it.
As you suggest the {X} and {Y} numbers are pipeline stages when registers are consumed or results are produced. You don't actually need the pipeline details to use the tables you can infer everything you need from the pipeline stage numbers. The important facts:
All pipeline stages take one cycle
You can dual issue some instructions, so to optimally fill an N cycle interlock gap you need 2N instructions of suitable pairings
Moving data from NEON to ARM registers is Cortex-A8 is expensive, so NEON in Cortex-A8 is best used for large blocks of work with little ARM pipeline interaction. [
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s05s02.html
] (Cortex-A9 is much better at this).
Using the tables:
If you have an instruction which consumes a register in N1 and produces a result in N3 then the result value is not available to the next instruction until N4 - effectively 3 cycles (4 - 1) after the initial instruction was issued. You would need to fill the gap between the 2 dependent instructions with 6 other (because we dual issue) ARM or NEON instructions.
If you have an instruction which consumes in N2 and produces in N5 (result ready in N6), then a dependent which consumes in N1 then you have a 5 cycle latency. Four cycles for the first instructions latency (6-2) and one cycle because the second instruction consumes a cycle earlier in the pipeline than the first (2-1)
Worth noting:
only certain pairs of instructions can be dual issued, and it is alignment sensitive because the core is in-order (i.e. you may have a sequence ABCD where AB and CD can be dual issued as pairs, but you happen to actually execute xA, then B, then CD. In this case x (the previous instruction) and A hit the pipeline together and were a valid dual issue target so were issued. BC in this case are not valid dual issue pairs so only B is issued, and finally CD are issued. You sequence of ABCD looks like it might take 2 cycles but actually took 3.
Worth noting:
One of the biggest killers on modern cores where clock frequency is much higher than memory clock is not really pipeline cycles for arithmetic instructions, but memory latency when you miss a data load. If you miss in L2 cache for data loads it can take _hundreds_ of cycles to fetch that data from external memory, if you miss in the TLB (MMU cache) and in the L2 data cache it can take a couple of multiples of that. If you know what your data set is going to be then issue PLD instructions or even just manually touch the data with a LDR (even if you then don't use it that time around and reload it later) as early as possible to maximize the chance it is in cache when you actually need it.
Iso
Cancel
Up
0
Down
Reply
Cancel
Children
No data
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Suggested Answer
Debugging kernel: OS support not working for Linux 5.4
0
Kernel Developers
External Hardware Debug
Debugger
7070
views
5
replies
Latest
2 months ago
by
sgoldschmidt
Suggested Answer
DS-5 bare metal wait error after run "debug"
0
DS-5 Development Studio
Debugging
Arm Compiler 5
Memory
29185
views
14
replies
Latest
2 months ago
by
prasadghole
Suggested Answer
ARM development studio with ARM Juno r2 board
0
Juno Arm Development Platform
Arm Development Studio
Products
Arm Support
6573
views
2
replies
Latest
2 months ago
by
Ronan Synnott
Answered
"Unable to execute remote query (response code 503) " issue
0
6300
views
1
reply
Latest
2 months ago
by
Ronan Synnott
Not Answered
Where can I download DS-5 hardware firmware??
0
5821
views
1
reply
Latest
2 months ago
by
Ronan Synnott
<
>
View all questions in Arm Development Studio forum