Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Pelion IoT Platform
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
IP Products
Processors
Jump...
Cancel
Processors
Cortex-A / A-Profile forum
Cortex A8 Instruction Cycle Timing
Blogs
Forums
Videos & Files
Help
Jump...
Cancel
New
Replies
90 replies
Subscribers
275 subscribers
Views
64667 views
Users
0 members are here
Cortex-A
Related
Cortex A8 Instruction Cycle Timing
Offline
barney vardanyan
over 7 years ago
Note: This was originally posted on 17th March 2011 at
http://forums.arm.com
Hi) sorry for bad English
I need to count latency for two instruction, and all I have is the arm cortex A 8 documantation(charter 16) !
but I have no idea how can do this work using that documantation(
Parents
Offline
Gilead Kutnick
over 7 years ago
Note: This was originally posted on 10th August 2011 at
http://forums.arm.com
Yeah, I may be misremembering the queue length.. I'll have to check again later today when I have access to the description.
I thought I remembered issuing on both first and last cycle but I'm having trouble doing it now too. I'm also having trouble getting the loop you mentioned earlier down to 10 cycles. It looks like it's taking at least 12. The entire loop is taking 14 - since there is stalling, it's difficult to tell how much, if any, is overlapping the 2 cycles of integer loop overhead. You would think that at least one cycle would be overlapped since it's purely a fetch cycle.
The number of cycles stays the same for me regardless of if I load to different registers or using different base registers with the same arrangement as in your example. Maybe we're using different versions of Cortex-A8? I'm using OMAP3530, how about you?
Here are some interesting things I've observed:
1) If I add one or two pairs of nops in the middle I get the same speed (14 cycles for the loop). If I add a third pair the speed goes down to 13 cycles. With the fourth pair it goes back up to 14 cycles, and with every pair after that it adds 2 cycles. So, with 3 nop pairs I get no stalls in the NEON code, because there are 12 pairs of instructions (+1 cycle for fetch stall).
2) If I change three or more of the vld1s to independent vext.8 I get 10 cycles, or full pairing. Same with vmovn, vswp, vrev16, vzip, and vuzp. So the bottleneck is not dual-issue, it's loads and stores.
3) If I change to 64-bit loads instead of 128-bit I still get 14 cycles for the loop. So I don't think it's a bandwidth limitation.
4) If I change to 64-bit or 128-bit store I get 21 cycles for the loop. However, here if I store to separate 16-byte addresses in a 64-byte block I get something like 15.5 cycles (this is with a cache-line aligned destination). This is probably due to coalescing filling a whole cache line in the write buffer, where otherwise the cache line has to be loaded. I tried "warming" the buffer by memcpying it to itself to make sure it was in L1 cache, but that didn't make a difference.
5) If I change the vmul.f32s to vmla.f32 things get bad. If I start at a baseline of no-pairing I get the expected 9 cycles. Then pairing a single vmovn turns it into 12. And from there every new pair adds 4 cycles. I get the same cycles with vrecps.f32, and presumably will with the other chained pipeline instructions.
So I guess the lessons are to not do too many loads/stores in a row, and that chained pipeline instructions hate being dual issued with anything for some reason. We should do some more testing to see if there are any other instructions that cause a big penalty over dual-issue like this.
Cancel
Up
0
Down
Reply
Cancel
Reply
Offline
Gilead Kutnick
over 7 years ago
Note: This was originally posted on 10th August 2011 at
http://forums.arm.com
Yeah, I may be misremembering the queue length.. I'll have to check again later today when I have access to the description.
I thought I remembered issuing on both first and last cycle but I'm having trouble doing it now too. I'm also having trouble getting the loop you mentioned earlier down to 10 cycles. It looks like it's taking at least 12. The entire loop is taking 14 - since there is stalling, it's difficult to tell how much, if any, is overlapping the 2 cycles of integer loop overhead. You would think that at least one cycle would be overlapped since it's purely a fetch cycle.
The number of cycles stays the same for me regardless of if I load to different registers or using different base registers with the same arrangement as in your example. Maybe we're using different versions of Cortex-A8? I'm using OMAP3530, how about you?
Here are some interesting things I've observed:
1) If I add one or two pairs of nops in the middle I get the same speed (14 cycles for the loop). If I add a third pair the speed goes down to 13 cycles. With the fourth pair it goes back up to 14 cycles, and with every pair after that it adds 2 cycles. So, with 3 nop pairs I get no stalls in the NEON code, because there are 12 pairs of instructions (+1 cycle for fetch stall).
2) If I change three or more of the vld1s to independent vext.8 I get 10 cycles, or full pairing. Same with vmovn, vswp, vrev16, vzip, and vuzp. So the bottleneck is not dual-issue, it's loads and stores.
3) If I change to 64-bit loads instead of 128-bit I still get 14 cycles for the loop. So I don't think it's a bandwidth limitation.
4) If I change to 64-bit or 128-bit store I get 21 cycles for the loop. However, here if I store to separate 16-byte addresses in a 64-byte block I get something like 15.5 cycles (this is with a cache-line aligned destination). This is probably due to coalescing filling a whole cache line in the write buffer, where otherwise the cache line has to be loaded. I tried "warming" the buffer by memcpying it to itself to make sure it was in L1 cache, but that didn't make a difference.
5) If I change the vmul.f32s to vmla.f32 things get bad. If I start at a baseline of no-pairing I get the expected 9 cycles. Then pairing a single vmovn turns it into 12. And from there every new pair adds 4 cycles. I get the same cycles with vrecps.f32, and presumably will with the other chained pipeline instructions.
So I guess the lessons are to not do too many loads/stores in a row, and that chained pipeline instructions hate being dual issued with anything for some reason. We should do some more testing to see if there are any other instructions that cause a big penalty over dual-issue like this.
Cancel
Up
0
Down
Reply
Cancel
Children
No data
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Not Answered
Exception type generated upon hardware reset
0
1 (Reset)
Interrupt
2551
views
1
reply
Latest
21 days ago
by
42Bastian Schick
Answered
Debug using gdb debugger, how to get the exception level?
0
Cortex-A53
Debugging
3596
views
8
replies
Latest
22 days ago
by
Boon Khai
Not Answered
Multicontroller communication
0
2810
views
0
replies
Started
28 days ago
by
RUSurya
Not Answered
Is it possible to turn my phone's 64-bit armv8-a (32-bit mode) to 64 bit mode
0
3463
views
1
reply
Latest
1 month ago
by
Raheem
Not Answered
Interrupts not received in secure world for Cortex A7 in Trsuty
0
Arm Trusted Firmware
Cortex-A7
3700
views
1
reply
Latest
1 month ago
by
42Bastian Schick
<
>
View all questions in Cortex-A / A-Profile forum