Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Pelion IoT Platform
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
Tools and Software
Software Tools
Jump...
Cancel
Software Tools
Arm Development Studio forum
instruction cycle timings for LDR1, STR1 on cortex-a8
Tools, Software and IDEs blog
Forums
Videos & Files
Help
Jump...
Cancel
New
Replies
2 replies
Subscribers
127 subscribers
Views
1639 views
Users
0 members are here
Related
instruction cycle timings for LDR1, STR1 on cortex-a8
Offline
chandrakala reddy
over 7 years ago
Note: This was originally posted on 13th March 2012 at http://forums.arm.com
Hi,
Iam new to beagle board and cortex-a8. i have written a small piece of code to understand instruction cycle timings of cortex-a8. code is in a loop of 10,000 count. code behaves differentlty with different combinations. following is my code with cycle timings
when i keep only loads, code is taking 10 cycles instead of 6 cycles. In the following case, there are no cache issues, as same memory is used to load the values to 'q' register
VLD1.S32 {rq0},[r11@128]
VLD1.S32 {rq1},[r11@128]
VLD1.S32 {rq3},[r11@128]
VLD1.S32 {rq5},[r11@128]
VLD1.S32 {rq6},[r11@128]
VLD1.S32 {rq7},[r11@128]
Below code is taking 13 cycles instead of 6 cycles. difference is above code has loads and this code has stores
VST1.S32 {rq0},[r12@128]
VST1.S32 {rq1},[r12@128]
VST1.S32 {rq3},[r12@128]
VST1.S32 {rq5},[r12@128]
VST1.S32 {rq6},[r12@128]
VST1.S32 {rq7},[r12@128]
Combination of loads and stores are working fine. they are taking 12 cycles which is expected. but when i change the register r12 to r11 in store operation, code is taking 32 cycles. why accessing of r11 in loads and stores is giving more cycles.
VLD1.S32 {rq0},[r11@128]
VLD1.S32 {rq1},[r11@128]
VLD1.S32 {rq3},[r11@128]
VLD1.S32 {rq5},[r11@128]
VLD1.S32 {rq6},[r11@128]
VLD1.S32 {rq7},[r11@128]
VST1.S32 {rq0},[r12@128]
VST1.S32 {rq1},[r12@128]
VST1.S32 {rq3},[r12@128]
VST1.S32 {rq5},[r12@128]
VST1.S32 {rq6},[r12@128]
VST1.S32 {rq7},[r12@128]
Why this is happening. why different combinations are behaving differently. Can anyone please explain.
Thanks in advance,
Chandrakala
Offline
chandrakala reddy
over 7 years ago
Note: This was originally posted on 14th March 2012 at
http://forums.arm.com
Can anyone please help me with this.
-Chandrakala
Cancel
Up
0
Down
Reply
Cancel
Offline
Gilead Kutnick
over 7 years ago
Note: This was originally posted on 15th March 2012 at
http://forums.arm.com
I can't give a very thorough answer, but in my tests I've found that the Cortex-A8 core can't sustain 1 128-bit load or 128-bit store per cycle every cycle in a long run. There could be a bottleneck somewhere like the load queue for NEON or the write buffer for stores. If you mix it with other instructions you can sustain 1 per cycle for a while, which is what you seem to be achieving in the mix of loads + stores. I can't give any exact numbers but my rough heuristic is to try to have at least 1 non-load/store for every 2 loads/stores, although you're probably better off with more than that.
I've also heard that you can get slower throughput using the same register for loads. You might be able to do better if you try different registers.
As for your second case, the problem is that NEON (unlike the integer core) doesn't have store to load forwarding. Since you're performing these operations in a loop, the loads to r11 come immediately after the stores to r11. Before the load can happen the write buffer which contains its new value has to be emptied, causing a big stall. You can also see this sort of stall if you perform an unaligned load immediately after an unaligned store, to the address immediately after the store address. This is because unaligned accesses get split into multiple aligned accesses which in this case will be partially overlapping.
Cancel
Up
0
Down
Reply
Cancel
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Not Answered
Forum FAQs
0
ARM Community
1138
views
0
replies
Started
6 days ago
by
Annie Cracknell
Suggested Answer
How to view SFRs in DS during debugging?
0
475
views
1
reply
Latest
2 days ago
by
Ronan Synnott
Answered
Dual-core debugging in DS
0
3345
views
2
replies
Latest
15 days ago
by
Ivan Savvateev
Answered
Failure to get an evaluation license with error Unable to execute API call /api/v1/connect
0
4280
views
3
replies
Latest
21 days ago
by
Tim Holt
Suggested Answer
DS52020.0 connection to Musca-A/B boards not working
0
Arm Development Studio
Musca-A
5313
views
4
replies
Latest
23 days ago
by
Daniel Oliveira
>
View all questions in Arm Development Studio forum