Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Cortex-R4 : does "dual-issued pairs" really improve performance ?
Jump...
Cancel
Locked
Locked
Replies
8 replies
Subscribers
119 subscribers
Views
4598 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Cortex-R4 : does "dual-issued pairs" really improve performance ?
Christophe Beausoleil
over 12 years ago
Note: This was originally posted on 1st August 2011 at
http://forums.arm.com
Hello,
Could someone help me to explain that behavior :
I use a sequence of 4096 instructions (target is TMS570/Cortex-R4F) :
movs r0,#1
str r0, [r8~#0]
movs r1,#2
str r1, [r8~#4]
movs r2,#3
str r3, [r8~#8]
...
When "dual-issue" mode is enabled (bits 28-31 of Auxiliary Control Register and bits 18-20 of Secondary Auxiliary Control Register are reset), this code (plus a few instructions bordering it) executes in 5162 clock cycles.
When "dual-issue" mode is disabled (same bits are set), this code executes in 4146 clock cycles !!!
I observe this phenomenon for both ARM and Thumb2 modes.
So when "dual-issue" mode is enabled, it seems that one pipeline stage is "sometimes" (once out of 4) waiting for dual words (thus introducing extra wait states) in order to process them by pairs, but I can't find any description of it.
Could someone help me to understand, please ? This is quite important for me, because I have to produce highly deterministic real-time software, and this kind of feature is hard to model...
Thanks for any help.
Best regards
Christophe
Parents
Chris Turner
over 12 years ago
Note: This was originally posted on 27th January 2012 at
http://forums.arm.com
Yes, see what they say about it. In closing, let me mention that predicting precise cycle counts for processors like Cortex-R4 is not an exact science because there are heuristics in the branch prediction and behaviours in store buffers and the like that may cause slight variations. However, you should find that real-time performance remains adequately deterministic thanks to this processor's fast interrupt entry mode in the pipeline, reduction of interrupt entry dependency on queued memory transactions, availability of TCM to store critical code and data without dependency on the main L1/L2 memory system and external bus, and the absence of any MMU that would trigger TLB misses, page table walks etc.
With best regards, Chris
Cancel
Vote up
0
Vote down
Cancel
Reply
Chris Turner
over 12 years ago
Note: This was originally posted on 27th January 2012 at
http://forums.arm.com
Yes, see what they say about it. In closing, let me mention that predicting precise cycle counts for processors like Cortex-R4 is not an exact science because there are heuristics in the branch prediction and behaviours in store buffers and the like that may cause slight variations. However, you should find that real-time performance remains adequately deterministic thanks to this processor's fast interrupt entry mode in the pipeline, reduction of interrupt entry dependency on queued memory transactions, availability of TCM to store critical code and data without dependency on the main L1/L2 memory system and external bus, and the absence of any MMU that would trigger TLB misses, page table walks etc.
With best regards, Chris
Cancel
Vote up
0
Vote down
Cancel
Children
No data