This question is a kind of survey:
Hi folks,
I know this subject, or almost the same, has already be presented, but I don't found an appropriate answer.
First, a short reminder
The Cortex-M3 instruction set offers three ways to load a 32-bit literal (address or constant) into a register:
1/ Using a literal-pool and PC-relative load:
LDR Rx, [PC, #offset_to_constant]
2/ Using a couple MOVW/MOVT, to load the constant in two steps:
MOVW Rx, #least_significant_halfword
MOVT Rx, #most_significant_halfword
3/ Using the specific 'Flexible Second Operand' (but it is out of scope for my question):
if a constant :
- could be obtain by a shift on left of an 8-bit value (e.g: '000001FE'h = 'FF'h<<1)
- has the format '00XY00XY'h
- has the format 'XY00XY00'h
- has the format 'XYXYXYXY'h
MOV.W Rx, #constant
Based these elements, I make the following analysis:
[A - instruction timing]
From the instruction timing, we have the following results:
# literal-pool version :
code size : 6 or 8 bytes depending of relative offset to the constant (6 if offset%4=0 && offset in [0, 1020])
speed : 2 cycles for the LDR (or 1 cycle in some case)
# MOVW/MOVT version :
code size : 8 bytes
speed : 2 cycles
[B - Cache]
For a cache usage point of view, more precisely an unified code/data cache (in my case):
As MOVW and MOVT instructions are contiguous, the principle of locality is respected.
For the 'literal-pool' version, as the instruction and the constant pool are separated by some amount of bytes bigger than a cache-line, the principle of locality is not respected and therefore could induce some miss for subsequent data accesses in the system memory space.
[Conclusions]
# 'literal-pool' needs same number of cycles than 'MOVW/MOVT' : 2 cycles vs 2 cycles
# 'literal-pool' takes up less room in pre-fetch unit buffer than 'MOVW/MOVT' : 16 % vs 66 %
# 'literal-pool' needs less instruction fetches on I-Code bus than 'MOVW/MOVT' : 1 fetch vs 2 fetches
# 'literal-pool' needs more data fetches on D-Code bus than 'MOVW/MOVT' : 1 fetch vs 0 fetch
# 'literal-pool' respects the principle of locality less than 'MOVW/MOVT' :
# 'literal-pool' doesn't respect the principle of locality for cache programming
# 'MOVW/MOVT' respects the principle of locality for cache programming
Note: I know that the compiler seems to favor the literal pool version because of the gain for code size when the same constant is used in several places.
After this short analysis, if it is correct, I'm not sure anymore of what is the best strategy to have the best execution speed.
Question 1:
Does anyone have any feedback on this topic ?
Question 2 (subsidiary):
Does anyone know if those elements are taken into account by developement tools like Keil, nowadays?
Thanks.
I think it won't matter much on the Cortex-M3; except from the case, where you can use literal loading (LDR) and save a clock-cycle.
-But you will need to test and verify this on the particular device you're building code for; it may be different on another Cortex-M3 implementation!
I know that on the LPC175x / LPC176x, it's possible to pipeline the LDR instructions (16-bit instructions are usually the best candidates, due to possible misalignment of 32-bit instructions).
If you start on a Cortex-M3 and later upgrade your hardware using a drop-in replacement Cortex-M4 or Cortex-M7, then your result will most likely differ from your initial tests.
There's another benefit if you use the literal pool; that is addresses are easier to relocate, in case you're building code that needs a feature like that.
These days, we're clever enough to write code-loaders, which can locate addresses, which are split into multi-bitfields and add a value to them, though.
You'll probably not feel much difference between loading from a nearby literal pool compared to loading from a literal pool located far away; except from if your CPU is working full-time tight-timing on moving data aound and has a lot of DMA copying data at the same time.
Joseph Yiu might be able to go in detail on when and why the cache would miss and when and why this would insert stalls (and most likely provide a much better answer than mine).
Hi guys,
The analysis you guys has done so far already covered most of the important points
A few extra things:
For cache based designs (Note: Cortex-M0/M0+/M3/M4 dont have internal cache, but some MCU vendors added system level cache),
if the specific fragment of code is going to be executed frequently, then the literal data can be held in data cache and therefore you will likely to have cache hit, so both method will give you the same performance on Cortex-M3/M4 (it is both 2 cycles). If the specific fragment of code is rarely used, then you might get cache miss on both instructions and data. In that case using MOVW/MOVT might be faster.
Note:
If you have lots of DMA operations in the background, then the performance impact will more likely based on bus system designs.
The choice between literal data or MOVW/MOVT possibly doesn't matter as both method required access to the program code memory space.
If you have both instruction cache and data cache, then it is likely that you have cache hit most of the time and the memory bus is free for the DMA operations.
Hope this helps.
regards,
Joseph
Thank you, Joseph, this really shed some light in the corners!
While reading your reply, I started thinking about conditional execution; here it might be beneficial to have a single LDRcc instead of MOVW+MOVT, the execution will still take the same number of clock-cycles, however, if using LDRcc, you'll be able to squeeze in another conditional instruction. That might make it possible to save an extra clock cycle or two, depending on the task.
Yes, well spotted. IT instruction block only allow 4 instructions so using LDR allow you to add more conditional operation in the same block.
Thanks all for your interesting feedbacks .
(Sorry for this late answer, I was a bit busy those last weeks)
In fact, my MCU vendor put in place an unified N-way set associative cache.
Therefore, both data and instruction accesses fall into this cache, the only one
In this specific case, I'm not sure what is the best strategy.
jyiu and jensbauer , have you any tips / information / opinion for this kind of configuration : an unified cache.
My fear is that we would need to juggle between the position of code and the position of data manipulated by this code.
Regards,
Rémi.
Hmm, I think I'm on thin ice here.
N-way - does this mean that there are in fact "multiple cache entries" ?
associative - does this mean that a cache is associated with an address range ?
I imagine it means that the cache is 'intelligent' and the least probable cache entries are the ones being recycled.
-It's only a guess, you'll probably have to ask your MCU vendor about the details on this.
One coding-style which is very likely to be a success, is to ask yourself: "What would the CPU like to do the most".
Example: A developer back in the early 90's had two choices for fetching the high-byte of a 16-bit value.
1: He could use LSR
2: He could store the 16-bit value in memory and read it as a byte.
The operation would use the exact same number of clock cycles.
He chose option 1, the LSR instruction. This was a clever choice, because a few years later, a new processor was available and a new computer was added to the computer family he wrote the program for. The memory access took the same amount of clock-cycles, but the new processor had better instruction-caching, which meant the LSL instruction executed much faster.
Thus the difference was significant, since this operation was running in a loop.
Remember that the CPU loves being lazy and it hates accessing external resources.
If you need to load a lot of immediate data values (eg. in loops), then consider having a 32-bit register holding common values and another 32-bit register holding a bitmask
Loading a 16-bit value could be done this way...
We want the value 0x00321800 in r3 and the value 0x00765400 in r2
We also want the value 0x07654000 in r4
r7 holds the mask: 0x00ffff00
r6 holds the data: 0x87654321
and r3,r7,r6,ror#20
and r2,r7,r6,ror#4
and r4,r6,r7,ror#28
So in the first two instructions, we rotate the data, in the last instruction, we rotate the mask
... I had to do this once in some code, which required a lot of constant values and a tight timing.
I couldn't afford the overhead of a loop; the code had to be completely unrolled.
In addition I had to squeeze as many pre-loaded values into registers as I could.
This solution changed the task from being 'impossible' to just barely become possible; if it had required one more clock-cycle anywhere, that would break everything.
For systems with cache, in general I will go for literal load because you could get smaller code size => which usually means lower cache miss => better performance.
But ideally you should try benchmark your code to see what work best. Cache hit rate can be very application and compiler specific and potentially using either way the compiler could generate a code sequence that doesn't match the cache very well.
Thank you both for your answers so quick.
I thinks that I have enough tips to make some choices for the implementation. Now I need to make some benchmarking.
It is always interesting to ask opinion of experts to avoid to loose too much time if a wrong direction has been taken.
@Joseph
I will keep in mind your tips regarding the different kinds of Cortex-M3: with prefetch buffer, with separate code and data caches and unified cache.
Thank you for answer.
Regarding the cache, it is exactly a 4-way set associative cache.
It means there are 4 cache lines for a given range. Before saying to many wrong things, I prefer to give you a link to an article that I read to understand this topic, "What every programmer should know about memory",
http://lwn.net/Articles/252125/