This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex M3 - Literal-pool vs MOVW-MOVT when cache is present

This question is a kind of survey:

Hi folks,

I know this subject, or almost the same, has already be presented, but I don't found an appropriate answer.

First, a short reminder

The Cortex-M3 instruction set offers three ways to load a 32-bit literal (address or constant) into a register:

1/ Using a literal-pool and PC-relative load:

   LDR Rx, [PC, #offset_to_constant]   

2/ Using a couple MOVW/MOVT, to load the constant in two steps:

   MOVW Rx, #least_significant_halfword

   MOVT Rx, #most_significant_halfword

3/ Using the specific 'Flexible Second Operand' (but it is out of scope for my question):

   if a constant :

   - could be obtain by a shift on left of an 8-bit value  (e.g: '000001FE'h = 'FF'h<<1)

   - has the format '00XY00XY'h

   - has the format 'XY00XY00'h

   - has the format 'XYXYXYXY'h

   MOV.W Rx, #constant

Based these elements, I make the following analysis:

[A - instruction timing]

From the instruction timing, we have the following results:

# literal-pool version :

   code size : 6 or 8 bytes depending of relative offset to the constant (6 if offset%4=0 && offset in [0, 1020])

   speed     : 2 cycles for the LDR (or 1 cycle in some case)

# MOVW/MOVT version :

   code size : 8 bytes

   speed     : 2 cycles

[B - Cache]

For a cache usage point of view, more precisely an unified code/data cache (in my case):

As MOVW and MOVT instructions are contiguous, the principle of locality is respected.

For the 'literal-pool' version, as the instruction and the constant pool are separated by some amount of bytes bigger than a cache-line, the principle of locality is not respected and therefore could induce some miss for subsequent data accesses in the system memory space.


# 'literal-pool' needs same number of cycles                  than 'MOVW/MOVT' :  2 cycles vs  2 cycles

# 'literal-pool' takes up less room in pre-fetch unit buffer  than 'MOVW/MOVT' : 16 %      vs 66 %   

# 'literal-pool' needs less instruction fetches on I-Code bus than 'MOVW/MOVT' :  1 fetch  vs  2 fetches

# 'literal-pool' needs more data fetches on D-Code bus        than 'MOVW/MOVT' :  1 fetch  vs  0 fetch 

# 'literal-pool' respects the principle of locality less      than 'MOVW/MOVT' :

# 'literal-pool' doesn't respect the principle of locality for cache programming

# 'MOVW/MOVT'    respects the principle of locality for cache programming

Note: I know that the compiler seems to favor the literal pool version because of the gain for code size when the same constant is used in several places.

After this short analysis, if it is correct, I'm not sure anymore of what is the best strategy to have the best execution speed.

Question 1:

Does anyone have any feedback on this topic ?

Question 2 (subsidiary):

Does anyone know if those elements are taken into account by developement tools like Keil, nowadays?


  • I think it won't matter much on the Cortex-M3; except from the case, where you can use literal loading (LDR) and save a clock-cycle.

    -But you will need to test and verify this on the particular device you're building code for; it may be different on another Cortex-M3 implementation!

    I know that on the LPC175x / LPC176x, it's possible to pipeline the LDR instructions (16-bit instructions are usually the best candidates, due to possible misalignment of 32-bit instructions).

    If you start on a Cortex-M3 and later upgrade your hardware using a drop-in replacement Cortex-M4 or Cortex-M7, then your result will most likely differ from your initial tests.

    There's another benefit if you use the literal pool; that is addresses are easier to relocate, in case you're building code that needs a feature like that.

    These days, we're clever enough to write code-loaders, which can locate addresses, which are split into multi-bitfields and add a value to them, though.

    You'll probably not feel much difference between loading from a nearby literal pool compared to loading from a literal pool located far away; except from if your CPU is working full-time tight-timing on moving data aound and has a lot of DMA copying data at the same time.

    Joseph Yiu might be able to go in detail on when and why the cache would miss and when and why this would insert stalls (and most likely provide a much better answer than mine).

  • I think it won't matter much on the Cortex-M3; except from the case, where you can use literal loading (LDR) and save a clock-cycle.

    -But you will need to test and verify this on the particular device you're building code for; it may be different on another Cortex-M3 implementation!

    I know that on the LPC175x / LPC176x, it's possible to pipeline the LDR instructions (16-bit instructions are usually the best candidates, due to possible misalignment of 32-bit instructions).

    If you start on a Cortex-M3 and later upgrade your hardware using a drop-in replacement Cortex-M4 or Cortex-M7, then your result will most likely differ from your initial tests.

    There's another benefit if you use the literal pool; that is addresses are easier to relocate, in case you're building code that needs a feature like that.

    These days, we're clever enough to write code-loaders, which can locate addresses, which are split into multi-bitfields and add a value to them, though.

    You'll probably not feel much difference between loading from a nearby literal pool compared to loading from a literal pool located far away; except from if your CPU is working full-time tight-timing on moving data aound and has a lot of DMA copying data at the same time.

    Joseph Yiu might be able to go in detail on when and why the cache would miss and when and why this would insert stalls (and most likely provide a much better answer than mine).
