This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M3 pipelining of consecutive LDR instructions to different memory regions?

Hi all,

recently I did some measurements concerning the SysTick-Timer and consumend clock cycles (because of performance reasons).

I wrote a simple function in assembly, which gets called from a C file. Before and after the call i read the value of the SysTick-Timer to determine the cycles neeed for loading the parameter value into register r0, the call and all the assembly code in the function.

Taking into account, that two consecutive (simple) LDR instructions can get pipeplined, it seems they don't get pipelined - at least when looking at the clock cycles.

Am I right assuming that loads to different memory regions (for SysTick-Timer and stack) don't get (ever) pipelined ? And maybe a slightly other question: do loads get pipelined when crossing boundaries concerning "minimum memory part sizes" (AHB-Lite) in the same memory region?

Thanks in advance,

Alex

Top replies

Sadanand Gulwadi over 11 years ago +1 verified

Hi Alex, You do have a point that needs clarification. Here is what I imagine can be done (a slight chance you have already seen and tried this). In the Load/Store Timing section of the documentation,...

Parents

0 Jens Bauer over 11 years ago

Just an additional note: In case you one day move to Cortex-M4 (or someone using Cortex-M4 is passing by here), there is an additional factor.
If a load/store instruction uses a 32-bit opcode, make sure this opcode is aligned on a 32-bit boundary (eg. the address is divisible by 4).
If not, the instruction might not always pipeline optimally.
-This does not seem to apply to the Cortex-M3, though.
From the Cortex-M4 instruction timing documentation:

Neighboring load and store single instructions can pipeline their address and data phases but in some cases such as 32-bit opcodes aligned on odd halfword boundaries they might not pipeline optimally.

If all of them can be 16-bit, add the .n suffix to all the load/store instructions and you shouldn't have any problems there.
Otherwise, you may have to .align 2 before you start your subroutine / load-block and add the .w suffic for all 16-bit load/store instructions that can't be paired with another 16-bit instruction.
Thus you would have to be sure that an instruction isn't inserted before the block so all the instructions are misaligned.
        .thumb_func
myFunction:
        lsrs        r0,r3,16
        movs        r3,#10
        ...
        ...
        .align      2               /* align on a 32-bit boundary; this may insert one NOP instruction */
        ldr.w       r12,[r7,#0]
        ldr.w       r1,[r7,#4]
        ldr.w       r14,[r7,#8]
        ldr.n       r3,[r7,#12]
        ldr.n       r4,[r7,#16]
        str.n       r1,[r3]
In the above example, we can pair the loading of r3 and r4, because the two neighbouring LDR instructions do not use any of the high 8 registers, thus the opcodes can be 16-bit.
But even though ldr r1,[r7,#4] can be 16-bit, it's just between two 32-bit wide opcodes, so we'll need to force it to be a 32-bit opcode so the addresses won't be misaligned.
Note: The .align directive actually automatically fills using nop instructions if used in a section containing executable code.
Just use .align 2 in there, which will align the location counter to a (1 << 2) byte boundary; .align 4 will align the location counter to a (1 << 4) byte boundary.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Jens Bauer over 11 years ago

Just an additional note: In case you one day move to Cortex-M4 (or someone using Cortex-M4 is passing by here), there is an additional factor.
If a load/store instruction uses a 32-bit opcode, make sure this opcode is aligned on a 32-bit boundary (eg. the address is divisible by 4).
If not, the instruction might not always pipeline optimally.
-This does not seem to apply to the Cortex-M3, though.
From the Cortex-M4 instruction timing documentation:

Neighboring load and store single instructions can pipeline their address and data phases but in some cases such as 32-bit opcodes aligned on odd halfword boundaries they might not pipeline optimally.

If all of them can be 16-bit, add the .n suffix to all the load/store instructions and you shouldn't have any problems there.
Otherwise, you may have to .align 2 before you start your subroutine / load-block and add the .w suffic for all 16-bit load/store instructions that can't be paired with another 16-bit instruction.
Thus you would have to be sure that an instruction isn't inserted before the block so all the instructions are misaligned.
        .thumb_func
myFunction:
        lsrs        r0,r3,16
        movs        r3,#10
        ...
        ...
        .align      2               /* align on a 32-bit boundary; this may insert one NOP instruction */
        ldr.w       r12,[r7,#0]
        ldr.w       r1,[r7,#4]
        ldr.w       r14,[r7,#8]
        ldr.n       r3,[r7,#12]
        ldr.n       r4,[r7,#16]
        str.n       r1,[r3]
In the above example, we can pair the loading of r3 and r4, because the two neighbouring LDR instructions do not use any of the high 8 registers, thus the opcodes can be 16-bit.
But even though ldr r1,[r7,#4] can be 16-bit, it's just between two 32-bit wide opcodes, so we'll need to force it to be a 32-bit opcode so the addresses won't be misaligned.
Note: The .align directive actually automatically fills using nop instructions if used in a section containing executable code.
Just use .align 2 in there, which will align the location counter to a (1 << 2) byte boundary; .align 4 will align the location counter to a (1 << 4) byte boundary.
Cancel
Vote up 0 Vote down

Cancel

Children

No data