This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M3 pipelining of consecutive LDR instructions to different memory regions?

Hi all,

recently I did some measurements concerning the SysTick-Timer and consumend clock cycles (because of performance reasons).

I wrote a simple function in assembly, which gets called from a C file. Before and after the call i read the value of the SysTick-Timer to determine the cycles neeed for loading the parameter value into register r0, the call and all the assembly code in the function.

Taking into account, that two consecutive (simple) LDR instructions can get pipeplined, it seems they don't get pipelined - at least when looking at the clock cycles.

Am I right assuming that loads to different memory regions (for SysTick-Timer and stack) don't get (ever) pipelined ? And maybe a slightly other question: do loads get pipelined when crossing boundaries concerning "minimum memory part sizes" (AHB-Lite) in the same memory region?

Thanks in advance,

Alex

  • Hi Alex,

    You do have a point that needs clarification. Here is what I imagine can be done (a slight chance you have already seen and tried this). In the Load/Store Timing section of the documentation, possible scenarios involving an LDR instruction are explained as follows:

    • LDR [any] are pipelined when possible. This means that if the next instruction is an LDR or STR, and the destination of the first LDR is not used to compute the address for the next instruction, then one cycle is removed from the cost of the next instruction. So, an LDR might be followed by an STR, so that the STR writes out what the LDR loaded. More multiple LDRs can be pipelined together. Some optimized examples are:
      • LDR R0,[R1]; LDR R1,[R2] - normally three cycles total
      • LDR R0,[R1,R2]; STR R0,[R3,#20] - normally three cycles total
      • LDR R0,[R1,R2]; STR R1,[R3,R2] - normally three cycles total
      • LDR R0,[R1,R5]; LDR R1,[R2]; LDR R2,[R3,#4] - normally four cycles total.

    Given the above explanation, I would recommend taking a look at the entire piece of your code at the assembly level to check whether the destination of the first LDR is not being used to compute the address of the next LDR. Hope this helps.

    Regards,

    Sadanand Gulwadi

    ARM University Program Manager, Bangalore

  • Also a thing that's worth having in mind:

    Run your code from the SRAM, not the flash memory.

    As Sadanand requested: Please post a snippet containing the instructions that do not get pipelined, and we'll probably be able to find out why they aren't pipelined, and perhaps we can then provide a solution.

  • Hi Sadanand, jensbauer,

    thank you for responding to my question.

    Thanks for pointing out information regarding load / store timings from TRM,

    they seem familiar..

    I'll give the exectution from SRAM a try in the next days..

    The two LDR instructions in question (at address 8000832 and 8000834) are shown in the disassembly (objdump) below.

    I'm using a STM32L152RE microcontroller, RAM space is from 0x20000000 with a size of 0x00014000,

    ROM is located at 0x08000000 with a size of 0x00080000.

    R6 is containing the address of SCS_BASE (0xE000E000), variable at [sp,#4] is of type volatile int.

    The execution of the code takes 9 cycles according to the subtraction of the SysTick values.

    8000832:    69b7          ldr    r7, [r6, #24]   ;load SysTick->VAL

    8000834:    9801          ldr    r0, [sp, #4]    ;parameter for function

    8000836:    f7ff fc85     bl    8000144 <singleinstruction_test>

    800083a:    69b0          ldr    r0, [r6, #24]   ;load SysTick->VAL

    08000144 <singleinstruction_test>:

    8000144:    f100 0000     add.w    r0, r0, #0

    8000148:    4770          bx    lr

    Concerning clock frequency and flash memory, I've used the frequency of the built-in (just) 16 MHz oscillator directly,

    which - according to the manual of the controller - does't need wait states for flash access and is applicable for

    the AHB-Lite bus as well (This may sound a bit odd, because of my previous stated 'performance reasons', but I think

    it's okay for examination of the loads. Pleas correct me, if I'm wrong..).

  • Hi Alex,

    Thanks for the dis-assembly code. Makes things clearer. Firstly, since I assume you are subtracting the first SysTick value from the second, I am not sure the cycle count obtained after this subtraction includes the time taken by the first SysTick LDR instruction. Also, I suspect the use of the BL and the BX instructions (between the load parameter instruction and the second load SysTick instruction) brings about a whole lot of cycle-computation complexity into the picture. Could you please try two LDRs without a BL and BX instruction following them and between two SysTicks loads? Hope I make sense.

    Regards,

    Sadanand

  • Hi Sadanand,

    sorry for the late reply, the last days were quite busy, I'll post my results ASAP.

    Regards,

    Alex

  • Alright, I finally had the time to try your suggested actions.


    Firstly, since I assume you are subtracting the first SysTick value from the second, I am not sure the cycle count obtained after this subtraction includes the time taken by the first SysTick LDR instruction.

    I agree with you, that the cycle count of the first SysTick LDR instruction isn't included, but that's a thing I'm aware of and, in my case, it would be the desired behaviour, anyway.

    I wrote the assembly below with R0 initially containing the address of the systick->value reg, R2 and R3 initially containing the addresses to 2 different variables on the stack.

        LDR R1, [R0]     ;read systick val reg

        LDR R2, [R2]     ;read variable 1

        LDR R3, [R3]     ;read variable 2

        LDR R0, [R0]     ;read systick val reg

       

        SUB R0, R1, R0     ;subtract second read value from first read value

    The result of the SUB instruction is 5, which in my opinion only can happen, when the two loads of the stack variables get pipelined and the loads of the systick val register don't get pipelined with the variable accesses, which brings me back to my assumption, that loads to PPB memory region and System memory region don't get pipelined. What do you think?

    Regards,

    Alex

  • I think there might be some traps in the mentioned code.

    I agree; some pipelining must have happened, otherwise all instructions would take two clock cycles, and that would result in a final value of 6 clock cycles.

    But since you're getting 5 clock cycles, it might be more 'accurate' to do the following:

        LDR R1, [R0]     ;read systick val reg

        SUB R4,R5,R6  ;dummy instruction to flush the pipeline

    ;    LDR R2, [R2]     ;read variable 1

    ;    LDR R3, [R3]     ;read variable 2

        SUB R4,R5,R6  ;dummy instruction to flush the pipeline

        LDR R0, [R0]     ;read systick val reg

      

        SUB R0, R1, R0     ;subtract second read value from first read value

    Now you'll be able to disable the two LDR instructions in the centre, then measure it, enable one of them, measure this one, enable both and measure both.

    I wish I could give you a definitive answer, though.

  • Hi Alex,

    Thanks for the experimentation. Much as I understand what you observe, I do feel there is a difference between the CPU clock (HCLK) and the SysTick clock that is causing the number of cycles taken to show up as 5. The Reference Manual for the STM32L152xx family (http://www.st.com/web/en/resource/technical/document/reference_manual/CD00240193.pdf) may help to sort this out.

    Regards,

    Sadanand

  • Generally speaking: Would it be (more) accurate to set up a timer that follows CCLK and read the Timer-Counter value before beginning and after ending the test, as long as all interrupts are disabled ?

  • I believe so. For if you take a look at the reference manual, it mentions the SysTick clock is set to 4 MHz or Max HCLK/8, where HCLK can take up a frequency from 2 to 32 MHz depending on user configuration.

  • Hi all,

    thank you for your responses.

    Concerning jensbauer's suggested measurements with disabling the 2 loads:

    The suggested code shows a cycle count of 4 with the 2 loads disabled, 6 cycles with just one load disabled and 7 when executing the loads in the middle, which show expected and reasonable results.

    Concerning  Sadanand's post relating the clock source for the SysTick-Timer:

    Thanks for pointing that out. My first thought was that I could have overseen that. A view in the Programming Manual of the controller and my initialisation code  reminded me that I had taken care of that already. Because the SysTick Control Register allows selecting the clock source between AHB/8 and processor clock (AHB), I guess that it shouldn't make a difference using the SysTick-Timer or a dedicated timer/counter (since in my case processor clock / AHB clock are the same / no prescaler is used).

    For now, I have to stop the investigation on why these loads show the discussed behaviour because of lacking time and settle with the knowledge that the loads don't get pipelined.

    Anyway, I really apprechiate your help, thank you very much.

    Regards,

    Alex

  • > Concerning jensbauer's suggested measurements with disabling the 2 loads:

    > The suggested code shows a cycle count of 4 with the 2 loads disabled, 6 cycles with just one load disabled and 7 when executing the loads in the middle, which show expected and reasonable results.

    I suggested the dummy instructions, in order to get accurate measurements; in other words to "synchronize", so you do not get your results disturbed by the reading of the cycle counter.

    If 4 cycles are being used with no loads enabled and 6 cycles are being used with 1 load enabled, that suggests the first load takes 2 clock cycles, correct ?

    If 6 cycles are being used with one load enabled and 7 cycles are being used with 2 loads enabled, that suggests the second load takes 1 clock cycle, correct ?

    If the above is true, then I believe the second instruction is being pipelined as expected.

    -Or do I misinterpret the results ?

    If you need to measure if the reading of the cycle-counter affects the pipelining, you could make a duplicate load:

        LDR R1,[R0]     ;read systick val reg

        SUB R4,R5,R6    ;dummy instruction to flush the pipeline

    ;   LDR R4,[R0]     ;dummy read of systick val reg

    ;   LDR R2,[R2]     ;read variable 1

    ;   LDR R3,[R3]     ;read variable 2

        SUB R4,R5,R6    ;dummy instruction to flush the pipeline

        LDR R0,[R0]     ;read systick val reg

        SUB R0,R1,R0    ;subtract second read value from first read value

    If enabling all 3 loads, I would expect the result to be...

    • 8 if an instruction can be pipelined after reading the systick value.
    • 9 if an instruction cannot be pipelined after reading the systick value.

    Note: Remember that the first load in a sequence is never pipelined, so the first load will always use 2 clock cycles...

  • Sorry for the delay on this...

    The SysTick timer is in a Strongly Ordered memory space, so the transfers to SysTick cannot pipelined with other memory accesses. For your instruction sequence, only the two reads to the variables 1 & 2 can be pipelined.

    regards,

    Joseph

  • Just an additional note: In case you one day move to Cortex-M4 (or someone using Cortex-M4 is passing by here), there is an additional factor.

    If a load/store instruction uses a 32-bit opcode, make sure this opcode is aligned on a 32-bit boundary (eg. the address is divisible by 4).

    If not, the instruction might not always pipeline optimally.

    -This does not seem to apply to the Cortex-M3, though.

    From the Cortex-M4 instruction timing documentation:

    Neighboring load and store single instructions can pipeline their address and data phases but in some cases such as 32-bit opcodes aligned on odd halfword boundaries they might not pipeline optimally.

    If all of them can be 16-bit, add the .n suffix to all the load/store instructions and you shouldn't have any problems there.

    Otherwise, you may have to .align 2 before you start your subroutine / load-block and add the .w suffic for all 16-bit load/store instructions that can't be paired with another 16-bit instruction.

    Thus you would have to be sure that an instruction isn't inserted before the block so all the instructions are misaligned.

            .thumb_func

    myFunction:

            lsrs        r0,r3,16

            movs        r3,#10

            ...

            ...

            .align      2               /* align on a 32-bit boundary; this may insert one NOP instruction */

            ldr.w       r12,[r7,#0]

            ldr.w       r1,[r7,#4]

            ldr.w       r14,[r7,#8]

            ldr.n       r3,[r7,#12]

            ldr.n       r4,[r7,#16]

            str.n       r1,[r3]

    In the above example, we can pair the loading of r3 and r4, because the two neighbouring LDR instructions do not use any of the high 8 registers, thus the opcodes can be 16-bit.

    But even though ldr r1,[r7,#4] can be 16-bit, it's just between two 32-bit wide opcodes, so we'll need to force it to be a 32-bit opcode so the addresses won't be misaligned.

    Note: The .align directive actually automatically fills using nop instructions if used in a section containing executable code.

    Just use .align 2 in there, which will align the location counter to a (1 << 2) byte boundary; .align 4 will align the location counter to a (1 << 4) byte boundary.