Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Instruction timings - arm cortex m3

I am using the following 3 assembly sections to read a memory mapped i/o to multiple registers and to read same i/o and save it ram respectively , on an ARM Cortex M3. I want to know exactly how many CPU cycles this would take to complete. Or in other words how fast am I reading the register.

1) read to and save to memory: Can LDR-STR=LDR-STR be tightly pipelined (With Address Phase of one instruction overlapping Data Phase of previous instruction), in which case the following will take only 9 cycles ? 

     486:   781a      ldrb        r2, [r3, #0]

     488:   7002      strb        r2, [r0, #0]

     48a:   781a      ldrb        r2, [r3, #0]

     48c:   7042      strb        r2, [r0, #1]

     48e:  781a      ldrb        r2, [r3, #0]

     490:   7082      strb        r2, [r0, #2]

     492:   781a      ldrb        r2, [r3, #0]

     494:   70c2      strb        r2, [r0, #3]

2) read to multiple registers: I am assuming these instructions take 5 cycles.

     486:   781a      ldrb        r2, [r3, #0]

     48a:   781a      ldrb        r4, [r3, #0]

     48e:  781a      ldrb        r5, [r3, #0]

     492:   781a      ldrb        r6, [r3, #0]

I appreciate any insight you can provide.

Thanks,

Parents
  • Hi Jens,

    For best performance, in general pipeline LDR and STR are good for Cortex-M3/M4. (Not applicable to Cortex-M0, M0+ , M7)

    This reduce the subseqence LDR/STR instructions to 1 cycle (assumed 0 wait state, no unaligned/bitband transfers).

    If you insert operations between LDR/STR, then each LDR would be 2 cycles (STR could still be 1 cycle because of the write buffer).

    Ideally, use 16-bit LDR/STR for this (also for code size benefit).

    If you need to use 32-bit versions, then try to make sure that the pipelining LDR/STR instructions are aligned to 32-bit addresses.

    regards.

    Joseph

Reply
  • Hi Jens,

    For best performance, in general pipeline LDR and STR are good for Cortex-M3/M4. (Not applicable to Cortex-M0, M0+ , M7)

    This reduce the subseqence LDR/STR instructions to 1 cycle (assumed 0 wait state, no unaligned/bitband transfers).

    If you insert operations between LDR/STR, then each LDR would be 2 cycles (STR could still be 1 cycle because of the write buffer).

    Ideally, use 16-bit LDR/STR for this (also for code size benefit).

    If you need to use 32-bit versions, then try to make sure that the pipelining LDR/STR instructions are aligned to 32-bit addresses.

    regards.

    Joseph

Children
  • Thank you for the detailed reply; this sounds great, because often I do the following ...

         .rept 200

         ldr rX,[rS,#imm]

         str rX,[rT,#imm]

         (optionally integer instructions here)

         .endr

    ...so that integer instructions (eg. ADD, SUB, AND, ORR, EOR, shifts, etc.) will be right after STR and before LDR; never after LDR.

    But in order to make my question more clear:

    A:

         .rept 20

              .rept 10

                   ldr rX,[rS,#imm]

                   str rX,[rT,#imm]

              .endr

              .rept 10

                   and rX,rX,#imm

              .endr

         .endr

    B:

         .rept 20

              .rept 10

                   ldr rX,[rS,#imm]

                   str rX,[rT,#imm]

                   and rX,rX,#imm

              .endr

         .endr

    I understand it as example B would not suffer from bubbles in the pipeline, nothing gets pipelined after a STR, thus it does not matter which instruction we place after STR, correct ?

    -Would bubbles appear in the pipeline in example A (because of the long list of LDR/STR), or would they be equally efficient ?