This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Instruction timings - arm cortex m3

I am using the following 3 assembly sections to read a memory mapped i/o to multiple registers and to read same i/o and save it ram respectively , on an ARM Cortex M3. I want to know exactly how many CPU cycles this would take to complete. Or in other words how fast am I reading the register.

1) read to and save to memory: Can LDR-STR=LDR-STR be tightly pipelined (With Address Phase of one instruction overlapping Data Phase of previous instruction), in which case the following will take only 9 cycles ? 

     486:   781a      ldrb        r2, [r3, #0]

     488:   7002      strb        r2, [r0, #0]

     48a:   781a      ldrb        r2, [r3, #0]

     48c:   7042      strb        r2, [r0, #1]

     48e:  781a      ldrb        r2, [r3, #0]

     490:   7082      strb        r2, [r0, #2]

     492:   781a      ldrb        r2, [r3, #0]

     494:   70c2      strb        r2, [r0, #3]

2) read to multiple registers: I am assuming these instructions take 5 cycles.

     486:   781a      ldrb        r2, [r3, #0]

     48a:   781a      ldrb        r4, [r3, #0]

     48e:  781a      ldrb        r5, [r3, #0]

     492:   781a      ldrb        r6, [r3, #0]

I appreciate any insight you can provide.

Thanks,

Parents
  • I understand your confusion, because I do not know the exact cause myself!

    SysTick might not be very accurate; please see this post: cortex-M3 pipelinging of consecutive LDR instructions.

    Another thing that could affect the execution timing, is whether you run your code from RAM or Flash memory.

    If you have the ART accelerator, you don't need to worry, but if not, you might want to execute the code from internal SRAM.

    Now, back to the ldrb/strb pipeline ... It may help to experiment a little:

    First thing to try is to use .align 2, then add the suffix .w to all ldrb and strb instructions. -Just in case it matters.

    Assuming it didn't change anything, try the following:

    Instead of pointing r3 and r0 to GPIO space, try pointing both to SRAM; preferrably a different SRAM block than the one you're executing code from (in case you execute code from SRAM).

    If the results differ, then the GPIO registers may be causing the delay (but there's no guarantee that this is the case, because there's also a data cache, which would help when loading from SRAM).

    Regarding your first question, whether or not ldrb:strb:ldrb:strb can be tightly pipelined; I think it can not.

    As I understand it, nothing can be pipelined after STR.

    The first LDR instruction takes 2 clock cycles, the next LDR instruction should take only one instruction (if it's pipelined).

    STR rS,[rB,#imm] should always take 1 clock cycle.

    Thus I would expect LDR:STR to take 3 clock cycles, not 4.

    Note that the examples in the manual always ends with a STR instruction.

    Another test to try, is the following:

        ldrb    r2,[r3,#0]

        strb    r1,[r0,#0]

        ldrb    r1,[r3,#0]

        strb    r2,[r0,#1]

        ldrb    r2,[r3,#0]

        strb    r1,[r0,#2]

        ldrb    r1,[r3,#0]

        strb    r2,[r0,#3]

    -Thus you're not writing into a register right before reading it. This should not make any changes according to the manual, but it might be good to get it confirmed.

    If you're reading only a single bit, then you'll have the option of saving the result in a register by using the BFI instruction.

    Normally when doing this, it's best to read only the low bit(s).

    Thus if reading only 2 bits from the port each time, you might be able to get away with ...

        .set    pos,30

        .rept   16

        ldr     r2,[r3,#0]

        bfi     r1,r2,#pos,#2

        .set    pos,pos-2

        .endr

    This should take maximum 3 clock cycles between each bit pair. Thus the fewer number of bits you're sampling, the longer you can sample; spreading the storage over several registers.

    Unfortunately, if you need to take a lot of samples, you'll run out of registers.

Reply
  • I understand your confusion, because I do not know the exact cause myself!

    SysTick might not be very accurate; please see this post: cortex-M3 pipelinging of consecutive LDR instructions.

    Another thing that could affect the execution timing, is whether you run your code from RAM or Flash memory.

    If you have the ART accelerator, you don't need to worry, but if not, you might want to execute the code from internal SRAM.

    Now, back to the ldrb/strb pipeline ... It may help to experiment a little:

    First thing to try is to use .align 2, then add the suffix .w to all ldrb and strb instructions. -Just in case it matters.

    Assuming it didn't change anything, try the following:

    Instead of pointing r3 and r0 to GPIO space, try pointing both to SRAM; preferrably a different SRAM block than the one you're executing code from (in case you execute code from SRAM).

    If the results differ, then the GPIO registers may be causing the delay (but there's no guarantee that this is the case, because there's also a data cache, which would help when loading from SRAM).

    Regarding your first question, whether or not ldrb:strb:ldrb:strb can be tightly pipelined; I think it can not.

    As I understand it, nothing can be pipelined after STR.

    The first LDR instruction takes 2 clock cycles, the next LDR instruction should take only one instruction (if it's pipelined).

    STR rS,[rB,#imm] should always take 1 clock cycle.

    Thus I would expect LDR:STR to take 3 clock cycles, not 4.

    Note that the examples in the manual always ends with a STR instruction.

    Another test to try, is the following:

        ldrb    r2,[r3,#0]

        strb    r1,[r0,#0]

        ldrb    r1,[r3,#0]

        strb    r2,[r0,#1]

        ldrb    r2,[r3,#0]

        strb    r1,[r0,#2]

        ldrb    r1,[r3,#0]

        strb    r2,[r0,#3]

    -Thus you're not writing into a register right before reading it. This should not make any changes according to the manual, but it might be good to get it confirmed.

    If you're reading only a single bit, then you'll have the option of saving the result in a register by using the BFI instruction.

    Normally when doing this, it's best to read only the low bit(s).

    Thus if reading only 2 bits from the port each time, you might be able to get away with ...

        .set    pos,30

        .rept   16

        ldr     r2,[r3,#0]

        bfi     r1,r2,#pos,#2

        .set    pos,pos-2

        .endr

    This should take maximum 3 clock cycles between each bit pair. Thus the fewer number of bits you're sampling, the longer you can sample; spreading the storage over several registers.

    Unfortunately, if you need to take a lot of samples, you'll run out of registers.

Children
No data