I am using the following 3 assembly sections to read a memory mapped i/o to multiple registers and to read same i/o and save it ram respectively , on an ARM Cortex M3. I want to know exactly how many CPU cycles this would take to complete. Or in other words how fast am I reading the register.
1) read to and save to memory: Can LDR-STR=LDR-STR be tightly pipelined (With Address Phase of one instruction overlapping Data Phase of previous instruction), in which case the following will take only 9 cycles ?
486: 781a ldrb r2, [r3, #0]
488: 7002 strb r2, [r0, #0]
48a: 781a ldrb r2, [r3, #0]
48c: 7042 strb r2, [r0, #1]
48e: 781a ldrb r2, [r3, #0]
490: 7082 strb r2, [r0, #2]
492: 781a ldrb r2, [r3, #0]
494: 70c2 strb r2, [r0, #3]
2) read to multiple registers: I am assuming these instructions take 5 cycles.
48a: 781a ldrb r4, [r3, #0]
48e: 781a ldrb r5, [r3, #0]
492: 781a ldrb r6, [r3, #0]
I appreciate any insight you can provide.
Thanks,
This may depend on more than one thing. I think jyiu might be able to give you a more complete answer than I can provide.
I think the I/O timing may depend on the vendor's implementation.
As far as I remember, the instruction alignment is important.
If you use any 32-bit load or store instructions (eg. ldrb.w or strb.w instead of ldrb.n or strb.n), then make sure the instructions are aligned on a 4-byte boundary.
If an instruction is not aligned on a 4-byte boundary, I think you will not get the expected results.
Thus ...
If you're using only low registers (r0-r7), then you can use 16-bit instructions. If you're using a high register (r8-r15), then you need to use ldrb.w / strb.w instead.
The assembler automatically selects the necessary instruction size, but you may explicitly add the .w or .n suffix.
To make sure your instructions are aligned on a 4-byte boundary, you can use ...
.align 2
... when using the GNU Assembler. Note 2 does not mean 2-byte alignment, it means (1 << 2) byte alignment.
Thus I recommend that you solely use ldrb.w / strb.w in case any of your load or store instructions contain high registers, because aligning the instructions will insert a NOP, which usually cost you 1 clock cycle, so your timing will be affected.
Other things you should know: Some devices have limits on their GPIO speeds. Some devices have high-speed GPIO pins, which follow the CPU speed, thus you have nothing to worry about. Some devices, such as STM32 devices can have the GPIO pin speed configured (you can choose between low, mid, high and very high speeds).
Also, if you're using bit-bang, make sure no interrupts can disturb you while you're reading/writing - but you probably know that already.
(If you have a dual-core configuration, then one core might also affect the timing of the other core; I believe this is due to memory read/write access; but as you're using a Cortex-M3, you're likely not using a dual core configuration).
Thank you for giving a very detailed explanation. I knew there was way little background information in my original post. Thank you for pointing out the aspects about the system that would be relevant in this computation. Actually my system:
1) has no interrupts enabled
2) has fast gpio which i am reading from , operates at cpu clock
3) only r0-r7 are used in these ldr/str instructions and hence all are 16 bit thumb instructions,
I did some experiments since my post; I repeated the 1st set of instructions in my post, the load/str pairs (all of which are 16 bit instructions) 32 times and took some measurements- I wanted to confirm that the Address/Data phases are being pipelined as stated in arm cortex technical reference ( which says " LDR R0,[R1,R2]; STR R0,[R3,#20] - normally three cycles total " & "Neighboring load and store single instructions can pipeline their address and data phases. This enables these instructions to complete in a single execution cycle.")
LDR R0,[R1,R2]; STR R0,[R3,#20]
However, the ldrb/strb pair executed 32 times took 128 cycles as opposed to 64 ( when measured using sys_tick) - taking 2 cycles per instruction. I even switched to using multiple registers (r0-r7) for the ldrb/strb pair, instead of using same register, just in case the use of same register was causing any stalls ( though it did not seem likely since the register r2 used in ldrb/strb was not used in computing address).
Also, on another note, when I took same measurements on the 2nd set of instructions, the strb-strb-strb-.. , surprisingly it took only one cycle per 'strb' instruction, thus confirming that Address and Data phases of 'strb-strb-strb.' are getting pipelined.
I am now confused why the latter behaves as expected/stated in the manual but the former doesn't.
Thanks again ,
Hello,
I would like to confirm the 2nd set is 'strb-strb-strb-'.
Isn't it 'ldrb-ldrb-ldrb'?
If it is right, I guess the reason of the unexpected behavior would be the destination (SRAM) latency.
Because the source latency of 1 cycle due to the Fast GPIO, the series of the Fast GPIO accesses can be executed every one cycle.
I would like to propose an experiment that the destination of the 1st case would be located at the Fast GPIO.
In this case, I guess the behavior of 'ldrb-strb' would match with your expectation.
Best regards,
Yasuhiko Koumoto,
I understand your confusion, because I do not know the exact cause myself!
SysTick might not be very accurate; please see this post: cortex-M3 pipelinging of consecutive LDR instructions.
Another thing that could affect the execution timing, is whether you run your code from RAM or Flash memory.
If you have the ART accelerator, you don't need to worry, but if not, you might want to execute the code from internal SRAM.
Now, back to the ldrb/strb pipeline ... It may help to experiment a little:
First thing to try is to use .align 2, then add the suffix .w to all ldrb and strb instructions. -Just in case it matters.
Assuming it didn't change anything, try the following:
Instead of pointing r3 and r0 to GPIO space, try pointing both to SRAM; preferrably a different SRAM block than the one you're executing code from (in case you execute code from SRAM).
If the results differ, then the GPIO registers may be causing the delay (but there's no guarantee that this is the case, because there's also a data cache, which would help when loading from SRAM).
Regarding your first question, whether or not ldrb:strb:ldrb:strb can be tightly pipelined; I think it can not.
As I understand it, nothing can be pipelined after STR.
The first LDR instruction takes 2 clock cycles, the next LDR instruction should take only one instruction (if it's pipelined).
STR rS,[rB,#imm] should always take 1 clock cycle.
Thus I would expect LDR:STR to take 3 clock cycles, not 4.
Note that the examples in the manual always ends with a STR instruction.
Another test to try, is the following:
ldrb r2,[r3,#0]
strb r1,[r0,#0]
ldrb r1,[r3,#0]
strb r2,[r0,#1]
strb r1,[r0,#2]
strb r2,[r0,#3]
-Thus you're not writing into a register right before reading it. This should not make any changes according to the manual, but it might be good to get it confirmed.
If you're reading only a single bit, then you'll have the option of saving the result in a register by using the BFI instruction.
Normally when doing this, it's best to read only the low bit(s).
Thus if reading only 2 bits from the port each time, you might be able to get away with ...
.set pos,30
.rept 16
ldr r2,[r3,#0]
bfi r1,r2,#pos,#2
.set pos,pos-2
.endr
This should take maximum 3 clock cycles between each bit pair. Thus the fewer number of bits you're sampling, the longer you can sample; spreading the storage over several registers.
Unfortunately, if you need to take a lot of samples, you'll run out of registers.
I have forgotten that you used Cortex-M3.
Cortex-M3 does not have the Fast GPIO.
I wonder why the 2nd case took only 5 cycles.
Is it your assumption?
What platform do you use?
I experimented with my FRDM-K20D50M board.
The below are results.
1) GPIO to SRAM
1-1) ldrb-strb x8 => 22 cycles
1-2) ldrb x8 (from GPIO) => 16 cycles
1-3) strb x8 (to GPIO) => 14 cycles
2) SRAM to SRAM
2-1) ldrb-strb x8 => 20 cycles
2-2) ldrb x8 (from SRAM) => 18 cycles
2-3) strb x8 (to SRAM) => 20 cycles
3) GPIO to GPIO
3-1) ldrb-strb x8 => 14 cycles
3-2) ldrb x8 (from GPIO) => 16 cycles
3-3) strb x8 (to GPIO) => 14 cycles
4) Flash to SRAM
4-1) ldrb-strb x8 => 22 cycles
4-2) ldrb x8 (from Flash) => 18 cycles
4-3) strb x8 (to Flash) => HardFault
In any cases, the execution of one instruction takes about 2 cycles.
The store buffer effect is sometimes seen.
I would like to know the scenario which acts pipelined AHB bus.
Yasuhiko Koumoto.
This is very interesting.
I'm particularly amazed that the STRB takes longer when writing to the SRAM.
Did you execute the code from Flash memory or from SRAM ?
Hello hensbauer,
the code was executed from Flash memory.
The processor is Freescale Kinetis K20 (of which core is Cortex-M4).
I think that it would be Kinetis specific phenomena.
The GPIO is directly connected to the crossbar switch but the SRAM is connected to the crossbar switch via SRAM controller.
So, I think the delay came from the overhead of the SRAM controller.
Hello jensbauer,
I revised my results by seeing your comments.
I measured NOP cycle and I found there was 7 cycles overhead of result (i.e NOPx16=23 cycles. NOPx32=39 cycles).
Also I re-measured the executions.
ldrb:strbx8 ldrbx8(fm src) strbx8(to src)
SRAM to SRAM: 22 22 22
GPIO to SRAM: 23 16 17
SRAM to GPIO: 23 22 22
GPIO to GPIO: 17 16 17
The below are the 7 cycle compensation results.
SRAM to SRAM: 15 15 15
GPIO to SRAM: 16 9 10
SRAM to GPIO: 16 15 15
GPIO to GPIO: 10 9 10
From this, in Kinetis, SRAM access will take 2 cycles (i.e. cannot be pipelined).
GPIO access will be pipelined.
From this thread, I have learned much. Thank you.
It's good to see that the results improved.
I learned something from this exercise as well. I learned that it is important to align instructions on 4-byte boundaries, when timing is critical.
-Perhaps that's why I had problems understanding the cycle timing earlier.
Timing instructions is not always straight-forward. One has to take precautions and compensate for a number of things.
Of course, my test is only designed for non-cached instructions and also only expects that the instructions begin at the first instruction mentioned.
To measure an instruction like LDR between two LDR instructions, one would of course first measure two LDR instructions, then measure a block containing 3 LDR instructions and finally subtract the two results.
I'm still puzzled why I got 8 clock cycles for the ldrb/strb pairs and not 9. I've re-checked my timer, and it's not running at half speed; indeed, when I wait for 100000000 counts, I get a one second LED-blink, otherwise I'd get half a second.
I bet Joseph would be intersted in seeing our results.
I previously posted incorrect results, due to buggy code).
These are my new LPC1768 (Cortex-M3) test results:
ldrb.n / strb.n / ldrb.w / strb.w on 4-byte boundary addresses:
ldrb:strb ldrb strb
SRAM0 to SRAM1: 8 4 4
GPIO0 to SRAM1: 8 4 4
SRAM0 to GPIO2: 8 4 4
GPIO0 to GPIO2: 8 4 4
If the first ldrb/strb is not on a 4-byte boundary address:
SRAM0 to SRAM1: 9 5 5
GPIO0 to SRAM1: 9 5 5
SRAM0 to GPIO2: 9 5 5
GPIO0 to GPIO2: 9 5 5
If using ldrb.w/strb.w and the instructions are not on 4-byte boundary addresses:
SRAM0 to SRAM1: 13 7 6
GPIO0 to SRAM1: 13 7 6
SRAM0 to GPIO2: 13 7 6
GPIO0 to GPIO2: 13 7 6
It seems I get the same results when executing the code from local SRAM, as I do from Flash memory.
Local SRAM is at address 0x10000000.
AHB SRAM0 is at address 0x2007c000.
AHB SRAM1 is at address 0x20080000.
Timer0 is set to the CPU frequency is used for measuring.
But I do not really believe my own test-results, because where did the initial clock cycle for LDR go ?
...I just ran a few extra checks: 16 nop.n or nop.w takes 16 clock cycles, 16 mla takes 32 clock cycles, so I guess the above results are correct, somehow.
The test-code basically looks like this:
push {r4-r7,lr}
// nop.n
ldr.w r5,[r2,#0] /* snapshot timer counter */
nop.w /* flush pipeline */
ldr.w r6,[r2,#0] /* snapshot timer counter */
ldrb r3,[r0,#0]
strb r3,[r1,#0]
strb r3,[r1,#1]
strb r3,[r1,#2]
strb r3,[r1,#3]
nop /* flush pipeline, so we're sure our LDR won't get pipelined */
ldr r7,[r2,#0] /* snapshot timer counter */
nop
sub r0,r7,r6
sub r1,r6,r5
sub r0,r0,r1
pop {r4-r7,pc}
Hi Jens,
Sorry for not avery active recently in here. Having a crazy busy time at the moment.
It seems you are running the processor fairly fast (100MHz?), so the flash wait state and the cache nature of the ART in STM32 can certainly impact the timing. And of course the behaviour of the bus infrastructure (e.g. bus bridges, bus matrix, etc) could also affect the timing. Reading data from flash can have worst timing because the data is not necessary cached in the instruction buffer, even you have enabled the ART flash access accelerator.
Inside the Cortex-M3/M4, if you are executing sucessive unaligned LDR.W and STR.W instruction, yes, there are some pipeline "bubbles" that can affect the timing. One or two sucessive LDR.W / STR.W do not have such effect, but when having more in the sequence can lead to extra cycle (Sorry, I can't remember the details on top of my head). 16-bit instructions don't have this effect.
Making branch target aligned to 32-bit can also help, especially if the branch target is a 32-bit instruction.
regards,
Joseph
Thank you, Joseph, this definitely helps a lot in understanding what to do and how to do it.
As I understand it, it sounds like it's a good idea to use 16-bit instructions (and align them on a 32-bit boundary if one can swap two instructions).
I was very much surprised with the LPC1768 - perhaps my measuring results weren't all wrong after all.
When reading your reply, it appears that it's good to place some integer-instructions (eg. add, sub, comp, and, orr, eor, shift, etc.) right after STR-type instructions; because then there will not be pipeline-bubbles, due to that the LDR/STR block will be short, thus it won't be "exhausted" - or am I getting this part wrong ?
For best performance, in general pipeline LDR and STR are good for Cortex-M3/M4. (Not applicable to Cortex-M0, M0+ , M7)
This reduce the subseqence LDR/STR instructions to 1 cycle (assumed 0 wait state, no unaligned/bitband transfers).
If you insert operations between LDR/STR, then each LDR would be 2 cycles (STR could still be 1 cycle because of the write buffer).
Ideally, use 16-bit LDR/STR for this (also for code size benefit).
If you need to use 32-bit versions, then try to make sure that the pipelining LDR/STR instructions are aligned to 32-bit addresses.
regards.
Thank you for the detailed reply; this sounds great, because often I do the following ...
.rept 200
ldr rX,[rS,#imm]
str rX,[rT,#imm]
(optionally integer instructions here)
...so that integer instructions (eg. ADD, SUB, AND, ORR, EOR, shifts, etc.) will be right after STR and before LDR; never after LDR.
But in order to make my question more clear:
A:
.rept 20
.rept 10
and rX,rX,#imm
B:
I understand it as example B would not suffer from bubbles in the pipeline, nothing gets pipelined after a STR, thus it does not matter which instruction we place after STR, correct ?
-Would bubbles appear in the pipeline in example A (because of the long list of LDR/STR), or would they be equally efficient ?