I am using the following 3 assembly sections to read a memory mapped i/o to multiple registers and to read same i/o and save it ram respectively , on an ARM Cortex M3. I want to know exactly how many CPU cycles this would take to complete. Or in other words how fast am I reading the register.
1) read to and save to memory: Can LDR-STR=LDR-STR be tightly pipelined (With Address Phase of one instruction overlapping Data Phase of previous instruction), in which case the following will take only 9 cycles ?
486: 781a ldrb r2, [r3, #0]
488: 7002 strb r2, [r0, #0]
48a: 781a ldrb r2, [r3, #0]
48c: 7042 strb r2, [r0, #1]
48e: 781a ldrb r2, [r3, #0]
490: 7082 strb r2, [r0, #2]
492: 781a ldrb r2, [r3, #0]
494: 70c2 strb r2, [r0, #3]
2) read to multiple registers: I am assuming these instructions take 5 cycles.
48a: 781a ldrb r4, [r3, #0]
48e: 781a ldrb r5, [r3, #0]
492: 781a ldrb r6, [r3, #0]
I appreciate any insight you can provide.
Thanks,
Hello jensbauer,
I revised my results by seeing your comments.
I measured NOP cycle and I found there was 7 cycles overhead of result (i.e NOPx16=23 cycles. NOPx32=39 cycles).
Also I re-measured the executions.
ldrb:strbx8 ldrbx8(fm src) strbx8(to src)
SRAM to SRAM: 22 22 22
GPIO to SRAM: 23 16 17
SRAM to GPIO: 23 22 22
GPIO to GPIO: 17 16 17
The below are the 7 cycle compensation results.
SRAM to SRAM: 15 15 15
GPIO to SRAM: 16 9 10
SRAM to GPIO: 16 15 15
GPIO to GPIO: 10 9 10
From this, in Kinetis, SRAM access will take 2 cycles (i.e. cannot be pipelined).
GPIO access will be pipelined.
From this thread, I have learned much. Thank you.
Best regards,
Yasuhiko Koumoto.
It's good to see that the results improved.
I learned something from this exercise as well. I learned that it is important to align instructions on 4-byte boundaries, when timing is critical.
-Perhaps that's why I had problems understanding the cycle timing earlier.
Timing instructions is not always straight-forward. One has to take precautions and compensate for a number of things.
Of course, my test is only designed for non-cached instructions and also only expects that the instructions begin at the first instruction mentioned.
To measure an instruction like LDR between two LDR instructions, one would of course first measure two LDR instructions, then measure a block containing 3 LDR instructions and finally subtract the two results.
I'm still puzzled why I got 8 clock cycles for the ldrb/strb pairs and not 9. I've re-checked my timer, and it's not running at half speed; indeed, when I wait for 100000000 counts, I get a one second LED-blink, otherwise I'd get half a second.
I bet Joseph would be intersted in seeing our results.
Hi Jens,
Sorry for not avery active recently in here. Having a crazy busy time at the moment.
It seems you are running the processor fairly fast (100MHz?), so the flash wait state and the cache nature of the ART in STM32 can certainly impact the timing. And of course the behaviour of the bus infrastructure (e.g. bus bridges, bus matrix, etc) could also affect the timing. Reading data from flash can have worst timing because the data is not necessary cached in the instruction buffer, even you have enabled the ART flash access accelerator.
Inside the Cortex-M3/M4, if you are executing sucessive unaligned LDR.W and STR.W instruction, yes, there are some pipeline "bubbles" that can affect the timing. One or two sucessive LDR.W / STR.W do not have such effect, but when having more in the sequence can lead to extra cycle (Sorry, I can't remember the details on top of my head). 16-bit instructions don't have this effect.
Making branch target aligned to 32-bit can also help, especially if the branch target is a 32-bit instruction.
regards,
Joseph
Thank you, Joseph, this definitely helps a lot in understanding what to do and how to do it.
As I understand it, it sounds like it's a good idea to use 16-bit instructions (and align them on a 32-bit boundary if one can swap two instructions).
I was very much surprised with the LPC1768 - perhaps my measuring results weren't all wrong after all.
When reading your reply, it appears that it's good to place some integer-instructions (eg. add, sub, comp, and, orr, eor, shift, etc.) right after STR-type instructions; because then there will not be pipeline-bubbles, due to that the LDR/STR block will be short, thus it won't be "exhausted" - or am I getting this part wrong ?
For best performance, in general pipeline LDR and STR are good for Cortex-M3/M4. (Not applicable to Cortex-M0, M0+ , M7)
This reduce the subseqence LDR/STR instructions to 1 cycle (assumed 0 wait state, no unaligned/bitband transfers).
If you insert operations between LDR/STR, then each LDR would be 2 cycles (STR could still be 1 cycle because of the write buffer).
Ideally, use 16-bit LDR/STR for this (also for code size benefit).
If you need to use 32-bit versions, then try to make sure that the pipelining LDR/STR instructions are aligned to 32-bit addresses.
regards.
Thank you for the detailed reply; this sounds great, because often I do the following ...
.rept 200
ldr rX,[rS,#imm]
str rX,[rT,#imm]
(optionally integer instructions here)
.endr
...so that integer instructions (eg. ADD, SUB, AND, ORR, EOR, shifts, etc.) will be right after STR and before LDR; never after LDR.
But in order to make my question more clear:
A:
.rept 20
.rept 10
and rX,rX,#imm
B:
I understand it as example B would not suffer from bubbles in the pipeline, nothing gets pipelined after a STR, thus it does not matter which instruction we place after STR, correct ?
-Would bubbles appear in the pipeline in example A (because of the long list of LDR/STR), or would they be equally efficient ?