Hello everyone,
I am currently working on a cortex-M0 microprocessor(LPC1114). I have looked through all the possible instruction descriptions but I did not find anyone of them explaining why some instructions takes two cycle to execute.
For example, ANDS, MOVS takes only one cycle to execute. but why do we need two cycles to execute LDR? and STR?
Thanks so much for your answer.
So there will be a pipeline stall for LDR and STR to allow the bus to fetch the data?
For example, assume I have a LDR PC-relative instruction as LDR R4, =(#0xFEDCBA98);
Assume at cycle 0, this instruction is fetched with another instruction (assume it as NOP),
At cycle 1, the LDR pc-relative instruction is decoded
at cycle 2, the ldr pc-relative instruction is executed and also the final address of the data is calculated as PC+ offset here
at cycle 3, the processor has a pipeline stall stage to let the bus write back the value to R4 register.
The above picture is a timing diagram,
Assume FEDCBA98 is stored at address 0x074. cycle 2 is the execution cycle of LDR R4 =(0xFEDCBA98) instruction. Thus, PC+offset is ready at time t2 (some delay after clock edge 1). Then, data will be available on the bus at time t2, and write back to the register at clock edge 3.
Cycle 3 is the pipeline stall stage.
Please correct me if I am wrong.
Thanks so much!
Sorry, I have no idea about pipeline etc.
Cortex-M0 is based on a simple 3 stage pipeline design: Fetch, decode, execute.
The data access starts at execution stage - due to pipeline nature of AHB bus protocol, the data transfer is 1 cycle after the execution stage. As a result, the pipeline is stalled for 1 cycle. Hence for single load and store instructions, it takes 2 cycles on Cortex-M0.
In Cortex-M3, although it is also based on AHB bus protocol, the store operation can hide the stall cycle using a write buffer (i.e. instruction is treated as completed even the bus transaction on bus interface is not completed).
regards,
Joseph
Thanks so much for the explanation!
Hello Joseph Yiu,
But what about LDR on CM4 (and I think it's same on CM3) : why does it take 2 cycles - by the CM4 tech manual, and the hardware cycle counter as I checked - even for this type of LDR example :
LDR r0, [r1, r2] ?
Please correct me, but doesn't it need extra cycle to calculate r1 + r2? Looking at the older ARM7TDMI spec, LDR is a 3 cycle, one spent to calculate the destination address. And I tried reasoning how the extra cycle could be hidden, but I cannot really find 100% explanation (one thing I thought, CM3, CM4 are Harvard, perhaps memory write and instruction fetch can happen at same time..but , there is still cycle needed to update the register with memory contents before next instruction?)
For case of STR taking only one cycle on CM4, the manual says :"STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing."
So the espec is "cheating" somewhat, as there is last cycle but it's happening as the next one is executing at least.
There is nothing more detailed in CM4 manual about how LDR is implemented however.
Clarification would be very much appreciated - about how / why LDR only is takes 2 cycles here (Ignoring case when its to to /from PC, which takes longer).
Hi,
In Cortex-M3 and Cortex-M4, the LDR for single load takes two cycles. This is documented in
http://infocenter.arm.com/help/topic/com.arm.doc.100166_0001_00_en/ric1417175925887.html
The Cortex-M3 and Cortex-M4 are three stage pipeline with simple fetch-decode-execute arrangement.
The LDR's address cycle is the first cycle of execute, and the read data is available in the next cycle, hence the single LDR takes two cycles. For multiple load store instructions or back to back load we can detect the next operation is also a data memory access and generate the address for it while the pipeline is waiting for the first data. Therefore if the multiple load reads N data, it takes N+1 cycles.
For stores, the address is also output in the execute stage, but the processor do not need to wait until the write is completed in the next cycle because there is a write buffer at the bus interface. Hence stores only take one cycle.
The address generation unit use a combinatorial path to handle the address generation (Rn+Rm) or (Rn+offset) and output to the bus immediately without registering it. As a result it doesn't take an extra cycle, but it means the timing constraint on the bus interface is tight.
Please note if you measure timing with DWT cycle counter with single step, the enable and disabling of that counter is not gurrantee to match the execution cycles at halting and unhalting.
Reading this excellent detailed answer only now ... After my first detailed ARM course.
Joseph Yiu : is there a new /recent book which covers ARM v7 and all cortex'es M and the ARM bus details to the great detail..? I mean not multiple ARM manuals, but a book consolidating it in great detail.
For bus system design, Arm have this book available:
https://www.arm.com/resources/education/books/soc-reference-book
This only cover AHB and APB, but not AXI (too much information for a single book to cover all).