Hello,
What is the latency for the LDR instruction when the result is used for integer arithmetic operations (for example DSP MAC instructions)?
Also, can 64-bit loads (LDRD) be paired with another instruction? Can I do for example a 64-bit load and an integer MAC at the same time?
I hope ARM will add the latencies and more detailed pairing information to the reference manual soon.
Antti
Hello Antti,
as I don't know the details of Cortex-M7 micro-architecture, I measured the latencies on the real chip.
LDR r1,[r0] 0.900 cycles
----------------
LDR r1,[r0] 1.000 cycles
MLA r3,r2,r2,r1
LDR r1,[r0] 2.000 cyclesMLA r3,r1,r1,r2
LDR r1,[r0] 0.999 cyclesMLA r3,r2,r2,r2
The results were that if there was no register dependency LDR and MLA were concurrently executed, and if there there was register dependency some wait cycles happened. Even in the case, if the LDR result was the addend of MLA the wait cycles were hidden.
LDRD r2,r3,[r0] 0.999 cycles
LDRD r2,r3,[r0] 1.900 cycles
MLA r1,r0,r0,r2
MLA r1,r0,r0,r3
LDRD r2,r3,[r0] 2.900 cycles
MLA r1,r2,r2,r0
MLA r1,r3,r3,r0
MLA r1,r0,r0,r0
The results were that LDRD and MLA could not be concurrently executed and the operand order was strongly affect their latencies.
I think it would be too difficult because there would be many variations of each instruction according to the conditions.
Best regards,
Yasuhiko Koumoto.
Thanks for the help, Yasuhiko!
yasuhikokoumoto wrote:The results were that if there was no register dependency LDR and MLA were concurrently executed, and if there there was register dependency some wait cycles happened. Even in the case, if the LDR result was the addend of MLA the wait cycles were hidden.
yasuhikokoumoto wrote:
This is good to know. It looks like the result of the load is forwarded after the second load pipeline stage to the MAC unit. This explains why the LDR and MLA can pair if the LDR result is used for the accumulation part.
LDRD r2,r3,[r0] 2.900 cyclesMLA r1,r2,r2,r0----------------LDRD r2,r3,[r0] 2.900 cyclesMLA r1,r3,r3,r0----------------LDRD r2,r3,[r0] 1.900 cyclesMLA r1,r0,r0,r0The results were that LDRD and MLA could not be concurrently executed and the operand order was strongly affect their latencies.
This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?
I hope ARM will add the latencies and more detailed pairing information to the reference manual soon.I think it would be too difficult because there would be many variations of each instruction according to the conditions.
I believe 90% of use cases could be explained with relatively simple clarifications about how the load & alu units work and what are the common cases that cause extra stalls.
I tried and the results are shown below.
LDR r2,[r0] 0.999 LDR r3,[r0,#4] ---------------- LDR r2,[r0] 1.499 LDR r3,[r0,#4] MLA r1,r0,r0,r2 ---------------- LDR r2,[r0] 1.900 LDR r3,[r0,#4] MLA r1,r0,r0,r3 ---------------- LDR r2,[r0] 2.000 LDR r3,[r0,#4] MLA r1,r2,r2,r0 ---------------- LDR r2,[r0] 2.900 LDR r3,[r0,#4] MLA r1,r3,r3,r0 ---------------- LDR r2,[r0] 1.499 LDR r3,[r0,#4] MLA r1,r0,r0,r0
From the results, LDRD and two LDRs would be identical but the issue rate of two will prevent the 3rd instruction issuing.
I think that some instructions would be into several micro-operations as like as the Intel processors.
For example, LDRD would be converted to two 'load's. and MLA would be converted to a 'mult' and an 'add' and
at the 1st cycle a 'load' and a 'load would be issued and
at the 2nd cycle a 'mult' and an 'add' would be issued if there is not register dependency.
I think that if we would draw a pipeline chart in micro-operation level, the number of latencies would be proven.
Best regards,Yasuhiko Koumoto.
View all questions in Cortex-M / M-Profile forum