Architectures and Processors forum Cortex-M7 load instruction latency and pairing

State Accepted Answer
+1 person also asked this people also asked this
Locked Locked
Replies 3 replies
Subscribers 349 subscribers
Views 6077 views
Users 0 members are here

Options

Related

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M7 load instruction latency and pairing

Antti over 10 years ago

Hello,

What is the latency for the LDR instruction when the result is used for integer arithmetic operations (for example DSP MAC instructions)?

Also, can 64-bit loads (LDRD) be paired with another instruction? Can I do for example a 64-bit load and an integer MAC at the same time?

I hope ARM will add the latencies and more detailed pairing information to the reference manual soon.

Antti

Top replies

Yasuhiko Koumoto over 10 years ago in reply to Antti +1 verified

Hello Antti, This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of...

Parents

+1 Yasuhiko Koumoto over 10 years ago in reply to Antti
Hello Antti,
This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?
I tried and the results are shown below.
LDR r2,[r0] 0.999 LDR r3,[r0,#4] ---------------- LDR r2,[r0] 1.499 LDR r3,[r0,#4] MLA r1,r0,r0,r2 ---------------- LDR r2,[r0] 1.900 LDR r3,[r0,#4] MLA r1,r0,r0,r3 ---------------- LDR r2,[r0] 2.000 LDR r3,[r0,#4] MLA r1,r2,r2,r0 ---------------- LDR r2,[r0] 2.900 LDR r3,[r0,#4] MLA r1,r3,r3,r0 ---------------- LDR r2,[r0] 1.499 LDR r3,[r0,#4] MLA r1,r0,r0,r0
From the results, LDRD and two LDRs would be identical but the issue rate of two will prevent the 3rd instruction issuing.
I think that some instructions would be into several micro-operations as like as the Intel processors.
For example, LDRD would be converted to two 'load's. and MLA would be converted to a 'mult' and an 'add' and
at the 1st cycle a 'load' and a 'load would be issued and
at the 2nd cycle a 'mult' and an 'add' would be issued if there is not register dependency.
I think that if we would draw a pipeline chart in micro-operation level, the number of latencies would be proven.
Best regards,
Yasuhiko Koumoto.
Cancel
Vote up +1 Vote down

Cancel

Reply

+1 Yasuhiko Koumoto over 10 years ago in reply to Antti
Hello Antti,
This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?
I tried and the results are shown below.
LDR r2,[r0] 0.999 LDR r3,[r0,#4] ---------------- LDR r2,[r0] 1.499 LDR r3,[r0,#4] MLA r1,r0,r0,r2 ---------------- LDR r2,[r0] 1.900 LDR r3,[r0,#4] MLA r1,r0,r0,r3 ---------------- LDR r2,[r0] 2.000 LDR r3,[r0,#4] MLA r1,r2,r2,r0 ---------------- LDR r2,[r0] 2.900 LDR r3,[r0,#4] MLA r1,r3,r3,r0 ---------------- LDR r2,[r0] 1.499 LDR r3,[r0,#4] MLA r1,r0,r0,r0
From the results, LDRD and two LDRs would be identical but the issue rate of two will prevent the 3rd instruction issuing.
I think that some instructions would be into several micro-operations as like as the Intel processors.
For example, LDRD would be converted to two 'load's. and MLA would be converted to a 'mult' and an 'add' and
at the 1st cycle a 'load' and a 'load would be issued and
at the 2nd cycle a 'mult' and an 'add' would be issued if there is not register dependency.
I think that if we would draw a pipeline chart in micro-operation level, the number of latencies would be proven.
Best regards,
Yasuhiko Koumoto.
Cancel
Vote up +1 Vote down

Cancel

Children

No data