This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M7 load instruction latency and pairing

Hello,

What is the latency for the LDR instruction when the result is used for integer arithmetic operations (for example DSP MAC instructions)?

Also, can 64-bit loads (LDRD) be paired with another instruction? Can I do for example a 64-bit load and an integer MAC at the same time?

I hope ARM will add the latencies and more detailed pairing information to the reference manual soon.

Antti

Parents
  • Hello Antti,

    This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?

    I tried and the results are shown below.

    LDR r2,[r0]    0.999
    LDR r3,[r0,#4]
    ----------------
    LDR r2,[r0]    1.499
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r2
    ----------------
    LDR r2,[r0]    1.900
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r3
    ----------------
    LDR r2,[r0]    2.000
    LDR r3,[r0,#4]
    MLA r1,r2,r2,r0
    ----------------
    LDR r2,[r0]    2.900
    LDR r3,[r0,#4]
    MLA r1,r3,r3,r0
    ----------------
    LDR r2,[r0]    1.499
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r0
    

    From the results, LDRD and two LDRs would be identical but the issue rate of two will prevent the 3rd instruction issuing.

    I think that some instructions would be into several micro-operations as like as the Intel processors.

    For example, LDRD would be converted to two 'load's. and MLA would be converted to a 'mult' and an 'add' and

    at the 1st cycle a 'load' and a 'load would be issued and

    at the 2nd cycle a 'mult' and an 'add' would be issued if there is not register dependency.

    I think that if we would draw a pipeline chart in micro-operation level, the number of latencies would be proven.

    Best regards,
    Yasuhiko Koumoto.

Reply
  • Hello Antti,

    This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?

    I tried and the results are shown below.

    LDR r2,[r0]    0.999
    LDR r3,[r0,#4]
    ----------------
    LDR r2,[r0]    1.499
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r2
    ----------------
    LDR r2,[r0]    1.900
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r3
    ----------------
    LDR r2,[r0]    2.000
    LDR r3,[r0,#4]
    MLA r1,r2,r2,r0
    ----------------
    LDR r2,[r0]    2.900
    LDR r3,[r0,#4]
    MLA r1,r3,r3,r0
    ----------------
    LDR r2,[r0]    1.499
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r0
    

    From the results, LDRD and two LDRs would be identical but the issue rate of two will prevent the 3rd instruction issuing.

    I think that some instructions would be into several micro-operations as like as the Intel processors.

    For example, LDRD would be converted to two 'load's. and MLA would be converted to a 'mult' and an 'add' and

    at the 1st cycle a 'load' and a 'load would be issued and

    at the 2nd cycle a 'mult' and an 'add' would be issued if there is not register dependency.

    I think that if we would draw a pipeline chart in micro-operation level, the number of latencies would be proven.

    Best regards,
    Yasuhiko Koumoto.

Children
No data