This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M7 load instruction latency and pairing

Hello,

What is the latency for the LDR instruction when the result is used for integer arithmetic operations (for example DSP MAC instructions)?

Also, can 64-bit loads (LDRD) be paired with another instruction? Can I do for example a 64-bit load and an integer MAC at the same time?

I hope ARM will add the latencies and more detailed pairing information to the reference manual soon.

Antti

Parents
  • Thanks for the help, Yasuhiko!

    yasuhikokoumoto wrote:



    The results were that if there was no register dependency LDR and MLA were concurrently executed, and if there there was register dependency some wait cycles happened. Even in the case, if the LDR result was the addend of MLA the wait cycles were hidden.

    This is good to know. It looks like the result of the load is forwarded after the second load pipeline stage to the MAC unit. This explains why the LDR and MLA can pair if the LDR result is used for the accumulation part.

    LDRD r2,r3,[r0]  2.900 cycles

    MLA  r1,r2,r2,r0

    ----------------

    LDRD r2,r3,[r0]  2.900 cycles

    MLA  r1,r3,r3,r0

    ----------------

    LDRD r2,r3,[r0] 1.900 cycles

    MLA  r1,r0,r0,r0

    The results were that LDRD and MLA could not be concurrently executed and the operand order was strongly affect their latencies.

    This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?

    I hope ARM will add the latencies and more detailed pairing information to the reference manual soon.

    I think it would be too difficult because there would be many variations of each instruction according to the conditions.

    I believe 90% of use cases could be explained with relatively simple clarifications about how the load & alu units work and what are the common cases that cause extra stalls.

    Antti

Reply
  • Thanks for the help, Yasuhiko!

    yasuhikokoumoto wrote:



    The results were that if there was no register dependency LDR and MLA were concurrently executed, and if there there was register dependency some wait cycles happened. Even in the case, if the LDR result was the addend of MLA the wait cycles were hidden.

    This is good to know. It looks like the result of the load is forwarded after the second load pipeline stage to the MAC unit. This explains why the LDR and MLA can pair if the LDR result is used for the accumulation part.

    LDRD r2,r3,[r0]  2.900 cycles

    MLA  r1,r2,r2,r0

    ----------------

    LDRD r2,r3,[r0]  2.900 cycles

    MLA  r1,r3,r3,r0

    ----------------

    LDRD r2,r3,[r0] 1.900 cycles

    MLA  r1,r0,r0,r0

    The results were that LDRD and MLA could not be concurrently executed and the operand order was strongly affect their latencies.

    This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?

    I hope ARM will add the latencies and more detailed pairing information to the reference manual soon.

    I think it would be too difficult because there would be many variations of each instruction according to the conditions.

    I believe 90% of use cases could be explained with relatively simple clarifications about how the load & alu units work and what are the common cases that cause extra stalls.

    Antti

Children
  • Hello Antti,

    This is strange. There seems to be an unexplained one cycle extra latency if the LDRD result is used in the MLA. Could you test the same case using two separate LDR instructions instead of a single LDRD?

    I tried and the results are shown below.

    LDR r2,[r0]    0.999
    LDR r3,[r0,#4]
    ----------------
    LDR r2,[r0]    1.499
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r2
    ----------------
    LDR r2,[r0]    1.900
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r3
    ----------------
    LDR r2,[r0]    2.000
    LDR r3,[r0,#4]
    MLA r1,r2,r2,r0
    ----------------
    LDR r2,[r0]    2.900
    LDR r3,[r0,#4]
    MLA r1,r3,r3,r0
    ----------------
    LDR r2,[r0]    1.499
    LDR r3,[r0,#4]
    MLA r1,r0,r0,r0
    

    From the results, LDRD and two LDRs would be identical but the issue rate of two will prevent the 3rd instruction issuing.

    I think that some instructions would be into several micro-operations as like as the Intel processors.

    For example, LDRD would be converted to two 'load's. and MLA would be converted to a 'mult' and an 'add' and

    at the 1st cycle a 'load' and a 'load would be issued and

    at the 2nd cycle a 'mult' and an 'add' would be issued if there is not register dependency.

    I think that if we would draw a pipeline chart in micro-operation level, the number of latencies would be proven.

    Best regards,
    Yasuhiko Koumoto.