This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ASIMD multiply-accumulate instruction

Instruction GroupAArch64 Instructions

Exec Latency

Execution Throughput

Utilized Pipelines

ASIMD FP multiply accumulate, Q-form

VMLA,VMLS,VFMA,

9(4)

1F0/F1

ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar μops, allowing a typical sequence of floating-point multiply-accumulate μops to issue one every four cycles

(accumulate latency shown in parentheses).

(1)、in above description, what is the meaning of "late-forwarding"?

(2)、whan is the meaning of "allowing a typical sequence of floating-point multiply-accumulate μops to issue one every four cycles"?

  • I believe in order to issue FMA operation you don't need all the three input operands to be ready.  It can start when multiply operands are ready. Later It can take that accumulate operand when it finished the multiply part (late-forwarding).

    Consider a chain of FMA operation where one FMA is using previous FMA's result as its accumulate operand.

    FMA operation is 9 cycles, to issue next FMA on the pipe ideally you need to wait for 9 cycles. But with late forwarding you can issue another FMA in 4th cycle itself.  By the time FMA needs accumulate operand previous FMA would be finished.