Exec Latency
Execution Throughput
Utilized Pipelines
ASIMD FP multiply accumulate, Q-form
VMLA,VMLS,VFMA,
9(4)
ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar μops, allowing a typical sequence of floating-point multiply-accumulate μops to issue one every four cycles
(accumulate latency shown in parentheses).
(1)、in above description, what is the meaning of "late-forwarding"?
(2)、whan is the meaning of "allowing a typical sequence of floating-point multiply-accumulate μops to issue one every four cycles"?
I believe in order to issue FMA operation you don't need all the three input operands to be ready. It can start when multiply operands are ready. Later It can take that accumulate operand when it finished the multiply part (late-forwarding).
Consider a chain of FMA operation where one FMA is using previous FMA's result as its accumulate operand.
FMA operation is 9 cycles, to issue next FMA on the pipe ideally you need to wait for 9 cycles. But with late forwarding you can issue another FMA in 4th cycle itself. By the time FMA needs accumulate operand previous FMA would be finished.