Hello all,
I have a question about the pipeline used by adds with shift (adds x3, x4, x5, lsl #32) in Neoverse N1, specifically 'adds x3, x4, x5, lsl #32'.
According to Neoverse N1 Software Optimization (https://developer.arm.com/documentation/pjdoc466751330-9707/latest/ ), the instruction is supposed to use the M pipeline.
… but running some experiments on Graviton 2 (which is Neoverse N1) seems to suggest that it doesn’t seem to do so.
To check it, I wrote a code that saturates the M pipeline like this:
.rep 1000 mul x7, x8, x9 .endr
I can observe that the average clock is 3.117, which seems to be consistent with the description of MADD which is good.
If I add ‘adds x3, x4, x5, lsl #32’ like this, the # cycle must increase because M pipeline is already saturated:
.rep 1000 mul x7, x8, x9 adds x3, x4, x5, lsl #32 .endr
However, the observed # clocks is still 3.13, which implies that `mul` and `adds` can run in parallel.
We suspect that the adds instruction is actually using the ‘I’ pipeline. This experiment indirectly shows that:
.rep 1000 add x1, x0, x0; add x1, x0, x0; add x1, x0, x0 // Three adds: saturates I pipeline .endr
.rep 1000 add x1, x0, x0; add x1, x0, x0; add x1, x0, x0 adds x2, x3, x4, lsl #32 .endr