We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello all,
I have a question about the pipeline used by adds with shift (adds x3, x4, x5, lsl #32) in Neoverse N1, specifically 'adds x3, x4, x5, lsl #32'.
According to Neoverse N1 Software Optimization (https://developer.arm.com/documentation/pjdoc466751330-9707/latest/ ), the instruction is supposed to use the M pipeline.
… but running some experiments on Graviton 2 (which is Neoverse N1) seems to suggest that it doesn’t seem to do so.
To check it, I wrote a code that saturates the M pipeline like this:
.rep 1000 mul x7, x8, x9 .endr
I can observe that the average clock is 3.117, which seems to be consistent with the description of MADD which is good.
If I add ‘adds x3, x4, x5, lsl #32’ like this, the # cycle must increase because M pipeline is already saturated:
.rep 1000 mul x7, x8, x9 adds x3, x4, x5, lsl #32 .endr
However, the observed # clocks is still 3.13, which implies that `mul` and `adds` can run in parallel.
We suspect that the adds instruction is actually using the ‘I’ pipeline. This experiment indirectly shows that:
.rep 1000 add x1, x0, x0; add x1, x0, x0; add x1, x0, x0 // Three adds: saturates I pipeline .endr
.rep 1000 add x1, x0, x0; add x1, x0, x0; add x1, x0, x0 adds x2, x3, x4, lsl #32 .endr