This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

The pipeline of add with lsl >4 in Neoverse N1

Hello all,

I have a question about the pipeline used by adds with shift (adds x3, x4, x5, lsl #32) in Neoverse N1, specifically 'adds x3, x4, x5, lsl #32'.

According to Neoverse N1 Software Optimization (https://developer.arm.com/documentation/pjdoc466751330-9707/latest/ ), the instruction is supposed to use the M pipeline.

adds-lsl-32-is-M

… but running some experiments on Graviton 2 (which is Neoverse N1) seems to suggest that it doesn’t seem to do so.

To check it, I wrote a code that saturates the M pipeline like this:

.rep 1000
    mul x7, x8, x9
.endr

I can observe that the average clock is 3.117, which seems to be consistent with the description of MADD which is good.

 

If I add ‘adds x3, x4, x5, lsl #32’ like this, the # cycle must increase because M pipeline is already saturated:

.rep 1000
    mul x7, x8, x9
    adds x3, x4, x5, lsl #32
.endr

However, the observed # clocks is still 3.13, which implies that `mul` and `adds` can run in parallel.

 

We suspect that the adds instruction is actually using the ‘I’ pipeline. This experiment indirectly shows that:

.rep 1000
    add x1, x0, x0; add x1, x0, x0; add x1, x0, x0 // Three adds: saturates I pipeline
.endr

  • Average clock is 1.14

.rep 1000
    add x1, x0, x0; add x1, x0, x0; add x1, x0, x0
    adds x2, x3, x4, lsl #32
.endr

  • Average clock is 1.51, increased!