Why might Loop Unrolling contribute to Lower Runtime When using two Cores (X1 or A76) but not with one Core?

I am benchmarking an FMA application running on one or two Cortex X1  cores or one or two A76 chips (inside a Pixel 6 phone). Loop unrolling improves performance by ~10%, but only when I use two chips and not just one.

Consider the following code:
```
    for (size_t x = 0; x < 4; x++) {
        size_t row = start_row + x * 8;
        for (size_t y = 0; y < 4; y++) {
            size_t col = start_col + y * 8;
            fma_f32_8x8(bptr + col, aptr + K * row, M, N, K,
                        cptr + col + N * row);
        }
    }
When I unroll the inner loop, the entire application runs about 10% faster, but only when the application runs on two cores. When I test the application on a single core (either X1 or A76), the runtime barely changes. Why might that be the case? At first, I suspected frontend stalls, but I could not see a relationship between frontend stalls in simpleperf (Android's perf wrapper) and whether or not I unrolled the loop. The branch-misses improve with loop unrolling, but that already improves when I use one core.  Do the two cores share a ressource that might become saturated when I use two cores but don't unroll the loop?


Does anybody know why loop unrolling in the FMA application might be particularly beneficial when using two cores and how I could verify this?