I am benchmarking an FMA application running on one or two Cortex X1 cores or one or two A76 chips (inside a Pixel 6 phone). Loop unrolling improves performance by ~10%, but only when I use two chips and not just one. Consider the following code:``` for (size_t x = 0; x < 4; x++) { size_t row = start_row + x * 8; for (size_t y = 0; y < 4; y++) { size_t col = start_col + y * 8; fma_f32_8x8(bptr + col, aptr + K * row, M, N, K, cptr + col + N * row); } }When I unroll the inner loop, the entire application runs about 10% faster, but only when the application runs on two cores. When I test the application on a single core (either X1 or A76), the runtime barely changes. Why might that be the case? At first, I suspected frontend stalls, but I could not see a relationship between frontend stalls in simpleperf (Android's perf wrapper) and whether or not I unrolled the loop. The branch-misses improve with loop unrolling, but that already improves when I use one core. Do the two cores share a ressource that might become saturated when I use two cores but don't unroll the loop?Does anybody know why loop unrolling in the FMA application might be particularly beneficial when using two cores and how I could verify this?
I did more benchmarking and realized that the effect I saw was a fluke and not statistically significant. This question could be deleted (removed).
Thank you for coming back and sharing an update for your question.