We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I am benchmarking an FMA application running on one or two Cortex X1 cores or one or two A76 chips (inside a Pixel 6 phone). Loop unrolling improves performance by ~10%, but only when I use two chips and not just one. Consider the following code:``` for (size_t x = 0; x < 4; x++) { size_t row = start_row + x * 8; for (size_t y = 0; y < 4; y++) { size_t col = start_col + y * 8; fma_f32_8x8(bptr + col, aptr + K * row, M, N, K, cptr + col + N * row); } }When I unroll the inner loop, the entire application runs about 10% faster, but only when the application runs on two cores. When I test the application on a single core (either X1 or A76), the runtime barely changes. Why might that be the case? At first, I suspected frontend stalls, but I could not see a relationship between frontend stalls in simpleperf (Android's perf wrapper) and whether or not I unrolled the loop. The branch-misses improve with loop unrolling, but that already improves when I use one core. Do the two cores share a ressource that might become saturated when I use two cores but don't unroll the loop?Does anybody know why loop unrolling in the FMA application might be particularly beneficial when using two cores and how I could verify this?
Thank you for coming back and sharing an update for your question.