In the Part 2, after collecting the performance statics from PMU, the reason for relatively high L2 data cache miss rate deserves more analysis. This part describes how to do hot-spot analysis and an optimization method with validation using Perf.
To do hot-spot analysis, you must collect more detailed information. You can use Perf to sample PMU events to conduct hot-spot analysis.
perf record
perf.data
perf report
The basic usage of the perf record is as follows:
# Specify PMU event and sampling period $ perf record -e <event> -c <count> -- <command to run the application> # Specify PMU event and average sampling rate $ perf record -e <event> -F <freq> -- <command to run the application> # By default, cycles is selected with average sampling rate set to be 4000 samples/sec $ perf record -- <command to run the application>
When using the command perf record, you can select PMU events to sample by using the -e option. Also, you can specify the sampling period by using the -c option or average sampling rate by using the -F option.
-e
-c
-F
By default, perf record uses the event cycles and the sampling rate is set to 4000. For Armv8-A CPUs, the event cycles is mapped to the cycle counter.
cycles
The sampling period refers to the number of occurrences of the PMU event. For example, if you set -c 1000, a sample is recorded every 1000 occurrences of the selected PMU event.
-c 1000
The sampling rate refers to the average rate of samples per second. For example, if you set -F 1000, Perf records around 1000 samples per second. It is achieved by dynamically adjusting the sampling period by the kernel.
F 1000
When using perf record to sample PMU events for hot-spot analysis, consider the following factors:
For Armv8-A CPUs, the sampling is triggered by PMU overflow interrupt. This means that setting -c to a small number or setting -F to a large number introduces more interrupts while the application runs. Frequent interrupts can cause significant overhead.
Also, this interrupt-based recording method used in Perf will introduces skids. This means that some information, for example, the instruction pointer sampled is not the one where the counter counts. So, you need to carefully set the sampling rate/period that is appropriate for your application.
The performance characteristics given by perf stat show that not all the events must be selected in perf record. You can choose a few of them to record and do hot-spot analysis. In this example, you can sample only l2d_cache_refill.
perf stat
l2d_cache_refill
An example to sample PMU events and do hot-spot analysis on the Juno r2 platform is as follows:
$ taskset -c 0-3 perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.316 MB perf.data (8265 samples) ]
First, use the command perf record to sample PMU event l2d_cache_refill. After the file perf.data is generated, use the command perf report for further hot-spot analysis. This command provides an interactive interface. A basic procedure to do hot-spot analysis is as follows:
1. Locate the function with most overhead
Use the command perf report as follows. By default reads the file named perf.data in the current directory.
$ perf report
In this example, we select only one PMU event when using the command perf record. This shows when using the perf report, it shows recording information directly as follows. If more PMU events are recorded by perf record, perf report shows a pop-up an menu for you to choose which PMU event to analyze.
Figure 1: Overhead of PMU event samples per function
The above figure shows that the functions compute_squares and array_assign respectively take 49.95% and 49.83% of the overall samples of PMU event l2d_cache_refill. The [.] means that the samples are taken in the user-space.
compute_squares
array_assign
[.]
2. Source analysis with Perf annotate
We can take a step further to analyze from source code. The interrupt-based recording introduces the skids. Therefore, Perf cannot provide the instructions where the PMU events are counted.
Figure 2: Menu for choosing operations for PMU event samples
We can locate the source code of function array_assign by selecting option Annotate array_assign, as shown in the following figure.
Annotate array_assign
Figure 3: Hot-spot analysis through selected PMU event samples
The following figure shows that the percentage of the PMU event being sampled when the CPU executes the instruction on the right side is displayed on the left side of the separator vertical line. Based on the percentage data and the understanding of cache, we can speculate that the high L2 data cache miss rate is mainly caused by the approach of traversing a 2D array by column.
perf diff
The basic usage of the perf diff is as follows:
# Specify perf.data path to compare $ perf diff <path of old perf.data file> <path of new perf.data file> # By default, compare the `perf.data` and `perf.data.old` in the current directory $ perf diff
Based on collected information, one possible optimization direction for the example application is as follows:
#define COL_LINE 64 #define ROW_LINE 512000 long array[ROW_LINE][COL_LINE]; void compute_squares() { int i, j; for (i=0; i<COL_LINE; i++) { for (j=0; j<ROW_LINE; j++) { array[j][i] = array[j][i] * array[j][i]; } } } void array_assign() { int i, j; /* [M] Traverse by row */ for (i=0; i<ROW_LINE; i++) { for (j=0; j<COL_LINE; j++) { array[i][j] = i+j; } } } int main() { array_assign(); compute_squares(); return 0; }
We modify the approach of traversing the 2D array from column traversal to row traversal. Using the cache-friendly traversal method, it is expected to result in a reduction in L2 data cache misses which will improve the performance of the application.
Compile the modified example application and run perf stat again to get key performance metrics as follows:
$ gcc -O0 -g 2D_array.c -o 2D_array $ taskset -c 0-3 perf stat -e armv8_pmuv3_0/cpu_cycles/,armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/mem_access/ -- 2D_array Performance counter stats for '2D_array': 5,387,230,517 armv8_pmuv3_0/cpu_cycles/ 1,599,412,481 armv8_pmuv3_0/inst_retired/ 34,576,836 armv8_pmuv3_0/l1d_cache_refill/ 697,922,349 armv8_pmuv3_0/l1d_cache/ 37,191,464 armv8_pmuv3_0/l2d_cache_refill/ 85,987,007 armv8_pmuv3_0/l2d_cache/ 665,282,669 armv8_pmuv3_0/mem_access/ 7.699647601 seconds time elapsed 7.575061000 seconds user 0.124050000 seconds sys
After the perf stat, you can compare the PMU event values and related metrics manually. The result is as follows:
The results in the above table show that the optimization causes only a small decrease in L2 cache miss rate. However, you can observe a significant decrease in L1 data cache miss rate. This leads to reduced L2 cache accesses and faster execution time.
We may need to confirm if the modification of the code results in the optimization. Run the perf record again with the same parameters. After that, use perf diff to compare the effect of the optimization. A reduction in L2 data cache misses in the function array_assign is expected.
$ taskset -c 0-3 ./perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.301 MB perf.data (7690 samples) ] $ perf diff # Event 'armv8_pmuv3_0/l2d_cache_refill/' # # Baseline Delta Abs Shared Object Symbol # ........ ......... ................. ...................... # 49.83% -38.83% 2D_array [.] array_assign 49.95% +38.29% 2D_array [.] compute_squares 0.22% +0.54% [unknown] [k] 0xffff800008326210 0.01% +0.01% ld-2.31.so [.] 0x000000000000bab0 0.00% +0.00% [kernel.kallsyms] [k] 0x00008000083292e0
The perf record for the optimized application is run in the same directory as last time. In this case, Perf renames the perf.data to perf.data.old, and stores the latest recording samples into perf.data.
perf.data.old
The event sampling results in the perf.data.old are treated as Baseline. We can see the changes because of the optimization from the column Delta Abs. Consider the PMU event l2d_cache_refill as an example. Before the optimization, 49.83% of the sampled L2 data cache misses events are from the function array_assign. After optimization, however, the value is reduced to 10.00%. Now, we can be sure that the optimization direction is effective.