Poor cache utilization is something which can have a big negative impact on performance and improving the utilization will typically have very little or no trade off. Unfortunately detecting poor cache utilization is often difficult to do and requires considerable developer time. In this guide I will demonstrate using Streamline to drive cache optimization and identify areas of inefficiency.
I have used the Juno ARM Development Platform for the purposes of this guide, however the counters I use (or equivalents) should be available on all ARM Cortex-A class processors so it should be easily repeatable. Even without a platform to test on, the methodology I use should provide an insight into using Streamline to help guide optimization.
This guide assumes a basic level of knowledge of Streamline. Introductory information and getting started guides can be found in DS-5’s documentation or, along with other tutorials.
Start by installing gator on the target. This is beyond the scope of this guide; see the readme in <DS-5 installation dir>/arm/gator/ for detailed information. Once installed, launch the gator daemon. I successfully used both user-space and kernel-space versions of gator. The user-space version is sufficient in most cases, the kernel-space version is only required in some circumstances – I expand on this point later.
Compile the attached cache-test application. It is sufficiently simple that it could be compiled on the device (if a compiler were available) or cross-compiled otherwise.
Open up the Streamline Data view in DS-5. Configure the Streamline connection using the Capture & Analysis Options to use the gator version running on the target. The other default configuration options should be sufficient, although you may optionally add the application binary to the Program Images section at the bottom for function-level profile information, or, if the binary contains debug symbols, source-code-level profile information.
Adjust the Counter configuration to collect events from:
In our case we are also collecting “Cache: Data TLB refill”, which will provide an additional measurement to analyze caching performance, as well as “Clock: Cycle” and “Instruction: Executed” which will provide an insight into how execution is progressing. We are also collecting from the energy measurement counters provided on the Juno development platform.
The counters listed above are specific to our particular platform – the Juno development board. This has a big.LITTLE arrangement of 2x Cortex-A57s and 4x Cortex-A53s; we will be running our program on one of the Cortex-A57 cores.
The ARM Performance Monitors extension is an optional, non-invasive debug component available on most Cortex-A-class cores. Streamline reads the Performance Monitor Unit (PMU) architecture provided by this extension to generate its profiling information. Each of the processor counters observed within Streamline corresponds to a PMU event. Not all events described by the PMU architecture are implemented in each core, however a core set of events must be implemented, including the “Cache: Data access” and “Cache: Data refill” events shown above (in PMUv2 and PMUv3). Thus these two events should be available on all Cortex-A-class cores which implement the architecture. For more detailed information on the Performance Monitors Extension see the relevant section of the ARM Architecture Reference Manual for ARMv7 (Chapter C12) or ARMv8 (Chapter D5) as appropriate.
The “Cache: L2 data access” and “Cache: L2 data refill” counters are also common (but not mandated) on cores with an integrated L2 cache controller, however some cores have separate L2 cache controllers – for example the CoreLink Level 2 Cache Controller L2C-310. In this case the counters will be limited to what is available from the controller and whether Streamline supports it. In the case of the L2C-310, equivalent counters are available and it is supported in Streamline, however the counters are only readable using kernel-space gator (user-space gator can still read all others). Ultimately the L1 cache counters give a good view of what’s going on so if you are unable to read counters from the L2 cache (for whatever reason) it is still possible to follow the steps in this guide to help perform cache-optimization, it might just be slightly harder to see the full path of data through the cache system.
Most cores also provide additional PMU events (which will vary by core) to monitor cache usage and these can provide further information.
The “Cache: Data access” counter (PMU event number 0x04) measures all memory-read or -write operations which access the L1 data cache. All L1 data cache accesses (with the exception of cache maintenance instructions) are counted, whether they resulted in a hit or a miss.
The “Cache: Data refill” counter (PMU event number 0x03) measures all memory-read or -write operations which cause a refill of the L1 data cache from: another L1 data cache, an L2 cache, any further levels of cache or main memory – in other words L1 data accesses which result in a miss. As above this does not count cache maintenance instructions, nor does it count accesses that are satisfied by refilling data from a previous miss.
The “Cache: L2 data access” and “Cache: L2 data refill” counters (representing PMU event numbers 0x16 and 0x17 respectively) measure as their L1 counterparts, except on the L2 data cache.
More detailed information on any of these events can be found in the Performance Monitors Extension chapter of the relevant ARM Architecture Reference Manual as linked above.
After you have configured the target, press the Start capture button. Once capturing has started run the cache-test application on the target (as “./cache-test”). Depending on the performance of your target this will take a few seconds to run and will output several messages before returning to the command prompt. When this happens, press the Stop capture and analyze button. After a brief pause the analyzed data will be displayed.
You should now be presented with a chart looking similar to the image below:
Filter this by just the cache-test application by clicking on the “[cache-test #<proc-id>]” entry in the process list below the charts. In the case of multiple processes-of-interest the Ctrl key can be held down to select multiple processes. Having done this, depending on how long the capture session lasted and how long the program ran there may be considerable space around it. Change the Timeline display resolution using the dropdown to the left of the Time index display above the charts (set to 100ms in the example above) to zoom in.
The results currently are somewhat difficult to interpret as all Cache measurements are plotted on the same chart but have different ranges. Split the “Cache: Data access” and “Cache: L2 Data access” measurements into a separate chart as follows:
Having separated these two series the chart should now look similar to the image below:
Next we will produce some custom data series to provide additional information about the performance of the caches:
This is a very simple example but it is possible to combine any number of expressions and standard mathematical syntax to manipulate or create new series in this way, as documented in the Streamline User Guide (Section 6.21).
This will result in a chart that looks similar to the image below:
In our case the clock frequency figure (133 MHz) is misleading as it is the average of 6 cores, 5 of which are powered down.
Having reorganized the captured data we are now in a position to analyze what happened.
The program appears to be split into three main phases. The first 200 ms has a relatively low level of cache activity, followed by a further 100 ms phase with:
This suggests a lot of data is being processed but the caches are being well utilized. The relatively high L2 data refill ratio would be a cause for concern, however with a low L1 refill ratio it suggests that the L2 cache is simply not being accessed that frequently – something which is confirmed by the low number of L2 cache accesses (4.7 M) vs. a high number of L1 cache accesses (50.2 M). The L2 cache will always perform at least some refills when operating on new data since it must fetch this data from main memory.
There is then a subsequent 2200 ms phase with:
This hints at a similar level of data consumption (based on the fact that the L2 cache has a similar number of refills, meaning the actual volume of data collected from main memory was similar), but much poorer cache utilization (based on the high L1 data cache refill ratio).
This is the sort of pattern to watch out for when profiling applications with Streamline as it often means that cache utilization can be improved. As the L1 data cache refill ratio is high while the L2 data refill ratio is low the program appears to be thrashing the L1 cache. Were the L2 data refill ratio also high the program would be thrashing the L2 cache, however in this case it may be that the program is consuming unique data – in which case there is very little that can be done. However in situations where the same data is being operated on multiple times (as is common) this access pattern can often be significantly improved.
In our case the cache-test application sums the rows of a large 2-dimensional matrix twice. The first time it accesses each cell in Row-Major order – the order the data is stored in the underlying array:
for (y = 0; y < iterations; y++) for (x = 0; x < iterations; x++) sum_1d[y] += src_2d[(y * iterations) + x];
Whereas the second time it accesses each cell in Column-Major order:
for (x = 0; x < iterations; x++) for (y = 0; y < iterations; y++) sum_1d[y] += src_2d[(y * iterations) + x];
This means the cache is unable to take advantage of the array’s spatial locality, something which is hinted at by the significant jump from a negligible number of L1 data TLB refills to 26.9 million. The TLB (Translation Lookaside Buffer) is a small cache of the page table: the Cortex-A57’s L1 data TLB is a 32-entry fully-associative cache. A large number of misses in the TLB (i.e. the result of performing un-cached address translations) can be indicative of frequent non-contiguous memory accesses spanning numerous pages – as is observed in our case.
The cache-test program operates on a 5000x5000 matrix of int32s – or 95.4 MB of data. The Cortex-A57 uses a 64-byte cache line length, giving a minimum of 1.56 M cache accesses to completely retrieve all the data. This explains the virtually equal L1 and L2 data cache refills (1.57 M each) in phase 1, where the data is being accessed in order, and explains why they must be this high even in the best case.
In this simple case we can improve the cache utilization by switching around the inner and outer loops of the function, thus achieving a significant performance improvement (in our case a 22x speed increase) at no additional cost.
In real-world examples, where it may not be as easy to locate the exact area of inefficiency, Streamline’s source code view can be used to help pinpoint the issue. To use this it will be necessary to load the application’s binary, either as described earlier or after capture by right clicking the report in the Streamline Data view, selecting Analyze... and adding the binary. If the binary contains debug symbols source-code-level debug information will be available (in the Code tab), otherwise only function-level information will be available (in the Functions tab, and also from the Timeline Samples HUD. Function-level information will still provide a good clue as to where to look however. Providing debug symbols are available, the code view can be easily used to give a view similar to below by clicking through the offending functions in the Functions tab.
The annotations on the left of the source code line show the number of occasions that line was being executed when the sample was taken and that percentage relative to the rest of the function. Using the Timeline Sample HUD we can identify the “yx_loop” function as being responsible for the majority of the samples from our code (1617) throughout the second phase (which we identified as having poor cache utilization). Clicking through this function in the Sample HUD or the Functions tab, we can see 1584 samples on the line within the nested for-loop – suggesting this loop needs a second look. In our case this is a particularly simple function consisting only of this loop, but if it were more complex it would offer a much greater insight into the exact spot the offending function was spending most of its time.
I have attached the source to the simple cache-test example. It is currently in the process of being added to the examples bundled with DS-5, so it will be included with future product versions. I will update this blog post when that happens.
Feel free to post any comments or questions below and I will respond as soon as possible.
Read more about Streamline
Hello Jonathan,
how do you know that 81.5M L1 data cache accesses of yx_loop function?
-Best Regards!
Dean