Streamline, the profiling tool in DS-5, is a powerful tool for analysing system behaviour with proven results at both macro (system level interactions) and micro (CPU code hotspots) levels.
Streamline is also an excellent way to explore a system. Very few of us are lucky enough, or capable of(!), owning an entire system these days, especially in the Cortex-A space. Our contribution is usually a small part of a much wider software stack with code from colleagues and contractors running on a 3rd party OS with multiple other processes.
We tend to have a model in our head of how the system works, what order things happen in, which parts are potential bottlenecks and what parts need to be worried about - this is an essential abstraction to allow us to get some work done but can end up misleading us; if system behaviour doesn’t contradict this model too obviously or performance is acceptable we may never stop to ask whether our model is valid, especially as measuring it may be time consuming and not obviously contributing to an on-time, on-budget delivery.
Recently I ran into exactly this kind of scenario when I was given the opportunity to see some preliminary results from early Streamline support of a new piece of IP, in this case a hardware accelerator. We fired up an accelerator test-case and captured some data for visualisation on the Timeline view, including the new counters and measures of accelerator activity.
Of course we also captured other data including CPU activity at the same time; looking at the CPU activity chart which wobbled along at about 25% I asked myself whether this was expected given we were supposed to be offloading all the real work to the accelerator but performance was good, nobody was worried and it was easy to move on. At this stage in the interests of full disclosure I should note that some function names have been changed and some images have been modified to avoid revealing who I have been working with.
The 'What the .... ?!?' moment
Later, however, it transpired we were running on a quad-core CPU and then I knew I was on to something.
Let’s take a brief detour. A Streamline report opens initially with a collapsed, average activity for all cores in an SMP system. You can expand this to show separate averages for big and LITTLE clusters (if relevant) and then all cores individually. If a single core in a quad-core system runs at 100% activity this appears as a constant baseline of 25% in the average view. So 25% average activity quickly becomes a red-flag to Streamline users, likewise for 50% on a dual-core system and so on.
So returning to our accelerator, as suspected, upon expanding the average we could see one core running at close to 100%. Something that immediately clashed with our mental model. Now people were intrigued and suddenly the hunt was on for a reason. Fortunately Streamline generally makes this kind of investigation pretty quick. First we isolated an active portion of the test-case using the caliper tool. This restricts the other Streamline views, like the Call Paths and Function tabs, to only the selected portion of the timeline.
The Call Paths view shows a breakdown of where time is spent inside the test application and driver. To get this extra detail you need to add the relevant ELF files and re-process the report (there is no need to re-run the test case and capture new data). Now we have our suspect - copy_from_subsystem() accounting for 50.3% of the process workload (the 9.78% figure in the Total column is the load presented to all four cores).
Looking at the code we can see it is a loop to copy data from an accelerator specific block of memory to the destination buffer in the user-space code that invoked the accelerator. This code is copying 4 bytes at a time and is almost certainly sub-optimal, and when we drilled down into the port_mem_read32() function we discovered that every call involved some per page calculations to translate from accelerator memory to kernel memory. Instead of doing these once per 4k page and re-using the result, it was being done for every word i.e. 1024 times more than required.
Nobody was particularly at fault here; a piece of bare metal test code from initial hardware bring up that probably only needed to access a few words of memory and didn’t need to be optimal had survived various development iterations and made its way into an early version of the Linux driver. Performance was reasonably good so nobody had any reason to look any deeper. Only when confronted by a quick Streamline report that contradicted our expectations did we stop to examine things in more detail.
I’ve no doubt that this issue would eventually have been found and fixed with or without Streamline, but at what cost, and after how many other design decisions had been made based on the performance of the existing code base? Alternatively suppose a customer had found this when bringing up their first silicon; what kind of impression would they have formed and how would that have affected their future attitude to the product?
And if the issue is not so much about raw performance but energy efficiency then I’m not so sure it would have been caught; it tends to be more difficult to formulate expectations about battery life, especially if the inefficient task is only active for a small part of overall up-time. Without any expectations to challenge it's even more important to make some actual measurements. To help here Streamline can collect energy consumption data either from on-board sensors, via shunt resistors and the EnergyProbehttp://ds.arm.com/ds-5/optimize/or certain third-party DAQs. So for example, do you actually know whether the code to turn off the power-hungry peripherals in your design actually works? How long does it take? With Streamline you now have a way to see the drop in energy consumption correlated with the execution of your code.
Streamline is a quick way to get a much better understanding of system behaviour and improve your mental model. This can lead to immediate gains due to easy optimisations or fixes (like the one described), medium term improvements from a better understanding of how the various components of a system work together (e.g. additional buffering in a graphical pipeline) and longer term improvements that even influence the design of future products.
In addition hopefully my anecdote illustrates how all developers can make regular use of a tool like Streamline to avoid misconceptions about system behaviour and keep day to day track of the less tangible aspects of a design.
To optimise you must first understand. Streamline is your friend on both counts.
Even this analysis turned out to be a little flawed. The copying of data from the hardware to user-space was sub-optimal but it wasn’t the majority of the issue; after fixing the copying issue the load went down, but only to 80-something percent. Another example of an assumption breaking down because I don’t happen to have a good picture of what CPU load should look like when doing this kind of copy operation. Instead I was happy to have found something suboptimal and assumed it was the only problem. Our current understanding is that the driver is spending a lot of CPU cycles waiting for the accelerator hardware because it is instantiated in an FPGA and thus runs comparatively slowly. This sounds like the last piece of the puzzle, but nobody has actually proven this yet, and so perhaps we're guilty of having made another assumption. Looks like I'd better take my own advice and go do some more investigation...!