1 2 3 Previous Next

Software Development Tools

124 posts

The last quarter we started to blog about our work on GNU GCC and LLVM because we think that sharing information is the key factor to cooperate in the Open Source community: we want to continue with the updates by sharing our achievement of last quarter and plans for the future. We will be at GNU Tools Cauldron 2015 on 7th/8th/9th August in Prague (Czech Republic): please come and talk with us, this is a great occasion to meet us in person and discuss open source contributions.


The following notes include partial information on what we’ve been working on in the last quarter and what we plan to do in the next one: for details please refer to the slides or get in touch with us.



The last quarter was particularly important for the release of the new major version of GCC 5.1! Thanks to ARM engineers and thanks to everyone who helped to smoothly get this important milestone release out of the door. On the Cortex-R and Cortex-M profile, ARM released GCC 4.9 for ARM Embedded Processors: you can find the release notes on the Launchpad website.


In terms of development, the majority of the effort went into improving the ABI compliance and some performance tuning. As revealed in the previous update, we added ARMv8.1 support to binutils, enabled GCC native tune (-mcpu=native) and worked on ABI compliance for both Cortex-A and Cortex-R/M toolchains.


What’s next?

For the next quarter, the plan is to complete what’s left for ARMv8.1 support and working on various optimizations such as enhancing GCC loop invariants (PR65477, PR62173, PR62178), improving the cost model for Cortex-A53 and Cortex-A57 and improve CSEL code generation for AArch64.

Further improvements will be made on improving selection of FP divide & multiply on Cortex-M and add support for all memory models of AArch64.



Even if relatively new, LLVM is quickly gaining popularity and ARM committed to support the community development. The commercial toolchain we offer to our customers, ARM Compiler 6, is in fact based on open source clang.


In the last quarter we worked on different aspects of the compiler, from adding support for the ARMv8.1 architecture to improvements on usability of the command line interface: in collaboration with Linaro we ameliorated the architecture and core name parsing, now with a cleaner code and more usable than before.

In terms of performance, we’ve been working on several optimizations (alignment of global variables, minimization of stack usage (details in section LLVM lifetime markers), new float2int pass, PBQP register allocator, etc.) but we also set up a new Cortex-A53 performance tracking bot: read more about this in the section below.


What’s next?

In terms of future plans, we will be still focused on performance improvements across all the cores and optimizing accesses to global variables in loops. We also plan to further improve the LNT WebUI to make it easier to detect performance changes tracked by the running bots.


LLVM lifetime markers

In the last quarter update we mentioned the necessity to reduce Stack usage, particularly important for the Android Open Source Project. Lifetime markers are used to identify when a particular slot becomes alive or dead, along all control flow path: generally those markers are ignored by most optimization passes but those are important to reduce the stack usage.


ARM engineers removed the previous limitation of 32 bytes minimum size for a marker, unveiling a few issues (primitive types use 1, 2, 4, 8 bytes stack slots) but contributing to an overall reduction of the stack usage.


LLVM public performance tracking bot

Development of compilers is tough job! Each patch can not only affect the correctness of the code generation but also the performance of the code generated. Tracking performance can be really tricky, especially considering the number of devices and architectures LLVM supports. For those reasons, ARM committed to help the community by adding a public Cortex-A53 tracking bot: the script executes a few benchmarks on LLVM top-of-trunk every 6 hours and publishes results at http://llvm.org/perf


There are still a few improvements that could be made on the system but we feel this is going in the right direction and we hope the community will make good use of it!


For more details please refer to the full presentation given by Matthew Gretton-Dann available on YouTube and his slides (attached to this blog post).

We would like to hear from you on what you are doing in the open source community, share ideas and cooperate for the good of the whole ecosystem. See you at GNU Cauldron in August!



DAC 2015, Fast Models 9.3

Posted by robkaye Jun 23, 2015

Earlier this month I attended DAC in San Francisco.  We had a demo of Fast Models, some partner presentations and a poster session.  I came away from the conference with the impression that while the technical conference remains vibrant the exhibition portion is declining in importance.  I first took part in the 1980s but since then we have seen the birth of the Internet.   In those far off days we used to see large delegations from all parts of the world attend to find out the latest product information and get updates from the EDA vendors. Who can forget some of the creative ways that some of these promoted their products?   Nowadays that information largely available online and through the various social media (like this one) decreasing the value of visiting the trade show: its may be convenient and efficient.  It's certainly a lot less fun.


A new demo involving Fast Models was shown by Aldec:


Aldec - 1.png

Aldec's demo platform for their Hybrid Virtual Prototype with Fast Models.

Hybrid platforms like this are becoming very popular when there is a need to connect a high-performance simulator to represent the processor or processor subsystem with a more detailed model of other parts of the system.  This could be for many reasons, which we have discussed in an earlier blog.

Immediately prior to DAC we released Fast Models version 9.3.  We have moved to a quarterly release cycle (from half-yearly) that better serves the needs of ARM's IP roadmap.  In this release we introduced support for new Cache Coherent Network Models (CCN-502, CCN-512) and Mali Display Processors (Mali-DP500 and Mali-DP550).  We also continued to advance the capabilities of the models: the two areas that was are currently focused on are Timing Annotation and Checkpointing (Save and Restore),  

Timing Annotation extends the use of the Virtual Prototype in early, high-level, performance estimation.  The functionality provides a mechanism for the user to insert estimated timings at key points in the Virtual Prototype to improve the correlation of the reported cycle counts with what will be achieved in hardware.  The aim is to do this with minimal impact on the throughput of the model.   We are adding the Timing Annotation in stages: in this release the focus has been on the integrated cache models.  Of course, the results are very heaviy dependent on quality of the annotated values.


We also introduced a new type of system in the example Virtual Prototypes supplied with Fast Models.  Previously we have delivered Fixed Virtual Prototypes (FVP) and Exported Virtual Subsystems (EVS) the former being standalone platforms, the latter being functionally equivalent examples that integrate with SystemC.  The third category, which also works with SystemC, has been called an SVP or SystemC Virtual Prototype.  The evolution from the EVS is that in the SVP models are individually instantiated into SystemC rather than being a monolithic subsystem.  This gives the platform developer much more flexibility.


The second half of 2015 will see the continued evolution of the Fast Model functionality and a burgeoning library of models. 

Hopefully I'll be seeing some of you at the ARM TechCon in November where we'll be going into more detail on these capabilities.

ARM FAE Ronan Synnott explains the DS-5 Development Studio at 52 DAC in the Moscone Center. The DS-5 contains compilers, debuggers and streamline analyzers that assist with every stage of SoC development. To find out more please visit –http://www.ds.arm.com



Do you have any questions? Please put them in the comment section below

Often we hear embedded software engineers avoiding the usage of C++ because of being scared by potential performance hit or code size explosions. Even though some features of C++ can have significant impact on the performance and code size, it would be a mistake to exclude the usage of the language completely because of this.


In this article I want to show a few additions to the C++11 and C++14 standards which can improve the readability of your code but won’t affect the performance thus can be used even with the smallest Cortex-M0+ core.


ARM Compiler 5 supports C++11 whereas ARM Compiler 6 supports both C++11 and the most recent C++14 (refer to documentation for details). If not specified, ARM Compiler 6 assumes C++03 as a standard so you need to use the command line options --std=c++11 or --std=c++14 to be able to use the newer standards. If you want to enforce the conformance to a specific standard you can use the command line option --pedantic-errors: armclang would generate an error in case you are using extension or features of the standard.



The constexpr keyword has been introduced with C++11 but C++14 removed a few constraints making this functionality even more powerful. When a function is declared as constexpr, the compiler will know that the result of that function can be evaluated at compile time and it can be used accordingly.

Let’s assume we want to create a static array based on the number of bits set in a word; with C++03 we would have written something similar to the following code:

const int my_word = 0xFEF1; // bit mask

int *my_array;

int number_of_bits(int word) {
    int count=0;
    while(word) {
        count += word & 0x1;
        word >>= 1;
    return count;

my_array = (int*)malloc(sizeof(int)*number_of_bits(my_word));


With C++14 is possible to calculate this in a function and the result is available at compile time. The code can be transformed as:

const int my_word = 0xFEF1; // bit mask

constexpr int number_of_bits(int word) {
    int count=0;
    while(word) {
          count += word & 0x1;
          word >>= 1;
    return count;

int my_array[number_of_bits(word)];

Because the function is evaluated at compile time, the compiler can instantiate the array on the stack saving the call to malloc() at run-time: readability and performance have been improved at the same time!

Binary literals

Often in our applications, we need to use bit masks or perform bit operations: How many times did we write code similar to the following?


if (x & 0x20) { // 0010 0000


What does 0x20 mean in this code? For an expert programmer this is clearly checking if the sixth LSB bit of x is set but it might be getting trickier with more complex bit masks. C++14 makes this even more clear. In the latest version of the standard C++14 it is possible to define binary literals, making the specification of bit masks even clearer:


if (x & 0b0010’0000) {


As you can see from the example, not only can we specify the bitmask directly but we can also use ' as a digits separator to enhance readability even further. The generated assembly code is the same but the source is easier to understand.


Range-based for loop

Most modern languages like Python and C# support range-based loops; this doesn’t add more power to the language but it improves readability of the resulting code.

This functionality has been added to C++11 and it’s now possible to use range-based loops directly in your existing code.

Let’s take a look at an example:


int my_array[] = {1, 2, 3, 4, 5};
int sum_array(void) {
    int sum = 0;
    for (int i=0; i<5; i++) {
          sum += my_array[i];
    return sum;


This can be re-written to

Int my_array[] = {1, 2, 3, 4, 5};
int sum_array(void) {
  int sum = 0;
  for (auto value : my_array) {
    sum += value;
    return sum;

The code reads better now and we also removed the size of the array from the for loop which is potential source of bugs (we need to update it if we add a new element for example).

The range-based for loop works with any type with begin() and end() function defined so that we can apply the same technique with std::vector:

int sum_array(std::vector<int> array) {
    int sum = 0;
    for (auto &value : array) {
        sum += value;
    return sum;

In this case the improvements in terms of readability are even better and, as a result, the code is easier to understand and maintain.

Null pointer constant

Since the beginning of the C standard, we have used NULL to check the validity of a pointer. This led to confusion in C++ because NULL is equivalent to 0.

Let’s assume we have two functions with the same name and different arguments:

Log_value(int value); // first function
Log_value(char *value); // second function

In C++, the following code has an unexpected effect from the developer's point of view.

Log_value(NULL); // will call the first function

Infact, by using NULL we expect the second function to be called but, because NULL is equal to 0, the first function will be called instead.

In C++11 the keyword nullptr has been introduced and should be used instead of NULL, so that we can easily avoid this ambiguity:

Log_value(nullptr); // will call the second function


In this case, the second function is correctly called with an explicit null pointer value.



We have seen a few functionalities of C++11 and C++14 which can be used without worrying about performance and that can enhance the readability of your code. This article covers just a few of them, you can find more information on Wikipedia (C++11 - Wikipedia, the free encyclopedia and C++14 - Wikipedia, the free encyclopedia) and on Standard C++11 and Standard C++14.

I hope you found these information useful and you can soon start to use some of the functionalities in your code base. As mentioned at the beginning, ARM Compiler 6 supports C++11 and C++14. If you still don’t have DS-5, download a free 30-day evaluation of Ultimate Edition to get started.


Feel free to post any questions or comments below.





Poor cache utilization is something which can have a big negative impact on performance and improving the utilization will typically have very little or no trade off. Unfortunately detecting poor cache utilization is often difficult to do and requires considerable developer time. In this guide I will demonstrate using Streamline to drive cache optimization and identify areas of inefficiency.


I have used the Juno ARM Development Platform for the purposes of this guide, however the counters I use (or equivalents) should be available on all ARM Cortex-A class processors so it should be easily repeatable. Even without a platform to test on, the methodology I use should provide an insight into using Streamline to help guide optimization.


This guide assumes a basic level of knowledge of Streamline. Introductory information and getting started guides can be found in DS-5’s documentation or, along with other tutorials, on the website.



Setting up Streamline

Start by installing gator on the target. This is beyond the scope of this guide; see the readme in <DS-5 installation dir>/arm/gator/ for detailed information. Once installed, launch the gator daemon. I successfully used both user-space and kernel-space versions of gator. The user-space version is sufficient in most cases, the kernel-space version is only required in some circumstances – I expand on this point later.


Compile the attached cache-test application. It is sufficiently simple that it could be compiled on the device (if a compiler were available) or cross-compiled otherwise.



Configuring DS-5

Open up the Streamline Data view in DS-5. Configure the Streamline connection using the Capture & analysis options () to use the gator version running on the target. The other default configuration options should be sufficient, although you may optionally add the application binary to the Program Images section at the bottom for function-level profile information, or, if the binary contains debug symbols, source-code-level profile information.



Adjust the Counter configuration () to collect events from:

  • Cortex-A57
    • Cache: Data access
    • Cache: Data refill
    • Cache: L2 data access
    • Cache: L2 data refill



In our case we are also collecting “Cache: Data TLB refill”, which will provide an additional measurement to analyze caching performance, as well as “Clock: Cycle” and “Instruction: Executed” which will provide an insight into how execution is progressing. We are also collecting from the energy measurement counters provided on the Juno development platform.


Further Information on the Target Counters

The counters listed above are specific to our particular platform – the Juno development board. This has a big.LITTLE arrangement of 2x Cortex-A57s and 4x Cortex-A53s; we will be running our program on one of the Cortex-A57 cores.


The ARM Performance Monitors extension is an optional, non-invasive debug component available on most Cortex-A-class cores. Streamline reads the Performance Monitor Unit (PMU) architecture provided by this extension to generate its profiling information. Each of the processor counters observed within Streamline corresponds to a PMU event. Not all events described by the PMU architecture are implemented in each core, however a core set of events must be implemented, including the “Cache: Data access” and “Cache: Data refill” events shown above (in PMUv2 and PMUv3). Thus these two events should be available on all Cortex-A-class cores which implement the architecture. For more detailed information on the Performance Monitors Extension see the relevant section of the ARM Architecture Reference Manual for ARMv7 (Chapter C12) or ARMv8 (Chapter D5) as appropriate.


The “Cache: L2 data access” and “Cache: L2 data refill” counters are also common (but not mandated) on cores with an integrated L2 cache controller, however some cores have separate L2 cache controllers – for example the CoreLink Level 2 Cache Controller L2C-310. In this case the counters will be limited to what is available from the controller and whether Streamline supports it. In the case of the L2C-310, equivalent counters are available and it is supported in Streamline, however the counters are only readable using kernel-space gator (user-space gator can still read all others). Ultimately the L1 cache counters give a good view of what’s going on so if you are unable to read counters from the L2 cache (for whatever reason) it is still possible to follow the steps in this guide to help perform cache-optimization, it might just be slightly harder to see the full path of data through the cache system.


Most cores also provide additional PMU events (which will vary by core) to monitor cache usage and these can provide further information.


The Chosen Counters

The “Cache: Data access” counter (PMU event number 0x04) measures all memory-read or -write operations which access the L1 data cache. All L1 data cache accesses (with the exception of cache maintenance instructions) are counted, whether they resulted in a hit or a miss.


The “Cache: Data refill” counter (PMU event number 0x03) measures all memory-read or -write operations which cause a refill of the L1 data cache from: another L1 data cache, an L2 cache, any further levels of cache or main memory – in other words L1 data accesses which result in a miss. As above this does not count cache maintenance instructions, nor does it count accesses that are satisfied by refilling data from a previous miss.


The “Cache: L2 data access” and “Cache: L2 data refill” counters (representing PMU event numbers 0x16 and 0x17 respectively) measure as their L1 counterparts, except on the L2 data cache.


More detailed information on any of these events can be found in the Performance Monitors Extension chapter of the relevant ARM Architecture Reference Manual as linked above.



Capturing Data

After you have configured the target, press the Start capture button (). Once capturing has started run the cache-test application on the target (as “./cache-test”). Depending on the performance of your target this will take a few seconds to run and will output several messages before returning to the command prompt. When this happens, press the Stop capture and analyze button (). After a brief pause the analyzed data will be displayed.



Reformatting the Captured Data

You should now be presented with a chart looking similar to the image below:



Filter this by just the cache-test application by clicking on the “[cache-test #<proc-id>] entry in the process list below the charts. In the case of multiple processes-of-interest the Ctrl key can be held down to select multiple processes. Having done this, depending on how long the capture session lasted and how long the program ran there may be considerable space around it. Change the Timeline display resolution using the dropdown to the left of the Time index display above the charts (set to 100ms in the example above) to zoom in.


The results currently are somewhat difficult to interpret as all Cache measurements are plotted on the same chart but have different ranges. Split the “Cache: Data access” and “Cache: L2 Data access” measurements into a separate chart as follows:

  1. Click on the Charts Snippet menu () above the process list.
  2. Select Add Blank Chart. Enter “Cache Accesses” as the new chart’s Title and drag it above the “Cache” chart.
  3. On the “Cache” chart, open the Configuration Panel ().
  4. Amend the “Cache” chart’s title to “Cache Refills”.
  5. Using the handle (), drag the “Data access” and “L2 data access” series to the newly created “Cache Accesses” chart.
  6. Remove the blank “Required” series in the “Cache Accesses” chart ().
  7. Change the plotting method of both charts from Stacked to Overlay (using the drop-down box at the top left of the Configuration Panel), allowing the relationship between the values to be more apparent.
    In Overlay mode the series are plotted from the top of the list, down – i.e. the series at the bottom is plotted last, in front of all others. As a result some series may need rearranging to improve their visibility in Overlay mode (although colors are slightly transparent so no data is completely hidden).
  8. Optionally rename the series as appropriate – e.g. “Data access” may be more sensibly named “L1 data access” to complement the “L2 data access” series.
  9. Optionally change the colors of the series to improve their contrast.
  10. Close the Configuration Panel by pressing the button again ().


Having separated these two series the chart should now look similar to the image below:


Next we will produce some custom data series to provide additional information about the performance of the caches:

  1. Click on the Charts Snippet menu () above the process list.
  2. Select Add Blank Chart. Enter “Cache Refill Ratios” as the new chart’s Title and drag it below the “Cache Refill” chart.
  3. Enter “L1 data ratio” as the new series’ Name. Set the Expression to be “$CacheDataRefill / $CacheDataAccess”. As this result is a percentage (the ratio of L1 data cache refills to accesses – i.e. the miss rate), tick the Percentage checkbox.
  4. Add another series to the new “Cache Refill Ratios” chart () and repeat the process for the L2 cache, setting the Expression to be “$CacheL2DataRefill / $CacheL2DataAccess”.
    The expression will differ if using a separate L2 cache controller. Pressing Ctrl + Space in the Expression window will list all available variables.
    In our case the 0x04/0x03 and 0x16/0x17 counter pairs are explicitly listed in the ARMv8 ARM Architecture Reference Manual as being associated in this way. Some care should be taken when using a separate cache controller that this assumption still holds.
  5. Change the plotting method of the chart from Stacked to Overlay.
  6. Optionally change the colors of the series to improve their contrast.


This is a very simple example but it is possible to combine any number of expressions and standard mathematical syntax to manipulate or create new series in this way, as documented in the Streamline User Guide (Section 6.21).


This will result in a chart that looks similar to the image below:


In our case the clock frequency figure (133 MHz) is misleading as it is the average of 6 cores, 5 of which are powered down.



Understanding the Captured Data

Having reorganized the captured data we are now in a position to analyze what happened.


The program appears to be split into three main phases. The first 200 ms has a relatively low level of cache activity, followed by a further 100 ms phase with:

  • A large number of L1 data cache accesses (50.2 M).
  • A virtually equal number of L1 and L2 data cache refills (1.57 M each).
  • A negligible number of L1 data TLB refills (26 K).
  • A low L1 data cache refill ratio (3.1%), although a relatively high L2 data cache refill ratio (33.2%).


This suggests a lot of data is being processed but the caches are being well utilized. The relatively high L2 data refill ratio would be a cause for concern, however with a low L1 refill ratio it suggests that the L2 cache is simply not being accessed that frequently – something which is confirmed by the low number of L2 cache accesses (4.7 M) vs. a high number of L1 cache accesses (50.2 M). The L2 cache will always perform at least some refills when operating on new data since it must fetch this data from main memory.


There is then a subsequent 2200 ms phase with:

  • A slightly larger number of L1 data cache accesses (81.5 M over the period), but a significantly reduced rate of L1 data cache accesses (37 M accesses per second compared to 502 M accesses per second in the first phase).
  • A significantly increased number of L1 data cache refills (26.9 M).
  • A similar number of L2 data cache refills (2.1 M).
  • A vastly increased number of L1 data TLB refills (24.9 M).
  • A much higher L1 data cache refill ratio (33.0%) and a much lower L2 data cache refill ratio (2.03%).


This hints at a similar level of data consumption (based on the fact that the L2 cache has a similar number of refills, meaning the actual volume of data collected from main memory was similar), but much poorer cache utilization (based on the high L1 data cache refill ratio).


This is the sort of pattern to watch out for when profiling applications with Streamline as it often means that cache utilization can be improved. As the L1 data cache refill ratio is high while the L2 data refill ratio is low the program appears to be thrashing the L1 cache. Were the L2 data refill ratio also high the program would be thrashing the L2 cache, however in this case it may be that the program is consuming unique data – in which case there is very little that can be done. However in situations where the same data is being operated on multiple times (as is common) this access pattern can often be significantly improved.


In our case the cache-test application sums the rows of a large 2-dimensional matrix twice. The first time it accesses each cell in Row-Major order – the order the data is stored in the underlying array:

for (y = 0; y < iterations; y++)
  for (x = 0; x < iterations; x++)
  sum_1d[y] += src_2d[(y * iterations) + x];


Whereas the second time it accesses each cell in Column-Major order:

for (x = 0; x < iterations; x++)
  for (y = 0; y < iterations; y++)
  sum_1d[y] += src_2d[(y * iterations) + x];


This means the cache is unable to take advantage of the array’s spatial locality, something which is hinted at by the significant jump from a negligible number of L1 data TLB refills to 26.9 million. The TLB (Translation Lookaside Buffer) is a small cache of the page table: the Cortex-A57’s L1 data TLB is a 32-entry fully-associative cache. A large number of misses in the TLB (i.e. the result of performing un-cached address translations) can be indicative of frequent non-contiguous memory accesses spanning numerous pages – as is observed in our case.

The cache-test program operates on a 5000x5000 matrix of int32s – or 95.4 MB of data. The Cortex-A57 uses a 64-byte cache line length, giving a minimum of 1.56 M cache accesses to completely retrieve all the data. This explains the virtually equal L1 and L2 data cache refills (1.57 M each) in phase 1, where the data is being accessed in order, and explains why they must be this high even in the best case.



Fixing the Issue

In this simple case we can improve the cache utilization by switching around the inner and outer loops of the function, thus achieving a significant performance improvement (in our case a 22x speed increase) at no additional cost.


In real-world examples, where it may not be as easy to locate the exact area of inefficiency, Streamline’s source code view can be used to help pinpoint the issue. To use this it will be necessary to load the application’s binary, either as described earlier or after capture by right clicking the report in the Streamline Data view, selecting Analyze... and adding the binary. If the binary contains debug symbols source-code-level debug information will be available (in the Code tab), otherwise only function-level information will be available (in the Functions tab, and also from the Timeline Samples HUD ()). Function-level information will still provide a good clue as to where to look however. Providing debug symbols are available, the code view can be easily used to give a view similar to below by clicking through the offending functions in the Functions tab.


The annotations on the left of the source code line show the number of occasions that line was being executed when the sample was taken and that percentage relative to the rest of the function. Using the Timeline Sample HUD () we can identify the “yx_loop” function as being responsible for the majority of the samples from our code (1617) throughout the second phase (which we identified as having poor cache utilization). Clicking through this function in the Sample HUD or the Functions tab, we can see 1584 samples on the line within the nested for-loop – suggesting this loop needs a second look. In our case this is a particularly simple function consisting only of this loop, but if it were more complex it would offer a much greater insight into the exact spot the offending function was spending most of its time.




I have attached the source to the simple cache-test example. It is currently in the process of being added to the examples bundled with DS-5, so it will be included with future product versions. I will update this blog post when that happens.


Feel free to post any comments or questions below and I will respond as soon as possible.

Usually when you create a bare-metal image you specify the location in memory where the code and data will reside, and provide an entry point address where execution starts.

But what if you don't want to specify a fixed memory location at build time?

Security has become a crucial aspect of applications. One common attack to gain privilege on a system is through buffer overflows: this anomaly could potentially lead to the execution of malicious code, jeopardizing the security of the entire system through code injection.

Different techniques are used to make a hacker's life harder, including randomizing the address space layout (ASLR). This technique is widely used in several high-level Operating Systems like Android, iOS, Linux and Windows.

With ARM Compiler 6 you can extend this protection to bare-metal applications by creating Position Independent Executables (PIE), also known as Position Independent Code (PIC). A PIE is an executable that does not use fixed addresses to access memory. Rather, it can be executed at any suitably aligned address and the code automatically recalculates the required addresses.

ARM Compiler 6 provides the -fbare-metal-pie (armclang) and --bare_metal_pie (armlink) options to let you create a bare-metal PIE:

armclang -fbare-metal-pie -target armv8a-arm-none-eabi source.c 
armlink --bare_metal_pie source.o

Note: armclang automatically passes the --bare_metal_pie option to armlink when you compile with -fbare-metal-pie.

Note: Bare-metal PIE is currently only supported for 32-bit targets.


Worked Example Part 1: Creating a PIE

Let's take a look at how this works in practice.

This example creates a very simple "Hello World" program in DS-5, uses ARM Compiler 6 to create a PIE, then uses the DS-5 Debugger and the AEMv8-A model to run the executable at an arbitrary position in memory.


Step 1: Create a "Hello World" C project in DS-5 Debugger

  1. Create a new C project in DS-5 called PIEdemo (Click File > New > Other... to start the New Project wizard), using Project type: Empty Project and Toolchain: ARM Compiler 6 (DS-5 built in).
  2. Add a new source file pie.c to the new project (right-click the project, then click New > Source File) with the following content:

    #include <stdio.h> 
    const char *myString = "Hello World\n";
    int main()
        return 0;

Step 2: Compile the source code to create a PIE

  1. Edit the project properties (right-click the project, then click Properties) and navigate to the ARM Compiler toolchain settings (C/C++ Build > Settings).
  2. Add the following command-line options:

    • ARM C Compiler 6 > Target > Target: armv8a-arm-none-eabi (this compiles for AArch32)
    • ARM C Compiler 6 > Miscellaneous > Other flags: -fbare-metal-pie -mfpu=none
    • ARM Linker 6 > Miscellaneous > Other flags: --bare_metal_pie
  3. Build the project (right-click the project, then click Build Project).


Step 3: Create a debug configuration for the AEMv8-A model

  1. Create a new debug configuration (right-click in the Debug Control tab, then click Debug Configurations..., then click the New Launch Configuration button).
  2. On the Connection tab:
    1. Select the VE_AEMv8x1 > Bare Metal Debug > Debug AEMv8-A target.
    2. Add the model parameter: -C cluster.cpu0.CONFIG64=0. This puts the model in AArch32 state, rather than the default AArch64 state.


  3. On the Debugger tab, select Run control: connect only.

    We want to load the image manually so that we can specify the load address.

Step 4: Run the PIE on the AEMv8-A model

  1. Double-click the debug configuration to connect to the AEMv8-A model target.
  2. Load the PIE by running the following command on the Commands tab:

    loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

    This loads the PIE at the arbitrary address 0x80000044, performs all necessary address relocations, and automatically sets the entry point:


    Note: You can choose any address, but it must be suitably aligned and at a valid location in the AEMv8-A memory map. For more information about the AEMv8-A memory map, see AEMv8-A Base Platform - memory - map in the Fast Models Reference Manual

    Note: You can ignore the TAB180 error for the purposes of this tutorial. For more information, see ARM Compiler 6: Bare-metal Hello World C using the ARMv8 model | ARM DS-5 Development Studio.

  3. Execute the PIE by running the following command on the Commands tab:


    Check the Target Console tab to see the program output:


How Does It Work?

Position independent code uses PC-relative addressing modes where possible and otherwise accesses global data indirectly, via the Global Offset Table (GOT). When code needs to access global data it uses the GOT as follows:

  • Evaluate the GOT base address using a PC-relative addressing mode.
  • Get the address of the data item in the GOT by adding an offset index to the GOT base address.
  • Look up the contents of that GOT entry to obtain the actual address of the data item.
  • Access the actual address of the data item.

We'll see this process in action later.

At link time, the linker does the following:

  • Creates the executable as if it will run at address 0x00000000.
  • Generates a Dynamic Relocation Table (DRT), which is a list of addresses that need updating, specified as 4-byte offsets from the table entry.
  • Creates a .preinit_array section, which will update relocated addresses (more about this later…).
  • Converts function calls to direct calls.
  • Generates the Image$$StartOfFirstExecRegion symbol.


At execution time:

  • The entry code calls __arm_preinit_.
  • __arm_preinit_ processes functions in the .preinit_array section, calling __arm_relocate_pie.
  • __arm_relocate_pie uses Image$$StartOfFirstExecRegion (evaluated using a PC-relative addressing mode) to find the actual base address in memory where the image has been loaded, then processes each entry in the DRT adding the base address offset to each address entry in the GOT and initialized pointers in the data area.



Worked Example Part 2: Stepping through PIE execution with DS-5 Debugger

Our example from earlier contains the global string "Hello world". Let's see how relocation is used in the PIE to access this data regardless of where the image is loaded.

In the Project Explorer view, double-click on the .axf executable to see the sections it contains:


We can see that the GOT is located at address 0x00000EE0 in the original image.

Now load the image to address 0x80000044 by running the following command on the Commands tab:

loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

Use the Disassembly view to view address 0x80000F24 (0x80000044 + 0x00000EE0). We can see that the GOT has been loaded, but it still contains unrelocated addresses:


Now, set a breakpoint on main() and run the executable. This executes the setup code, including __arm_relocate_pie which relocates the addresses in the GOT. Run the following commands on the Commands tab:

b main

Look at the GOT again, and note that the addresses have been relocated:


Now we'll see how the code uses the GOT to access the "Hello World" string.

Step to the next source instruction by running the following command on the Commands tab:


Jump to address $pc in the Disassembly view to view the code in main():


The code to print "Hello World" starts at 0x800000E4 and does the following:

  1. Load R1 with the GOT offset for our string (0xC), obtained by a PC-relative data lookup from address 0x80000118.
  2. Load R2 with the PC-relative address of the GOT table (0xE30)
  3. Update R2 with the actual base address of the GOT table, PC + 0xE30 (0x800000F4 + 0xEE0 = 0x80000F24 ).
  4. Load R1 with the contents of address R1 + R2 (that is, address 0x80000F24 + 0xC = 0x80000F30). The contents of this address in the GOT is 0x80000F68, which is the address of the pointer to the "HelloWorld" string.
  5. Load R1 with the target address of the pointer, copy it to R0, and call puts.

You can single-step through the code and use the Registers view to see this working in DS-5 Debugger.


Further Reading

On 8th May ARM approved training specialists Doulos Embedded are hosting free webinars on effective application debugging for embedded Linux systems. Learn how to maximize your use of embedded Linux by addressing the important issue of application debugging, including examples using DS-5 Development Studio.


For Europe and Asia, register to attend on 8th 10am-11am BST (11am-12pm CEST, 2.30pm-3.30pm IST). Or for North America, register to attend at 10am-11am PDT (1pm-2pm EDT, 6pm-7pm BST).



See the full details »





ARM has always been committed to work with the ecosystem and cooperate with partners to get the best out of our cores. One important aspect of the cooperation is sharing what we have done in the open source and what we plan to do in the near future.


GNU Toolchain

In the first quarter of 2015 we focused on getting GCC 5 ready for release, plus some work on both A-Profile and R/M-Profile processors.


In particular, for Cortex-A processors, we made improvements to the instruction scheduling model, more accurate now, and we set an additional number of compiler tuning parameters which will lead to performance improvements on Cortex-A57. We also added support for the new Cortex-A72 and performed an initial tuning for performance.


On the Cortex-R/M class we created Thumb-I Prologue/Epilogues in RTL representation in order to allow the compiler to operate further tuning on functions call/return.


Additional work has been done, along with the community, for  improving NEON® intrinsics, refiningstring routines in glibc and implementing aeabi_memclr / aeabi_memset / aeabi_memmov in Newlib.


What’s next?

For the second quarter of 2015, we plan to complete what we started at the beginning of the year: first of all we are going to continue supporting and helping with the release of GCC 5. This is an important milestone and we want to make sure ARM support this. We will continue to work on adding support for ARMv8.1-A architecture in GCC and improving performance for Cortex-A53 and Cortex-A57. For example we noticed that GCC is generating branches to compile code with If/Then statements where a conditional select could be used instead: compiler engineers are exploring this optimisation opportunity which could potentially give a significant performance boost.


LLVM Update

The activity on LLVM has been focused on improving both AArch32 and AArch64 code generation: we added Cortex-A72 basic support and continue to advance the performance of the code generated for ARMv8 architecture such as improving unrolling heuristics.


Initially our efforts have been mainly directed to ARMv8 but we are now gradually making big advancements on ARMv7-A and ARMv7-M as well (read section MC-Hammer).


Supporting cores is not our only concern: the software ecosystem is important for us and in the last quarter we’ve been fixing stack re-alignment issues with the Android Open Source Project when built with LLVM.


What’s next?

During the next three months we will extend the support for ARMv8.1-A architecture and we will continue to work on performance optimisations. Some of the areas we will target are vectorisation, inlining, loop unrolling and floating point transformations. We are also discussing the support for strided accesses of the autovectorizer to maximise the usage of structure load and stores.


We will continue to support the Android Open Source Project (AOSP). In particular we will focus on stack size usage: LLVM is not performing as well as it could be in determining when local variables are not used anymore (“lifetime markers”) causing an unnecessary increase of stack usage.



Richard Barton presented MC-Hammer at Euro-LLVM 2012 (you can find presentation and slides here at LLVM website http://llvm.org/devmtg/2012-04-12/), a tool we’ve been using to verify the correctness of LLVM-MC against our proprietary reference implementation.


In 2012 we estimated that at the time ~10% of the all ARM instructions for Cortex-A8 were incorrectly encoded and 18% of instructions were incorrectly assembled when using LLVM. Over the past three years we gradually fixed corner case bugs and we are now confident that v7-A and v7-M variants of the ARM architecture are correct, as well as AArch64. This is a great result and it means that this functionality in LLVM-MC can be trusted and built upon.


We participated at EuroLLVM 2015 on 13th and 14th April in London (UK) and we will be at GNU Tools Cauldron 2015 on 7th/8th/9th August in Prague (Czech Republic): please come and talk with us! For more details please refer to the full presentation given by Matthew Gretton-Dann available on YouTube and his slides attached to this blog or get in contact with us if you need further information. We would like to hear from you on what you are doing in this space and maybe work together to achieve a shared goal.

Paul Black

Debug over power-down

Posted by Paul Black Apr 13, 2015

Efficiency is at the heart of the ARM® architecture. ARM cores are commonly used in products where power consumption and battery life are critical considerations, but this can pose additional challenges for a debugger. For DS-5 v5.21, DSTREAM support for debug over power-down has been enhanced, giving improved “out of the box” debugger stability for power-critical ARM implementations.


Designed for efficiency

Many ARM® Cortex® cores are designed to be powered down when they are not needed. Alongside the in-built efficiency of the ARM architecture, this enables very effective power management. Individual cores or entire clusters of cores (particularly in ARM® big.LITTLE™ implementations) can power down when they are not needed, providing substantial reductions in power consumption. However, when a core powers down its memory-mapped debug registers may also power down, and that causes problems for a debugger.


Cores have two independent power domains. The main domain serves the core itself, but there is a smaller power domain that supports only a small number of critical debug registers. If the main core domain is powered down but the debug domain is kept powered up, power consumption is still substantially reduced but a debugger can still read information about the core’s power state. The debugger can then make an informed decision about which operations are currently possible for that core, and which other registers are available. The debugger can correctly display the core’s current power state and debug availability.


Challenges for a debugger

However, control of the power domains is implementation specific, and it is not uncommon to find that a core’s debug domain becomes powered down. Sometimes the two domains are linked – so that when the core powers down, the critical debug registers also become unavailable. Sometimes the debug power domains are linked to the overall power domain for a cluster of cores. In this implementation, debug domains remain available until all the cores in a cluster have been powered down. In this case, access to one core’s debug registers may be affected by the power state of a different core in that cluster.


When critical core debug registers become unavailable through power-down, the debugger is unable to determine the power state of the core. The debugger will be unable to determine which operations are currently available for that core, and may attempt to access core registers that are not currently accessible. The result is likely to be a loss of control of the core, but in some cases one of the SoC’s internal busses may become locked by a hung transaction.


In most SoCs, there is a power control block containing registers which show the current state of various power domains. Therefore it is possible to hand-craft support for individual SoCs, enabling the debugger to read and then interpret information from the SoC’s power control block. Unfortunately, there are two drawbacks with this method:


Firstly, because the debugger has to read information from the SoC’s power control block and interpret it before accessing core debug registers, there’s a risk of race conditions. The longer the delay between the information being read and being interpreted and acted upon, the greater the risk that the core’s power state will change. This means that the additional debugger functionality has to be implemented as close to the SoC as possible – in DSTREAM. Making changes to the DSTREAM firmware is a specialist task that can only be carried out by the ARM UK debugger engineering teams, it cannot be scripted by an ARM FAE or a debugger user. Obviously, the greater round-trip delays of lower cost “JTAG wiggler” probes (where processing is carried out on the host instead of inside the probe) may prohibit this kind of functionality


Secondly, the ARM UK engineering teams will need information about the power control block in the SoC – how to access the registers and how to interpret their contents. This information may not be easily available, and it may be considered confidential – and this can introduce implementation delays.


Enhancements for DS-5

For DS-5 v5.21, DSTREAM implements a number of enhancements for debug over power-down. Provided that the registers in a core’s debug power domain have tie-off values (this means that debugger accesses during power-down fail cleanly, and result in an error rather than a hung transaction on the debug bus), DSTREAM now performs additional interpretation of register access failures. DSTREAM is able to intelligently interpret some register access failures as a core power-down, thereby keeping debug control of the core and correctly displaying the core’s power state in the DS-5 display. This removes the need for the ARM UK engineering teams to hand-craft support for individual SoCs, and the need to collect difficult or confidential information.


In DS-5 v5.21, this enhanced power-down support is implemented for ARMv7 cores. Enhanced support for ARMv8 cores is partly implemented, and will be completed in future releases of DS-5.

In DS-5 v5.21 Snapshot Viewer has been improved, with additional functionality and better ease of use. This release extends support to include ARM®v8 architecture, all core registers plus peripheral SFRs, multiple memory spaces, and Linux Kernel or Device Driver debug. The viewer is now data-driven, removing the need to hand-script new platform configurations.


Offline debug and trace

Sometimes, it’s not possible to use a conventional debug and trace session (through a JTAG debug and trace unit) to investigate a problem. There are a number of scenarios where JTAG debug can’t be used:

  • The problem that you need to investigate was seen on production hardware, that doesn’t have debug connectors. At best, this means that you have to re-create the problem on development hardware with full JTAG debug and trace capabilities
  • Some problems can’t be replicated with a debugger connected. The problems may be rare and occur infrequently, or you might only be able to re-create the problems out in the field, under particular customer use-cases, where it’s impractical to connect a debugger
  • You might have restricted access to development hardware, this is particularly true during SoC bring-up and with first revisions of development hardware. Although it might be easy to re-create the problem, scheduling time for investigation is more challenging
  • You need to investigate a crash, hang, or lock-up. Under these situations JTAG debug may be compromised, and information available to a debugger may be limited. If you reset the target to restore debug capabilities, you will almost certainly lose useful information about the cause of the crash


DS-5 Snapshot Viewer

In these situations you need a debugger that can help you analyse any information that could be dumped from the target (including register values, memory contents, and trace data) without the need to connect to the target itself. The ARM® DS-5 Debugger provides a Snapshot Viewer which takes information from dump files rather than from a target via a JTAG debug session. The Snapshot Viewer provides the same register, memory, disassembly, variable, and trace views as you would see in a JTAG debug and trace session, up to the limits of the information contained in the dump files. No target or JTAG unit are needed, and symbol information (including source code linking) can be used as normal. Since you’re not connected to a target, it’s not possible to make changes in the target state (including run-control operations) – you are debugging a snapshot of the target’s state.


The ARM DS-5 Snapshot Viewer is often used with CoreSight™ Access Library. This is an open source library that can run under a Linux kernel, and provides a high-level API for control of CoreSight debug and trace components. The library can be used to configure CoreSight trace when no debugger can be connected, and then store the captured trace, with memory and register contents, where they can be recovered for off-line analysis in a debugger (on an SD card perhaps). A number of example configurations for common platforms are shipped with DS-5, and it’s easy to adapt and extend these sample configurations to give support for your own target hardware.


Enhancements for DS-5 v5.21

In DS-5 v5.20, Snapshot Viewer was a proof of concept which gave only initial functionality. Because of the large amount of interest in the DS-5 Snapshot Viewer, and the large number of customer use-cases, in DS-5 v5.21 the Snapshot Viewer has been substantially enhanced. In DS-5 v5.21 the Snapshot Viewer offers a much wider range of functionality:

  • Support has been extended to include ARMv8 architecture cores (ARM®Cortex®-A53, ARM®Cortex®-A57, ARM®Cortex®-A72). Previously, only ARMv7 architecture cores were supported
  • In DS-5 v5.20, core register support was limited to critical registers only (R0-R15). Register support has been extended in DS-5 v5.21 to include all core registers, plus peripheral SFRs
  • In previous releases of DS-5, Snapshot Viewer only provided support for a single memory space. This meant that Secure, Non-Secure, and Physical (for example) memory addresses could not be distinguished in Snapshot Viewer, and only one value could be displayed for each memory address. In DS-5 v5.21, Snapshot Viewer support has been extended to cover multiple memory spaces
  • Support has been added for Linux Kernel and Device Driver debug operations. Note that the memory dumps used by Snapshot Viewer need to contain the necessary memory contents – such as kernel data structures
  • In DS-5 v5.20, a new DS-5 platform configuration was needed for each target platform. A small number of sample platform configurations were shipped with DS-5, and although these could be used as a basis for a new platform configuration, virtually every new platform would require some hand scripting. The scripting API is simple and easy to use, but it’s not an area where many users have extensive experience – and this could make it difficult to add support for new platforms. In DS-5 v5.21, the necessity for new platform configurations (and therefore, all hand scripting) has been removed. A single Snapshot Viewer platform configuration is shipped with DS-5, and it is capable of using a snapshot from any target. Information such as number and type of core is now taken from the snapshot dump files
  • The CoreSight Access Library examples have been extended in line with the extensions to Snapshot Viewer. A new example configuration has been added for the ARM Juno reference platform (Cortex-A57 and Cortex-A53), and the existing examples have been extended to use the new Snapshot Viewer dump file formats (adding information about core type and number)


If you have any questions, feel free to post them below. Also get in touch if you use the Snapshot Viewer, we’d love to hear feedback on how it works for you.


We have just released a new version of ARM DS-5 Development Studio, and here I am presenting the new features of Streamline performance analyzer. In this release we have focused on supporting new ARM semiconductor IP, adding new features for Linux and Android profiling, and enhancing the user experience, helping our partners to get the best out of the tool and their device.


New ARM semiconductor IP support

In order to help ARM's partners reduce their time to market in terms of system bring-up and software support, we aim to enable support for ARM's new IP ahead of silicon availability. In 2015 we have introduced support for the Cortex-A72 CPU, from version 5.20.2, and the Mali-T800 GPU series, which add to the extensive list of supported application and graphics processors.


Custom Activity Map

Since version 5.19, the OpenCL™ timeline works with user space gator. The same interface, called ‘Custom Activity Map’ (CAM), is now open and generic, ready to be used with other APIs or other implementations of OpenCL. Any system or driver that relies on running jobs, tracking their complex dependencies and managing limited resources, can benefit from this kind of visualization. The new "CAM" macros allow you to define and visualize a complex dependency chain of jobs. You can define a custom activity map (CAM) view, add tracks to it and submit jobs with single or multiple dependencies to be visualized in Streamline.


OpenCL mode in the Timeline view


Ftrace support

Since version 5.20, gator (the Streamline agent) has support for reading generic files based on entries in events XML files. This feature allows you to add counters to extract data held in files, for example, /dev, /sys or /proc file entries. Streamline also supports reading ftrace data, and in version 5.21 we have added stock ftrace counters. This means that the following ftrace counters can be visualized in Streamline by default:

  • Kmem: Number of bytes allocated in the kernel using kmalloc
  • Ext4: Number of bytes written to an ext4 filesystem
  • F2FS: Number of bytes written to an f2fs filesystem
  • Power: Clock rate state
  • Block Completed: Number of block IO operations completed by device driver
  • Block Issued: Number of block IO operations issued to device driver
  • Power: Number of times cpu_idle is entered or exited.

You can use a similar technique to add counters for ftrace data. Any other ftrace counter can be added to Streamline by providing an events.xml file including regex filters, e.g.:

<event counter="ftrace_trace_marker_numbers"
       regex="^tracing_mark_write: ([0-9]+)\s$"


Which can be tested by executing:

$ echo 42 > /sys/kernel/debug/tracing/trace_marker

To make things much easier and faster, you can now append events to gator without rebuilding it via the -E option. For more information, check out the Streamline User Guide.

New color schemes and user experience improvements

In this release we have also improved the user experience, providing additional color schemes, and refreshing the UI, including:

  • New UI for the processes area (heat map)
  • Reordering of threads, groups, channels, OpenCL lines
  • Allow user to select source file instead of using the path prefix substitution
  • Introduced the incident counter class.



New color schemes in Streamline version 5.21



Our internal developers and many customers are already using and benefiting from these new features. DS-5 5.21 is available to download now, so check it out today. Stay tuned to see what our engineering teams are building for the next version.

You may have seen the announcements of Texas Instruments new and exciting MSP432P4x MCUs based on the ARM Cortex-M4 core. Keil MDK Version 5 offers out of the box support for these devices with TI's MSP432 Device Family Pack. Learn how to use the Pack to develop, program and debug applications using µVision on our Cortex-M Learning Platform. Refer to the news on keil.com for more information.


We received hundreds of project proposals, and already shipped out more than 200 boards to participants.

The discussions in the contest forum are gaining steam, with the technical questions rolling in.


Now with one more week to go to register your project, I have great news to share:


WE-Logo_2007_A4_CMYK_DOS.jpgWürth Elektronik, one of the world’s leading manufacturers of electronic and electromechanical components, is offering free components for all contest participants. By sponsoring the contest, Würth Elektronik allows participants to design the most efficient boards and present their innovative solutions. Accepted entrants can request power and filter inductors, wireless charging coils, capacitors, LEDs and connectors. Check out www.we-online.com for details about their portfolio.


We are looking forward to receiving the final entries until 1st of April.



after the huge success of the XMC™ Developer Days last year, Infineon is going to have this event again. There will be two one-day training events: one in Milano on 16 April 2015 and one in Munich on 12 May 2015. If you want to register for the event (required but free-of-charge), please visit: www.infineon.com/xmcdeveloperday. Participants will receive a XMC4500 Relax Kit and a XMC1200 Boot Kit for free. You will have the ability to try the hardware in technical training sessions using various development tools, for example Keil MDK Version 5. The interface between MDK and the all new DAVE will also be explained. Each participant will receive a time limited license for MDK-Professional.


ARM will participate in both events. Milano will be supported by our Italian distributor Tecnologix and Munich will be supported by the local German ARM team.


See you in Milano and Munich!


Filter Blog

By date:
By tag: