1 2 3 Previous Next

Software Development Tools

119 posts

Usually when you create a bare-metal image you specify the location in memory where the code and data will reside, and provide an entry point address where execution starts.

But what if you don't want to specify a fixed memory location at build time?

Security has become a crucial aspect of applications. One common attack to gain privilege on a system is through buffer overflows: this anomaly could potentially lead to the execution of malicious code, jeopardizing the security of the entire system through code injection.

Different techniques are used to make a hacker's life harder, including randomizing the address space layout (ASLR). This technique is widely used in several high-level Operating Systems like Android, iOS, Linux and Windows.

With ARM Compiler 6 you can extend this protection to bare-metal applications by creating Position Independent Executables (PIE), also known as Position Independent Code (PIC). A PIE is an executable that does not use fixed addresses to access memory. Rather, it can be executed at any suitably aligned address and the code automatically recalculates the required addresses.

ARM Compiler 6 provides the -fbare-metal-pie (armclang) and --bare_metal_pie (armlink) options to let you create a bare-metal PIE:

armclang -fbare-metal-pie -target armv8a-arm-none-eabi source.c 
armlink --bare_metal_pie source.o

Note: armclang automatically passes the --bare_metal_pie option to armlink when you compile with -fbare-metal-pie.

Note: Bare-metal PIE is currently only supported for 32-bit targets.

 

Worked Example Part 1: Creating a PIE

Let's take a look at how this works in practice.

This example creates a very simple "Hello World" program in DS-5, uses ARM Compiler 6 to create a PIE, then uses the DS-5 Debugger and the AEMv8-A model to run the executable at an arbitrary position in memory.

 

Step 1: Create a "Hello World" C project in DS-5 Debugger

  1. Create a new C project in DS-5 called PIEdemo (Click File > New > Other... to start the New Project wizard), using Project type: Empty Project and Toolchain: ARM Compiler 6 (DS-5 built in).
  2. Add a new source file pie.c to the new project (right-click the project, then click New > Source File) with the following content:

    #include <stdio.h> 
    const char *myString = "Hello World\n";
    int main()
    {
        puts(myString);
        return 0;
    }

Step 2: Compile the source code to create a PIE

  1. Edit the project properties (right-click the project, then click Properties) and navigate to the ARM Compiler toolchain settings (C/C++ Build > Settings).
  2. Add the following command-line options:

    • ARM C Compiler 6 > Target > Target: armv8a-arm-none-eabi (this compiles for AArch32)
    • ARM C Compiler 6 > Miscellaneous > Other flags: -fbare-metal-pie -mfpu=none
    • ARM Linker 6 > Miscellaneous > Other flags: --bare_metal_pie
  3. Build the project (right-click the project, then click Build Project).

 

Step 3: Create a debug configuration for the AEMv8-A model

  1. Create a new debug configuration (right-click in the Debug Control tab, then click Debug Configurations..., then click the New Launch Configuration button).
  2. On the Connection tab:
    1. Select the VE_AEMv8x1 > Bare Metal Debug > Debug AEMv8-A target.
    2. Add the model parameter: -C cluster.cpu0.CONFIG64=0. This puts the model in AArch32 state, rather than the default AArch64 state.

      DebugConfiguration.png

  3. On the Debugger tab, select Run control: connect only.

    We want to load the image manually so that we can specify the load address.

Step 4: Run the PIE on the AEMv8-A model

  1. Double-click the debug configuration to connect to the AEMv8-A model target.
  2. Load the PIE by running the following command on the Commands tab:

    loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

    This loads the PIE at the arbitrary address 0x80000044, performs all necessary address relocations, and automatically sets the entry point:

    loadfile_command.png

    Note: You can choose any address, but it must be suitably aligned and at a valid location in the AEMv8-A memory map. For more information about the AEMv8-A memory map, see AEMv8-A Base Platform - memory - map in the Fast Models Reference Manual

    Note: You can ignore the TAB180 error for the purposes of this tutorial. For more information, see ARM Compiler 6: Bare-metal Hello World C using the ARMv8 model | ARM DS-5 Development Studio.

  3. Execute the PIE by running the following command on the Commands tab:

    run

    Check the Target Console tab to see the program output:

    target_console.png

How Does It Work?

Position independent code uses PC-relative addressing modes where possible and otherwise accesses global data indirectly, via the Global Offset Table (GOT). When code needs to access global data it uses the GOT as follows:

  • Evaluate the GOT base address using a PC-relative addressing mode.
  • Get the address of the data item in the GOT by adding an offset index to the GOT base address.
  • Look up the contents of that GOT entry to obtain the actual address of the data item.
  • Access the actual address of the data item.

We'll see this process in action later.

At link time, the linker does the following:

  • Creates the executable as if it will run at address 0x00000000.
  • Generates a Dynamic Relocation Table (DRT), which is a list of addresses that need updating, specified as 4-byte offsets from the table entry.
  • Creates a .preinit_array section, which will update relocated addresses (more about this later…).
  • Converts function calls to direct calls.
  • Generates the Image$$StartOfFirstExecRegion symbol.

smallImageBeforeLoading.png

At execution time:

  • The entry code calls __arm_preinit_.
  • __arm_preinit_ processes functions in the .preinit_array section, calling __arm_relocate_pie.
  • __arm_relocate_pie uses Image$$StartOfFirstExecRegion (evaluated using a PC-relative addressing mode) to find the actual base address in memory where the image has been loaded, then processes each entry in the DRT adding the base address offset to each address entry in the GOT and initialized pointers in the data area.

smallImageAfterLoading.png

 

Worked Example Part 2: Stepping through PIE execution with DS-5 Debugger

Our example from earlier contains the global string "Hello world". Let's see how relocation is used in the PIE to access this data regardless of where the image is loaded.

In the Project Explorer view, double-click on the .axf executable to see the sections it contains:

ElfSections.png

We can see that the GOT is located at address 0x00000EE0 in the original image.

Now load the image to address 0x80000044 by running the following command on the Commands tab:

loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

Use the Disassembly view to view address 0x80000F24 (0x80000044 + 0x00000EE0). We can see that the GOT has been loaded, but it still contains unrelocated addresses:

GOTpre.png

Now, set a breakpoint on main() and run the executable. This executes the setup code, including __arm_relocate_pie which relocates the addresses in the GOT. Run the following commands on the Commands tab:

b main
run

Look at the GOT again, and note that the addresses have been relocated:

GOTpost.png

Now we'll see how the code uses the GOT to access the "Hello World" string.

Step to the next source instruction by running the following command on the Commands tab:

next

Jump to address $pc in the Disassembly view to view the code in main():

CodeDisassembly.png

The code to print "Hello World" starts at 0x800000E4 and does the following:

  1. Load R1 with the GOT offset for our string (0xC), obtained by a PC-relative data lookup from address 0x80000118.
  2. Load R2 with the PC-relative address of the GOT table (0xE30)
  3. Update R2 with the actual base address of the GOT table, PC + 0xE30 (0x800000F4 + 0xEE0 = 0x80000F24 ).
  4. Load R1 with the contents of address R1 + R2 (that is, address 0x80000F24 + 0xC = 0x80000F30). The contents of this address in the GOT is 0x80000F68, which is the address of the pointer to the "HelloWorld" string.
  5. Load R1 with the target address of the pointer, copy it to R0, and call puts.

You can single-step through the code and use the Registers view to see this working in DS-5 Debugger.

 

Further Reading

On 8th May ARM approved training specialists Doulos Embedded are hosting free webinars on effective application debugging for embedded Linux systems. Learn how to maximize your use of embedded Linux by addressing the important issue of application debugging, including examples using DS-5 Development Studio.

 

For Europe and Asia, register to attend on 8th 10am-11am BST (11am-12pm CEST, 2.30pm-3.30pm IST). Or for North America, register to attend at 10am-11am PDT (1pm-2pm EDT, 6pm-7pm BST).

 

 

See the full details »

 

 

 

Tux.png

ARM has always been committed to work with the ecosystem and cooperate with partners to get the best out of our cores. One important aspect of the cooperation is sharing what we have done in the open source and what we plan to do in the near future.

 

GNU Toolchain

In the first quarter of 2015 we focused on getting GCC 5 ready for release, plus some work on both A-Profile and R/M-Profile processors.

 

In particular, for Cortex-A processors, we made improvements to the instruction scheduling model, more accurate now, and we set an additional number of compiler tuning parameters which will lead to performance improvements on Cortex-A57. We also added support for the new Cortex-A72 and performed an initial tuning for performance.

 

On the Cortex-R/M class we created Thumb-I Prologue/Epilogues in RTL representation in order to allow the compiler to operate further tuning on functions call/return.

 

Additional work has been done, along with the community, for  improving NEON® intrinsics, refiningstring routines in glibc and implementing aeabi_memclr / aeabi_memset / aeabi_memmov in Newlib.

 

What’s next?

For the second quarter of 2015, we plan to complete what we started at the beginning of the year: first of all we are going to continue supporting and helping with the release of GCC 5. This is an important milestone and we want to make sure ARM support this. We will continue to work on adding support for ARMv8.1-A architecture in GCC and improving performance for Cortex-A53 and Cortex-A57. For example we noticed that GCC is generating branches to compile code with If/Then statements where a conditional select could be used instead: compiler engineers are exploring this optimisation opportunity which could potentially give a significant performance boost.

 

LLVM Update

The activity on LLVM has been focused on improving both AArch32 and AArch64 code generation: we added Cortex-A72 basic support and continue to advance the performance of the code generated for ARMv8 architecture such as improving unrolling heuristics.

 

Initially our efforts have been mainly directed to ARMv8 but we are now gradually making big advancements on ARMv7-A and ARMv7-M as well (read section MC-Hammer).

 

Supporting cores is not our only concern: the software ecosystem is important for us and in the last quarter we’ve been fixing stack re-alignment issues with the Android Open Source Project when built with LLVM.

 

What’s next?

During the next three months we will extend the support for ARMv8.1-A architecture and we will continue to work on performance optimisations. Some of the areas we will target are vectorisation, inlining, loop unrolling and floating point transformations. We are also discussing the support for strided accesses of the autovectorizer to maximise the usage of structure load and stores.

 

We will continue to support the Android Open Source Project (AOSP). In particular we will focus on stack size usage: LLVM is not performing as well as it could be in determining when local variables are not used anymore (“lifetime markers”) causing an unnecessary increase of stack usage.

 

MC-Hammer

Richard Barton presented MC-Hammer at Euro-LLVM 2012 (you can find presentation and slides here at LLVM website http://llvm.org/devmtg/2012-04-12/), a tool we’ve been using to verify the correctness of LLVM-MC against our proprietary reference implementation.

 

In 2012 we estimated that at the time ~10% of the all ARM instructions for Cortex-A8 were incorrectly encoded and 18% of instructions were incorrectly assembled when using LLVM. Over the past three years we gradually fixed corner case bugs and we are now confident that v7-A and v7-M variants of the ARM architecture are correct, as well as AArch64. This is a great result and it means that this functionality in LLVM-MC can be trusted and built upon.

 

We participated at EuroLLVM 2015 on 13th and 14th April in London (UK) and we will be at GNU Tools Cauldron 2015 on 7th/8th/9th August in Prague (Czech Republic): please come and talk with us! For more details please refer to the full presentation given by Matthew Gretton-Dann available on YouTube and his slides attached to this blog or get in contact with us if you need further information. We would like to hear from you on what you are doing in this space and maybe work together to achieve a shared goal.

Paul Black

Debug over power-down

Posted by Paul Black Apr 13, 2015

Efficiency is at the heart of the ARM® architecture. ARM cores are commonly used in products where power consumption and battery life are critical considerations, but this can pose additional challenges for a debugger. For DS-5 v5.21, DSTREAM support for debug over power-down has been enhanced, giving improved “out of the box” debugger stability for power-critical ARM implementations.

 

Designed for efficiency

Many ARM® Cortex® cores are designed to be powered down when they are not needed. Alongside the in-built efficiency of the ARM architecture, this enables very effective power management. Individual cores or entire clusters of cores (particularly in ARM® big.LITTLE™ implementations) can power down when they are not needed, providing substantial reductions in power consumption. However, when a core powers down its memory-mapped debug registers may also power down, and that causes problems for a debugger.

 

Cores have two independent power domains. The main domain serves the core itself, but there is a smaller power domain that supports only a small number of critical debug registers. If the main core domain is powered down but the debug domain is kept powered up, power consumption is still substantially reduced but a debugger can still read information about the core’s power state. The debugger can then make an informed decision about which operations are currently possible for that core, and which other registers are available. The debugger can correctly display the core’s current power state and debug availability.

 

Challenges for a debugger

However, control of the power domains is implementation specific, and it is not uncommon to find that a core’s debug domain becomes powered down. Sometimes the two domains are linked – so that when the core powers down, the critical debug registers also become unavailable. Sometimes the debug power domains are linked to the overall power domain for a cluster of cores. In this implementation, debug domains remain available until all the cores in a cluster have been powered down. In this case, access to one core’s debug registers may be affected by the power state of a different core in that cluster.

 

When critical core debug registers become unavailable through power-down, the debugger is unable to determine the power state of the core. The debugger will be unable to determine which operations are currently available for that core, and may attempt to access core registers that are not currently accessible. The result is likely to be a loss of control of the core, but in some cases one of the SoC’s internal busses may become locked by a hung transaction.

 

In most SoCs, there is a power control block containing registers which show the current state of various power domains. Therefore it is possible to hand-craft support for individual SoCs, enabling the debugger to read and then interpret information from the SoC’s power control block. Unfortunately, there are two drawbacks with this method:

 

Firstly, because the debugger has to read information from the SoC’s power control block and interpret it before accessing core debug registers, there’s a risk of race conditions. The longer the delay between the information being read and being interpreted and acted upon, the greater the risk that the core’s power state will change. This means that the additional debugger functionality has to be implemented as close to the SoC as possible – in DSTREAM. Making changes to the DSTREAM firmware is a specialist task that can only be carried out by the ARM UK debugger engineering teams, it cannot be scripted by an ARM FAE or a debugger user. Obviously, the greater round-trip delays of lower cost “JTAG wiggler” probes (where processing is carried out on the host instead of inside the probe) may prohibit this kind of functionality

 

Secondly, the ARM UK engineering teams will need information about the power control block in the SoC – how to access the registers and how to interpret their contents. This information may not be easily available, and it may be considered confidential – and this can introduce implementation delays.

 

Enhancements for DS-5

For DS-5 v5.21, DSTREAM implements a number of enhancements for debug over power-down. Provided that the registers in a core’s debug power domain have tie-off values (this means that debugger accesses during power-down fail cleanly, and result in an error rather than a hung transaction on the debug bus), DSTREAM now performs additional interpretation of register access failures. DSTREAM is able to intelligently interpret some register access failures as a core power-down, thereby keeping debug control of the core and correctly displaying the core’s power state in the DS-5 display. This removes the need for the ARM UK engineering teams to hand-craft support for individual SoCs, and the need to collect difficult or confidential information.

 

In DS-5 v5.21, this enhanced power-down support is implemented for ARMv7 cores. Enhanced support for ARMv8 cores is partly implemented, and will be completed in future releases of DS-5.

In DS-5 v5.21 Snapshot Viewer has been improved, with additional functionality and better ease of use. This release extends support to include ARM®v8 architecture, all core registers plus peripheral SFRs, multiple memory spaces, and Linux Kernel or Device Driver debug. The viewer is now data-driven, removing the need to hand-script new platform configurations.

 

Offline debug and trace

Sometimes, it’s not possible to use a conventional debug and trace session (through a JTAG debug and trace unit) to investigate a problem. There are a number of scenarios where JTAG debug can’t be used:

  • The problem that you need to investigate was seen on production hardware, that doesn’t have debug connectors. At best, this means that you have to re-create the problem on development hardware with full JTAG debug and trace capabilities
  • Some problems can’t be replicated with a debugger connected. The problems may be rare and occur infrequently, or you might only be able to re-create the problems out in the field, under particular customer use-cases, where it’s impractical to connect a debugger
  • You might have restricted access to development hardware, this is particularly true during SoC bring-up and with first revisions of development hardware. Although it might be easy to re-create the problem, scheduling time for investigation is more challenging
  • You need to investigate a crash, hang, or lock-up. Under these situations JTAG debug may be compromised, and information available to a debugger may be limited. If you reset the target to restore debug capabilities, you will almost certainly lose useful information about the cause of the crash

 

DS-5 Snapshot Viewer

In these situations you need a debugger that can help you analyse any information that could be dumped from the target (including register values, memory contents, and trace data) without the need to connect to the target itself. The ARM® DS-5 Debugger provides a Snapshot Viewer which takes information from dump files rather than from a target via a JTAG debug session. The Snapshot Viewer provides the same register, memory, disassembly, variable, and trace views as you would see in a JTAG debug and trace session, up to the limits of the information contained in the dump files. No target or JTAG unit are needed, and symbol information (including source code linking) can be used as normal. Since you’re not connected to a target, it’s not possible to make changes in the target state (including run-control operations) – you are debugging a snapshot of the target’s state.

 

The ARM DS-5 Snapshot Viewer is often used with CoreSight™ Access Library. This is an open source library that can run under a Linux kernel, and provides a high-level API for control of CoreSight debug and trace components. The library can be used to configure CoreSight trace when no debugger can be connected, and then store the captured trace, with memory and register contents, where they can be recovered for off-line analysis in a debugger (on an SD card perhaps). A number of example configurations for common platforms are shipped with DS-5, and it’s easy to adapt and extend these sample configurations to give support for your own target hardware.

 

Enhancements for DS-5 v5.21

In DS-5 v5.20, Snapshot Viewer was a proof of concept which gave only initial functionality. Because of the large amount of interest in the DS-5 Snapshot Viewer, and the large number of customer use-cases, in DS-5 v5.21 the Snapshot Viewer has been substantially enhanced. In DS-5 v5.21 the Snapshot Viewer offers a much wider range of functionality:

  • Support has been extended to include ARMv8 architecture cores (ARM®Cortex®-A53, ARM®Cortex®-A57, ARM®Cortex®-A72). Previously, only ARMv7 architecture cores were supported
  • In DS-5 v5.20, core register support was limited to critical registers only (R0-R15). Register support has been extended in DS-5 v5.21 to include all core registers, plus peripheral SFRs
  • In previous releases of DS-5, Snapshot Viewer only provided support for a single memory space. This meant that Secure, Non-Secure, and Physical (for example) memory addresses could not be distinguished in Snapshot Viewer, and only one value could be displayed for each memory address. In DS-5 v5.21, Snapshot Viewer support has been extended to cover multiple memory spaces
  • Support has been added for Linux Kernel and Device Driver debug operations. Note that the memory dumps used by Snapshot Viewer need to contain the necessary memory contents – such as kernel data structures
  • In DS-5 v5.20, a new DS-5 platform configuration was needed for each target platform. A small number of sample platform configurations were shipped with DS-5, and although these could be used as a basis for a new platform configuration, virtually every new platform would require some hand scripting. The scripting API is simple and easy to use, but it’s not an area where many users have extensive experience – and this could make it difficult to add support for new platforms. In DS-5 v5.21, the necessity for new platform configurations (and therefore, all hand scripting) has been removed. A single Snapshot Viewer platform configuration is shipped with DS-5, and it is capable of using a snapshot from any target. Information such as number and type of core is now taken from the snapshot dump files
  • The CoreSight Access Library examples have been extended in line with the extensions to Snapshot Viewer. A new example configuration has been added for the ARM Juno reference platform (Cortex-A57 and Cortex-A53), and the existing examples have been extended to use the new Snapshot Viewer dump file formats (adding information about core type and number)

 

If you have any questions, feel free to post them below. Also get in touch if you use the Snapshot Viewer, we’d love to hear feedback on how it works for you.

premium-mobile-ip-smaller.png

We have just released a new version of ARM DS-5 Development Studio, and here I am presenting the new features of Streamline performance analyzer. In this release we have focused on supporting new ARM semiconductor IP, adding new features for Linux and Android profiling, and enhancing the user experience, helping our partners to get the best out of the tool and their device.

 

New ARM semiconductor IP support

In order to help ARM's partners reduce their time to market in terms of system bring-up and software support, we aim to enable support for ARM's new IP ahead of silicon availability. In 2015 we have introduced support for the Cortex-A72 CPU, from version 5.20.2, and the Mali-T800 GPU series, which add to the extensive list of supported application and graphics processors.

 

Custom Activity Map

Since version 5.19, the OpenCL™ timeline works with user space gator. The same interface, called ‘Custom Activity Map’ (CAM), is now open and generic, ready to be used with other APIs or other implementations of OpenCL. Any system or driver that relies on running jobs, tracking their complex dependencies and managing limited resources, can benefit from this kind of visualization. The new "CAM" macros allow you to define and visualize a complex dependency chain of jobs. You can define a custom activity map (CAM) view, add tracks to it and submit jobs with single or multiple dependencies to be visualized in Streamline.


pic15.jpg

OpenCL mode in the Timeline view

 

Ftrace support

Since version 5.20, gator (the Streamline agent) has support for reading generic files based on entries in events XML files. This feature allows you to add counters to extract data held in files, for example, /dev, /sys or /proc file entries. Streamline also supports reading ftrace data, and in version 5.21 we have added stock ftrace counters. This means that the following ftrace counters can be visualized in Streamline by default:

  • Kmem: Number of bytes allocated in the kernel using kmalloc
  • Ext4: Number of bytes written to an ext4 filesystem
  • F2FS: Number of bytes written to an f2fs filesystem
  • Power: Clock rate state
  • Block Completed: Number of block IO operations completed by device driver
  • Block Issued: Number of block IO operations issued to device driver
  • Power: Number of times cpu_idle is entered or exited.

You can use a similar technique to add counters for ftrace data. Any other ftrace counter can be added to Streamline by providing an events.xml file including regex filters, e.g.:


<event counter="ftrace_trace_marker_numbers"
       title="ftrace"
       name="trace_marker"
       regex="^tracing_mark_write: ([0-9]+)\s$"
       class="absolute">


 

Which can be tested by executing:


$ echo 42 > /sys/kernel/debug/tracing/trace_marker



To make things much easier and faster, you can now append events to gator without rebuilding it via the -E option. For more information, check out the Streamline User Guide.


New color schemes and user experience improvements

In this release we have also improved the user experience, providing additional color schemes, and refreshing the UI, including:

  • New UI for the processes area (heat map)
  • Reordering of threads, groups, channels, OpenCL lines
  • Allow user to select source file instead of using the path prefix substitution
  • Introduced the incident counter class.

 

streamline-colors.png

New color schemes in Streamline version 5.21

 

Availability

Our internal developers and many customers are already using and benefiting from these new features. DS-5 5.21 is available to download now, so check it out today. Stay tuned to see what our engineering teams are building for the next version.

You may have seen the announcements of Texas Instruments new and exciting MSP432P4x MCUs based on the ARM Cortex-M4 core. Keil MDK Version 5 offers out of the box support for these devices with TI's MSP432 Device Family Pack. Learn how to use the Pack to develop, program and debug applications using µVision on our Cortex-M Learning Platform. Refer to the news on keil.com for more information.

Embedded

We received hundreds of project proposals, and already shipped out more than 200 boards to participants.

The discussions in the contest forum are gaining steam, with the technical questions rolling in.

 

Now with one more week to go to register your project, I have great news to share:

 

WE-Logo_2007_A4_CMYK_DOS.jpgWürth Elektronik, one of the world’s leading manufacturers of electronic and electromechanical components, is offering free components for all contest participants. By sponsoring the contest, Würth Elektronik allows participants to design the most efficient boards and present their innovative solutions. Accepted entrants can request power and filter inductors, wireless charging coils, capacitors, LEDs and connectors. Check out www.we-online.com for details about their portfolio.

 

We are looking forward to receiving the final entries until 1st of April.

Hello,

 

after the huge success of the XMC™ Developer Days last year, Infineon is going to have this event again. There will be two one-day training events: one in Milano on 16 April 2015 and one in Munich on 12 May 2015. If you want to register for the event (required but free-of-charge), please visit: www.infineon.com/xmcdeveloperday. Participants will receive a XMC4500 Relax Kit and a XMC1200 Boot Kit for free. You will have the ability to try the hardware in technical training sessions using various development tools, for example Keil MDK Version 5. The interface between MDK and the all new DAVE will also be explained. Each participant will receive a time limited license for MDK-Professional.

 

ARM will participate in both events. Milano will be supported by our Italian distributor Tecnologix and Munich will be supported by the local German ARM team.

 

See you in Milano and Munich!

Embedded

The main function of the compiler is to translate source code to machine code but, during the development of a project, we inevitably make some mistake.

An important aspect of a good compiler is the capability of generating clear and precise error and warning messages. Clear error messages help to quickly identify mistakes in the code whereas warning messages are potential issues the compiler found in the code that might be worth investigating.

ARM Compiler 6 is based on the well-known Clang compiler which provides more precise and accurate warning/error messages compared to other tool-chains. Let's have a look at some examples:

 

 

Assignment in condition

 

Let’s try to write an example code:

#include <stdio.h>

int main() {
int a = 0, b = 0;
if (a = b) {
        printf("Right\n");
 } else {
        printf("Wrong\n");
 }
return 0;
}


















 

We made a mistake in the code by using ‘=’ instead of ‘==’. The code above is legitimate C language; it is the error in logic that will trip you up. Let’s see how different compilers help the developer identify the logic error:

ARM Compiler 5

"main.cpp", line 6: Warning:  #1293-D: assignment in condition

      if (a = b) {

          ^

ARM Compiler 6

main.cpp:6:11: warning: using the result of an assignment as a condition without parentheses [-Wparentheses]

    if (a = b) {

               ~~^~~

main.cpp:6:11: note: place parentheses around the assignment to silence this warning

    if (a = b) {

                  ^

       (     )

main.cpp:6:11: note: use '==' to turn this assignment into an equality comparison

    if (a = b) {

          ^

          ==

GCC 4.9

main.cpp: In function ‘int main()’:

main.cpp:6:14: warning: suggest parentheses around assignment used as truth value [-Wparentheses]

As you can see from the different outputs, ARM Compiler 6 not only indicates what it thinks is wrong in the code, but it also suggests two different ways to resolve the issue. The warning messages quote the entire line and highlight the specific part which requires attention from the user.

 

Templates

Templates are a great feature of C++ but they are sometimes a source of head-aches when troubleshooting problems. Let’s have a look at another example, this time involving C++ templates:

 

int sum_vector(const std::vector<int> &input) {
       std::vector<float>::const_iterator i = input.begin();
       int sum = 0;
       for(; i != input.end(); ++i) {
              sum += *i;
       }
       return sum;
}










 

The root of the error is that the template type used for the const_iterator is float whereas the vector requires int. How do different compilers identify this error?

ARM Compiler 5

"main.cpp", line 5: Error:  #144: a value of type "const int *" cannot be used to initialize an entity of type "const float *"

      std::vector<float>::const_iterator i = input.begin();

                                             ^

ARM Compiler 6

main.cpp:5:40: error: cannot initialize a variable of type 'std::vector<float>::const_iterator' (aka 'const float *') with an rvalue of type 'const_iterator' (aka 'const int *')

'std::vector<float>::const_iterator i = input.begin();

                                    ^   ~~~~~~~~~~~~~

GCC 4.9

main.cpp: In function ‘int sum_vector(const std::vector<int>&)’:

main.cpp:5:56: error: conversion from ‘std::vector<int>::const_iterator {aka __gnu_cxx::__normal_iterator<const int*, std::vector<int> >}’ to non-scalar type ‘std::vector<float>::const_iterator {aka __gnu_cxx::__normal_iterator<const float*, std::vector<float> >}’ requested

main.cpp:7:26: error: no match for ‘operator!=’ in ‘i != (& input)->std::vector<_Tp, _Alloc>::end<int, std::allocator<int> >()’

main.cpp:7:26: note: candidates are:

In file included from /usr/include/x86_64-linux-gnu/c++/4.7/./bits/c++allocator.h:34:0,

                 from /usr/include/c++/4.7/bits/allocator.h:48,

(additional 98 lines following)

The error messages produced by ARM Compiler 5 and ARM Compiler 6 are self-explaining and give a clear suggestion to the user on how to fix the issue: the mismatching template types are clearly indicated in the error message and the caret points the user to the precise point within the code. GCC, on the other hand, generates about 100 lines of hard-to-understand error messages.

 

 

Automatic Macro Expansion

Another very useful feature of diagnostic messages in ARM Compiler 6 is the automatic macro expansion. Consider the following code:

 

#define LOG(PREFIX, MESSAGE) fprintf(stderr, "%s: %s", PREFIX, MESSAGE)

#define LOG_WARNING(MESSAGE) LOG("Warning", MESSAGE)

…
       LOG_WARNING(123);
…









 

LOG_WARNING has been called with an integer as an argument even though, by expanding the two macros, you can see the fprintf function is expecting a string. When the macros are close together in the code it’s easy to spot these errors: what if they are defined in different part of the source code, or worse, in some external libraries ?

ARM Compiler 6 automatically expands each macro involved so that you can easily see what’s wrong in the call chain as shown below:

ARM Compiler 6

main.cpp:8:14: warning: format specifies type 'char *' but the argument has type 'int' [-Wformat]

        LOG_WARNING(123);

        ~~~~~~~~~~~~^~~

main.cpp:5:45: note: expanded from macro 'LOG_WARNING'

#define LOG_WARNING(MESSAGE) LOG("Warning", MESSAGE)

                                            ^

main.cpp:3:64: note: expanded from macro 'LOG'

#define LOG(PREFIX, MESSAGE) fprintf(stderr, "%s: %s", PREFIX, MESSAGE)

^

 

Summary

In summary we have seen how ARM Compiler 6 is able to give precise and detailed warning and error messages, simplifying the important task of troubleshooting coding errors. The quality of the error and warning messages not only helps the developer spot potential bugs in the final product, but also to quickly understand and fix them by providing sensible and easy to decipher suggestions.

I hope that you found this blog post useful and you are looking forward to using ARM Compiler 6 ! If you still don’t have DS-5, download a free 30-day evaluation .

Feel free to post any questions or comments below.

 

Ciao,

Stefano

Recently I have been preparing one of the demos for ARM's booth at Embedded World 2015 (which occurred last week 24th - 26th February) to showcase ARM Development Studio 5’s (DS-5) debug and trace functionality. This demo makes use of Freescale's SABRE board based on their newly announced i.MX 6SoloX Applications Processor, containing an ARM Cortex-A9 and ARM Cortex-M4. The image below shows the board with an attached LCD screen and connected to an ARM DSTREAM - ARM’s high-performance debug and trace unit.

ds5-imx6sx-photo.jpg

 

The i.MX 6SoloX is a multimedia-focused processor aimed at a wide variety of applications - achieved by combining the high performance of a Cortex-A9 and low power of a Cortex-M4 to maximise power efficiency.

 

Thanks to the co-operation between ARM and Freescale I was able to bring up the board very quickly, allowing me to get Linux and Android both booting on the board within a day. This is especially impressive given the board was pre-production at that point.

 

Once I had Linux successfully booting I then went about getting the DS-5 Debugger connected to it using the DSTEAM via the board's JTAG-20 connector. DS-5 Debugger support for the board was also added (via the configuration database) during the board's pre-production stages allowing full DS-5 support from release. This makes connecting the debugger as simple as selecting the platform in the Debug Configuration editor, choosing the appropriate connection to use and configuring it. This also enables a smooth, rapid transition from receiving the board to debugging or profiling a target application from the product's launch.

 

As the board has both a Cortex-A9 and Cortex-M4 on, it was a good candidate to demonstrate DS-5’s multicore debugging. These cores use Asymmetric Multiprocessing (AMP), as opposed to Symmetric Multiprocessing (SMP), meaning they are running completely independently (rather than under a single Operating System). I used DS-5 to connect and debug Linux on the Cortex-A9 as well as simultaneously using a separate connection to load a small Freescale MQX RTOS based example onto the Cortex-M4.

 

 

DS-5 Functionality

When debugging, we have access to DS-5's full suite of tools including, among others:

 

Debug

DS-5 allows multicore debug of targets (in both SMP and AMP configurations) using hardware and software breakpoints, with tools to provide a wide range of additional functionality from MMU page table views to custom peripheral register maps.

ds5-imx6sx-debug.png

 

Instruction Trace

Collecting trace from the target using the DSTREAM allows non-intrusive instruction-level trace which can be used to help with: debug, especially when when halting the core is undesirable; post-crash analysis; and profiling the target.

ds5-imx6sx-trace.png

 

RTOS OS Awareness

DS-5 offers additional information views (e.g. memory use, running tasks etc.) for a number of the most popular RTOS (pictured below is one of the views for MQX).

ds5-imx6sx-mqx-awareness.png

 

Linux OS Awareness

Specialised connections are available to Linux targets to allow debug and trace of the Linux Kernel and Kernel Modules as well as visualisations of the threads, processes and resources.

 

 

Linux Application debug

Applications may be debugged by downloading the target application and compatible gdbserver to the target for debugging using the DS-5 interface (concurrently with Bare Metal or OS-level debug and/or trace as desired).

ds5-imx6sx-app-debug.png

 

Streamline

Streamline allows visualisation of the target’s performance for profiling and optimizing target code. Requiring no additional hardware or debug connections, Streamline operates only via TCPIP (or optionally an ADB connection in the case of Android) and can profile many aspects of a CPU/GPU/interconnect in high detail at varying resolutions, including power analysis (via external or on-chip energy probes). The image below shows a Streamline trace from Linux on the i.MX 6SoloX under heavy load.

ds5-imx6sx-streamline.png

 

 

Summary

Thanks to a stable pre-production platform and Freescale's assistance adding the new board to the debug configuration database the whole bring up experience was seamless. For information on creating debug configurations for new platforms, see the "New Platform Bring-Up with DS-5" post. ARM encourages platform owners to submit any debug configurations for their platforms back to the DS-5 team so we can include them by default in subsequent releases - thus allowing the same, fluid out-of-box experience for end users.

February has been an exciting period for ARM: from the announcement of new products for the Premium Mobile segment to the release of the new DS-5 v.5.20.2

DS-5 v5.20.2 includes ARM Compiler 6.01, the latest LLVM based ARM Compiler. The main highlights of ARM Compiler 6.01 are:

  • Support for the latest ARM CPUs, including Cortex-A72
  • Extended support for the Cortex family of processors
  • Support for bare-metal Position Independent Executables (PIE)
  • Support for link time optimization (LTO)

 

Support for new CPUs

This release brings to the market the most advanced compilation technology in ARM Compiler 6.01 for the new ARMv8-A Cortex-A72 (-mcpu=cortex-a72) processor. With the support of the Cortex-A72, ARM Compiler continues to provide early-support for new cores to enable our customers to start developing as soon as possible and reduce the time-to-market.

 

Extended support for the Cortex family of processors

The new release of ARM Compiler adds support for ARMv7 Cortex-A class processors, enabling customers to take advantage of the new compiler for a wider range of products.

ARM Compiler 6.01 brings the advanced code generation used to build ARMv8 code to the more consolidated 32bit world, speeding up the adoption of the new compiler for companies not yet using ARMv8.

There’s more: ARM Compiler 6.01 adds initial (alpha quality) support for both Cortex-M and Cortex-R families to let engineers familiarise themselves with the new features and start the evaluation as soon as possible.

For more details see the release notes for ARM Compiler 6.01.

 

Bare-metal PIC support

Security has become a crucial aspect of applications, especially when connected to the network or available on the internet. One of the most common attacks to gain privilege on a system is through buffer overflows: this anomaly could potentially lead to the execution of malicious code, jeopardizing the security of the entire system through code injection.

Different techniques have been created to make a hackers’ life harder, one of the most commonly used to reduce the risk of attacks is to randomize the address space layout (ASLR). This technique is widely used in several high-level Operating Systems like Android, iOS, Linux and Windows. With ARM Compiler 6.01 it’s possible to extend the usage of this protection also on bare-metal applications.

ARM Compiler 6.01 allows the creation of bare-metal Position Independent Executables (PIE) which allows the executable to be loaded anywhere in the memory and the code will automatically recalculate the new addresses. More details are on infocenter about the Compiler command line option –fbare-metal-pie and Linker command line option –fpic.

 

Link Time Optimization

ARM Compiler is able to optimise code generated from each source file to get the best performance out of ARM processors. But what about optimising the different modules? ARM Compiler 6.01 introduces initial support for Link Time Optimization (LTO), which extends the capability of the compiler and the linker to perform optimizations by looking at the whole program and not just a single compilation unit, giving an extra performance boost!

To enable Link Time Optimization in ARM Compiler 6.01, take a look at the documentation in infocenter about Linker command line option –lto.

 

I hope that you found this information useful and you are ready to use ARM Compiler 6.01! If you still don’t have DS-5, download a free 30-day evaluation .

Feel free to post any questions or comments below.

 

Ciao,

Stefano

robkaye

4 Years On - Fast Models 9.2

Posted by robkaye Feb 23, 2015

At the end of last week (20/Feb/2015) we released Fast Models 9.2.   This year we are moving to a quarterly release cycle.  The more frequent releases enable the accelerating rate at which we are developing and deploying new Fast Models, and the speed at which partners pick them up and put them to use.  The main focus for Fast Models 9.2 was the release of the Cortex-A72 and CCI-500 models following the announcement of the IP earlier in the month.   Lead partners had been developing virtual prototypes with these models for several months already.   We also included critical fixes for partners and completed the next stage of performance improvement work.  For the latter, this release cycle our emphasis has been on how to models behave when used in SystemC simulations with many Fast Model components: something that is important to many of our partners.

 

It's been just over 4 years since I joined the Fast Models team.  I've just been working on a summary of how the solution has evolved in that time for an internal conference and thought it would be interesting to look how things have moved ahead in those four years.

 

Firstly, we have seen a rapid growth in usage: more and more partners are leveraging virtual prototypes as part of their SoC development process.   We are also seeing the models used in many more different ways.   Early software development remains front and center in our thoughts, but we have seen increasing use of the models in software driven validation of the hardware, in performance estimation and device compliance validation.

 

In 2011 we were working on the first models for ARMv8 cores.  That year we introduced models for four new cores at either beta or release status.  In 2015 it will be close to treble that amount.  On the System IP front it's the same story: approximately three times as many models will be rolled out this year compared to 2011.  Fast Models for Media IP (GPU, video and display processors) were just a concept on a road map in 2011 but this year we have several in the works along with a range of platform models that combine the media models with CPUs, system IP.   These platforms are aligned with the availability of IP and software stacks from sister teams in ARM to provide partners with a complete solution.

 

The underlying tools that support these models must move forward with the models deliveries. I've already mentioned the burgeoning use cases, then combine that with increasing complexity of the platforms being designed and the advance of host workstation operating systems,  To support this we have a comprehensive road map of feature support (such as checkpointing and timing annotation) to complement the continuous improvements in performance and quality and OS/tools support.

 

It's definitely an exciting and challenging part of the ARM story to be involved with.  I'm looking forward to 2015 and beyond with great anticipation.

 

Spectrum.pngTwo ends of the spectrum: Virtual Prototypes for an ARMv8 big.LITTLE mobile platform and a Cortex-M7 MCU.

Performance and power optimization are critical considerations for new Linux and Android™ products. This blog explores the most widely used performance and power profiling methodologies, and their application to the different stages in the product design.

 

The need for efficiency

In the highly competitive market for smartphones, tablets and mobile Internet devices, the success of new products depends strongly on high performance, responsive software and long battery life.

 

In the PC era it was acceptable to achieve high performance by clocking the hardware at faster frequencies. However, this does not work in a world in which users expect to always stay connected. The only way to deliver high performance while keeping a long battery life is to make the product more efficient.

 

On the hardware side the need for efficiency has pushed the use of lower silicon geometries and SoC integration. On the software side performance analysis needs to become an integral part of the design flow.

 

Processor instruction trace

Most Linux-capable ARM® processor-based chipsets include either a CoreSight Embedded Trace Macrocell (ETM) or a Program Trace Macrocell (PTM).

 

The ETM and PTM generate a compressed trace of every instruction executed by the processor, which is stored on an on-chip Embedded Trace Buffer (ETB) or an external trace port analyzer. Software debuggers can import this trace to reconstruct a list of instructions and create a profiling report. For example, DS-5 Development Studio Debugger can collect 4GB of instruction trace via the ARM DSTREAM target connection unit and display a time-based function heat map.

 

Instruction trace generation, collection and display.PNG

Figure 1: Instruction trace generation, collection and display

 

Instruction trace is potentially very useful for performance analysis, as it is 100% non-intrusive and provides information at the finest possible granularity. For instance, with instruction trace you can measure accurately the time lag between two instructions. Unfortunately, trace has some practical limitations.

 

The first limitation is commercial. The number of processors on a single SoC is growing and they are clocked at increasingly high frequencies, which results in higher bandwidth requirements on the CoreSight trace system and wider, more expensive, off-chip trace ports. The only sustainable solution for systems running at full speed is to trace to an internal buffer, which limits the capture to less than 1ms. This is not enough to generate profiling data for a full software task such as a phone call.

 

The second limitation is practical. Linux and Android are complex multi-layered systems, and it is difficult to find events of interest in an instruction trace stream. Trace search utilities help in this area, but navigating 4GB of compressed data is still very time-consuming.

 

The third limitation is technical. The debugger needs to know which application is running on the target and at which address it is loaded in order to decompress the trace stream. Today’s devices do not have the infrastructure to synchronize the trace stream with kernel context-switch information, which means that it is not possible to capture and decompress non-intrusively a full trace stream through context switches.

 

Sample-based profiling

For performance analysis over long periods of time sample-based analysis offers a very good compromise of low intrusiveness, low price and accuracy. A popular Linux sample-based profiling tool is perf.

 

Sample-based tools make use of a timer interrupt to stop the processor at regular intervals and capture the current value of the program counter in order to generate profiling reports. For example, perf can use this information to display the processor time spent on each process, thread, function or line of source code. This enables developers to easily spot hot areas of code.

 

At a slightly higher level of intrusiveness, sample-based profilers can also unwind the call stack at every sample to generate a call-path report. This report shows how much time the processor has spent on each call path, enabling different optimizations such as manual function inlining.

 

Sample-based profilers do not require a JTAG debug probe or a trace port analyzer, and are therefore much lower cost than instruction trace-based profilers. On the downside they cause a target slow-down of between 5 and 10% depending on how much information is captured on every sample.

 

It is important to note that sample-based profilers do not deliver “perfect data” but “statistically relevant data”, as the profiler works on samples instead of on every single instruction. Because of this, profiling data for hot functions is very accurate, but profiling data for the rest of the code is not accurate. This is not normally an issue, as developers are mostly interested in the hot code.

 

A final limitation of sample-based profilers is related to the analysis of short, critical sequences of code. The profiler will tell you how much processor time is spent on that code. However, only instruction trace can provide the detail on the sequence in which instructions are executed and how much time each instruction requires.

 

Logging and kernel traces

Logging or annotation is a traditional way to analyze the performance of a system. In its simplest form, logging relies on the developer adding print statements in different places in the code, each with a timestamp. The resulting log file shows how long each piece of code took to execute.

 

This methodology is simple and cheap. Its major drawback is that in order to measure a different part of the code you need to instrument it and rebuild it. Depending on the size of the application this can be very time consuming. For example, many companies only rebuild their software stacks overnight.

 

The Linux kernel provides the infrastructure for a more advanced form of logging called “tracing”. Tracing is used to automatically record a high number of system-level events such as IRQs, system calls, scheduling and event application-specific events. Lately, the kernel has been extended to also provide access to the processor’s performance counters, which contain hardware-related information such as cache usage or number of instructions executed by the processor.

 

Kernel trace enables you to analyze performance in two ways. First, you can use it to check whether some events are happening more often than expected. For example, it can be used to detect that an application is making the same system call several times when only one is required.  Secondly, it can be used to measure the latency between two events and compare it with your expectations or previous runs.

 

Since kernel trace is implemented in a fairly non-intrusive way, it is very widely used by the Linux community, using tools such as perf, ftrace or LTTng. A new Linux development will enable events to be “printed” to a CoreSight Instrumentation Trace Macrocell (ITM) or System Trace Macrocell (STM) in order to reduce intrusiveness further and provide a better synchronization of events with instruction trace.

 

Combining sampling with kernel trace

Open source tools such as perf and commercial tools such as the ARM DS-5 Streamline performance analyzer combine the functionality of a sample-based profiler with kernel trace data and processor performance counters, providing high-level visibility of how applications make use of the kernel and system-level resources.

 

For example, Streamline can display processor and kernel counters over time, synchronized to threads, processes and the samples collected, all in a single timeline view. For example, this information can be used to quickly spot which application is thrashing the cache memories or creating a burst in network usage.

 

Streamline Timeline View.png

Figure 2: Streamline Timeline View

 

Instrumentation-based profiling

Instrumentation completes the pictures of performance analysis methodologies. Instrumented software can log every function – or potentially every instruction - entry and exit to generate profiling or code coverage reports. This is achieved by instrumenting, or automatically modifying, the software itself.

 

The advantage of instrumentation over sample-based profiling is that it gives information about every function call instead of only a sample of them. Its disadvantage is that it is very intrusive, and may cause substantial slow-down.

 

Using the right tool for the job

All of the techniques described so far may apply to all stages of a typical software design cycle. However, some are more appropriate than others at each stage.

 

 

Low Cost

Low Intrusiveness

Accuracy

Granularity

System Visibility

Logging

•••

•••

•••••

••

Kernel trace

•••••

••••

•••••

•••

•••

Instruction trace

•••••

•••••

•••••

Sample-based

•••••

•••

•••

••

••••

Instrumentation

•••••

•••••

••••

 

Table 1: Comparison of methodologies

 

Instruction trace is mostly useful for kernel and driver development, but has limited use for Linux application and Android native development, and virtually no use for Android Java application development.

 

Performance improvements in kernel space are often in time-critical code handling the interaction between kernel, threads and peripherals. Improving this code requires the high accuracy and granularity, and low intrusiveness of instruction trace.

 

Secondly, kernel developers have enough control of the whole system to do something about it. For example, they can slow down the processors to transmit trace over a narrow trace port, or they can hand-craft the complete software stack for a fast peripheral. However, as you move into application space, developers do not need the accuracy and granularity of instruction trace, as the performance increase achieved by software tweaks can easily be lost by random kernel and driver behaviour totally outside of his control.

 

In the application space, engineering efficiency and system visibility are much more useful than perfect profiling information. The developer needs to find quickly which bits of code to optimize, and measure accurately the time between events, but can accept a 5% slow-down in the code.

 

System visibility is extremely important in both kernel and application space, as it enables developers to quickly find and kill the elephant in the room. Example system-related performance issues include misuse of cache memories, processors and peripherals not being turned off, inefficient access to the file system or deadlocks between threads or applications. Solving a system-related issue has the potential to increase the total performance of the system ten times more than spending days or weeks writing optimal code for an application in isolation. Because of this, analysis tools combining sample-based profiling and kernel trace will continue to dominate Linux performance analysis, especially at application level.

 

Instrumentation-based profiling is the weakest performance analysis technique because of its high level of intrusiveness. Optimizing Android Java applications has better chances of success by using manual logging than open-source tools.

 

High-performance Android systems

Most Android applications are developed at Java level in order to achieve platform portability. Unfortunately, the performance of the Java code has a random component, as it is affected by the JIT compiler. This makes both performance analysis and optimization difficult.

 

In any case, the only way to guarantee that an Android application will be fast and power-efficient is to write it - or at least parts of it - in native C/C++ code.  Research shows that native applications run between 5 and 20 times faster than equivalent Java applications. In fact, most popular Android apps for gaming, video or audio are written in C/C++.

 

For Android native development on ARM processor-based systems Android provides the Native Development Kit (NDK). ARM offers DS-5 as its professional software tool-chain for both Linux and Android native development.

 

By Javier Orensanz, Director of Product Management - Tools at ARM

Filter Blog

By date:
By tag: