1 2 3 Previous Next

Software Development Tools

132 posts



The GNU ARM Eclipse project includes a set of open source Eclipse plug-ins and tools to create/build/debug/manage ARM (32-bit) and AArch64 (64-bit) applications and static/shared libraries, using the latest GNU ARM GCC toolchains.


New look


Starting with September 2015, the GNU ARM Eclipse web site has a completely new look:


Apart from the aspect (definitely cool!), the main functional change is the addition of the right sidebar, to facilitate access to the project documentation.


The new site no longer uses WordPress; instead, it is entirely static and was generated with Jekyll.


New project home on GitHub


With GitHub gaining more and more traction, the GNU ARM Eclipse project was migrated from SourceForge to GitHub.




The migration of repositories was easy, each project was pushed into its own repository.


The current project repositories are:



Binary files as Releases


The migration of binary files was a bit more complicated, and, due to current GitHub limitations, is incomplete. The main problem was raised by the two Eclipse update sites, which require a certain folder structure, and since GitHub currently does not support adding folders to releases, the Eclipse update sites will remain hosted on SourceForge (at http://gnuarmeclipse.sourceforge.net/updates).


Except the Eclipse update sites, all future binary files will be published as GitHub Releases, attached to the respective project repositories.


The archive of past releases was also migrated from SourceForge to GitHub.


Issues trackers


The SourceForge trackers were replaced by the GitHub Issues trackers, one for each project.


It is planned to preserve the content of the old SourceForge trackers, even if now they are locked and new tickets cannot be created there.


Notifications via watched projects


For those interested in receiving notifications, the recommended way is to subscribe to the GitHub projects, by clicking the Watch button and selecting Watching).


In addition to the gnuarmeclipse/plug-ins project, it is also recommended to subscribe to the gnuarmeclipse/gnuarmeclipse.github.io project, to receive notifications for new Web posts.


More info


For more details about the GNU ARM Eclipse project, please refer to the project site http://gnuarmeclipse.github.io/.

ARM Compiler 6 main focus has always been bare-metal applications running on ARM processors. Even though ARM Compiler doesn't officially support building Linux applications, because of the high compatibility between armclang and GCC, it's much easier now to build them. In this blog I will explain how to set up ARM Compiler 6 to build a Linux Hello World from scratch.


This tutorial covers the build and debug of a basic Hello World C program running on Linaro on a ARMv8 model using ARM DS-5 Development Studio. In particular, it shows:

  • Download and setup GCC
  • Write a simple “Hello World” application in ARM DS-5 Development Studio
  • Build the application using ARM Compiler 6
  • Set up a debug session in ARM DS-5 Development Studio
  • Run it on a model of an ARMv8 system

To complete this tutorial, you'll need DS-5 Ultimate Edition: Download the 30-day trial »

Included in DS-5 Professional is the ARMv8-A Fixed Virtual Platform (FVP) model, giving you a platform to develop code on in advance of hardware availability.

Download Linaro GCC and Linaro image

If you do not have Linux already running on ARMv8 you can download a ready-to-use Linaro image from Linaro website: http://releases.linaro.org/latest/openembedded/aarch64/.
You need to download the kernel binary img.axf and the file system image vexpress64-openembedded_lamp-armv8-gcc-4.9_*.img.gz. (make sure you download the lamp image because the minimal image does not include gdbserver, necessary to debug the application from DS-5).


Even if it seems counterintuitive, it’s necessary to have GCC in order to build Linux application with ARM Compiler 6: the reason is that ARM Compiler 6 does not include Linux libraries so it needs to use glibc from GCC.


For our example, we will use the Linaro toolchain for Cortex-A which can be again downloaded from Linaro website http://www.linaro.org/downloads/.

Download Linaro-toolchain-binaries 4.9 (Aarch64 little-endian) and save extract it locally.



Add the new toolchain to DS-5

DS-5 includes three default toolchains but it’s also possible to add new ones as explained by Ronan Synnott in his blog post: Improved support for multiple ARM Compilers in DS-5 5.20 and beyond.

Open DS-5 settings by clicking on the menu Window and then Preferences. On the left hand side you can find a list of categories: select Toolchains under DS-5.


The list of available toolchains is shown in the list on the right hand side of the window. Proceed to add the downloaded GCC toolchain by clicking on the Add… button. Select the bin path of the toolchain you want to add and click on the Next > button.


DS-5 should be able to automatically detect the type of toolchain selected and other information like the version and the binaries. Click Finish if you want to complete the procedure and keep the default values (suggested). By clicking Next > you would be able to amend some of the information DS-5 already filled with values.


Create a new project

Create a new project in DS-5 by clicking on FileNewProject. Select C Project under C/C++ menu and click Next.


DS-5 shows the list of the available toolchains in the list. We need to give a name to the project, select the GCC toolchain we added in the previous section (make sure you select the aarch64 one and not the DS-5 built in) and click on the Finish button.


In order to use ARM Compiler 6 we need to change the project build settings to use armclang as a compiler and leave GCC for all the other tools. In particular, we want to make sure GCC linker is used instead of armlink.


Right click on the project and select Properties from the menu. In the C/C++ Build section we need to change the compiler in Tool Chain Editor. Click on Select tools and a window should appear with the list of all the available tools on the left hand side and the tools used for the project on the right hand side. What you need to do is just select ARM C Compiler 6 from the list on the left: DS-5 will automatically pick up the correspondent in the currently used tools (GCC C Compiler) and, by clicking on the << - Replace ->> button, we replace it with ARM Compiler 6.


The Select tools window should have now the following Used tools:


Once completed you can click OK and go in the Settings section of C/C++ Build.


In this section we need to configure armclang to compile for the ARMv8 target. Because armclang is not in the PATH if the project uses GCC, we need to specify in the Command textbox the full path as shown below (for example "C:\Program Files\DS-5\sw\ARMCompiler6.00u2\bin\armclang").


In the Target page it is necessary to specify aarch64-linux-gnu.


Add to Included Path the full path of the include directory in the ARM Compiler 6 directory (for example C:\Program Files\DS-5\sw\ARMCompiler6.00u2\include).


And finally we need to add few extra compiler options in the Miscellaneous section; specifically we need to indicate the root path of the GCC compiler with the option --gcc-toolchain and the path to the libc libraries with --sysroot. For example:


--gcc-toolchain="$PATH_TO_GCC_COMPILER$" --sysroot="$PATH_TO_GCC_COMPILER$\aarch64-linux-gnu\libc"


You can now press OK to save the new settings.


Building the project

Now that the project has been set up we need to write the code for the Hello World. Right click on the project and select NewSource File. Select a name for the new file and click Finish.

A new source editor window should open in DS-5 to edit the file. For this tutorial we will just add the following code:


int main() {
       printf("Hello v8 World!\n");
       return 0;

Save the file and build the project by selecting Build Project from the project menu.


The project should build without any errors. If not, check the output of the build in the Console tab and verify that all the settings have been correctly passed to the compiler/linker.


Start the ARMv8 model within DS-5

Our hello world application is ready but we still don’t have an environment where to test it. DS-5 Ultimate Edition includes multiple platform models of an ARMv8 processor we can use to boot Linux and debug our application on it. Again, let's take a look at Ronan Synnott blog post for more details: Booting Linux on the ARMv8-A model provided with DS-5 Ultimate Edition.


We can start the model directly from DS-5 by creating a new DS-5 Debugger configuration in Debug Configurations. Create a new Debug configuration and select AEMv8x4 under the ARM RTSM list (typing AEMv8 in Filter platforms will help with the selection).


Paste the following parameters in the Model parameters text box:

-a “[LINARO_PATH]\\img.axf”
 --parameter motherboard.mmc.p_mmc_file="[LINARO_PATH]\\vexpress64-openembedded_lamp-armv8-gcc-4.9_20150123-708.img"
 --parameter motherboard.mmc.card_type=eMMC 
 --parameter motherboard.smsc_91c111.enabled=true
 --parameter motherboard.hostbridge.userNetworking=true
 --parameter motherboard.hostbridge.userNetPorts="5555=5555,8080=8080,22=22"


Where [LINARO_PATH] is the path where you saved the kernel image and the Linux image downloaded from the Linaro website previously. The last parameter userNetPorts is important later to allow the connection of the debugger to the gdbserver port opened on the model.


In the Debugger tab make sure the radio button Connect only is selected. You can now Apply the modifications and click on Debug to start the model.


Once loaded, press the Continue button (green arrow) to run the model and boot Linux.


Debug via gdbserver

Once Linux finished booting (it shows the command line prompt), it’s possible to access to the file system and processes running on the model through a Remote System connection in DS-5. To create a new connection, select the Remote Systems tab in the DS-5 Debug perspective. Click the new connection button as indicated in the image below:

image_10.pngSelect Linux as System type and press Next. The model is running locally so we can specify LOCALHOST as hostname. Give a name to the connection and an optional description. Finally click Finish to complete the creation of the connection.


The new connection should appear in the list and you should get access to files and processes. In case DS-5 asks for login details use root as username and leave empty as password (or the one you specified if you changed that in the Linaro image running in the ARMv8 model).


We have now access to the Linux system running on the model and you should be able to access to the file system and view the running processes from directly the Remote System view.


Now that we have an established a successful connection, we can create the debug configuration for our Hello World and run the application on the model.


Open the Debug Configurations dialog again and create a new connection this time selecting Linux Application Debug – Application Debug – Connection via AArch64 gdbserver – Download and debug application.


Make sure you set the port to 5555 as we specified in the list of parameters when launching the model.


Switch to the Files tab and select the binary built in the previous step. Set /home/root for both Target download directory and Target working directory. In the Debugger tab make sure the radio button Debug from symbol is selected with main as symbol.


If all the settings are correct, the Debug button should be enabled and you can start a debug session simply by clicking on it. The debugger will connect to the target, upload the binary and stop at the beginning of the main function as we specified. The Debug Control view should appear similar to the following:


Press the green arrow to Continue to run the program after the breakpoint in the main function. The application should terminate successfully and you should be able to see in the App Console tab the console output specified in the printf function “Hello v8 World!”.


Congratulations ! You’ve just built a Linux application with ARM Compiler 6 running on a ARMv8 model !


In summary, in this tutorial we used DS-5 to create a Linux application built via ARM Compiler 6 and we debugged the application on a ARMv8 Fixed Virtual Platform Fast Model. The advanced code generation technology available in ARM Compiler 6 can be used to build Linux applications running on the latest ARM IP.


Did you find this blog useful? Do you think this would be a valuable supported feature? We would like to hear from you so please don't hesitate to comment or send an email (stefano[dot]cadario[at]arm[dot]com) to discuss this!



We‘ve just released ARM DS-5 Development Studio v5.22 and we have made Streamline more powerful and user-friendly. In this blog, I will highlight the major changes in the latest version.  For a more detailed list of enhancements and fixes, please see the changelog.


Android trace events alongside extensive list of standard system events


Android supports trace events and these events are written to a system trace buffer. We can use Systrace tool, provided by Android, to collect and visualize these events. In DS-5 v5.22 release, we have enhanced Streamline to support Android trace events. We can now see the performance counters and charts like CPU and GPU activity alongside standard Android trace events.


Figure 1 Streamline showing Android trace events


For example, in the above capture, you can inspect the frame by looking at various Android Surfaceflinger events like onDraw and eglSwapBuffers.


Profile Mali-T400 Series GPUs without having kernel source


Streamline requires an agent called gator to be installed and running on the ARM Linux target. Gator can operate in two modes

(a) kernel space gator – using a kernel module called gator.ko.
(b) user space gator – without the kernel module.

As user space gator is restricted to using user space APIs, it does not support all the features that kernel space gator supports. However user space gator is more easy to use as you do not need the target’s Linux kernel source to build the kernel module. Given the ease of use, we are working towards enhancing the features supported by user space gator. With this release, we are happy to announce that user space gator now supports Mali-T400 series of GPUs.  Note that you will need a recent version of Mali DDK, which exports system events to the user space. Going forward, you can expect us to add support for more Mali graphics processors.


Automatic fetch of symbol and other information from files on the target


Streamline needs symbol information to co-relate the events captured and the code being run. In the past, we had to manually provide this image information. This can be tricky if image is available only on the target but not on the host. In the v5.22 release, we have introduced automatic image transfer from the target feature to handle this situation.


Figure 2 New textbox to select processes for automatically fetching of image from the target


This is best shown with an example. In my case, I want to run the dhrystone executable on my Nexus 9 and see the function profile. As a first step, I run the program via adb, and start the Streamline session. During the session, I can now see a new box at the bottom, as seen in the above picture. Here, I can type a pattern (“dhr” in my case) to select the list of processes. Streamline will automatically fetch symbol information for these selected processes from the target. In my case, I can see that Streamline shows function profile for dhrystone, as seen in the below picture, without having to provide image manually.


Figure 3 Streamline showing function profile for the dhrystone process




Streamline snippet during the live capture


Streamline snippet is now available during live capture. As you might recall, Streamline snippet is a powerful feature where users can track complex counters, derived from a combination of more basic counters. For example, as seen in the below picture, you can track ClockPerInstruction (CPI) using $ClockCycles and $InstructionExecuted counters.


Figure 4 CPI snippet




DS-5 v5.22 comes with an enhanced Streamline with useful features like support for Android trace events, automatic symbol loading from target, profiling with user-space gator library for Mali-T400 series GPUs amongst others.  You can get all these features and more by downloading DS-5 v5.22 from hereSign up to the DS-5 newsletter and get updates, blogs and tutorials delivered to your inbox.

Several ARM partners such as Clarinox, Micrium, Oryx-Embedded, wolfSSL and YOGITECH  are using Software Packs to deliver middleware. This simplifies installation, usage, and project maintenance of software components. We have created a new Partner Pack website that gives you an overview over the currently available Packs, covering a wide range of use cases:

  • Functional safety
  • Real-time operating systems
  • Security/encryption
  • TCP/IP networking and
  • Wireless stacks

Use Pack Installer to install one of these Packs automatically in µVision:


You may know that there is a team at the University of Szeged who are keen to make the web forward especially on embedded systems. Several months ago an interesting question was raised to us which sounded simple, but was hard to answer right away. The question was: How can one build a functional web browser?


If you are interested, check out my colleague's post at our blog site.

Any comments, feedback or even contributions are welcome! (Comments can be left either here or on our blog.)

It seems that just yesterday we released ARM Compiler 6.01 and it’s already time for a new major release of the most advanced compiler from ARM.

Let’s see the major highlights for this release:

  • Update of C++ libraries
  • Performance improvements
  • Enhanced support for ARMv7-M, ARMv6-M cores


Update of C++ libraries

Previous versions of ARM Compiler included only the Rogue Wave C++ libraries, which haven’t been updated from the C++03 standard. In ARM Compiler 6.02, we are moving closer to the leading edge by incorporating libc++ from the LLVM project, having passed our extensive internal validation suites.

The new libraries support the C++11 and C++14 standards and, in conjunction with the LLVM clang front-end, ARM Compiler 6.02 is the most modern and advanced toolchain to develop software for your ARM-based device. Look at some of the advantages of the new C++ standards in my recent blog post on C++11/14 features.

If you want to use the old libraries you can still do it by using the --stdlib=legacy_cpplib command line option.

Performance improvements

Performance is an important aspect of a toolchain and benchmarks are a convenient way (although not perfect) to evaluate the quality of the optimizations performed by the compiler.

During the last months, ARM engineers worked on identifying and implementing optimization opportunities in the LLVM backend for ARM. The results are shown in the following graph.


As you can see, the improvements between ARM Compiler 6.01 and ARM Compiler 6.02 are significant and show we are working on the right direction. Even though your code base is different from a synthetic benchmark, you may also see a boost in your code base as well: let's give it a try!

Enhanced support for ARMv7-M and ARMv6-M

clang is often used to build high performance code Cortex-A cores and it plays a fundamental role in this area. Embedded ARM microcontrollers have been less of a focus for the LLVM community and ARM is now filling the gaps by making ARM Compiler 6 a toolchain able to build efficient code across all range of ARM processors, from the smallest Cortex-M0+ to the latest Cortex-A72 64-bit processor.

ARM engineers have focused on Cortex-M processors and we are now confident enough to change the support level for Cortex-M family cores from alpha to beta: this means that the code generated for the ARMv7-M and ARMv6-M architectures has reached a good quality level and has been sufficiently tested by ARM (but still work to do hence the beta support moniker). We expect to complete support for ARMv7-M and ARMv6-M in the next release of ARM Compiler at the end of this year.

If you want to know all the changes in this release of the compiler you can take a look at the release notes on ARM infocenter.

This version of the compiler will be included in the next version of DS-5 (5.22) but if you can’t wait, you can get the standalone version from ds.arm.com and add it to DS-5 (if you have DS-5.20 or greater) as shown in this tutorial.

As always, feel free to post any comment or question here or send me an email.

Any feedback is welcome and it helps us to continue delivering the most advanced toolchain for ARM from ARM.



Dear Friends,



here they are few lines of GCC-Assembly code to make your interrupt in Cortex-M4 fully
reentrant. Please read notes from Sippey  before proceeding to details of implementation
of this page.


NOTE1: The code uses large amount of stack (even 32 or 136 bytes each reentrant call depending
on the use or not of floating point operation), so be careful in excessive use of re-entrancy and remember to set stack appropriately. When you use this code within matlab/simulink, you need at least 136 bytes more each sampling rate in simulink schematic.


NOTE2: This code is inspired and optimized by the work of other authors, who better than me

knows ARM assembly and Cortex Architecture.



NOTE3: Re-entrant code is supposing that at interrupt exit the processor returns to task

space (being it on PSP or MSP). Hence to avoid messing the stack preemption function

should only be called by the lowest interrupt priority in the program.



Function description:




    - first pushes a dummy stack (only 32 bytes) on the stack and returns from the interrupt.

    - The return address programmed in the dummy stack is in the same function code, so
      that the rest of the code executes as being in the process-thread mode (instead of having
      the interrupt priority)

    - Once returned in the thread mode the code calls the function FUNCTION. This is a normal

      function call (e.g. the stack is saved again by the processor mechanism)

    -  At return it generates a software triggered interrupt SVC to restore STACK




    - Determines which SVC code was called.

    - In case other code a traditional IntHandler is executed

    - Otherwise we call RIPrestore who clean up the original interrupt stack.



NOTE: Why we restore stack in the SVC instead of using the RIPrun? Cortex CPU can process

two types of threading model, using one or two different stacks (PSP/MSP) when in appropriate mode.

Hence the original stack is being saved on a stack that is depending on the threading model. The

SVC call ensures that the processor recovers the stack appropriately.



MAJOR differences from Sippey code:

    1st use of defines to decide which priority levels and callback procedures to use;

    2nd all implementation are done using inline assembly from GCC

    3rd use of naked "C" functions to limit overhead due to function call

    4th RIPrun function locally encodes the return address
        (ADDW  R0, PC,16 ; SKIP 8 Instruction from here)
        to ease the code.



* Reentrant Interrupt Procedure Call (RIPC)
* ARM-GCC code to implement REENTRANT interrupt procedures.
* Source of inspiration:
*     - "The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors" 3rd ed.
*     - Sippey code for KEIL: Sippey (sippey@gmail.com)
* CORTEX M4 Register Structure
*   - CPU 16 Register (R0-R15) + PSR
*   - FPU 32 Register (S0-S31) + FPSCR
*      - R0-R3, R12, LR, and PSR are called “caller saved registers.”
*   - R4-R11                  are called “callee-saved registers.
*   -  S0-S15 + FPSCR are “caller saved registers.”
*   -  S16-S31        are “callee-saved registers.”
* Typical Calling Layout
*         R0/R1 is Return Result Value if any
*         R0-R3 are parameter value (with the above exception)
*         R12 is a scratch register
*         R13 used to store SP
*         R14 link register (return address)
*         R15 is Program Counter
* Stack Structure (growing from TOP to LOW memory)
*     BEWARE for efficiency Stack is manipulated aligned to 8 bytes always
*            in case of ODD number of registers it gets padded with white space
* Cases:      NOFPU               FPU
*         32  (pad align 8)      (PAd align 8)    if PADDING present xPSR bit9 == 1
*            28  xPSR           96    FPSCR
*            24  ReturnAddr     92    S15
*            20  LR             88    S14
*            16  R12            84    S13
*            12  R3             80    S12
*             8  R2             76    S11
*             4  R1             72    S10
*             0  R0*            68    S9                NO FP Stack pointer here
*          ==============       64    S8
*             8 REGs            60    S7
*                               56    S6
*      (total 8x4=32bytes)      52    S5
*                               48    S4
*                               44    S3
*                               40    S2
*                               36    S1
*                               32    S0
*                               28    xPSR
*                               24    ReturnAddr
*                               20    LR
*                               16    R12
*                               12    R3
*                                8    R2
*                                4    R1
*                                0    R0*               FP Stack pointer here
*                           ====================
*                            8+17 = 25 REGS PADDED to 26   (Total 26*4=104bytes)
*         The return address is the stacked PC
*         While Stacked LR was previous return address
*         BX LR is return from subroutine
*         if LR start with 0xFxxxxxxxx then it is interpreted as Return from Interrupt (Exception Return)
*         Possible Exception return values are:
*         if FPU was used before interrupt call
*         0xFFFFFFE1 Return to another exception using MSP (Master)
*         0xFFFFFFE9 Return to thread using MSP (Master)  stack pointer
*         0xFFFFFFED Return to thread using PSP (process) stack pointer
*         if FPU was not used before CALL
*         0xFFFFFFF1 Return to another exception using MSP (Master)
*         0xFFFFFFF9 Return to thread using MSP (Master)  stack pointer
*         0xFFFFFFED Return to thread using PSP (process) stack pointer

#include <misc.h>
#include <stm32f4xx.h>

// Lazy using strings to pass parameter to Assembly code
#define SVC_CALL_NUMBER       "0"     // SVC_CALL_NUMBER being used
#define PRI_LEVEL_LOCK        "240"   // Level 15 for STM32F4

static void RIPCrun( void (*fcn)(void) ) __attribute__ (( naked, used ));
static void RIPCrestoreSP( void ) __attribute__ (( naked,used ));

* This is NEW default handler for standard SVC if used override if
* required as usual in CM4
__attribute__(( weak,used )) void SVC_Orig_Handler()
    while(1); // No other default service! Catch or return?

* \brief RIPCrun makes the interrupt reentrant. It pushes a dummy
* stack, loads a fake return address depending on the FPU and call type
* and returns. The return address is given as param.
* Usage example
*     void SysTickHandler()
*     {
*            // NON REENT CODE BEFORE
*         RIPCrun(reentrant_Handler); // Call to reentrant code
*     }
* To avoid undesired preempt. The call is made in two stages,
* first we call/return to RIPstub that on its own calls desired
* Handler
* Note that the interrupt being made reentrant should have the lowest
* priority.
static void RIPCrun( void (*fcn)(void) )
                                                // R0 at entry contains the jumping address
    __asm volatile(
#ifdef __FPU_USED
            " TST LR, #0x10                  \n" /* Test bit 4 to check usage of FPU register */
            " IT EQ                          \n"
            " VMOVEQ.F32 S0, S0              \n" /* Mark FPU used for Lazy stacking operation  */
            " MRS  R1, xPSR                  \n" // Should be xPSR ??
            " PUSH {R1, LR}                  \n" /* Push PSR and LR on the stack*/
            " SUB  SP, #0x20                 \n" /* Reserve additional 8 words for a complete dummy stack return*/
            " STR  R0, [SP]                  \n" // Pass the R0 to Callee in return
            " ADDW  R0, PC,16                \n" // RIPCservice  (SKIP 8 Instruction from here)
            " STR  R0, [SP, #24]             \n" // Handler Launcher in thread (Temp return addr)
            " MOV  R0, #0x01000000           \n" // Generate a fresh new PSR
            " STR  R0, [SP, #28]             \n" // and store it (PSR) in proper offset
            " MOV  R0, #0xFFFFFFF9           \n" // Create a return value for ISR return to MSP no FP (8 Word frame)
            " MOV  LR, R0                      \n" // and place it to LR to emulate standard ISR return
            " BX   LR                         \n" // The return here will use our dummy stack

            // RIPCService
             * No we exited the interrupt and enter immediately here (SP+24 to this address).
             * At return the R0 register will be populated from the dummy stack with the parameter passed
             * to the RIPrun (ex R0) and we will jump there immediately.
             * Not this procedure call will be handled in MSP stack whatever would have been the original
             * THREAD stack (PSP or MSP).
            " BLX  R0                        \n" // RIPService Call function desired
            " MOVS  R0, #" PRI_LEVEL_LOCK "  \n" // Rearrange PRIORITY level to
            " MSR  BASEPRI, R0                  \n" // Block further trigger on our base interrupt
            " ISB                            \n" // ISB required to wait for BASEPRI effect (avoid further preemption)
            " SVC  #" SVC_CALL_NUMBER "      \n" // Replace here with desired syscall number
//            " BL   RIPCerror                 \n" // SVC will reset stack, we should not return here
    while(1); // We should never get here, otherwise stack was messed up!

* \brief Control logic is the following
*                  RIPsvc();
*         else
*                  SVC_Orig_handler();
* This handler and the RIPCsvc function are restoring the stack and hence should be protected against
* further reentrant interrupt of the same kind otherwise the stack can be messed up.
* The SVC handler always executes with MSP stack, but the original SVC service number can be stored in
* MSP or PSP. Hence the initial test serves to properly extract the SVC number.
__attribute__(( naked )) void SVC_Handler()
    __asm volatile(
            " TST    LR, #0x04               \n" /* Test EXC return bit 2 (MSP or PSP?)*/
            " ITE    EQ                      \n" // if 0
            " MRSEQ  R0, MSP                 \n" // Get SP from MSP
            " MRSNE  R0, PSP                 \n" // else use PSP
            " LDR    R1, [R0,#24]             \n" // This is offset of stacked PC
            " LDRB.W R0, [R1, #-2]           \n" // Check SVC calling service
            " CMP    R0, #" SVC_CALL_NUMBER "\n" // Replace here with desired syscall number
            " BEQ    RIPCrestoreSP           \n" // use our modified SVC handler
            " B      SVC_Orig_Handler        \n" // else jump to the original handler
    while(1); // We should never get here, otherwise stack was messed up!

* \brief this function is called after the SVC handler properly identified we are
* returning from a reentrant interrupt.
*   -  We restore BASEPRI set to avoid nesting of SVC_handler (which produces a fault).
*   -  We remove the stack provided by the SVC_Handler call.
*   -  We recover PSR and LR as for the original storage in the RIPCrun
*   -  We return this SVC using the stack pushed for the RIPCrun.
* DOUBT: Why triggering lazy stacking here? does it copies value in a dummy stack which
*  is trashed a couple of instruction later?
static void RIPCrestoreSP( void )
    __asm volatile(
            " MOVS R0, #0             \n" /* Use the lowest priority level*/
            " MSR  BASEPRI, R0      \n" // to renable the interrupt
            " ISB                   \n" // Ensure synchronization
#ifdef __FPU_USED
            " TST LR, #0x10         \n" /* Test bit 4 to check usage of FPU register */
            " IT EQ                 \n"
            " VMOVEQ.F32 S0, S0     \n" /* Mark FPU use for Lazy stacking operation  */
            " TST LR, #0x10         \n" /* Test bit 4 to check usage of FPU register */
            " ITE EQ                 \n"
            " ADDEQ SP, SP, #104     \n" // Restore stack properly
            " ADDNE SP, SP, #32     \n"
            " POP {R0, R1}          \n" /* Push PSR and LR on the stack*/
            " MSR APSR_nzcvq,R0     \n" // Should be xPSR ??
            " BX   R1                \n" // Finally jump to R1
    while(1); // We should never get here, otherwise stack was messed up!

#define TEST_REENT
#ifdef  TEST_REENT

#define NESTLEVEL 20

static int pass = 1;
float NPI[20];
unsigned int stackIN[NESTLEVEL];
unsigned int stackOUT[NESTLEVEL];
unsigned int nesting = 0;

* \brief executes some FP operation. Marks stack at entrance and exit and
* waits in the middle for a number of nested recursion.
* Note that the Stack consumption is about 72 bytes for nonFP reent
* and 144 bytes for FP reent. This is due to the double procedure
* call that is set at each interrupt (e.g. the original stack call
* is preserved until the end + one procedure call get through the BLX
* We have 8 local bytes on the stack more
* Which makes 32+8+32 (Two complete stacks + 8 bytes for temporary PSR&LR)
* Or 104 + 32 + 8 = 144 in case of FP call stack
* The bytes overhead w.r.t. the standard mechanism is hence 40 bytes.
* Beware to have a large enough stack for reentrancy.
void ReentTickTest()
    register unsigned int *stackref;
    int a=0,lev;

    __asm__ ("mov %0, sp" : "=g" (stackref) : );
    NPI[lev] = 3.1415926535f*lev;

        // Wait for Rentrancy
    __asm__ ("mov %0, sp" : "=g" (stackref) : );
    if (lev==0) pass=2;

void SysTick_Handler()

int main(void)
    float jj,kk;
    jj = 3.14;
    kk = jj*2;

    // The chosen IRQn should be the lowest in the system so that we are
    // sure that when this interrupt is exited we will return to thread
    // mode with a well not stack recovery mechanism.
    // The alternative is to disable the interrupt in the code, but this
    // violates the rule of MAX 12cycles for interrupt latency which is
    // one of the best features of Cortex

    for (;;)
            if (pass==2) break;
        nesting = 0;


In late summer (mid August/September) we will run a series of webinars that will show you the advantages of the ARM Cortex-M7 processor family and the silicon implementations that are available today. This webinar series is as follows:

They are hosted by my colleagues Johannes Bauer and Matthias Hertel who have lots of experience in the embedded space. While the first webinar will introduce the Cortex-M7 architecture and its advantages, the two other webinars will concente on the devices available by our silicon partners Atmel and STMicroelectronics. They will contain live demos on how to connect to the hardware and how to create your first applications with MDK Version 5.


Embedded ARM Processors

On July 28th, 2015 (8 am PST / 5 pm CEST) I will be holding a webinar on how to create your own Software Pack.


Software Packs offer a great way to distribute software components in a well-defined manner within engineering groups. In this webinar, a Software Pack is created based on the Jansson C library used for encoding, decoding and manipulating JSON data. I will show how to pack this software component together with an example project so that it can be shared with your fellow engineers. Also, I will discuss how to distribute a Pack to a wider audience.


For registration, please visit Creating a Software Pack to Share with Developers



The last quarter we started to blog about our work on GNU GCC and LLVM because we think that sharing information is the key factor to cooperate in the Open Source community: we want to continue with the updates by sharing our achievement of last quarter and plans for the future. We will be at GNU Tools Cauldron 2015 on 7th/8th/9th August in Prague (Czech Republic): please come and talk with us, this is a great occasion to meet us in person and discuss open source contributions.


The following notes include partial information on what we’ve been working on in the last quarter and what we plan to do in the next one: for details please refer to the slides or get in touch with us.



The last quarter was particularly important for the release of the new major version of GCC 5.1! Thanks to ARM engineers and thanks to everyone who helped to smoothly get this important milestone release out of the door. On the Cortex-R and Cortex-M profile, ARM released GCC 4.9 for ARM Embedded Processors: you can find the release notes on the Launchpad website.


In terms of development, the majority of the effort went into improving the ABI compliance and some performance tuning. As revealed in the previous update, we added ARMv8.1 support to binutils, enabled GCC native tune (-mcpu=native) and worked on ABI compliance for both Cortex-A and Cortex-R/M toolchains.


What’s next?

For the next quarter, the plan is to complete what’s left for ARMv8.1 support and working on various optimizations such as enhancing GCC loop invariants (PR65477, PR62173, PR62178), improving the cost model for Cortex-A53 and Cortex-A57 and improve CSEL code generation for AArch64.

Further improvements will be made on improving selection of FP divide & multiply on Cortex-M and add support for all memory models of AArch64.



Even if relatively new, LLVM is quickly gaining popularity and ARM committed to support the community development. The commercial toolchain we offer to our customers, ARM Compiler 6, is in fact based on open source clang.


In the last quarter we worked on different aspects of the compiler, from adding support for the ARMv8.1 architecture to improvements on usability of the command line interface: in collaboration with Linaro we ameliorated the architecture and core name parsing, now with a cleaner code and more usable than before.

In terms of performance, we’ve been working on several optimizations (alignment of global variables, minimization of stack usage (details in section LLVM lifetime markers), new float2int pass, PBQP register allocator, etc.) but we also set up a new Cortex-A53 performance tracking bot: read more about this in the section below.


What’s next?

In terms of future plans, we will be still focused on performance improvements across all the cores and optimizing accesses to global variables in loops. We also plan to further improve the LNT WebUI to make it easier to detect performance changes tracked by the running bots.


LLVM lifetime markers

In the last quarter update we mentioned the necessity to reduce Stack usage, particularly important for the Android Open Source Project. Lifetime markers are used to identify when a particular slot becomes alive or dead, along all control flow path: generally those markers are ignored by most optimization passes but those are important to reduce the stack usage.


ARM engineers removed the previous limitation of 32 bytes minimum size for a marker, unveiling a few issues (primitive types use 1, 2, 4, 8 bytes stack slots) but contributing to an overall reduction of the stack usage.


LLVM public performance tracking bot

Development of compilers is tough job! Each patch can not only affect the correctness of the code generation but also the performance of the code generated. Tracking performance can be really tricky, especially considering the number of devices and architectures LLVM supports. For those reasons, ARM committed to help the community by adding a public Cortex-A53 tracking bot: the script executes a few benchmarks on LLVM top-of-trunk every 6 hours and publishes results at http://llvm.org/perf


There are still a few improvements that could be made on the system but we feel this is going in the right direction and we hope the community will make good use of it!


For more details please refer to the full presentation given by Matthew Gretton-Dann available on YouTube and his slides (attached to this blog post).

We would like to hear from you on what you are doing in the open source community, share ideas and cooperate for the good of the whole ecosystem. See you at GNU Cauldron in August!



DAC 2015, Fast Models 9.3

Posted by robkaye Jun 23, 2015

Earlier this month I attended DAC in San Francisco.  We had a demo of Fast Models, some partner presentations and a poster session.  I came away from the conference with the impression that while the technical conference remains vibrant the exhibition portion is declining in importance.  I first took part in the 1980s but since then we have seen the birth of the Internet.   In those far off days we used to see large delegations from all parts of the world attend to find out the latest product information and get updates from the EDA vendors. Who can forget some of the creative ways that some of these promoted their products?   Nowadays that information largely available online and through the various social media (like this one) decreasing the value of visiting the trade show: its may be convenient and efficient.  It's certainly a lot less fun.


A new demo involving Fast Models was shown by Aldec:


Aldec - 1.png

Aldec's demo platform for their Hybrid Virtual Prototype with Fast Models.

Hybrid platforms like this are becoming very popular when there is a need to connect a high-performance simulator to represent the processor or processor subsystem with a more detailed model of other parts of the system.  This could be for many reasons, which we have discussed in an earlier blog.

Immediately prior to DAC we released Fast Models version 9.3.  We have moved to a quarterly release cycle (from half-yearly) that better serves the needs of ARM's IP roadmap.  In this release we introduced support for new Cache Coherent Network Models (CCN-502, CCN-512) and Mali Display Processors (Mali-DP500 and Mali-DP550).  We also continued to advance the capabilities of the models: the two areas that was are currently focused on are Timing Annotation and Checkpointing (Save and Restore),  

Timing Annotation extends the use of the Virtual Prototype in early, high-level, performance estimation.  The functionality provides a mechanism for the user to insert estimated timings at key points in the Virtual Prototype to improve the correlation of the reported cycle counts with what will be achieved in hardware.  The aim is to do this with minimal impact on the throughput of the model.   We are adding the Timing Annotation in stages: in this release the focus has been on the integrated cache models.  Of course, the results are very heaviy dependent on quality of the annotated values.


We also introduced a new type of system in the example Virtual Prototypes supplied with Fast Models.  Previously we have delivered Fixed Virtual Prototypes (FVP) and Exported Virtual Subsystems (EVS) the former being standalone platforms, the latter being functionally equivalent examples that integrate with SystemC.  The third category, which also works with SystemC, has been called an SVP or SystemC Virtual Prototype.  The evolution from the EVS is that in the SVP models are individually instantiated into SystemC rather than being a monolithic subsystem.  This gives the platform developer much more flexibility.


The second half of 2015 will see the continued evolution of the Fast Model functionality and a burgeoning library of models. 

Hopefully I'll be seeing some of you at the ARM TechCon in November where we'll be going into more detail on these capabilities.

ARM FAE Ronan Synnott explains the DS-5 Development Studio at 52 DAC in the Moscone Center. The DS-5 contains compilers, debuggers and streamline analyzers that assist with every stage of SoC development. To find out more please visit –http://www.ds.arm.com



Do you have any questions? Please put them in the comment section below

Often we hear embedded software engineers avoiding the usage of C++ because of being scared by potential performance hit or code size explosions. Even though some features of C++ can have significant impact on the performance and code size, it would be a mistake to exclude the usage of the language completely because of this.


In this article I want to show a few additions to the C++11 and C++14 standards which can improve the readability of your code but won’t affect the performance thus can be used even with the smallest Cortex-M0+ core.


ARM Compiler 5 supports C++11 whereas ARM Compiler 6 supports both C++11 and the most recent C++14 (refer to documentation for details). If not specified, ARM Compiler 6 assumes C++03 as a standard so you need to use the command line options --std=c++11 or --std=c++14 to be able to use the newer standards. If you want to enforce the conformance to a specific standard you can use the command line option --pedantic-errors: armclang would generate an error in case you are using extension or features of the standard.



The constexpr keyword has been introduced with C++11 but C++14 removed a few constraints making this functionality even more powerful. When a function is declared as constexpr, the compiler will know that the result of that function can be evaluated at compile time and it can be used accordingly.

Let’s assume we want to create a static array based on the number of bits set in a word; with C++03 we would have written something similar to the following code:

const int my_word = 0xFEF1; // bit mask

int *my_array;

int number_of_bits(int word) {
    int count=0;
    while(word) {
        count += word & 0x1;
        word >>= 1;
    return count;

my_array = (int*)malloc(sizeof(int)*number_of_bits(my_word));


With C++14 is possible to calculate this in a function and the result is available at compile time. The code can be transformed as:

const int my_word = 0xFEF1; // bit mask

constexpr int number_of_bits(int word) {
    int count=0;
    while(word) {
          count += word & 0x1;
          word >>= 1;
    return count;

int my_array[number_of_bits(word)];

Because the function is evaluated at compile time, the compiler can instantiate the array on the stack saving the call to malloc() at run-time: readability and performance have been improved at the same time!

Binary literals

Often in our applications, we need to use bit masks or perform bit operations: How many times did we write code similar to the following?


if (x & 0x20) { // 0010 0000


What does 0x20 mean in this code? For an expert programmer this is clearly checking if the sixth LSB bit of x is set but it might be getting trickier with more complex bit masks. C++14 makes this even more clear. In the latest version of the standard C++14 it is possible to define binary literals, making the specification of bit masks even clearer:


if (x & 0b0010’0000) {


As you can see from the example, not only can we specify the bitmask directly but we can also use ' as a digits separator to enhance readability even further. The generated assembly code is the same but the source is easier to understand.


Range-based for loop

Most modern languages like Python and C# support range-based loops; this doesn’t add more power to the language but it improves readability of the resulting code.

This functionality has been added to C++11 and it’s now possible to use range-based loops directly in your existing code.

Let’s take a look at an example:


int my_array[] = {1, 2, 3, 4, 5};
int sum_array(void) {
    int sum = 0;
    for (int i=0; i<5; i++) {
          sum += my_array[i];
    return sum;


This can be re-written to

Int my_array[] = {1, 2, 3, 4, 5};
int sum_array(void) {
  int sum = 0;
  for (auto value : my_array) {
    sum += value;
    return sum;

The code reads better now and we also removed the size of the array from the for loop which is potential source of bugs (we need to update it if we add a new element for example).

The range-based for loop works with any type with begin() and end() function defined so that we can apply the same technique with std::vector:

int sum_array(std::vector<int> array) {
    int sum = 0;
    for (auto &value : array) {
        sum += value;
    return sum;

In this case the improvements in terms of readability are even better and, as a result, the code is easier to understand and maintain.

Null pointer constant

Since the beginning of the C standard, we have used NULL to check the validity of a pointer. This led to confusion in C++ because NULL is equivalent to 0.

Let’s assume we have two functions with the same name and different arguments:

Log_value(int value); // first function
Log_value(char *value); // second function

In C++, the following code has an unexpected effect from the developer's point of view.

Log_value(NULL); // will call the first function

Infact, by using NULL we expect the second function to be called but, because NULL is equal to 0, the first function will be called instead.

In C++11 the keyword nullptr has been introduced and should be used instead of NULL, so that we can easily avoid this ambiguity:

Log_value(nullptr); // will call the second function


In this case, the second function is correctly called with an explicit null pointer value.



We have seen a few functionalities of C++11 and C++14 which can be used without worrying about performance and that can enhance the readability of your code. This article covers just a few of them, you can find more information on Wikipedia (C++11 - Wikipedia, the free encyclopedia and C++14 - Wikipedia, the free encyclopedia) and on Standard C++11 and Standard C++14.

I hope you found these information useful and you can soon start to use some of the functionalities in your code base. As mentioned at the beginning, ARM Compiler 6 supports C++11 and C++14. If you still don’t have DS-5, download a free 30-day evaluation of Ultimate Edition to get started.


Feel free to post any questions or comments below.





Poor cache utilization is something which can have a big negative impact on performance and improving the utilization will typically have very little or no trade off. Unfortunately detecting poor cache utilization is often difficult to do and requires considerable developer time. In this guide I will demonstrate using Streamline to drive cache optimization and identify areas of inefficiency.


I have used the Juno ARM Development Platform for the purposes of this guide, however the counters I use (or equivalents) should be available on all ARM Cortex-A class processors so it should be easily repeatable. Even without a platform to test on, the methodology I use should provide an insight into using Streamline to help guide optimization.


This guide assumes a basic level of knowledge of Streamline. Introductory information and getting started guides can be found in DS-5’s documentation or, along with other tutorials, on the website.



Setting up Streamline

Start by installing gator on the target. This is beyond the scope of this guide; see the readme in <DS-5 installation dir>/arm/gator/ for detailed information. Once installed, launch the gator daemon. I successfully used both user-space and kernel-space versions of gator. The user-space version is sufficient in most cases, the kernel-space version is only required in some circumstances – I expand on this point later.


Compile the attached cache-test application. It is sufficiently simple that it could be compiled on the device (if a compiler were available) or cross-compiled otherwise.



Configuring DS-5

Open up the Streamline Data view in DS-5. Configure the Streamline connection using the Capture & analysis options () to use the gator version running on the target. The other default configuration options should be sufficient, although you may optionally add the application binary to the Program Images section at the bottom for function-level profile information, or, if the binary contains debug symbols, source-code-level profile information.



Adjust the Counter configuration () to collect events from:

  • Cortex-A57
    • Cache: Data access
    • Cache: Data refill
    • Cache: L2 data access
    • Cache: L2 data refill



In our case we are also collecting “Cache: Data TLB refill”, which will provide an additional measurement to analyze caching performance, as well as “Clock: Cycle” and “Instruction: Executed” which will provide an insight into how execution is progressing. We are also collecting from the energy measurement counters provided on the Juno development platform.


Further Information on the Target Counters

The counters listed above are specific to our particular platform – the Juno development board. This has a big.LITTLE arrangement of 2x Cortex-A57s and 4x Cortex-A53s; we will be running our program on one of the Cortex-A57 cores.


The ARM Performance Monitors extension is an optional, non-invasive debug component available on most Cortex-A-class cores. Streamline reads the Performance Monitor Unit (PMU) architecture provided by this extension to generate its profiling information. Each of the processor counters observed within Streamline corresponds to a PMU event. Not all events described by the PMU architecture are implemented in each core, however a core set of events must be implemented, including the “Cache: Data access” and “Cache: Data refill” events shown above (in PMUv2 and PMUv3). Thus these two events should be available on all Cortex-A-class cores which implement the architecture. For more detailed information on the Performance Monitors Extension see the relevant section of the ARM Architecture Reference Manual for ARMv7 (Chapter C12) or ARMv8 (Chapter D5) as appropriate.


The “Cache: L2 data access” and “Cache: L2 data refill” counters are also common (but not mandated) on cores with an integrated L2 cache controller, however some cores have separate L2 cache controllers – for example the CoreLink Level 2 Cache Controller L2C-310. In this case the counters will be limited to what is available from the controller and whether Streamline supports it. In the case of the L2C-310, equivalent counters are available and it is supported in Streamline, however the counters are only readable using kernel-space gator (user-space gator can still read all others). Ultimately the L1 cache counters give a good view of what’s going on so if you are unable to read counters from the L2 cache (for whatever reason) it is still possible to follow the steps in this guide to help perform cache-optimization, it might just be slightly harder to see the full path of data through the cache system.


Most cores also provide additional PMU events (which will vary by core) to monitor cache usage and these can provide further information.


The Chosen Counters

The “Cache: Data access” counter (PMU event number 0x04) measures all memory-read or -write operations which access the L1 data cache. All L1 data cache accesses (with the exception of cache maintenance instructions) are counted, whether they resulted in a hit or a miss.


The “Cache: Data refill” counter (PMU event number 0x03) measures all memory-read or -write operations which cause a refill of the L1 data cache from: another L1 data cache, an L2 cache, any further levels of cache or main memory – in other words L1 data accesses which result in a miss. As above this does not count cache maintenance instructions, nor does it count accesses that are satisfied by refilling data from a previous miss.


The “Cache: L2 data access” and “Cache: L2 data refill” counters (representing PMU event numbers 0x16 and 0x17 respectively) measure as their L1 counterparts, except on the L2 data cache.


More detailed information on any of these events can be found in the Performance Monitors Extension chapter of the relevant ARM Architecture Reference Manual as linked above.



Capturing Data

After you have configured the target, press the Start capture button (). Once capturing has started run the cache-test application on the target (as “./cache-test”). Depending on the performance of your target this will take a few seconds to run and will output several messages before returning to the command prompt. When this happens, press the Stop capture and analyze button (). After a brief pause the analyzed data will be displayed.



Reformatting the Captured Data

You should now be presented with a chart looking similar to the image below:



Filter this by just the cache-test application by clicking on the “[cache-test #<proc-id>] entry in the process list below the charts. In the case of multiple processes-of-interest the Ctrl key can be held down to select multiple processes. Having done this, depending on how long the capture session lasted and how long the program ran there may be considerable space around it. Change the Timeline display resolution using the dropdown to the left of the Time index display above the charts (set to 100ms in the example above) to zoom in.


The results currently are somewhat difficult to interpret as all Cache measurements are plotted on the same chart but have different ranges. Split the “Cache: Data access” and “Cache: L2 Data access” measurements into a separate chart as follows:

  1. Click on the Charts Snippet menu () above the process list.
  2. Select Add Blank Chart. Enter “Cache Accesses” as the new chart’s Title and drag it above the “Cache” chart.
  3. On the “Cache” chart, open the Configuration Panel ().
  4. Amend the “Cache” chart’s title to “Cache Refills”.
  5. Using the handle (), drag the “Data access” and “L2 data access” series to the newly created “Cache Accesses” chart.
  6. Remove the blank “Required” series in the “Cache Accesses” chart ().
  7. Change the plotting method of both charts from Stacked to Overlay (using the drop-down box at the top left of the Configuration Panel), allowing the relationship between the values to be more apparent.
    In Overlay mode the series are plotted from the top of the list, down – i.e. the series at the bottom is plotted last, in front of all others. As a result some series may need rearranging to improve their visibility in Overlay mode (although colors are slightly transparent so no data is completely hidden).
  8. Optionally rename the series as appropriate – e.g. “Data access” may be more sensibly named “L1 data access” to complement the “L2 data access” series.
  9. Optionally change the colors of the series to improve their contrast.
  10. Close the Configuration Panel by pressing the button again ().


Having separated these two series the chart should now look similar to the image below:


Next we will produce some custom data series to provide additional information about the performance of the caches:

  1. Click on the Charts Snippet menu () above the process list.
  2. Select Add Blank Chart. Enter “Cache Refill Ratios” as the new chart’s Title and drag it below the “Cache Refill” chart.
  3. Enter “L1 data ratio” as the new series’ Name. Set the Expression to be “$CacheDataRefill / $CacheDataAccess”. As this result is a percentage (the ratio of L1 data cache refills to accesses – i.e. the miss rate), tick the Percentage checkbox.
  4. Add another series to the new “Cache Refill Ratios” chart () and repeat the process for the L2 cache, setting the Expression to be “$CacheL2DataRefill / $CacheL2DataAccess”.
    The expression will differ if using a separate L2 cache controller. Pressing Ctrl + Space in the Expression window will list all available variables.
    In our case the 0x04/0x03 and 0x16/0x17 counter pairs are explicitly listed in the ARMv8 ARM Architecture Reference Manual as being associated in this way. Some care should be taken when using a separate cache controller that this assumption still holds.
  5. Change the plotting method of the chart from Stacked to Overlay.
  6. Optionally change the colors of the series to improve their contrast.


This is a very simple example but it is possible to combine any number of expressions and standard mathematical syntax to manipulate or create new series in this way, as documented in the Streamline User Guide (Section 6.21).


This will result in a chart that looks similar to the image below:


In our case the clock frequency figure (133 MHz) is misleading as it is the average of 6 cores, 5 of which are powered down.



Understanding the Captured Data

Having reorganized the captured data we are now in a position to analyze what happened.


The program appears to be split into three main phases. The first 200 ms has a relatively low level of cache activity, followed by a further 100 ms phase with:

  • A large number of L1 data cache accesses (50.2 M).
  • A virtually equal number of L1 and L2 data cache refills (1.57 M each).
  • A negligible number of L1 data TLB refills (26 K).
  • A low L1 data cache refill ratio (3.1%), although a relatively high L2 data cache refill ratio (33.2%).


This suggests a lot of data is being processed but the caches are being well utilized. The relatively high L2 data refill ratio would be a cause for concern, however with a low L1 refill ratio it suggests that the L2 cache is simply not being accessed that frequently – something which is confirmed by the low number of L2 cache accesses (4.7 M) vs. a high number of L1 cache accesses (50.2 M). The L2 cache will always perform at least some refills when operating on new data since it must fetch this data from main memory.


There is then a subsequent 2200 ms phase with:

  • A slightly larger number of L1 data cache accesses (81.5 M over the period), but a significantly reduced rate of L1 data cache accesses (37 M accesses per second compared to 502 M accesses per second in the first phase).
  • A significantly increased number of L1 data cache refills (26.9 M).
  • A similar number of L2 data cache refills (2.1 M).
  • A vastly increased number of L1 data TLB refills (24.9 M).
  • A much higher L1 data cache refill ratio (33.0%) and a much lower L2 data cache refill ratio (2.03%).


This hints at a similar level of data consumption (based on the fact that the L2 cache has a similar number of refills, meaning the actual volume of data collected from main memory was similar), but much poorer cache utilization (based on the high L1 data cache refill ratio).


This is the sort of pattern to watch out for when profiling applications with Streamline as it often means that cache utilization can be improved. As the L1 data cache refill ratio is high while the L2 data refill ratio is low the program appears to be thrashing the L1 cache. Were the L2 data refill ratio also high the program would be thrashing the L2 cache, however in this case it may be that the program is consuming unique data – in which case there is very little that can be done. However in situations where the same data is being operated on multiple times (as is common) this access pattern can often be significantly improved.


In our case the cache-test application sums the rows of a large 2-dimensional matrix twice. The first time it accesses each cell in Row-Major order – the order the data is stored in the underlying array:

for (y = 0; y < iterations; y++)
  for (x = 0; x < iterations; x++)
  sum_1d[y] += src_2d[(y * iterations) + x];


Whereas the second time it accesses each cell in Column-Major order:

for (x = 0; x < iterations; x++)
  for (y = 0; y < iterations; y++)
  sum_1d[y] += src_2d[(y * iterations) + x];


This means the cache is unable to take advantage of the array’s spatial locality, something which is hinted at by the significant jump from a negligible number of L1 data TLB refills to 26.9 million. The TLB (Translation Lookaside Buffer) is a small cache of the page table: the Cortex-A57’s L1 data TLB is a 32-entry fully-associative cache. A large number of misses in the TLB (i.e. the result of performing un-cached address translations) can be indicative of frequent non-contiguous memory accesses spanning numerous pages – as is observed in our case.

The cache-test program operates on a 5000x5000 matrix of int32s – or 95.4 MB of data. The Cortex-A57 uses a 64-byte cache line length, giving a minimum of 1.56 M cache accesses to completely retrieve all the data. This explains the virtually equal L1 and L2 data cache refills (1.57 M each) in phase 1, where the data is being accessed in order, and explains why they must be this high even in the best case.



Fixing the Issue

In this simple case we can improve the cache utilization by switching around the inner and outer loops of the function, thus achieving a significant performance improvement (in our case a 22x speed increase) at no additional cost.


In real-world examples, where it may not be as easy to locate the exact area of inefficiency, Streamline’s source code view can be used to help pinpoint the issue. To use this it will be necessary to load the application’s binary, either as described earlier or after capture by right clicking the report in the Streamline Data view, selecting Analyze... and adding the binary. If the binary contains debug symbols source-code-level debug information will be available (in the Code tab), otherwise only function-level information will be available (in the Functions tab, and also from the Timeline Samples HUD ()). Function-level information will still provide a good clue as to where to look however. Providing debug symbols are available, the code view can be easily used to give a view similar to below by clicking through the offending functions in the Functions tab.


The annotations on the left of the source code line show the number of occasions that line was being executed when the sample was taken and that percentage relative to the rest of the function. Using the Timeline Sample HUD () we can identify the “yx_loop” function as being responsible for the majority of the samples from our code (1617) throughout the second phase (which we identified as having poor cache utilization). Clicking through this function in the Sample HUD or the Functions tab, we can see 1584 samples on the line within the nested for-loop – suggesting this loop needs a second look. In our case this is a particularly simple function consisting only of this loop, but if it were more complex it would offer a much greater insight into the exact spot the offending function was spending most of its time.




I have attached the source to the simple cache-test example. It is currently in the process of being added to the examples bundled with DS-5, so it will be included with future product versions. I will update this blog post when that happens.


Feel free to post any comments or questions below and I will respond as soon as possible.

Usually when you create a bare-metal image you specify the location in memory where the code and data will reside, and provide an entry point address where execution starts.

But what if you don't want to specify a fixed memory location at build time?

Security has become a crucial aspect of applications. One common attack to gain privilege on a system is through buffer overflows: this anomaly could potentially lead to the execution of malicious code, jeopardizing the security of the entire system through code injection.

Different techniques are used to make a hacker's life harder, including randomizing the address space layout (ASLR). This technique is widely used in several high-level Operating Systems like Android, iOS, Linux and Windows.

With ARM Compiler 6 you can extend this protection to bare-metal applications by creating Position Independent Executables (PIE), also known as Position Independent Code (PIC). A PIE is an executable that does not use fixed addresses to access memory. Rather, it can be executed at any suitably aligned address and the code automatically recalculates the required addresses.

ARM Compiler 6 provides the -fbare-metal-pie (armclang) and --bare_metal_pie (armlink) options to let you create a bare-metal PIE:

armclang -fbare-metal-pie -target armv8a-arm-none-eabi source.c 
armlink --bare_metal_pie source.o

Note: armclang automatically passes the --bare_metal_pie option to armlink when you compile with -fbare-metal-pie.

Note: Bare-metal PIE is currently only supported for 32-bit targets.


Worked Example Part 1: Creating a PIE

Let's take a look at how this works in practice.

This example creates a very simple "Hello World" program in DS-5, uses ARM Compiler 6 to create a PIE, then uses the DS-5 Debugger and the AEMv8-A model to run the executable at an arbitrary position in memory.


Step 1: Create a "Hello World" C project in DS-5 Debugger

  1. Create a new C project in DS-5 called PIEdemo (Click File > New > Other... to start the New Project wizard), using Project type: Empty Project and Toolchain: ARM Compiler 6 (DS-5 built in).
  2. Add a new source file pie.c to the new project (right-click the project, then click New > Source File) with the following content:

    #include <stdio.h> 
    const char *myString = "Hello World\n";
    int main()
        return 0;

Step 2: Compile the source code to create a PIE

  1. Edit the project properties (right-click the project, then click Properties) and navigate to the ARM Compiler toolchain settings (C/C++ Build > Settings).
  2. Add the following command-line options:

    • ARM C Compiler 6 > Target > Target: armv8a-arm-none-eabi (this compiles for AArch32)
    • ARM C Compiler 6 > Miscellaneous > Other flags: -fbare-metal-pie -mfpu=none
    • ARM Linker 6 > Miscellaneous > Other flags: --bare_metal_pie
  3. Build the project (right-click the project, then click Build Project).


Step 3: Create a debug configuration for the AEMv8-A model

  1. Create a new debug configuration (right-click in the Debug Control tab, then click Debug Configurations..., then click the New Launch Configuration button).
  2. On the Connection tab:
    1. Select the VE_AEMv8x1 > Bare Metal Debug > Debug AEMv8-A target.
    2. Add the model parameter: -C cluster.cpu0.CONFIG64=0. This puts the model in AArch32 state, rather than the default AArch64 state.


  3. On the Debugger tab, select Run control: connect only.

    We want to load the image manually so that we can specify the load address.

Step 4: Run the PIE on the AEMv8-A model

  1. Double-click the debug configuration to connect to the AEMv8-A model target.
  2. Load the PIE by running the following command on the Commands tab:

    loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

    This loads the PIE at the arbitrary address 0x80000044, performs all necessary address relocations, and automatically sets the entry point:


    Note: You can choose any address, but it must be suitably aligned and at a valid location in the AEMv8-A memory map. For more information about the AEMv8-A memory map, see AEMv8-A Base Platform - memory - map in the Fast Models Reference Manual

    Note: You can ignore the TAB180 error for the purposes of this tutorial. For more information, see ARM Compiler 6: Bare-metal Hello World C using the ARMv8 model | ARM DS-5 Development Studio.

  3. Execute the PIE by running the following command on the Commands tab:


    Check the Target Console tab to see the program output:


How Does It Work?

Position independent code uses PC-relative addressing modes where possible and otherwise accesses global data indirectly, via the Global Offset Table (GOT). When code needs to access global data it uses the GOT as follows:

  • Evaluate the GOT base address using a PC-relative addressing mode.
  • Get the address of the data item in the GOT by adding an offset index to the GOT base address.
  • Look up the contents of that GOT entry to obtain the actual address of the data item.
  • Access the actual address of the data item.

We'll see this process in action later.

At link time, the linker does the following:

  • Creates the executable as if it will run at address 0x00000000.
  • Generates a Dynamic Relocation Table (DRT), which is a list of addresses that need updating, specified as 4-byte offsets from the table entry.
  • Creates a .preinit_array section, which will update relocated addresses (more about this later…).
  • Converts function calls to direct calls.
  • Generates the Image$$StartOfFirstExecRegion symbol.


At execution time:

  • The entry code calls __arm_preinit_.
  • __arm_preinit_ processes functions in the .preinit_array section, calling __arm_relocate_pie.
  • __arm_relocate_pie uses Image$$StartOfFirstExecRegion (evaluated using a PC-relative addressing mode) to find the actual base address in memory where the image has been loaded, then processes each entry in the DRT adding the base address offset to each address entry in the GOT and initialized pointers in the data area.



Worked Example Part 2: Stepping through PIE execution with DS-5 Debugger

Our example from earlier contains the global string "Hello world". Let's see how relocation is used in the PIE to access this data regardless of where the image is loaded.

In the Project Explorer view, double-click on the .axf executable to see the sections it contains:


We can see that the GOT is located at address 0x00000EE0 in the original image.

Now load the image to address 0x80000044 by running the following command on the Commands tab:

loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

Use the Disassembly view to view address 0x80000F24 (0x80000044 + 0x00000EE0). We can see that the GOT has been loaded, but it still contains unrelocated addresses:


Now, set a breakpoint on main() and run the executable. This executes the setup code, including __arm_relocate_pie which relocates the addresses in the GOT. Run the following commands on the Commands tab:

b main

Look at the GOT again, and note that the addresses have been relocated:


Now we'll see how the code uses the GOT to access the "Hello World" string.

Step to the next source instruction by running the following command on the Commands tab:


Jump to address $pc in the Disassembly view to view the code in main():


The code to print "Hello World" starts at 0x800000E4 and does the following:

  1. Load R1 with the GOT offset for our string (0xC), obtained by a PC-relative data lookup from address 0x80000118.
  2. Load R2 with the PC-relative address of the GOT table (0xE30)
  3. Update R2 with the actual base address of the GOT table, PC + 0xE30 (0x800000F4 + 0xEE0 = 0x80000F24 ).
  4. Load R1 with the contents of address R1 + R2 (that is, address 0x80000F24 + 0xC = 0x80000F30). The contents of this address in the GOT is 0x80000F68, which is the address of the pointer to the "HelloWorld" string.
  5. Load R1 with the target address of the pointer, copy it to R0, and call puts.

You can single-step through the code and use the Registers view to see this working in DS-5 Debugger.


Further Reading

Filter Blog

By date:
By tag: