1 2 3 Previous Next

Software Development Tools

131 posts

We‘ve just released ARM DS-5 Development Studio v5.22 and we have made Streamline more powerful and user-friendly. In this blog, I will highlight the major changes in the latest version.  For a more detailed list of enhancements and fixes, please see the changelog.

 

Android trace events alongside extensive list of standard system events

 

Android supports trace events and these events are written to a system trace buffer. We can use Systrace tool, provided by Android, to collect and visualize these events. In DS-5 v5.22 release, we have enhanced Streamline to support Android trace events. We can now see the performance counters and charts like CPU and GPU activity alongside standard Android trace events.

streamline-android-trace-events.png

Figure 1 Streamline showing Android trace events

 

For example, in the above capture, you can inspect the frame by looking at various Android Surfaceflinger events like onDraw and eglSwapBuffers.

 

Profile Mali-T400 Series GPUs without having kernel source

 

Streamline requires an agent called gator to be installed and running on the ARM Linux target. Gator can operate in two modes

(a) kernel space gator – using a kernel module called gator.ko.
(b) user space gator – without the kernel module.

As user space gator is restricted to using user space APIs, it does not support all the features that kernel space gator supports. However user space gator is more easy to use as you do not need the target’s Linux kernel source to build the kernel module. Given the ease of use, we are working towards enhancing the features supported by user space gator. With this release, we are happy to announce that user space gator now supports Mali-T400 series of GPUs.  Note that you will need a recent version of Mali DDK, which exports system events to the user space. Going forward, you can expect us to add support for more Mali graphics processors.

 

Automatic fetch of symbol and other information from files on the target

 

Streamline needs symbol information to co-relate the events captured and the code being run. In the past, we had to manually provide this image information. This can be tricky if image is available only on the target but not on the host. In the v5.22 release, we have introduced automatic image transfer from the target feature to handle this situation.

streamline-automatic-image-transfer.png

Figure 2 New textbox to select processes for automatically fetching of image from the target

 

This is best shown with an example. In my case, I want to run the dhrystone executable on my Nexus 9 and see the function profile. As a first step, I run the program via adb, and start the Streamline session. During the session, I can now see a new box at the bottom, as seen in the above picture. Here, I can type a pattern (“dhr” in my case) to select the list of processes. Streamline will automatically fetch symbol information for these selected processes from the target. In my case, I can see that Streamline shows function profile for dhrystone, as seen in the below picture, without having to provide image manually.

  streamline-drystone-function-profile.png

Figure 3 Streamline showing function profile for the dhrystone process

 

 

 

Streamline snippet during the live capture

 

Streamline snippet is now available during live capture. As you might recall, Streamline snippet is a powerful feature where users can track complex counters, derived from a combination of more basic counters. For example, as seen in the below picture, you can track ClockPerInstruction (CPI) using $ClockCycles and $InstructionExecuted counters.

  streamline-cpi-snippet.png

Figure 4 CPI snippet

 

Conclusion

 

DS-5 v5.22 comes with an enhanced Streamline with useful features like support for Android trace events, automatic symbol loading from target, profiling with user-space gator library for Mali-T400 series GPUs amongst others.  You can get all these features and more by downloading DS-5 v5.22 from here.  Sign up to the DS-5 newsletter and get updates, blogs and tutorials delivered to your inbox.

Several ARM partners such as Clarinox, Micrium, Oryx-Embedded, wolfSSL and YOGITECH  are using Software Packs to deliver middleware. This simplifies installation, usage, and project maintenance of software components. We have created a new Partner Pack website that gives you an overview over the currently available Packs, covering a wide range of use cases:

  • Functional safety
  • Real-time operating systems
  • Security/encryption
  • TCP/IP networking and
  • Wireless stacks

Use Pack Installer to install one of these Packs automatically in µVision:

Embedded

You may know that there is a team at the University of Szeged who are keen to make the web forward especially on embedded systems. Several months ago an interesting question was raised to us which sounded simple, but was hard to answer right away. The question was: How can one build a functional web browser?

 

If you are interested, check out my colleague's post at our blog site.

Any comments, feedback or even contributions are welcome! (Comments can be left either here or on our blog.)

It seems that just yesterday we released ARM Compiler 6.01 and it’s already time for a new major release of the most advanced compiler from ARM.


Let’s see the major highlights for this release:

  • Update of C++ libraries
  • Performance improvements
  • Enhanced support for ARMv7-M, ARMv6-M cores

 

Update of C++ libraries

Previous versions of ARM Compiler included only the Rogue Wave C++ libraries, which haven’t been updated from the C++03 standard. In ARM Compiler 6.02, we are moving closer to the leading edge by incorporating libc++ from the LLVM project, having passed our extensive internal validation suites.


The new libraries support the C++11 and C++14 standards and, in conjunction with the LLVM clang front-end, ARM Compiler 6.02 is the most modern and advanced toolchain to develop software for your ARM-based device. Look at some of the advantages of the new C++ standards in my recent blog post on C++11/14 features.


If you want to use the old libraries you can still do it by using the --stdlib=legacy_cpplib command line option.


Performance improvements

Performance is an important aspect of a toolchain and benchmarks are a convenient way (although not perfect) to evaluate the quality of the optimizations performed by the compiler.

During the last months, ARM engineers worked on identifying and implementing optimization opportunities in the LLVM backend for ARM. The results are shown in the following graph.

dhry-coremark-m0m7.png

As you can see, the improvements between ARM Compiler 6.01 and ARM Compiler 6.02 are significant and show we are working on the right direction. Even though your code base is different from a synthetic benchmark, you may also see a boost in your code base as well: let's give it a try!


Enhanced support for ARMv7-M and ARMv6-M

clang is often used to build high performance code Cortex-A cores and it plays a fundamental role in this area. Embedded ARM microcontrollers have been less of a focus for the LLVM community and ARM is now filling the gaps by making ARM Compiler 6 a toolchain able to build efficient code across all range of ARM processors, from the smallest Cortex-M0+ to the latest Cortex-A72 64-bit processor.


ARM engineers have focused on Cortex-M processors and we are now confident enough to change the support level for Cortex-M family cores from alpha to beta: this means that the code generated for the ARMv7-M and ARMv6-M architectures has reached a good quality level and has been sufficiently tested by ARM (but still work to do hence the beta support moniker). We expect to complete support for ARMv7-M and ARMv6-M in the next release of ARM Compiler at the end of this year.


If you want to know all the changes in this release of the compiler you can take a look at the release notes on ARM infocenter.


This version of the compiler will be included in the next version of DS-5 (5.22) but if you can’t wait, you can get the standalone version from ds.arm.com and add it to DS-5 (if you have DS-5.20 or greater) as shown in this tutorial.


As always, feel free to post any comment or question here or send me an email.

Any feedback is welcome and it helps us to continue delivering the most advanced toolchain for ARM from ARM.

 

Ciao,
Stefano

Dear Friends,

 

 

here they are few lines of GCC-Assembly code to make your interrupt in Cortex-M4 fully
reentrant. Please read notes from Sippey  before proceeding to details of implementation
of this page.

 

NOTE1: The code uses large amount of stack (even 32 or 136 bytes each reentrant call depending
on the use or not of floating point operation), so be careful in excessive use of re-entrancy and remember to set stack appropriately. When you use this code within matlab/simulink, you need at least 136 bytes more each sampling rate in simulink schematic.

 

NOTE2: This code is inspired and optimized by the work of other authors, who better than me

knows ARM assembly and Cortex Architecture.

 

 

NOTE3: Re-entrant code is supposing that at interrupt exit the processor returns to task

space (being it on PSP or MSP). Hence to avoid messing the stack preemption function

should only be called by the lowest interrupt priority in the program.

 

 

Function description:

 

 

RIPrun( FUNCTION ):

    - first pushes a dummy stack (only 32 bytes) on the stack and returns from the interrupt.

    - The return address programmed in the dummy stack is in the same function code, so
      that the rest of the code executes as being in the process-thread mode (instead of having
      the interrupt priority)

    - Once returned in the thread mode the code calls the function FUNCTION. This is a normal

      function call (e.g. the stack is saved again by the processor mechanism)

    -  At return it generates a software triggered interrupt SVC to restore STACK

 

 

SVC_HANDLER

    - Determines which SVC code was called.

    - In case other code a traditional IntHandler is executed

    - Otherwise we call RIPrestore who clean up the original interrupt stack.

 

 

NOTE: Why we restore stack in the SVC instead of using the RIPrun? Cortex CPU can process

two types of threading model, using one or two different stacks (PSP/MSP) when in appropriate mode.

Hence the original stack is being saved on a stack that is depending on the threading model. The

SVC call ensures that the processor recovers the stack appropriately.

 

 

MAJOR differences from Sippey code:

    1st use of defines to decide which priority levels and callback procedures to use;

    2nd all implementation are done using inline assembly from GCC

    3rd use of naked "C" functions to limit overhead due to function call

    4th RIPrun function locally encodes the return address
        (ADDW  R0, PC,16 ; SKIP 8 Instruction from here)
        to ease the code.

 

 

/**
* Reentrant Interrupt Procedure Call (RIPC)
*
*
* ARM-GCC code to implement REENTRANT interrupt procedures.
* Source of inspiration:
*     - "The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors" 3rd ed.
*     - Sippey code for KEIL: Sippey (sippey@gmail.com)
*
*
*
*
* ESSENTIAL INFORMATIONS
*
* CORTEX M4 Register Structure
*   - CPU 16 Register (R0-R15) + PSR
*   - FPU 32 Register (S0-S31) + FPSCR
*
* Calling CONVENTIONS:
*      - R0-R3, R12, LR, and PSR are called “caller saved registers.”
*   - R4-R11                  are called “callee-saved registers.
*
*   -  S0-S15 + FPSCR are “caller saved registers.”
*   -  S16-S31        are “callee-saved registers.”
*
* Typical Calling Layout
*         R0/R1 is Return Result Value if any
*         R0-R3 are parameter value (with the above exception)
*         R12 is a scratch register
*         R13 used to store SP
*         R14 link register (return address)
*         R15 is Program Counter
*
*
* Stack Structure (growing from TOP to LOW memory)
*     BEWARE for efficiency Stack is manipulated aligned to 8 bytes always
*            in case of ODD number of registers it gets padded with white space
*
* Cases:      NOFPU               FPU
*           PREVIOUS TOP       PREVIOUS TOP      LAST STACKED ITEM
*         32  (pad align 8)      (PAd align 8)    if PADDING present xPSR bit9 == 1
*            28  xPSR           96    FPSCR
*            24  ReturnAddr     92    S15
*            20  LR             88    S14
*            16  R12            84    S13
*            12  R3             80    S12
*             8  R2             76    S11
*             4  R1             72    S10
*             0  R0*            68    S9                NO FP Stack pointer here
*          ==============       64    S8
*             8 REGs            60    S7
*                               56    S6
*      (total 8x4=32bytes)      52    S5
*                               48    S4
*                               44    S3
*                               40    S2
*                               36    S1
*                               32    S0
*                               28    xPSR
*                               24    ReturnAddr
*                               20    LR
*                               16    R12
*                               12    R3
*                                8    R2
*                                4    R1
*                                0    R0*               FP Stack pointer here
*                           ====================
*                            8+17 = 25 REGS PADDED to 26   (Total 26*4=104bytes)
*
*         The return address is the stacked PC
*         While Stacked LR was previous return address
*
*         BX LR is return from subroutine
*         if LR start with 0xFxxxxxxxx then it is interpreted as Return from Interrupt (Exception Return)
*         Possible Exception return values are:
*
*         if FPU was used before interrupt call
*         0xFFFFFFE1 Return to another exception using MSP (Master)
*         0xFFFFFFE9 Return to thread using MSP (Master)  stack pointer
*         0xFFFFFFED Return to thread using PSP (process) stack pointer
*
*         if FPU was not used before CALL
*         0xFFFFFFF1 Return to another exception using MSP (Master)
*         0xFFFFFFF9 Return to thread using MSP (Master)  stack pointer
*         0xFFFFFFED Return to thread using PSP (process) stack pointer
*
*/

#include <misc.h>
#include <stm32f4xx.h>

// Lazy using strings to pass parameter to Assembly code
#define SVC_CALL_NUMBER       "0"     // SVC_CALL_NUMBER being used
#define PRI_LEVEL_LOCK        "240"   // Level 15 for STM32F4

static void RIPCrun( void (*fcn)(void) ) __attribute__ (( naked, used ));
static void RIPCrestoreSP( void ) __attribute__ (( naked,used ));

/**
* This is NEW default handler for standard SVC if used override if
* required as usual in CM4
*/
__attribute__(( weak,used )) void SVC_Orig_Handler()
{
    while(1); // No other default service! Catch or return?
}


/**
* \brief RIPCrun makes the interrupt reentrant. It pushes a dummy
* stack, loads a fake return address depending on the FPU and call type
* and returns. The return address is given as param.
* Usage example
*     void SysTickHandler()
*     {
*            // NON REENT CODE BEFORE
*         RIPCrun(reentrant_Handler); // Call to reentrant code
*     }
*
* To avoid undesired preempt. The call is made in two stages,
* first we call/return to RIPstub that on its own calls desired
* Handler
*
* Note that the interrupt being made reentrant should have the lowest
* priority.
*/
static void RIPCrun( void (*fcn)(void) )
{
                                                // R0 at entry contains the jumping address
    __asm volatile(
#ifdef __FPU_USED
            " TST LR, #0x10                  \n" /* Test bit 4 to check usage of FPU register */
            " IT EQ                          \n"
            " VMOVEQ.F32 S0, S0              \n" /* Mark FPU used for Lazy stacking operation  */
#endif
            " MRS  R1, xPSR                  \n" // Should be xPSR ??
            " PUSH {R1, LR}                  \n" /* Push PSR and LR on the stack*/
            " SUB  SP, #0x20                 \n" /* Reserve additional 8 words for a complete dummy stack return*/
            " STR  R0, [SP]                  \n" // Pass the R0 to Callee in return
            " ADDW  R0, PC,16                \n" // RIPCservice  (SKIP 8 Instruction from here)
            " STR  R0, [SP, #24]             \n" // Handler Launcher in thread (Temp return addr)
            " MOV  R0, #0x01000000           \n" // Generate a fresh new PSR
            " STR  R0, [SP, #28]             \n" // and store it (PSR) in proper offset
            " MOV  R0, #0xFFFFFFF9           \n" // Create a return value for ISR return to MSP no FP (8 Word frame)
            " MOV  LR, R0                      \n" // and place it to LR to emulate standard ISR return
            " BX   LR                         \n" // The return here will use our dummy stack

            // RIPCService
            /**
             * No we exited the interrupt and enter immediately here (SP+24 to this address).
             * At return the R0 register will be populated from the dummy stack with the parameter passed
             * to the RIPrun (ex R0) and we will jump there immediately.
             * Not this procedure call will be handled in MSP stack whatever would have been the original
             * THREAD stack (PSP or MSP).
             */
            " BLX  R0                        \n" // RIPService Call function desired
            " MOVS  R0, #" PRI_LEVEL_LOCK "  \n" // Rearrange PRIORITY level to
            " MSR  BASEPRI, R0                  \n" // Block further trigger on our base interrupt
            " ISB                            \n" // ISB required to wait for BASEPRI effect (avoid further preemption)
            " SVC  #" SVC_CALL_NUMBER "      \n" // Replace here with desired syscall number
//            " BL   RIPCerror                 \n" // SVC will reset stack, we should not return here
            );
    while(1); // We should never get here, otherwise stack was messed up!
}



/**
* \brief Control logic is the following
*         if (GET_SVC_NUMBER == SVC_CALL_NUMBER)
*                  RIPsvc();
*         else
*                  SVC_Orig_handler();
* This handler and the RIPCsvc function are restoring the stack and hence should be protected against
* further reentrant interrupt of the same kind otherwise the stack can be messed up.
* The SVC handler always executes with MSP stack, but the original SVC service number can be stored in
* MSP or PSP. Hence the initial test serves to properly extract the SVC number.
*/
__attribute__(( naked )) void SVC_Handler()
{
    __asm volatile(
            " TST    LR, #0x04               \n" /* Test EXC return bit 2 (MSP or PSP?)*/
            " ITE    EQ                      \n" // if 0
            " MRSEQ  R0, MSP                 \n" // Get SP from MSP
            " MRSNE  R0, PSP                 \n" // else use PSP
            " LDR    R1, [R0,#24]             \n" // This is offset of stacked PC
            " LDRB.W R0, [R1, #-2]           \n" // Check SVC calling service
            " CMP    R0, #" SVC_CALL_NUMBER "\n" // Replace here with desired syscall number
            " BEQ    RIPCrestoreSP           \n" // use our modified SVC handler
            " B      SVC_Orig_Handler        \n" // else jump to the original handler
            );
    while(1); // We should never get here, otherwise stack was messed up!
}

/**
* \brief this function is called after the SVC handler properly identified we are
* returning from a reentrant interrupt.
*
* OPERATIONS:
*   -  We restore BASEPRI set to avoid nesting of SVC_handler (which produces a fault).
*   -  We remove the stack provided by the SVC_Handler call.
*   -  We recover PSR and LR as for the original storage in the RIPCrun
*   -  We return this SVC using the stack pushed for the RIPCrun.
*
* DOUBT: Why triggering lazy stacking here? does it copies value in a dummy stack which
*  is trashed a couple of instruction later?
*/
static void RIPCrestoreSP( void )
{
    __asm volatile(
            " MOVS R0, #0             \n" /* Use the lowest priority level*/
            " MSR  BASEPRI, R0      \n" // to renable the interrupt
            " ISB                   \n" // Ensure synchronization
#ifdef __FPU_USED
            " TST LR, #0x10         \n" /* Test bit 4 to check usage of FPU register */
            " IT EQ                 \n"
            " VMOVEQ.F32 S0, S0     \n" /* Mark FPU use for Lazy stacking operation  */
#endif
            " TST LR, #0x10         \n" /* Test bit 4 to check usage of FPU register */
            " ITE EQ                 \n"
            " ADDEQ SP, SP, #104     \n" // Restore stack properly
            " ADDNE SP, SP, #32     \n"
            " POP {R0, R1}          \n" /* Push PSR and LR on the stack*/
            " MSR APSR_nzcvq,R0     \n" // Should be xPSR ??
            " BX   R1                \n" // Finally jump to R1
            );
    while(1); // We should never get here, otherwise stack was messed up!
}

#define TEST_REENT
#ifdef  TEST_REENT

#define NESTLEVEL 20

static int pass = 1;
float NPI[20];
unsigned int stackIN[NESTLEVEL];
unsigned int stackOUT[NESTLEVEL];
unsigned int nesting = 0;

/**
* \brief executes some FP operation. Marks stack at entrance and exit and
* waits in the middle for a number of nested recursion.
*
* Note that the Stack consumption is about 72 bytes for nonFP reent
* and 144 bytes for FP reent. This is due to the double procedure
* call that is set at each interrupt (e.g. the original stack call
* is preserved until the end + one procedure call get through the BLX
*
* We have 8 local bytes on the stack more
*
* Which makes 32+8+32 (Two complete stacks + 8 bytes for temporary PSR&LR)
*
* Or 104 + 32 + 8 = 144 in case of FP call stack
*
* The bytes overhead w.r.t. the standard mechanism is hence 40 bytes.
*
* Beware to have a large enough stack for reentrancy.
*/
void ReentTickTest()
{
    register unsigned int *stackref;
    int a=0,lev;

    lev=nesting++;
    __asm__ ("mov %0, sp" : "=g" (stackref) : );
    stackIN[lev]=(uint32_t)stackref;
    NPI[lev] = 3.1415926535f*lev;

    while((a<6)&&(pass==1))
    {
        // Wait for Rentrancy
        a=nesting;
    }
    pass=0;
    __asm__ ("mov %0, sp" : "=g" (stackref) : );
    stackOUT[lev]=(uint32_t)stackref;
    if (lev==0) pass=2;
}




void SysTick_Handler()
{
    RIPCrun(ReentTickTest);
}


int main(void)
{
    float jj,kk;
    jj = 3.14;
    kk = jj*2;
    jj=kk;

    // The chosen IRQn should be the lowest in the system so that we are
    // sure that when this interrupt is exited we will return to thread
    // mode with a well not stack recovery mechanism.
    //
    // The alternative is to disable the interrupt in the code, but this
    // violates the rule of MAX 12cycles for interrupt latency which is
    // one of the best features of Cortex
    NVIC_SetPriority(SysTick_IRQn,15);
    SysTick_Config(16000);

    for (;;)
    {
        while(1)
        {
            if (pass==2) break;
        }
        pass=1;
        nesting = 0;
    }
}

#endif

In late summer (mid August/September) we will run a series of webinars that will show you the advantages of the ARM Cortex-M7 processor family and the silicon implementations that are available today. This webinar series is as follows:

They are hosted by my colleagues Johannes Bauer and Matthias Hertel who have lots of experience in the embedded space. While the first webinar will introduce the Cortex-M7 architecture and its advantages, the two other webinars will concente on the devices available by our silicon partners Atmel and STMicroelectronics. They will contain live demos on how to connect to the hardware and how to create your first applications with MDK Version 5.

 

Embedded ARM Processors

On July 28th, 2015 (8 am PST / 5 pm CEST) I will be holding a webinar on how to create your own Software Pack.

 

Software Packs offer a great way to distribute software components in a well-defined manner within engineering groups. In this webinar, a Software Pack is created based on the Jansson C library used for encoding, decoding and manipulating JSON data. I will show how to pack this software component together with an example project so that it can be shared with your fellow engineers. Also, I will discuss how to distribute a Pack to a wider audience.

 

For registration, please visit Creating a Software Pack to Share with Developers

 

@embedded

The last quarter we started to blog about our work on GNU GCC and LLVM because we think that sharing information is the key factor to cooperate in the Open Source community: we want to continue with the updates by sharing our achievement of last quarter and plans for the future. We will be at GNU Tools Cauldron 2015 on 7th/8th/9th August in Prague (Czech Republic): please come and talk with us, this is a great occasion to meet us in person and discuss open source contributions.

 

The following notes include partial information on what we’ve been working on in the last quarter and what we plan to do in the next one: for details please refer to the slides or get in touch with us.

 

GNU GCC

The last quarter was particularly important for the release of the new major version of GCC 5.1! Thanks to ARM engineers and thanks to everyone who helped to smoothly get this important milestone release out of the door. On the Cortex-R and Cortex-M profile, ARM released GCC 4.9 for ARM Embedded Processors: you can find the release notes on the Launchpad website.

 

In terms of development, the majority of the effort went into improving the ABI compliance and some performance tuning. As revealed in the previous update, we added ARMv8.1 support to binutils, enabled GCC native tune (-mcpu=native) and worked on ABI compliance for both Cortex-A and Cortex-R/M toolchains.

 

What’s next?

For the next quarter, the plan is to complete what’s left for ARMv8.1 support and working on various optimizations such as enhancing GCC loop invariants (PR65477, PR62173, PR62178), improving the cost model for Cortex-A53 and Cortex-A57 and improve CSEL code generation for AArch64.

Further improvements will be made on improving selection of FP divide & multiply on Cortex-M and add support for all memory models of AArch64.

 

LLVM

Even if relatively new, LLVM is quickly gaining popularity and ARM committed to support the community development. The commercial toolchain we offer to our customers, ARM Compiler 6, is in fact based on open source clang.

 

In the last quarter we worked on different aspects of the compiler, from adding support for the ARMv8.1 architecture to improvements on usability of the command line interface: in collaboration with Linaro we ameliorated the architecture and core name parsing, now with a cleaner code and more usable than before.

In terms of performance, we’ve been working on several optimizations (alignment of global variables, minimization of stack usage (details in section LLVM lifetime markers), new float2int pass, PBQP register allocator, etc.) but we also set up a new Cortex-A53 performance tracking bot: read more about this in the section below.

 

What’s next?

In terms of future plans, we will be still focused on performance improvements across all the cores and optimizing accesses to global variables in loops. We also plan to further improve the LNT WebUI to make it easier to detect performance changes tracked by the running bots.

 

LLVM lifetime markers

In the last quarter update we mentioned the necessity to reduce Stack usage, particularly important for the Android Open Source Project. Lifetime markers are used to identify when a particular slot becomes alive or dead, along all control flow path: generally those markers are ignored by most optimization passes but those are important to reduce the stack usage.

llvm_lifetime_markers.png

ARM engineers removed the previous limitation of 32 bytes minimum size for a marker, unveiling a few issues (primitive types use 1, 2, 4, 8 bytes stack slots) but contributing to an overall reduction of the stack usage.

 

LLVM public performance tracking bot

Development of compilers is tough job! Each patch can not only affect the correctness of the code generation but also the performance of the code generated. Tracking performance can be really tricky, especially considering the number of devices and architectures LLVM supports. For those reasons, ARM committed to help the community by adding a public Cortex-A53 tracking bot: the script executes a few benchmarks on LLVM top-of-trunk every 6 hours and publishes results at http://llvm.org/perf

llvm_perf_report.png

There are still a few improvements that could be made on the system but we feel this is going in the right direction and we hope the community will make good use of it!

 

For more details please refer to the full presentation given by Matthew Gretton-Dann available on YouTube and his slides (attached to this blog post).

We would like to hear from you on what you are doing in the open source community, share ideas and cooperate for the good of the whole ecosystem. See you at GNU Cauldron in August!


Ciao,
Stefano

robkaye

DAC 2015, Fast Models 9.3

Posted by robkaye Jun 23, 2015

Earlier this month I attended DAC in San Francisco.  We had a demo of Fast Models, some partner presentations and a poster session.  I came away from the conference with the impression that while the technical conference remains vibrant the exhibition portion is declining in importance.  I first took part in the 1980s but since then we have seen the birth of the Internet.   In those far off days we used to see large delegations from all parts of the world attend to find out the latest product information and get updates from the EDA vendors. Who can forget some of the creative ways that some of these promoted their products?   Nowadays that information largely available online and through the various social media (like this one) decreasing the value of visiting the trade show: its may be convenient and efficient.  It's certainly a lot less fun.

 

A new demo involving Fast Models was shown by Aldec:

 

Aldec - 1.png

Aldec's demo platform for their Hybrid Virtual Prototype with Fast Models.


Hybrid platforms like this are becoming very popular when there is a need to connect a high-performance simulator to represent the processor or processor subsystem with a more detailed model of other parts of the system.  This could be for many reasons, which we have discussed in an earlier blog.


Immediately prior to DAC we released Fast Models version 9.3.  We have moved to a quarterly release cycle (from half-yearly) that better serves the needs of ARM's IP roadmap.  In this release we introduced support for new Cache Coherent Network Models (CCN-502, CCN-512) and Mali Display Processors (Mali-DP500 and Mali-DP550).  We also continued to advance the capabilities of the models: the two areas that was are currently focused on are Timing Annotation and Checkpointing (Save and Restore),  


Timing Annotation extends the use of the Virtual Prototype in early, high-level, performance estimation.  The functionality provides a mechanism for the user to insert estimated timings at key points in the Virtual Prototype to improve the correlation of the reported cycle counts with what will be achieved in hardware.  The aim is to do this with minimal impact on the throughput of the model.   We are adding the Timing Annotation in stages: in this release the focus has been on the integrated cache models.  Of course, the results are very heaviy dependent on quality of the annotated values.

 

We also introduced a new type of system in the example Virtual Prototypes supplied with Fast Models.  Previously we have delivered Fixed Virtual Prototypes (FVP) and Exported Virtual Subsystems (EVS) the former being standalone platforms, the latter being functionally equivalent examples that integrate with SystemC.  The third category, which also works with SystemC, has been called an SVP or SystemC Virtual Prototype.  The evolution from the EVS is that in the SVP models are individually instantiated into SystemC rather than being a monolithic subsystem.  This gives the platform developer much more flexibility.

 

The second half of 2015 will see the continued evolution of the Fast Model functionality and a burgeoning library of models. 


Hopefully I'll be seeing some of you at the ARM TechCon in November where we'll be going into more detail on these capabilities.


ARM FAE Ronan Synnott explains the DS-5 Development Studio at 52 DAC in the Moscone Center. The DS-5 contains compilers, debuggers and streamline analyzers that assist with every stage of SoC development. To find out more please visit –http://www.ds.arm.com



 

 

Do you have any questions? Please put them in the comment section below



Often we hear embedded software engineers avoiding the usage of C++ because of being scared by potential performance hit or code size explosions. Even though some features of C++ can have significant impact on the performance and code size, it would be a mistake to exclude the usage of the language completely because of this.

 

In this article I want to show a few additions to the C++11 and C++14 standards which can improve the readability of your code but won’t affect the performance thus can be used even with the smallest Cortex-M0+ core.

 

ARM Compiler 5 supports C++11 whereas ARM Compiler 6 supports both C++11 and the most recent C++14 (refer to documentation for details). If not specified, ARM Compiler 6 assumes C++03 as a standard so you need to use the command line options --std=c++11 or --std=c++14 to be able to use the newer standards. If you want to enforce the conformance to a specific standard you can use the command line option --pedantic-errors: armclang would generate an error in case you are using extension or features of the standard.

 

Constexpr

The constexpr keyword has been introduced with C++11 but C++14 removed a few constraints making this functionality even more powerful. When a function is declared as constexpr, the compiler will know that the result of that function can be evaluated at compile time and it can be used accordingly.

Let’s assume we want to create a static array based on the number of bits set in a word; with C++03 we would have written something similar to the following code:


const int my_word = 0xFEF1; // bit mask

int *my_array;

int number_of_bits(int word) {
    int count=0;
    while(word) {
        count += word & 0x1;
        word >>= 1;
    }
    return count;
}
...

my_array = (int*)malloc(sizeof(int)*number_of_bits(my_word));
...










 

With C++14 is possible to calculate this in a function and the result is available at compile time. The code can be transformed as:

const int my_word = 0xFEF1; // bit mask

constexpr int number_of_bits(int word) {
    int count=0;
    while(word) {
          count += word & 0x1;
          word >>= 1;
    }
    return count;
}


int my_array[number_of_bits(word)];











Because the function is evaluated at compile time, the compiler can instantiate the array on the stack saving the call to malloc() at run-time: readability and performance have been improved at the same time!


Binary literals

Often in our applications, we need to use bit masks or perform bit operations: How many times did we write code similar to the following?

 

if (x & 0x20) { // 0010 0000
 ...
}














 

What does 0x20 mean in this code? For an expert programmer this is clearly checking if the sixth LSB bit of x is set but it might be getting trickier with more complex bit masks. C++14 makes this even more clear. In the latest version of the standard C++14 it is possible to define binary literals, making the specification of bit masks even clearer:

 

if (x & 0b0010’0000) {
...
}














 

As you can see from the example, not only can we specify the bitmask directly but we can also use ' as a digits separator to enhance readability even further. The generated assembly code is the same but the source is easier to understand.

 

Range-based for loop

Most modern languages like Python and C# support range-based loops; this doesn’t add more power to the language but it improves readability of the resulting code.

This functionality has been added to C++11 and it’s now possible to use range-based loops directly in your existing code.

Let’s take a look at an example:

 

int my_array[] = {1, 2, 3, 4, 5};
int sum_array(void) {
    int sum = 0;
    for (int i=0; i<5; i++) {
          sum += my_array[i];
      }
    return sum;
}














 

This can be re-written to

Int my_array[] = {1, 2, 3, 4, 5};
int sum_array(void) {
  int sum = 0;
  for (auto value : my_array) {
    sum += value;
    return sum;
  }
}













The code reads better now and we also removed the size of the array from the for loop which is potential source of bugs (we need to update it if we add a new element for example).


The range-based for loop works with any type with begin() and end() function defined so that we can apply the same technique with std::vector:


int sum_array(std::vector<int> array) {
    int sum = 0;
    for (auto &value : array) {
        sum += value;
    }
    return sum;
}















In this case the improvements in terms of readability are even better and, as a result, the code is easier to understand and maintain.


Null pointer constant

Since the beginning of the C standard, we have used NULL to check the validity of a pointer. This led to confusion in C++ because NULL is equivalent to 0.

Let’s assume we have two functions with the same name and different arguments:


Log_value(int value); // first function
Log_value(char *value); // second function















In C++, the following code has an unexpected effect from the developer's point of view.


Log_value(NULL); // will call the first function















Infact, by using NULL we expect the second function to be called but, because NULL is equal to 0, the first function will be called instead.


In C++11 the keyword nullptr has been introduced and should be used instead of NULL, so that we can easily avoid this ambiguity:


Log_value(nullptr); // will call the second function

 

In this case, the second function is correctly called with an explicit null pointer value.

 

Summary

We have seen a few functionalities of C++11 and C++14 which can be used without worrying about performance and that can enhance the readability of your code. This article covers just a few of them, you can find more information on Wikipedia (C++11 - Wikipedia, the free encyclopedia and C++14 - Wikipedia, the free encyclopedia) and on Standard C++11 and Standard C++14.

I hope you found these information useful and you can soon start to use some of the functionalities in your code base. As mentioned at the beginning, ARM Compiler 6 supports C++11 and C++14. If you still don’t have DS-5, download a free 30-day evaluation of Ultimate Edition to get started.

 

Feel free to post any questions or comments below.

 

Ciao,

Stefano

Introduction

Poor cache utilization is something which can have a big negative impact on performance and improving the utilization will typically have very little or no trade off. Unfortunately detecting poor cache utilization is often difficult to do and requires considerable developer time. In this guide I will demonstrate using Streamline to drive cache optimization and identify areas of inefficiency.

 

I have used the Juno ARM Development Platform for the purposes of this guide, however the counters I use (or equivalents) should be available on all ARM Cortex-A class processors so it should be easily repeatable. Even without a platform to test on, the methodology I use should provide an insight into using Streamline to help guide optimization.

 

This guide assumes a basic level of knowledge of Streamline. Introductory information and getting started guides can be found in DS-5’s documentation or, along with other tutorials, on the website.

 

 

Setting up Streamline

Start by installing gator on the target. This is beyond the scope of this guide; see the readme in <DS-5 installation dir>/arm/gator/ for detailed information. Once installed, launch the gator daemon. I successfully used both user-space and kernel-space versions of gator. The user-space version is sufficient in most cases, the kernel-space version is only required in some circumstances – I expand on this point later.

 

Compile the attached cache-test application. It is sufficiently simple that it could be compiled on the device (if a compiler were available) or cross-compiled otherwise.

 

 

Configuring DS-5

Open up the Streamline Data view in DS-5. Configure the Streamline connection using the Capture & analysis options () to use the gator version running on the target. The other default configuration options should be sufficient, although you may optionally add the application binary to the Program Images section at the bottom for function-level profile information, or, if the binary contains debug symbols, source-code-level profile information.

 

 

Adjust the Counter configuration () to collect events from:

  • Cortex-A57
    • Cache: Data access
    • Cache: Data refill
    • Cache: L2 data access
    • Cache: L2 data refill

 

 

In our case we are also collecting “Cache: Data TLB refill”, which will provide an additional measurement to analyze caching performance, as well as “Clock: Cycle” and “Instruction: Executed” which will provide an insight into how execution is progressing. We are also collecting from the energy measurement counters provided on the Juno development platform.

 

Further Information on the Target Counters

The counters listed above are specific to our particular platform – the Juno development board. This has a big.LITTLE arrangement of 2x Cortex-A57s and 4x Cortex-A53s; we will be running our program on one of the Cortex-A57 cores.

 

The ARM Performance Monitors extension is an optional, non-invasive debug component available on most Cortex-A-class cores. Streamline reads the Performance Monitor Unit (PMU) architecture provided by this extension to generate its profiling information. Each of the processor counters observed within Streamline corresponds to a PMU event. Not all events described by the PMU architecture are implemented in each core, however a core set of events must be implemented, including the “Cache: Data access” and “Cache: Data refill” events shown above (in PMUv2 and PMUv3). Thus these two events should be available on all Cortex-A-class cores which implement the architecture. For more detailed information on the Performance Monitors Extension see the relevant section of the ARM Architecture Reference Manual for ARMv7 (Chapter C12) or ARMv8 (Chapter D5) as appropriate.

 

The “Cache: L2 data access” and “Cache: L2 data refill” counters are also common (but not mandated) on cores with an integrated L2 cache controller, however some cores have separate L2 cache controllers – for example the CoreLink Level 2 Cache Controller L2C-310. In this case the counters will be limited to what is available from the controller and whether Streamline supports it. In the case of the L2C-310, equivalent counters are available and it is supported in Streamline, however the counters are only readable using kernel-space gator (user-space gator can still read all others). Ultimately the L1 cache counters give a good view of what’s going on so if you are unable to read counters from the L2 cache (for whatever reason) it is still possible to follow the steps in this guide to help perform cache-optimization, it might just be slightly harder to see the full path of data through the cache system.

 

Most cores also provide additional PMU events (which will vary by core) to monitor cache usage and these can provide further information.

 

The Chosen Counters

The “Cache: Data access” counter (PMU event number 0x04) measures all memory-read or -write operations which access the L1 data cache. All L1 data cache accesses (with the exception of cache maintenance instructions) are counted, whether they resulted in a hit or a miss.

 

The “Cache: Data refill” counter (PMU event number 0x03) measures all memory-read or -write operations which cause a refill of the L1 data cache from: another L1 data cache, an L2 cache, any further levels of cache or main memory – in other words L1 data accesses which result in a miss. As above this does not count cache maintenance instructions, nor does it count accesses that are satisfied by refilling data from a previous miss.

 

The “Cache: L2 data access” and “Cache: L2 data refill” counters (representing PMU event numbers 0x16 and 0x17 respectively) measure as their L1 counterparts, except on the L2 data cache.

 

More detailed information on any of these events can be found in the Performance Monitors Extension chapter of the relevant ARM Architecture Reference Manual as linked above.

 

 

Capturing Data

After you have configured the target, press the Start capture button (). Once capturing has started run the cache-test application on the target (as “./cache-test”). Depending on the performance of your target this will take a few seconds to run and will output several messages before returning to the command prompt. When this happens, press the Stop capture and analyze button (). After a brief pause the analyzed data will be displayed.

 

 

Reformatting the Captured Data

You should now be presented with a chart looking similar to the image below:

streamline-screenshot-1.png

 

Filter this by just the cache-test application by clicking on the “[cache-test #<proc-id>] entry in the process list below the charts. In the case of multiple processes-of-interest the Ctrl key can be held down to select multiple processes. Having done this, depending on how long the capture session lasted and how long the program ran there may be considerable space around it. Change the Timeline display resolution using the dropdown to the left of the Time index display above the charts (set to 100ms in the example above) to zoom in.

 

The results currently are somewhat difficult to interpret as all Cache measurements are plotted on the same chart but have different ranges. Split the “Cache: Data access” and “Cache: L2 Data access” measurements into a separate chart as follows:

  1. Click on the Charts Snippet menu () above the process list.
  2. Select Add Blank Chart. Enter “Cache Accesses” as the new chart’s Title and drag it above the “Cache” chart.
  3. On the “Cache” chart, open the Configuration Panel ().
  4. Amend the “Cache” chart’s title to “Cache Refills”.
  5. Using the handle (), drag the “Data access” and “L2 data access” series to the newly created “Cache Accesses” chart.
  6. Remove the blank “Required” series in the “Cache Accesses” chart ().
  7. Change the plotting method of both charts from Stacked to Overlay (using the drop-down box at the top left of the Configuration Panel), allowing the relationship between the values to be more apparent.
    In Overlay mode the series are plotted from the top of the list, down – i.e. the series at the bottom is plotted last, in front of all others. As a result some series may need rearranging to improve their visibility in Overlay mode (although colors are slightly transparent so no data is completely hidden).
  8. Optionally rename the series as appropriate – e.g. “Data access” may be more sensibly named “L1 data access” to complement the “L2 data access” series.
  9. Optionally change the colors of the series to improve their contrast.
  10. Close the Configuration Panel by pressing the button again ().

 

Having separated these two series the chart should now look similar to the image below:

 

Next we will produce some custom data series to provide additional information about the performance of the caches:

  1. Click on the Charts Snippet menu () above the process list.
  2. Select Add Blank Chart. Enter “Cache Refill Ratios” as the new chart’s Title and drag it below the “Cache Refill” chart.
  3. Enter “L1 data ratio” as the new series’ Name. Set the Expression to be “$CacheDataRefill / $CacheDataAccess”. As this result is a percentage (the ratio of L1 data cache refills to accesses – i.e. the miss rate), tick the Percentage checkbox.
  4. Add another series to the new “Cache Refill Ratios” chart () and repeat the process for the L2 cache, setting the Expression to be “$CacheL2DataRefill / $CacheL2DataAccess”.
    The expression will differ if using a separate L2 cache controller. Pressing Ctrl + Space in the Expression window will list all available variables.
    In our case the 0x04/0x03 and 0x16/0x17 counter pairs are explicitly listed in the ARMv8 ARM Architecture Reference Manual as being associated in this way. Some care should be taken when using a separate cache controller that this assumption still holds.
  5. Change the plotting method of the chart from Stacked to Overlay.
  6. Optionally change the colors of the series to improve their contrast.

 

This is a very simple example but it is possible to combine any number of expressions and standard mathematical syntax to manipulate or create new series in this way, as documented in the Streamline User Guide (Section 6.21).

 

This will result in a chart that looks similar to the image below:

 

In our case the clock frequency figure (133 MHz) is misleading as it is the average of 6 cores, 5 of which are powered down.

 

 

Understanding the Captured Data

Having reorganized the captured data we are now in a position to analyze what happened.

 

The program appears to be split into three main phases. The first 200 ms has a relatively low level of cache activity, followed by a further 100 ms phase with:

  • A large number of L1 data cache accesses (50.2 M).
  • A virtually equal number of L1 and L2 data cache refills (1.57 M each).
  • A negligible number of L1 data TLB refills (26 K).
  • A low L1 data cache refill ratio (3.1%), although a relatively high L2 data cache refill ratio (33.2%).

 

This suggests a lot of data is being processed but the caches are being well utilized. The relatively high L2 data refill ratio would be a cause for concern, however with a low L1 refill ratio it suggests that the L2 cache is simply not being accessed that frequently – something which is confirmed by the low number of L2 cache accesses (4.7 M) vs. a high number of L1 cache accesses (50.2 M). The L2 cache will always perform at least some refills when operating on new data since it must fetch this data from main memory.

 

There is then a subsequent 2200 ms phase with:

  • A slightly larger number of L1 data cache accesses (81.5 M over the period), but a significantly reduced rate of L1 data cache accesses (37 M accesses per second compared to 502 M accesses per second in the first phase).
  • A significantly increased number of L1 data cache refills (26.9 M).
  • A similar number of L2 data cache refills (2.1 M).
  • A vastly increased number of L1 data TLB refills (24.9 M).
  • A much higher L1 data cache refill ratio (33.0%) and a much lower L2 data cache refill ratio (2.03%).

 

This hints at a similar level of data consumption (based on the fact that the L2 cache has a similar number of refills, meaning the actual volume of data collected from main memory was similar), but much poorer cache utilization (based on the high L1 data cache refill ratio).

 

This is the sort of pattern to watch out for when profiling applications with Streamline as it often means that cache utilization can be improved. As the L1 data cache refill ratio is high while the L2 data refill ratio is low the program appears to be thrashing the L1 cache. Were the L2 data refill ratio also high the program would be thrashing the L2 cache, however in this case it may be that the program is consuming unique data – in which case there is very little that can be done. However in situations where the same data is being operated on multiple times (as is common) this access pattern can often be significantly improved.

 

In our case the cache-test application sums the rows of a large 2-dimensional matrix twice. The first time it accesses each cell in Row-Major order – the order the data is stored in the underlying array:

for (y = 0; y < iterations; y++)
  for (x = 0; x < iterations; x++)
  sum_1d[y] += src_2d[(y * iterations) + x];

 

Whereas the second time it accesses each cell in Column-Major order:

for (x = 0; x < iterations; x++)
  for (y = 0; y < iterations; y++)
  sum_1d[y] += src_2d[(y * iterations) + x];

 

This means the cache is unable to take advantage of the array’s spatial locality, something which is hinted at by the significant jump from a negligible number of L1 data TLB refills to 26.9 million. The TLB (Translation Lookaside Buffer) is a small cache of the page table: the Cortex-A57’s L1 data TLB is a 32-entry fully-associative cache. A large number of misses in the TLB (i.e. the result of performing un-cached address translations) can be indicative of frequent non-contiguous memory accesses spanning numerous pages – as is observed in our case.

The cache-test program operates on a 5000x5000 matrix of int32s – or 95.4 MB of data. The Cortex-A57 uses a 64-byte cache line length, giving a minimum of 1.56 M cache accesses to completely retrieve all the data. This explains the virtually equal L1 and L2 data cache refills (1.57 M each) in phase 1, where the data is being accessed in order, and explains why they must be this high even in the best case.

 

 

Fixing the Issue

In this simple case we can improve the cache utilization by switching around the inner and outer loops of the function, thus achieving a significant performance improvement (in our case a 22x speed increase) at no additional cost.

 

In real-world examples, where it may not be as easy to locate the exact area of inefficiency, Streamline’s source code view can be used to help pinpoint the issue. To use this it will be necessary to load the application’s binary, either as described earlier or after capture by right clicking the report in the Streamline Data view, selecting Analyze... and adding the binary. If the binary contains debug symbols source-code-level debug information will be available (in the Code tab), otherwise only function-level information will be available (in the Functions tab, and also from the Timeline Samples HUD ()). Function-level information will still provide a good clue as to where to look however. Providing debug symbols are available, the code view can be easily used to give a view similar to below by clicking through the offending functions in the Functions tab.

 

The annotations on the left of the source code line show the number of occasions that line was being executed when the sample was taken and that percentage relative to the rest of the function. Using the Timeline Sample HUD () we can identify the “yx_loop” function as being responsible for the majority of the samples from our code (1617) throughout the second phase (which we identified as having poor cache utilization). Clicking through this function in the Sample HUD or the Functions tab, we can see 1584 samples on the line within the nested for-loop – suggesting this loop needs a second look. In our case this is a particularly simple function consisting only of this loop, but if it were more complex it would offer a much greater insight into the exact spot the offending function was spending most of its time.

 

 

Summary

I have attached the source to the simple cache-test example. It is currently in the process of being added to the examples bundled with DS-5, so it will be included with future product versions. I will update this blog post when that happens.

 

Feel free to post any comments or questions below and I will respond as soon as possible.

Usually when you create a bare-metal image you specify the location in memory where the code and data will reside, and provide an entry point address where execution starts.

But what if you don't want to specify a fixed memory location at build time?

Security has become a crucial aspect of applications. One common attack to gain privilege on a system is through buffer overflows: this anomaly could potentially lead to the execution of malicious code, jeopardizing the security of the entire system through code injection.

Different techniques are used to make a hacker's life harder, including randomizing the address space layout (ASLR). This technique is widely used in several high-level Operating Systems like Android, iOS, Linux and Windows.

With ARM Compiler 6 you can extend this protection to bare-metal applications by creating Position Independent Executables (PIE), also known as Position Independent Code (PIC). A PIE is an executable that does not use fixed addresses to access memory. Rather, it can be executed at any suitably aligned address and the code automatically recalculates the required addresses.

ARM Compiler 6 provides the -fbare-metal-pie (armclang) and --bare_metal_pie (armlink) options to let you create a bare-metal PIE:

armclang -fbare-metal-pie -target armv8a-arm-none-eabi source.c 
armlink --bare_metal_pie source.o

Note: armclang automatically passes the --bare_metal_pie option to armlink when you compile with -fbare-metal-pie.

Note: Bare-metal PIE is currently only supported for 32-bit targets.

 

Worked Example Part 1: Creating a PIE

Let's take a look at how this works in practice.

This example creates a very simple "Hello World" program in DS-5, uses ARM Compiler 6 to create a PIE, then uses the DS-5 Debugger and the AEMv8-A model to run the executable at an arbitrary position in memory.

 

Step 1: Create a "Hello World" C project in DS-5 Debugger

  1. Create a new C project in DS-5 called PIEdemo (Click File > New > Other... to start the New Project wizard), using Project type: Empty Project and Toolchain: ARM Compiler 6 (DS-5 built in).
  2. Add a new source file pie.c to the new project (right-click the project, then click New > Source File) with the following content:

    #include <stdio.h> 
    const char *myString = "Hello World\n";
    int main()
    {
        puts(myString);
        return 0;
    }

Step 2: Compile the source code to create a PIE

  1. Edit the project properties (right-click the project, then click Properties) and navigate to the ARM Compiler toolchain settings (C/C++ Build > Settings).
  2. Add the following command-line options:

    • ARM C Compiler 6 > Target > Target: armv8a-arm-none-eabi (this compiles for AArch32)
    • ARM C Compiler 6 > Miscellaneous > Other flags: -fbare-metal-pie -mfpu=none
    • ARM Linker 6 > Miscellaneous > Other flags: --bare_metal_pie
  3. Build the project (right-click the project, then click Build Project).

 

Step 3: Create a debug configuration for the AEMv8-A model

  1. Create a new debug configuration (right-click in the Debug Control tab, then click Debug Configurations..., then click the New Launch Configuration button).
  2. On the Connection tab:
    1. Select the VE_AEMv8x1 > Bare Metal Debug > Debug AEMv8-A target.
    2. Add the model parameter: -C cluster.cpu0.CONFIG64=0. This puts the model in AArch32 state, rather than the default AArch64 state.

      DebugConfiguration.png

  3. On the Debugger tab, select Run control: connect only.

    We want to load the image manually so that we can specify the load address.

Step 4: Run the PIE on the AEMv8-A model

  1. Double-click the debug configuration to connect to the AEMv8-A model target.
  2. Load the PIE by running the following command on the Commands tab:

    loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

    This loads the PIE at the arbitrary address 0x80000044, performs all necessary address relocations, and automatically sets the entry point:

    loadfile_command.png

    Note: You can choose any address, but it must be suitably aligned and at a valid location in the AEMv8-A memory map. For more information about the AEMv8-A memory map, see AEMv8-A Base Platform - memory - map in the Fast Models Reference Manual

    Note: You can ignore the TAB180 error for the purposes of this tutorial. For more information, see ARM Compiler 6: Bare-metal Hello World C using the ARMv8 model | ARM DS-5 Development Studio.

  3. Execute the PIE by running the following command on the Commands tab:

    run

    Check the Target Console tab to see the program output:

    target_console.png

How Does It Work?

Position independent code uses PC-relative addressing modes where possible and otherwise accesses global data indirectly, via the Global Offset Table (GOT). When code needs to access global data it uses the GOT as follows:

  • Evaluate the GOT base address using a PC-relative addressing mode.
  • Get the address of the data item in the GOT by adding an offset index to the GOT base address.
  • Look up the contents of that GOT entry to obtain the actual address of the data item.
  • Access the actual address of the data item.

We'll see this process in action later.

At link time, the linker does the following:

  • Creates the executable as if it will run at address 0x00000000.
  • Generates a Dynamic Relocation Table (DRT), which is a list of addresses that need updating, specified as 4-byte offsets from the table entry.
  • Creates a .preinit_array section, which will update relocated addresses (more about this later…).
  • Converts function calls to direct calls.
  • Generates the Image$$StartOfFirstExecRegion symbol.

smallImageBeforeLoading.png

At execution time:

  • The entry code calls __arm_preinit_.
  • __arm_preinit_ processes functions in the .preinit_array section, calling __arm_relocate_pie.
  • __arm_relocate_pie uses Image$$StartOfFirstExecRegion (evaluated using a PC-relative addressing mode) to find the actual base address in memory where the image has been loaded, then processes each entry in the DRT adding the base address offset to each address entry in the GOT and initialized pointers in the data area.

smallImageAfterLoading.png

 

Worked Example Part 2: Stepping through PIE execution with DS-5 Debugger

Our example from earlier contains the global string "Hello world". Let's see how relocation is used in the PIE to access this data regardless of where the image is loaded.

In the Project Explorer view, double-click on the .axf executable to see the sections it contains:

ElfSections.png

We can see that the GOT is located at address 0x00000EE0 in the original image.

Now load the image to address 0x80000044 by running the following command on the Commands tab:

loadfile PIEdemo/Debug/PIEdemo.axf 0x80000044

Use the Disassembly view to view address 0x80000F24 (0x80000044 + 0x00000EE0). We can see that the GOT has been loaded, but it still contains unrelocated addresses:

GOTpre.png

Now, set a breakpoint on main() and run the executable. This executes the setup code, including __arm_relocate_pie which relocates the addresses in the GOT. Run the following commands on the Commands tab:

b main
run

Look at the GOT again, and note that the addresses have been relocated:

GOTpost.png

Now we'll see how the code uses the GOT to access the "Hello World" string.

Step to the next source instruction by running the following command on the Commands tab:

next

Jump to address $pc in the Disassembly view to view the code in main():

CodeDisassembly.png

The code to print "Hello World" starts at 0x800000E4 and does the following:

  1. Load R1 with the GOT offset for our string (0xC), obtained by a PC-relative data lookup from address 0x80000118.
  2. Load R2 with the PC-relative address of the GOT table (0xE30)
  3. Update R2 with the actual base address of the GOT table, PC + 0xE30 (0x800000F4 + 0xEE0 = 0x80000F24 ).
  4. Load R1 with the contents of address R1 + R2 (that is, address 0x80000F24 + 0xC = 0x80000F30). The contents of this address in the GOT is 0x80000F68, which is the address of the pointer to the "HelloWorld" string.
  5. Load R1 with the target address of the pointer, copy it to R0, and call puts.

You can single-step through the code and use the Registers view to see this working in DS-5 Debugger.

 

Further Reading

On 8th May ARM approved training specialists Doulos Embedded are hosting free webinars on effective application debugging for embedded Linux systems. Learn how to maximize your use of embedded Linux by addressing the important issue of application debugging, including examples using DS-5 Development Studio.

 

For Europe and Asia, register to attend on 8th 10am-11am BST (11am-12pm CEST, 2.30pm-3.30pm IST). Or for North America, register to attend at 10am-11am PDT (1pm-2pm EDT, 6pm-7pm BST).

 

 

See the full details »

 

 

 

Tux.png

ARM has always been committed to work with the ecosystem and cooperate with partners to get the best out of our cores. One important aspect of the cooperation is sharing what we have done in the open source and what we plan to do in the near future.

 

GNU Toolchain

In the first quarter of 2015 we focused on getting GCC 5 ready for release, plus some work on both A-Profile and R/M-Profile processors.

 

In particular, for Cortex-A processors, we made improvements to the instruction scheduling model, more accurate now, and we set an additional number of compiler tuning parameters which will lead to performance improvements on Cortex-A57. We also added support for the new Cortex-A72 and performed an initial tuning for performance.

 

On the Cortex-R/M class we created Thumb-I Prologue/Epilogues in RTL representation in order to allow the compiler to operate further tuning on functions call/return.

 

Additional work has been done, along with the community, for  improving NEON® intrinsics, refiningstring routines in glibc and implementing aeabi_memclr / aeabi_memset / aeabi_memmov in Newlib.

 

What’s next?

For the second quarter of 2015, we plan to complete what we started at the beginning of the year: first of all we are going to continue supporting and helping with the release of GCC 5. This is an important milestone and we want to make sure ARM support this. We will continue to work on adding support for ARMv8.1-A architecture in GCC and improving performance for Cortex-A53 and Cortex-A57. For example we noticed that GCC is generating branches to compile code with If/Then statements where a conditional select could be used instead: compiler engineers are exploring this optimisation opportunity which could potentially give a significant performance boost.

 

LLVM Update

The activity on LLVM has been focused on improving both AArch32 and AArch64 code generation: we added Cortex-A72 basic support and continue to advance the performance of the code generated for ARMv8 architecture such as improving unrolling heuristics.

 

Initially our efforts have been mainly directed to ARMv8 but we are now gradually making big advancements on ARMv7-A and ARMv7-M as well (read section MC-Hammer).

 

Supporting cores is not our only concern: the software ecosystem is important for us and in the last quarter we’ve been fixing stack re-alignment issues with the Android Open Source Project when built with LLVM.

 

What’s next?

During the next three months we will extend the support for ARMv8.1-A architecture and we will continue to work on performance optimisations. Some of the areas we will target are vectorisation, inlining, loop unrolling and floating point transformations. We are also discussing the support for strided accesses of the autovectorizer to maximise the usage of structure load and stores.

 

We will continue to support the Android Open Source Project (AOSP). In particular we will focus on stack size usage: LLVM is not performing as well as it could be in determining when local variables are not used anymore (“lifetime markers”) causing an unnecessary increase of stack usage.

 

MC-Hammer

Richard Barton presented MC-Hammer at Euro-LLVM 2012 (you can find presentation and slides here at LLVM website http://llvm.org/devmtg/2012-04-12/), a tool we’ve been using to verify the correctness of LLVM-MC against our proprietary reference implementation.

 

In 2012 we estimated that at the time ~10% of the all ARM instructions for Cortex-A8 were incorrectly encoded and 18% of instructions were incorrectly assembled when using LLVM. Over the past three years we gradually fixed corner case bugs and we are now confident that v7-A and v7-M variants of the ARM architecture are correct, as well as AArch64. This is a great result and it means that this functionality in LLVM-MC can be trusted and built upon.

 

We participated at EuroLLVM 2015 on 13th and 14th April in London (UK) and we will be at GNU Tools Cauldron 2015 on 7th/8th/9th August in Prague (Czech Republic): please come and talk with us! For more details please refer to the full presentation given by Matthew Gretton-Dann available on YouTube and his slides attached to this blog or get in contact with us if you need further information. We would like to hear from you on what you are doing in this space and maybe work together to achieve a shared goal.

Filter Blog

By date:
By tag: