1 2 3 Previous Next

ARM and Keil Tools

73 posts

Micrium uC/OS goes ARMv8-A


In case you've missed it: Micrium announced yesterday support for ARM Cortex-A50 series of 64-bit processors in both uC/OS II and uC/OS III RTOSs. As a key embedded operating system partner within the ARM ecosystem, we are very happy see Micrium achieve this milestone, supported by ARM tools. You can read the full story here: Micrium Paves the Way with ARMv8 RTOS Support | Micrium


Get a head start on your software development


The tools your need to accelerate your ARMv8 project, including simulator models, are available now from ARM. Check out ARM DS-5 Ultimate Edition.

ARM® Compiler 6 is the next generation C/C++ compilation toolchain from ARM, based on Clang and the LLVM Compiler framework. Version 6.00 of the toolchain provides architectural support for v8 of the ARM Architecture and alpha support for v7-A. It can be used in conjunction with ARM DS-5 Development Studio to build and debug ARMv8 executable code. In this blog post, we shall look at the scalability benefits that LLVM brings for solving code generation problems for modern microarchitectures that are complex and product designs that are demanding in terms of performance and functionality. We shall also explore how the collaborative open source development process helps in responding to the challenges of ever shorter design cycles, by making tools development more agile and efficient.


The power of modular design, optimizer and IR

The compiler in ARM Compiler 6 toolchain is armclang, based on Clang, a C/C++ front end for the LLVM code-generation framework. LLVM is designed as a set of reusable libraries with well defined interfaces. In comparison, armcc, the compiler in ARM Compiler 5 is composed of modules with less well defined interfaces and separation, which makes the parts less reusable across a larger code generation problem space. armclang strictly adheres to the three phase LLVM design with a front end parser and syntax checker, a mid end optimizer and code generators that produce native machine code in the backend. The three phases have clear separation in terms of their intended function and this aspect of LLVM makes it reusable and flexible.


LLVM IR, or Intermediate representation, is the glue that connects the three phases. LLVM IR is the only interface to the optimizer and is designed as a first-class language with well defined semantics. LLVM IR was designed from ground up, with supporting common compiler optimizations in mind. The optimizer itself is designed as a set of passes that apply transformations to the input IR to produce IR that feeds into the code generator, which then can produce efficient machine code. The library based design of the optimizer allows the toolchain designers to select passes and optimizations that are most relevant for the given application domain and produce efficient code with minimum effort.


LLVM framework also makes it easier to add new targets, e.g. by using target description (.td) files to concisely describe a large part of the target architecture in a domain-specific language. Partners also have the option of adding their own backends for custom DSPs or GPUs as plugins into the toolchain. The code generator itself is based on several built-in passes for common problems like instruction selection, register allocation, scheduling etc, so adding new code generators from scratch is relatively easy. The expressiveness of target description syntax is being continuously improved, so that it becomes easier to add targets in the future. The modular design and robust LLVM IR format yields itself well to specialized code generation challenges such as security related extensions sometimes found on embedded microcontrollers .


ARM Compiler 6 comes with optimized libraries and armlink, an industrial strength linker that has been developed over the years as part of the ARM Compiler toolchain and this nicely complements the benefits accrued from LLVM as detailed above. For example, we expect to introduce link time optimization in a future version of the product that would bring together the benefits of LLVM technology for optimization (leveraging the support for bitcode format in LLVM) and the time tested robustness of armlink. When introduced, this would enable optimization across library boundaries which was not possible using older versions of ARM Compiler. By applying ARM’s best-in-class embedded libraries, ARM Compiler 6 generates highly optimized library functions tuned specifically for the ARM architecture, improving both performance and power consumption.


The power of open source – Agile innovation in a collaborative environment

LLVM is licensed under the University of Illinois/NCSA Open source license which means

  1. you can freely distribute LLVM
  2. commercial products can be derived from LLVM (which ARM Compiler 6 is an example of) with few restrictions
  3. there is no requirement to make derived works open source

LLVM development happens in the open, with contributions committed to the codebase on a regular basis. Each contribution is evaluated on its merits by a rigorous development process that involves community led code reviews, addition of new tests for the contribution and qualification by a 100% pass rate on existing regression tests before it gets accepted into the codebase. The support for ARMv8 Advanced SIMD introduced in the LLVM AArch64 backend for ARMv8 is a good example of a successful result of this process in action, where ARM and Qualcomm collaborated in the community to deliver a well validated piece of functionality to upstream LLVM. The AArch64 backend itself is validated and kept free of regressions by means of a community hosted buildbot which tests whether Clang can correctly cross-compile itself and runs LLVM regression tests using the cross-compiled Clang, targeting an AArch64 Linux platform. This means ARM partners bringing products to market can focus on true differentiating factors and spend less in terms of software development effort on the correctness and quality of common code generation. This development model cuts waste and prevents fragmentation of code generation efforts across the partnership. ARM is at the heart of a rich ecosystem of silicon partners, tools vendors, EDA, OEM and software development partners and combining the strength of the partnership with that of the open development model helps in speeding up innovation across the segments that the partnership operates in.


We'd like to hear from you!

We have seen how ARM Compiler 6 makes use of the modular design of LLVM and how this can help to solve code generation problems for product designs based on the ARM architecture. This is the first version of the product in which we are switching from proprietary compiler technology (whose origins date back to the initial days of the ARM architecture itself) to LLVM and what an exciting transition this has been! With the successful transition, I believe we are on a firm foundation to meet the code generation challenges posed by superscalar architectures with multiple pipeline stages, heterogeneous multi-processors, and exacting power efficiency requirements. We would love to hear from partners who wish to collaborate on LLVM technology and/or on ARM Compiler 6. ARM Compiler 6 is supported initially in DS-5 Ultimate Edition. If you are interested in evaluating ARM Compiler 6, you can request an evaluation by going to DS-5 Ultimate Edition.

ARM Compiler 6 is now available, bringing to you a modern, extensible compiler architecture for the next generation of ARM processors. Version 6 of the ARM Compiler adopts the Clang and LLVM compiler framework, which is swiftly gaining momentum as the compiler of choice for advanced code generation. By working with the open source Clang/LLVM, ARM is able to work in cooperation with our partners, accelerating feature creation and code generation efficiency targeting the ARM architecture. Tuning, testing and implementation are all much faster with open source LLVM.


Why is this important to ARM?


ARM has gained success based on partnership. Using an open source framework for the next-generation ARM Compiler, we have opened the door for better collaboration with regard to code generators; a critical component for improving performance and power consumption on ARM processors. ARM has actively contributed to many open source communities for years, but the ARM Compiler was developed alongside the ARM architecture and has always been proprietary. ARM Compiler 6 marks the start of a new generation, channelling open source contributions into an integrated, validated and fully supported commercial product, enabling partners and end users to take advantage of the velocity of open source development and the efficiency of Clang/LLVM.


What’s special about Clang and LLVM?


The flexible and modern Clang and LLVM infrastructure provides a solid foundation for ARM’s code generation tools. Clang is a C/C++ compiler front end based on a modular architecture with well-defined interfaces for applying complimentary tools such as code analyzers and code generators. Clang also offers improved diagnostic capabilities, leading to higher quality code and shorter development cycles.


LLVM is an extensible compiler framework which is well suited for advanced code generation techniques such as link-time code generation. LLVM’s modular framework makes it easier to develop and test new optimizations, leading to better performing code and lower power consumption.


To learn more about Clang & LLVM technology, read Vinod's blog.


What’s special about ARM Compiler 6?


Building on Clang & LLVM, ARM Compiler 6 really does provide the best of both worlds. It delivers efficient code size and performance and comes as an integrated and validated toolchain that works straight out-of-the-box. Benefits include:


  • Tight integration: ARM Compiler 6 is more than just a compiler; it is a full code generation toolchain consisting of compiler, linker, assembler, and libraries. Its integration in the ARM DS-5 Development Studio Ultimate Edition provides a full C/C++ software development environment.
  • Optimized for ARM: Highly optimized libraries provide superior performance and code size for embedded applications, maximizing software performance and reducing costs.
  • Stable and robust: Developed and maintained by ARM experts, ARM Compiler 6 has undergone extensive testing on ARMv8 targets to ensure that it is stable, mature and efficient.
  • Professionally supported and maintained: ARM Compiler 6 and DS-5 are actively supported, validated, documented and maintained by ARM’s globally distributed technical experts, ensuring rapid issue resolution and faster time to market.


Will migrating to ARM Compiler 6 be easy?


Yes, to ensure as smooth a transition as possible, we have put together a comprehensive migration guide which is included within the DS-5 Ultimate Edition installation.


Try DS-5 Ultimate Edition now


To get everything you need to develop for the ARMv8 architecture, request a free 30-day trial of ARM DS-5 Ultimate Edition. Or learn more about ARM Compiler 6 and DS-5 Ultimate Edition.

Supporting ARMv8 early adopters since 2011


Developed alongside our latest architecture, ARMv8, ARM tools and virtual platforms have enabled leading chipmakers and ecosystem partners to accelerate SoC bring-up and software development for over two years. Now, however, as 64-bit ARM processors start to hit the shelves, it is time to make these tools available to the wider ARM community. This is ARM DS-5 Development Studio Ultimate Edition (UE).


Learn more about DS-5 Ultimate »


Don't wait for silicon, start writing ARMv8 code now


DS-5 Ultimate packs all the tools you need to jump-start your ARMv8 software development program, even if there is no silicon on your desk yet. In addition to code generation and debug tools, the suite includes a quad-core ARMv8-A fixed virtual platform that provides architecturally compliant simulation for you to run and test your code on your own PC. The model can be used for bare-metal/embedded software development as well as for running Linux images made available from Linaro for the ARMv8 VE FVP.


As you would expect, DS-5 UE can also be used to connect to your custom design or any of the many other platforms available in its configuration database, including the ARM Versatile Express SMMs for ARM Cortex-A57 and Cortex-A53 processors.


What's in the box


DS-5 Ultimate Edition is the premium member of the DS-5 family, and as such includes all features available in DS-5 Professional plus:

  • New ARM Compiler 6 for efficient code generation, initially targeting ARMv8-A architecture
  • Quad-core ARMv8-A (VE) fixed virtual platform enabling pre-silicon software development
  • ARMv8-A capable multicore, multi-cluster debugger
  • Support for Cortex-A57 and Cortex-A53 cores in Streamline Performance Analyzer
  • Future support for ARMv8-R


Learn more about DS-5 Ultimate »

Request your DS-5 Ultimate Edition evaluation


DS-5 Ultimate Edition will become available in the next few days, so go ahead and request now your fully featured 30-day evaluation license at http://ds.arm.com/ds-5-ultimate-edition/request-trial/

I had a busy week at the Game Developers Conference in San Francisco.


This year at our GDC booth, we were demoing the Streamline feature of DS-5, and how is can be used in conjunction with Mali Graphics Debugger to locate and analyse hot spots, and ultimately improve your application. My colleague Lorenzo Dal Col and I gave a talk on some of the techniques you can use with these tools. A brief overview is available below:



In our booth lecture theater, I gave a short overview of how to use Streamline for power analysis with any of the energy metering solutions that it supports. If you were unable to attend, you can replay it again below:



Note that all our booth lectures, as well as a variety of other videos from the event, are available at ARM's YouTube page.



I can't imagine anyone reading this posting hasn't already read about the Apple "goto fail" bug in SSL. My reaction was one of incredulity; I really couldn't believe this code could have got into the wild on so many levels.


First we've got to consider the testing (or lack thereof) for this codebase. The side effect of the bug was that all SSL certificates passed, even malformed ones. This implies positive testing (i.e. we can demonstrate it works), but no negative testing (i.e. a malformed SSL certificate), or even no dynamic SSL certificate testing at all?


What I haven't established* is whether the bug came about through code removal (e.g. there was another 'if' statement before the second goto) or, due to trial-and-error programming, the extra goto got added (with other code) that then didn't get removed on a clean-up. There are, of course, some schools of thought that believe it was deliberately put in as part of prism!


Then you have to query regression testing; have they never tested for malformed SSL certificates (I can't believe that; mind you I didn't believe Lance was doping!) or did they use a regression-subset for this release which happened to miss this bug? Regression testing vs. product release is always a massive pressure. Automation of regression testing through continuous integration is key, but even so, for very large code bases it is simplistic to say "rerun all tests"; we live in a world of compromises.


Next, if we actually analyse the code then I can imagine the MISRA-C group jumping around saying "look, look, if only they’d followed MISRA-C this couldn't of happened" (yes Chris, it's you I'm envisaging) and of course they're correct. This code breaks a number of MISRA rules, but most notably:

15.6 (Required) The body of an iteration-statement or selection-statement shall be a compound-statement

Which boils down to all if-statements must use a block structure, so the code would go from (ignoring the glaring coding error of two gotos):

       if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
              goto fail;
       if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
              goto fail;
              goto fail;
       if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
              goto fail;


       if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0) {
              goto fail;
       if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) {
              goto fail;
              goto fail;
       if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0) {
              goto fail;

This would then stop the unconditional goto be executed. But would cause a further rule violation:

Rule 2.1 (Required) A project shall not contain unreachable code

Nevertheless, what might surprise you (unless you follow MISRA-C closely) is that the use of the goto statement ISallowed:

Rule 15.2 (Required) The goto statement shall jump to a label declared in the same function.

but discouraged:

Rule 15.1 (Advisory) The goto statement should not be used.

In addition, I would expect all static analysis tools to flag this error; indicating even rudimentary static analysis is not being applied to this codebase. But that's not really what struck me when I first saw the code. My reaction was

"They must have ignored compiler warnings, as any compiler worth its salt  would warn about unreachable code"

Now I do most of my work, here at Feabhas using either the ARM/Keil compiler (armcc) or the IAR Compiler (iccarm) and am very used to seeing this warning as it is common to have infinite loops in multi-tasking code. Sure enough, give the following code:

int calculate_F(void);
int simpleTest(int p) 
   int ret_val = 1;
   if(p & 0x000F)
      goto out;
   if(p & 0x00F0)
      goto out;
      goto out;
   if(p & 0x0F00)
      goto out;
      ret_val = calculate_F();
       return ret_val;

As expected, the ARM compiler reports a warning regarding unreachable code.

goto fail

So how could this happen? To my utter amazement, compiling this with GCC and -Wall (Warnings all) doesn’t report any warnings. Also apparently neither does Clang (the default compiler on OSX) with -Wall, but does if you specify -Wunreachable-code (which isn’t part of all ???).


So what’s my takeaway from this? I’ve always advocated the use of Static Analysis tools as an integral part of the build cycle (i.e. not trying to apply it retrospectively to 100,000’s of lines of code) rather than relying on the compiler to generate appropriate warnings. And if you weren’t convinced before, this is just another, now well documented, example of why you should.


* Please let me know if it has

Our latest application note explains how to create projects in Dave3 (http://www.infineon.com/dave) and how to work with these projects in MDK Version 5. Here's the link to the app note: Application Note 258: Using DAVE3 with MDK Version 5. If you want to see this live in action, please visit ARM's stand at the embedded world in Nuremberg: Hall 4, Stand 350. See you there!


Keil MDK is the most comprehensive software development environment for Cortex-M processor based microcontrollers. MDK Version 5 is now split into the MDK Core and Software Packs which makes new device support and middleware updates independent from the toolchain. Download the "Getting Started" user's guide and learn how to create applications for ARM Cortex-M microcontrollers. The book starts with the installation of MDK and describes the software components along with the complete workflow from starting a project to debugging on hardware.

It is available on the MDK Version 5 web page: Keil MDK-ARM Version 5


Hi, I would like to share some exciting new prices on ARM Versatile Express development boards making them more affordable for the user.


For the mobile segment, the Versatile Express family of development boards are the only development boards that offer a vendor neutral solution with the ability to expand the design with native AXI interfaces. This means that you can validate your IP in FPGA with ARM development chips before your SoC is available. Building a development system with ARM Versatile Express boards minimizes project setup time and allows the developer to concentrate on the task in hand - testing and validating the product IP and software, rather than designing and debugging the development system.


The products listed below are now more affordable with 50% price reduction*.


Products                                                        Order code                    New Price

CoreTile Express A15x2_A7x3(TC2)            V2P-CA15-0314A         $3000

CoreTile Express A9x4                                  V2P-CA9-0301A          $3000

CoreTile Express A5x2                                  V2P-CA5-0305A          $3000

Motherboard Express uATX                          V2M-P1-0303A            $3000


*Versatile Express LogicTiles and Cortex-M Prototyping System prices are unchanged



Further information on the Versatile Express family is available here http://www.arm.com/products/tools/development-boards/versatile-express/index.php

Streamline, the profiling tool in DS-5, is a powerful tool for analysing system behaviour with proven results at both macro (system level interactions) and micro (CPU code hotspots) levels.


Mental models


Streamline is also an excellent way to explore a system. Very few of us are lucky enough, or capable of(!), owning an entire system these days, especially in the Cortex-A space. Our contribution is usually a small part of a much wider software stack with code from colleagues and contractors running on a 3rd party OS with multiple other processes.


We tend to have a model in our head of how the system works, what order things happen in, which parts are potential bottlenecks and what parts need to be worried about - this is an essential abstraction to allow us to get some work done but can end up misleading us; if system behaviour doesn’t contradict this model too obviously or performance is acceptable we may never stop to ask whether our model is valid, especially as measuring it may be time consuming and not obviously contributing to an on-time, on-budget delivery.


Recently I ran into exactly this kind of scenario when I was given the opportunity to see some preliminary results from early Streamline support of a new piece of IP, in this case a hardware accelerator. We fired up an accelerator test-case and captured some data for visualisation on the Timeline view, including the new counters and measures of accelerator activity.




Of course we also captured other data including CPU activity at the same time; looking at the CPU activity chart which wobbled along at about 25% I asked myself whether this was expected given we were supposed to be offloading all the real work to the accelerator but performance was good, nobody was worried and it was easy to move on. At this stage in the interests of full disclosure I should note that some function names have been changed and some images have been modified to avoid revealing who I have been working with.

The 'What the .... ?!?' moment


Later, however, it transpired we were running on a quad-core CPU and then I knew I was on to something.


Let’s take a brief detour. A Streamline report opens initially with a collapsed, average activity for all cores in an SMP system. You can expand this to show separate averages for big and LITTLE clusters (if relevant) and then all cores individually. If a single core in a quad-core system runs at 100% activity this appears as a constant baseline of 25% in the average view. So 25% average activity quickly becomes a red-flag to Streamline users, likewise for 50% on a dual-core system and so on.




So returning to our accelerator, as suspected, upon expanding the average we could see one core running at close to 100%. Something that immediately clashed with our mental model. Now people were intrigued and suddenly the hunt was on for a reason. Fortunately Streamline generally makes this kind of investigation pretty quick. First we isolated an active portion of the test-case using the caliper tool. This restricts the other Streamline views, like the Call Paths and Function tabs, to only the selected portion of the timeline.




The Call Paths view shows a breakdown of where time is spent inside the test application and driver. To get this extra detail you need to add the relevant ELF files and re-process the report (there is no need to re-run the test case and capture new data). Now we have our suspect - copy_from_subsystem() accounting for 50.3% of the process workload (the 9.78% figure in the Total column is the load presented to all four cores).




Looking at the code we can see it is a loop to copy data from an accelerator specific block of memory to the destination buffer in the user-space code that invoked the accelerator. This code is copying 4 bytes at a time and is almost certainly sub-optimal, and when we drilled down into the port_mem_read32() function we discovered that every call involved some per page calculations to translate from accelerator memory to kernel memory. Instead of doing these once per 4k page and re-using the result, it was being done for every word i.e. 1024 times more than required.




Nobody was particularly at fault here; a piece of bare metal test code from initial hardware bring up that probably only needed to access a few words of memory and didn’t need to be optimal had survived various development iterations and made its way into an early version of the Linux driver. Performance was reasonably good so nobody had any reason to look any deeper. Only when confronted by a quick Streamline report that contradicted our expectations did we stop to examine things in more detail.


I’ve no doubt that this issue would eventually have been found and fixed with or without Streamline, but at what cost, and after how many other design decisions had been made based on the performance of the existing code base? Alternatively suppose a customer had found this when bringing up their first silicon; what kind of impression would they have formed and how would that have affected their future attitude to the product?


And if the issue is not so much about raw performance but energy efficiency then I’m not so sure it would have been caught; it tends to be more difficult to formulate expectations about battery life, especially if the inefficient task is only active for a small part of overall up-time. Without any expectations to challenge it's even more important to make some actual measurements. To help here Streamline can collect energy consumption data either from on-board sensors, via shunt resistors and the EnergyProbe or certain third-party DAQs. So for example, do you actually know whether the code to turn off the power-hungry peripherals in your design actually works? How long does it take? With Streamline you now have a way to see the drop in energy consumption correlated with the execution of your code.



Streamline is a quick way to get a much better understanding of system behaviour and improve your mental model. This can lead to immediate gains due to easy optimisations or fixes (like the one described), medium term improvements from a better understanding of how the various components of a system work together (e.g. additional buffering in a graphical pipeline) and longer term improvements that even influence the design of future products.


In addition hopefully my anecdote illustrates how all developers can make regular use of a tool like Streamline to avoid misconceptions about system behaviour and keep day to day track of the less tangible aspects of a design.


To optimise you must first understand. Streamline is your friend on both counts.




Even this analysis turned out to be a little flawed. The copying of data from the hardware to user-space was sub-optimal but it wasn’t the majority of the issue; after fixing the copying issue the load went down, but only to 80-something percent. Another example of an assumption breaking down because I don’t happen to have a good picture of what CPU load should look like when doing this kind of copy operation. Instead I was happy to have found something suboptimal and assumed it was the only problem. Our current understanding is that the driver is spending a lot of CPU cycles waiting for the accelerator hardware because it is instantiated in an FPGA and thus runs comparatively slowly. This sounds like the last piece of the puzzle, but nobody has actually proven this yet, and so perhaps we're guilty of having made another assumption. Looks like I'd better take my own advice and go do some more investigation...!

Gain a deeper understanding of the ARM® Compiler and the optimization techniques it uses. By understanding how to control these optimizations, your code can benefit from speed increases, size savings and lower redundancy. These tutorials have been put together by Chris Walsh, inspired by some of the more frequently asked questions and most popular sections of ARM Compiler documentation. Follow them through easily by downloading a 30-day evaluation of DS-5 Development Studio.


1. Building "Hello World" Using the ARM Compiler


There's no shame in starting from the basics, so our first tutorial covers how to set up DS-5 Development Studio to select the ARM Compiler and choose the right optimization level from the C/C++ build properties. If this is the first time you have used DS-5, then learning where all the build options are will be vital for more complex projects. Read it here »

Hello World ARM C Compiler Tutorial


2. ARM Compiler Optimization


This tutorial explains all the different kinds of optimizations that the ARM Compiler carries out, including automatic vectorization for NEON™, tailcall optimization and tail recursion, common subexpression elimination, cross jump elimination and table-driven peepholing. Knowing how to control compiler optimization can help you compile for speed or code size and affects the visibility of your C code when debugging. Read it here »


3. Beyond "Hello World": Advanced Compiler Features


Learn about some of the more advanced features of the ARM Compiler toolchain, such as compiling mixed C and assembly source files. Knowing how the ARM linker (armlink) combines object code from ARM compiler (armcc) and ARM assembler (armasm) will help you to understand how the final .axf file is put together. The example in the tutorial gives you snippets of assembly language and C to illustrate this. Also covered is linker feedback, where the compiler and linker collaborate to remove unused code, providing diagnostic .txt files so you can see which functions have been removed as you go. Read it here »


Advanced ARM Compiler features tutorial


4. Accessing Memory Mapped Peripherals


In most ARM embedded systems, peripherals are located at specific addresses in memory. It is often convenient to map a C variable onto each register of a memory-mapped peripheral, and then use a pointer to that variable to read and write the register. In your code, you must consider not only the size and address of the register, but also its alignment in memory. Read it here »


5. Targeting Processors, Floating-Point Units and NEON


Understand when to choose between architecture or processor during compilation to take full advantage of processor-specific features, or to maximize compatibility. Every CPU target also has an associated implicit Floating-Point Unit, which can be overridden at compile time if you want. In this tutorial, it is explained how and why you might do this. Finally, this tutorial covers enabling NEON and the various methods of creating code that uses NEON instructions. Read it here »

Targeting Processors, FPUs and NEON


6. Using Inline Assembly to Improve Code Efficiency


ARM Compiler provides an inline assembler that enables you to write optimized assembly language routines and to access features of the target processor not available from C or C++. This tutorial explains how to use the __asm keyword to incorporate inline ARM syntax assembly code into a function. Also covered is the use of named register values to access registers of an ARM archiecture-based processor. Read it here »


For more DS-5 Development Studio tutorials, visit http://ds.arm.com/developer-resources/tutorials/

MDK-ARM Version 5 uses the new concept of Software packs for providing support for microcontroller devices and development boards. Software Packs can also contain software components such as drivers and middleware, including example projects and code templates.

The following types of Software Packs can be distinguished:

Software Pack Variants.png

  • Device Family Pack (DFP): generated by a silicon supplier or tool vendor; provides support to create software applications for a specific target microcontroller.
  • CMSIS Pack: provided by ARM® and includes support for CMSIS-Core, DSP, and RTOS.
  • Middleware Pack: created by a silicon supplier, tool vendor or a third party; reduces development time by giving access to popular software components (such as software stacks, special hardware libraries, etc). There are various Middleware Packs already available at www.keil.com/dd2/pack.
  • Board Support Pack (BSP): published by a board vendor to support the peripheral hardware mounted on the board.
  • In-house Software Pack: developed by the tool user for internal or external distribution of software components.


A complete set of application notes is now available to explain the basics behind each Software Pack and how to write and publish your own Pack successfully:


In addition, the MDK Version 5 website contains videos about:

  • MDK Version 5 Overview
  • Getting started with MDK Version 5
  • Software Packs, Peripheral Drivers, and Run-Time Environment
  • Product Lifecycle Management with Software Packs

The MDK-Professional Middleware website describes the Middleware Software Pack in more detail.


The new Device Database lists all available DFPs, whereas the Pack website shows all Software Packs.


If you want to know more about MDK-ARM Version 5 in general, please visit www2.keil.com/mdk5.


Hi I'm very excited to announce of the release of a new product we have been working on for the last few months called the Cortex-M Prototyping System, as part of the Versatile Express family of products. This development board is targeted at the evaluation and prototyping of Cortex-M based designs. It comes provided with fixed encrypted images of all the Cortex-M series of ARM Processors (M0, M0+,M1,M3 & M4) and an application note for Cortex-M0 design start in an example subsystem which is user modifiable. It is fully supported in ARM DS-5 and KEIL MDK with example projects for each. It is any ideal platform for the evaluation of the different Cortex-M processors and FPGA prototyping to allow early device driver development.


It offers

  • ~150K Logic Elements for user prototyping
  • 8MB of single cycle SRAM, 16MB of PSRAM
  • IO expansion
  • Wide range of debug connectors
  • Encryption support
  • Peripherals including Ethernet, audio, VGA, UART, SPI & touch screen.

Further information is available here. The board is available now order code V2M-MPS2-0318A at an affordable price of $995


FPGA 'Adaptive' Debugging - what it is:


When you use ARM DS-5 Altera Edition with Altera Cyclone V SoC devices, the debugger understands your programmable FPGA logic design.  Does something not work?  Do you need the hardware to run faster? Simply recompile the FPGA portion of the design (or nicely ask your hardware designer to re-code and recompile), and the debugger automatically *adapts* and enables you to interact with the new 'hardware'. Pretty powerful stuff. 


Learn more about how you can gain Insight into your system and be more productive with ARM DS-5 Altera Edition and Cyclone V SoC.


1 Introduction

        ARM, as the world leading CPU/GPU architecture licensing company with more than 95% marketing share of the mobile device CPU architecture, we are thinking highly of how can ARM help improving the performance of mobile internet application over ARM arch powered device hence build stronger ARM SW ecosystem. ARM does deliver some performance analysis tools and open source optimization related projects to help App developers. But very less of developers know about it…

       Cocos2d-x game engine, one of the world most popular game engines in the world, more than 25% global game engine marketing share and 70% marketing share in China, all the most popular mobile games in China are developed upon this engine, eg: Fish joy, I’m MT, big head, etc, and the most important, all this games will running upon ARM powered device.

       After some investigation and deeply communication with cocos2d-x founders, ARM setup this project to help do performance analysis for this game engine with ARM’s DS-5 Streamline tool.  After about 3 months, we do find some hotspots of cocos2d-x and so far we help optimized most of them, and performance improved a lot based on the same benchmarking case, about 30-70% improvement. And the most important, the code patch we submitted already accepted and integrated into the newly released cocos2d-x engine.

       This article will show case the detail steps how do ARM profile cocos2d-x engine and how can developers using DS-5 Streamline performance analysis tool to analysis their own mobile applications hence improve app performance.

2 Preparation

  1. DS-5 Streamline tool

Downloading the archive from arm site: http://www.arm.com/products/tools/software-tools/ds-5/index.php

  1. Properly build environment.

Please prepare the build environment according to your android source or the instructions described here:


  1. Android SDK and Platform tools


These tools will contain the adb command which we will use it to connect the device to host.

  1. Android NDK


This is required by Cocos2d-x for compilation for android platform

  1. Cocos2d-x game engine

You can get the source of cocos2d-x from following two places:

3 DS-5 install and target prepare

ARM Streamline Performance Analyzer is a system-wide visualizer and profiler for ARM powered target running on Linux and Android platforms, which builds on system tracepoints, hardware and software performance counters, sample-based profiling and user annotations to offer a powerful and flexible system analysis environment for software optimization.

3.1 Download and Install DS-5

Please install the DS-5 tools according to the instructions here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0482k/index.html

3.2 Target device prepare

To use ARM DS-5 Streamline, you need prepare a target device which already enabled DS-5 gator. You can enable any smart phone you want according to: 

Or buy a device which our partners already have DS-5 enabled and use it directly




As Application developers, we suggest bug a device directly since it’s hard for you to get Linux kernel source and driver related knowledge to build a DS5 gator driver yourself.


For this project, we are using Spreadtrum sample device which is ARM cotex-A5 single-core CPU and Mali 300 single-core GPU, and we do enable DS5 gator ourselves. After Gator driver and daemon are compiled successfully, we push it to your target, and then start gator with following adb commands:

#adb push gator.ko /system/bin/

#adb push gatord /system/bin/

#adb shell

#chmod 777 /system/bin/gatord

#gatord &

4 Build and install the target profiling applications

In this project, we are using 2 major profiling app:  Cocos2d-x official benchmark and the “Fishjoy2” game.

4.1 Build the benchmark application of Cocos2d-x

The benchmark app is stored in the source of Cocos2d-x, named as TestCpp under the “samples” directory, which is the official test suite developed by cocos2d-x team, and we will using those performance related test cases.


For how to build the TestCpp of Cocos2d-x for android platform, please follow the instructions README.md file under the “samples/Cpp/TestCpp/proj.android” directory, or you can refer to link here:https://github.com/cocos2d/cocos2d-x/tree/develop/samples/Cpp/TestCpp/proj.android

For convenience reason, we write below bash script to buid it, you refer it if you like.



# put this script in the root directory of cocos2d-x source, and execute it.

# then run it like this: ./build.sh

parent=$(cd $(dirname $0); pwd)

export ANDROID_SERIAL=19761202

export NDK_ROOT=/usr/local/adt-bundle-linux/android-ndk-r8e/

export API_ID="android-17"

android update project -p $parent/cocos2dx/platform/android/java/ -t "${API_ID}"

cd $parent/samples/Cpp/TestCpp/proj.android/

android update project -p . -t "${API_ID}"


if [ $? -ne 0 ]; then

    echo "faile to run ./build_native.sh"

    exit 1


ant debug install


4.2 Build the Fishjoy2 application

Per the confidential reason, we can’t get the Fishjoy2 source code, so Fishjoy2 team help build it for us and provide us the apk and .so file with debug info.


To make sure the call stack of streamline “call graphic” view works smoothly during the profiling, we suggest add “ -fno-omit-frame-pointer “ option when compiling your application, or else it will hard to get the call stack in streamline.  Here for the Cocos2d-x application, we can add the following two lines to the file of cocos2dx/Android.mk:

LOCAL_CFLAGS += -fno-omit-frame-pointer

LOCAL_EXPORT_CFLAGS += -fno-omit-frame-pointer

5 Start Steamline profiling

5.1 Connect DS-5 Stremline to target

To use streamline to profile android device, you need connect the android target device to host. Either use Ethernet connect or connect from USB cable and forward the port with below cmd:

#adb forward tcp:8080 tcp:8080

5.2 Configure Streamline

Start DS-5 tool from your PC and open the “Streamline Data” view as below chart show:


Click the Capture Options button (the gear icon) to open configuration window, and set the configurations as following:


  1. 1. Connection
    1. 1.1. Address: localhost
  2. 2. Capture:
    1. 2.1. Sample rate: Normal
    2. 2.2. Buffer mod: Streaming
    3. 2.3. Duration : Unlimited (leave it blank)
    4. 2.4. Call Stack Unwinding : checked
  3. 3. Energy Capture:
  4. 4. leave them as default
  5. 5. Analysis:
    1. 5.1. Process debug Information : checked
    2. 5.2. High Resolution Timeline : checked
  6. 6. Program images:
    1. 6.1. click the first icon to add all necessary symbol files that include the debug information for the library files, normally they are under directories like this:


  1. 6.2. add the symbol files of the TestCpp application,


5.3 Select CPU/GPU related counters you want profiling

Open the counter configuration tab and select the target counters you would like to check and show in the streamline analysis report, left side is the available counters you can selected, and right side shows the counters you already selected.


5.4 Collect and Check the profiling data

Click the Start Capture button to collect the streamline data. You can see the timer showing how long has collected, normally about 10s will be enough for us to profile and analyse, just click “stop” button when you want to stop the collecting.


After clicked the Stop button, the streamline analyzer will start automatically, and you will get the following Timeline view opened after the streamline analysis completed. All the counters you selected will be show in the timeline view.


Click to the Functions View, you will see the CPU usage percentage of all the functions. And normally we should check those top CPU usage functions to see whether there are work as design or potential performance issues.


You can reference the following link to get more detail information on how to utilize Streamline:



If you see there is .so file in the Location column, that meaning you need to add the symbol file to the “Program images” described in 4.2 section.

6 Profiling Stories

6.1 Profiling Story 1-- PerformanceTest NodeChildren B test case

· Run the test case

Start the TestCpp application on the test device and run the test case: PerformanceTest->PerformanceNodeChildrenTest->B Iterate SpriteSheet, click the + button to increase the nodes to 15000, we can see that the FPS is about 11.


· Collect profiling data and analysis it

Collect profiling data about 10s, from the timeline view profiling report we can see that the CPU is busy, but considering that this case is mainly doing the process of iterating the array, it is almost in the indefinite loop, so the CPU in high percentage should be OK.


Then from the Functions view, we can see that the hotspot is the memcpy function which takes about 50% CPU time.


For this memcpy hotspot we checked:

  1. the memcpy method itself
  2. the code function who call the memcpy

Go to Streamline “Call Graph” view we found it’s updateQuad method of CCTextureAtlas class who call memcpy continuously.


Find the solutions:

  1. the memcpy method itself

After checking the memcpy implementation, we do find it has been optimized with neon instructions, and there is not much difference with other implementation, eg, google android implementation and linaro implementation, meaning no more optimization opportunity, and we’d better check the callers.

  1. the functions where call the memcpy method

Dig into the source of updateQuad method


We find that there is a ”=” sentence to assign the big struct ccV3F_C4B_T2F, which is 96 Bytes. With the knowledge of android toolchain, we know this assignment will call the memcpy function at runtime.

After investigate the source and some discussion with cocos2d-x engine team, we believe it is possible to use element reference directly in the code where calls this updateQuad method.

For example, changing the following code:

_textureAtlas->updateQuad(&_quad, _atlasIndex);


quad = &((_textureAtlas->getQuads())[_atlasIndex]);

quad->bl.colors = _quad.bl.colors;

The code patches for this solution are:



· Optimization result

  1. CPU time in functions tab(54.62%->9.10%)

The CPU time of memcpy function deduced from 54.62% to 9.10% after optimization


  1. FPS in screen(11.3 fps->17.2 fps)

The FPS increased from 11.3 to 17.2, performance increased about 70% for this specific case


6.2 Profiling Story 2-- PerformanceTest Sprite A(1) case

· Run the test case

Start the TestCpp application on the test device and run the test case: PerformanceTest->PerformanceSpriteTest->A(1) position, click the + button to increase the nodes to 500.


· Collect profiling data and analysis it

Collect profiling about 10s, from the Timeline view we found that so far the CPU is not too busy.


But from the Functions view, idle process(sc8810_idle) takes about 73.43% CPU time.


Based on the experience, we know which meaning the system should be busy, CPU is waiting for something to be completed, like the IO. And in this case, the main IO should be the GPU. So we need check about the GPU status with streamline. This needs the Mali support gator driver module.

For this GPU hotspot we can check from two points based on experience:

  1. Instruction failed texture-miss count

Open the Counter configuration window and add below two counters to the collection list, save and recapture streamline data

  • Mali GPU Fragment Processor 0: Instruction completed count
  • Mali GPU Fragment Processor 0: Instruction failed texture-miss count


Then we can see that the failed texture-miss count is about 8,030,551, meaning too many instructions are failled to load that texture during fragment shading.


  1. The overdraw factor

Open the Counter configuration window and add 2 more hardware counters and recapture streamline data

  • Mali GPU Fragment Processor 0: Active Clock Cycles
  • Mali GPU Fragment Processor 0: Fragment passed z/stencil


Then we can see that the passed z/stencil count is about 8,573,446


With the overdraw formula, overdraw is about 22.3, which is too high as the overdraw factor for a typical application should be around 3.

overdraw = "Fragments Passed Z/stencil count" / "Device Resolution"

                = 8573446/(800*480)

                = 22.3

Find the solutions:

  1. Instruction failed texture-miss count

The cache of the Mali300 of the device we used is only 8K, and it would be the main reason that causing the huge number of texture misses. Per GPU knowledge, using compressing textures technique would help to reduce this misses. Unfortunately, cocos2d-x engine didn’t support compressed texture, after some technical discussion between ARM’s GPU experts and coco2d-x developer team, they finally have ETC1 format supported with the latest engine.

To testing the performance impact with compression texture, we convert the .png file to ETC one and change below code from:

sprite = Sprite::create("Images/grossinis_sister1.png");


sprite = Sprite::create("Images/grossinis_sister1.pkm");

Note 1:

ARM provide an tool named “Mali GPU Texture Compression Tool” to help converting the png file to ETC1 format, you can download it from link:http://malideveloper.arm.com/develop-for-mali/mali-gpu-texture-compression-tool/

With this tool, you can convert the png file to pkm file in ETC1 format with one simple cmd --- “./etcpack grossinis_sister1.png ./ -c etc1”. For more information about how to install and use the Mali GPU Texture Compression Tool, you can refer to link:http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0503e/index.html

Note 2:

Cocos2d-x still does not support the Alpha channel for ETC1 format yet, you can reference the following link regarding how to work alpha channel with ETC1 format:http://malideveloper.arm.com/develop-for-mali/sample-code/etcv1-texture-compression-and-alpha-channels/

  1. The overdraw factor

Normally, object drawing sequence will impact overdraw a lot, back->front is the worst case and front->back the best case. After checking with Cocos2d-x team, we was told that all objects have the same Z-order, unfortunately which cause highest overdraw as the worst-case “back->front”. That’s why streamline report show fragment shader cost a lot and fragment GPU is so busy.

Typical ways to reduce overdraw factor is have the app drawing its objects from front to back instead of back to front by having a Z sort at CPU side before submitting geometry to the GPU.

Cocos2d-x team agree with ARM’s proposal but still not support it since it might cause big architecture modification, they need evaluate the side effect. If your profiling report shows the same overdraw issue, please try ARM’s proposal above.

· Optimization result

  1. Instruction failed texture-miss count (8,030,551->3,081,109, 61.6%)

The failed texture-miss count reduced from 8,030,551 to 3,081,109 after used the ETC1 format.



  1. FPS in screen(9.3 fps->12 fps 29%)

The FPS changed from 9.3 to 12.0, meaning performance increased about 30% with the ETC format supported.


6.3 Profiling Story 3-- FishJoy2(Start Game)

· Run the test case

Firstly, please make sure device connect to internet via wifi, and then start the Fishjoy2 app.


· Collect profiling data and analysis it

Starting streamline capture by click the “Start Catpure” button, and then click the START button to start playing the game, stopping streamline capture once displayed the scene selection window.

In the timeline view, drag the two blue icons of the time ruler to cover the data for the start operation only. We can see that the START operation cost about 3.5s(2.2->5.7), and the CPU is busy, GPU is idle.


In the Functions view, we can see that the phread_mutex_unclock and pthread_mutex_lock takes 17.22% CPU time(9.51% + 7.71%)


Find the solutions:

After talked with the FishJoy2 team, they confirmed that it’s not expected for the pthread operation to take so much CPU time, they do find some defect of the source code, and fix it.

· Optimization result

  1. 1.    Start time(3.5s->2.5s, 28.6%)

After get the updated APK and recapture streamline report, you will see the start operation time reduced from 3.5s to 2.5s(2.1->4.6)


  1. 2.    CPU time(17.22%->12.55%, 27.1%)

And the function view show CPU occupancy rate of pthread operation reduced from 17.22% to 12.55%(7.18% + 5.37%)


6.4 Profiling Story 4-- FishJoy2(Quick click to play the game)

· Run the test case

Start the FishJoy2 Application, and play the ame about one minute


· Collect profiling data and analysis it

Capture the streamline data about 30s, you will see the Timeview profiling report show fragment GPU is very busy.


The Functions view show the idle process takes up the highest CPU time. And we can also see that there are many float related system calls takes higher CPU time, eg: the _addsf3/mulsf3/eqsf2


Find the solutions:

For the idle process and the high usage of GPU processor, we already know that this is the same problem with the Profiling Story 2 we met.

For many float related operation system calls taking higher CPU time, which is abnormal since ARM already optimized this kind of functionalities, after some discussion with Fishjoy2 team, we finally find that this game is compiled with the armeabi ABI, not with the armv7a ABI. We suggest fishjoy2 team recompile apk with armeabi-v7a option enabled as below code show:


$ cat samples/Cpp/TestCpp/proj.android/jni/Application.mk

APP_STL := gnustl_static



APP_ABI := armeabi-v7a



· Optimization result

After compiled the game with armv7a ABI, we can see that the float related operations disappeared in the higher CPU time occupancy list.


7 Conclusion

The cocos2d-x profiling project we have done do demonstrate that ARM Streamline is a very powerful tool to help application developers doing performance analysis, finding application hotspots and then optimizing their applications.   And the project output so far is very positive, not only help finding cocos2d-x game engine’s code logic related hotspots, but also finding some design architecture related potential limitations.

Cocos2d-x team do thanks ARM at their official SNS account –sina weibo/twitter/facebook—for all our effort, especially the code patch we submitted, which they think will benefit the whole cocos2d-x community. Meanwhile, cocos2d-x team engineers are starting using DS-5 Streamline to profile their latest engine themselves.

At the end, we would like to share to all the developers that some Chinese key mobile internet app companies are starting using ARM DS-5 Steamline to do performance analysis themselves now, like: Ucweb, tencent and alibaba.

8 Official Thanks from cocos2d-x


Filter Blog

By date:
By tag: