Skip navigation


1 2 3 Previous Next

ARM Processors

333 posts

As described in my last article, AArch64 performs stack pointer alignment checks in hardware. In particular, whenever the stack pointer is used as the base register in an address operand, it must have 16-byte alignment.


The alignment checks can be very inconvenient in code generators where it is not feasible to determine how much stack space a function will require. Many JIT compilers fall into this category; they tend to rely on being able to push individual values to the stack.


The Problem


For conventional C and C++ compilers, the stack pointer alignment restrictions in AAPCS64 don't seem to cause much trouble 1. Many C functions start with a prologue that allocates the stack space required for the whole function. This space is then accessed as needed during the function. This is possible because the C compiler can determine in advance the stack space that will be required. Special handling will be required for variable-length arrays and alloca, but these are special cases that aren't often seen in real code.


JIT compilers (and other time-constrained code generators) cannot usually do this because it is expensive to analyse the code to extract this information. Also, many simple compilers are based around a stack machine, and assume that there is an efficient push implementation for an arbitrary number of registers. This is easy to manage in AArch32 because the basic data type is usually a single 4-byte register, and these can be pushed individually (between function calls) without violating any sp alignment rules. However, for AArch64, the required stack pointer alignment is two x or four w registers, and it is not possible to push individual registers.


// Broken AArch64 implementation of `push {x1}; push {x0};`.
  str   x1, [sp, #-8]!  // This works, but leaves `sp` with only 8-byte alignment ...
  str   x0, [sp, #-8]!  // ... so the second `str` will fail.


The most appropriate method of implementing push and pop operations will depend on the nature of the engine you are using. I considered a number of possible solutions for use in the AArch64 port of the Google V8 JavaScript engine. I will present each idea along with their advantages and disadvantages.


Calculate stack sizes in advance.


If the required analysis is possible, it can result in fast generated code and efficient use of stack memory, so I've included this as a kind of benchmark, even though it might not be possible for many JIT compilers. The generated code will typically look something like this:


sub   sp, sp, #(8 * 14)       // Allocate space for the whole block.
str   x0, [sp, #(8 * 11)]     // Write to slot 11.
ldr   x0, [sp, #(8 * 11)]     // Read from slot 11.
add   sp, sp, #(8 * 14)       // Free the space at the end of the block.


Indexed addressing modes can sometimes be used to combine some of the operations. For example:


str   x0, [sp, #-(8 * 14)]!   // Allocate space and write to slot 0 in one step.


Depending on the design of your compiler (and your source language), it might be possible to calculate stack usage for individual basic blocks, even if function-level analysis isn't feasible. You'll have a separate allocation instruction (sub) for each block, but this is still cheaper than some other approaches that I'll describe in this article.




Use 16-byte stack slots.


Let's start with a simple, quick-and-dirty approach.


It sounds wasteful – and in most cases it is – but the simplest way to handle the stack pointer can be to push each value to a 16-byte slot. This doubles the stack usage (for pointer-sized values), and it effectively reduces the available memory bandwidth. It is also awkward to implement multiple-register operations using this scheme, since each register requires a separate instruction.


In general, I don't consider this approach to be appropriate. However, it does have one significant advantage, which is that it is very simple; there might be situations where this simplicity is worth the cost.


str   x0, [sp, #-16]!         // push {x0}
ldr   x0, [sp], #16           // pop {x0}


Use a register other than sp as the stack pointer.


This mechanism is simple in principle: if the alignment restrictions of sp are inconvenient, just use another register as your stack pointer. General-purpose registers have no special alignment restrictions. Interfaces with PCS-compliant code (such as the C or C++ parts of the virtual machine) need to synchronise sp and the replacement stack pointer, but this is usually simple and quite cheap.


There is a notable complication: memory below the architectural sp (but in the stack area) cannot be safely accessed. Notably, this area is used by signal handlers, which execute asynchronously (like interrupts). If we just copy sp to some other register and start using it as a (descending) stack pointer, our special stack area will eventually be corrupted.


Separate stack area.



One way to use a separate register for the stack is to have a completely separate area of memory allocated for generated code to use as a stack. The two stacks would grow and shrink independently, and the procedure-call standard would apply only to the architectural stack. You must ensure that you allocate enough memory, but on most platforms you can allocate a large range of contiguous virtual addresses without actually reserving physical memory. (This is how Linux creates the normal process stack, for example.)


There aren't very many complications with this technique. Generated code must be careful around entry and exit points, but not significantly more than usual. The biggest complication in most situations will be integration with other components. For example, in a virtual machine where a garbage collector needs to scan and update the stack, it also needs to be aware of the special stack area.


Reserve stack space in advance.


If the application can reliably predict a maximum stack space for a given function, the entry point can simply move sp down temporarily to accomodate this space. It is often easier to determine the maximum stack space required than it is to determine precisely how much stack is needed.


  // Using x28 as a replacement stack pointer.
  sub   sp, x28, #max_stack_space
  str   x0, [x28, #-8]!   // push {x0}
  ldr   x0, [x28], #8     // pop {x0}


Note that sp doesn't need to be kept 16-byte aligned in the example above because it isn't used to access memory.


Sadly, although finding an upper limit on the required stack space is easier than calculating the usage exactly, it still often requires analysis that isn't easily available, so this is definitely not a drop-in solution.



Shadow sp.


Another solution is to update sp just before every push. sp won't necessarily be 16-byte aligned, but since it is never used to access memory, it doesn't matter. This method is what the Google V8 JavaScript engine uses, and it's also what VIXL's MacroAssembler uses if you tell it to use a different stack pointer.


sub   sp, x28, #8           // preparation
str   x0, [x28, #-8]!       // push {x0}
ldr   x0, [x28], #8         // pop {x0}


In general, there is no need to unwind the architectural sp on pop instructions, since it is harmless to leave it where it is.


With some care, the preparation step for several pushes can be combined in order to minimise the code-size overhead. (If you take this far enough, it starts to look quite similar to the "reserve stack area in advance" proposal above.)


// Several pushes can share a single preparation step.
sub   sp, x28, #32          // preparation
stp   x3, x2, [x28, #-16]!  // push {x2}; push {x3};
stp   x1, x0, [x28, #-16]!  // push {x0}; push {x1};


Aside from the wasteful 16-byte-per-slot mechanism, this shadow-sp design is probably the simplest drop-in solution available; push and pop macros can be written to hide the alignment restrictions for ad-hoc usage, and no additional analysis is required to get it to work. It performs well, since most processors can execute the sub and str at the same time. The only significant cost to be aware of is the code size overhead, especially where you have many small pushes.


In Conclusion


None of these ideas will work well in every context, so the best choice really depends on the constraints that you have to work within. However, hopefully I've explained a few of the practical problems that you're likely to face, and given a bit of inspiration.



1I've never actually worked on a C compiler, but their stack allocation behaviour is clear from disassembly.


ARM TrustZone CryptoCell

Posted by wangyong Nov 21, 2015

ARM TrustZone CryptoCell

CryptoCell is a range of security sub-systems and hardware components that provide platform level security as well as hardware support for security acceleration and offloading.

CryptoCell’s architecture level protection provides tools and building blocks for a wide range of applications including: content protection, IoT security, encryption and provisioning.

CryptoCell digital security subsystem serves as an infrastructure for security related use cases running on the SoC and is comprised of hardware, firmware and SoC-external tools.

CryptoCell includes efficient hardware cryptographic engines, RNG, root of trust/key management, secure boot, secure debug and lifecycle management.

The CryptoCell-300 series of products are usually coupled with ARM Cortex®-M CPUs and the CryptoCell-700 series integrated with Cortex-A application processors.

CryptoCell enables SoC architects to tradeoff area, power, performance or robustness in a very flexible manner. Designs can be optimized to achieve the security vs. cost “sweet spot” appropriate to the target market.

CryptoCell Product Highlights

  • CryptoCell is an embedded security platform suitable for a wide range of SoC markets including automotive, mobile, IoT and deeply embedded. It is compatible with processors that have TrustZone architectural extensions but can also be used where this is absent (such as Cortex-R processors).
  • CryptoCell offers an outstanding level of security, while addressing challenging requirements for increased system complexity, high performance, low power consumption and small footprint.
  • CryptoCell multi-layered hardware and software architecture combines hardware accelerators, root-of-trust control hardware with a rich layer of security software and off chip tools.
  • The CryptoCell architecture is modular and flexible by design, allowing the security solution to be tailored to meet market requirements (all security services offered by TrustZone CryptoCell can be included or excluded from the final package of hardware and software delivered to customers).
  • CryptoCell can be configured to address different platform level security requirements as well as specific protocol related requirements (e.g. IPsec, HomeKit).

The CryptoCell-700 series and CryptoCell-300 series address different platform needs: CryptoCell-300 series is usually coupled with Cortex-M CPUs for environments that require a small footprint (e.g. IoT) and CryptoCell-700 series is usually coupled with Cortex-A CPUs for performance intensive use cases (e.g. mobile).

The following diagram (Fig 1.) illustrates the different components in the TrustZone CryptoCell subsystem.


Figure 1. TrustZone CryptoCell High Level Block Diagram



Addressing key security requirements

Digital devices deal with a wide range of possible threats, CryptoCell addresses the different security requirements coming from different stakeholders. Standard bodies and commercial organization, such as Microsoft, Google, Apple, DTLA, DCP LLC, OMTP, CMLA and others, define different attack vectors as pertinent:

  • Software attacks
  • Inter-chip signal probing
  • Board level software-based debug and test attacks
  • Physical interface attacks
  • Memory or any other non-SoC element replacement attacks
  • Off-line modification of the contents of non-volatile storage (e.g., Flash, EPROM)

To enable SOC vendors to address these attack vectors, CryptoCell offers protection of key device assets. Key device assets usually include:

  • Software code images (system, application, etc.).
  • Secret data, such as device keys and personal/corporate data.
  • Protected content, such as DRM audio and video files/stream.

TrustZone CryptoCell facilitates these security requirements and provides the necessary tools and building blocks to mitigate against such attacks.

Security Certification and Compliance

Security certification standards such as FIPS 140-2, Common Criteria and GlobalPlatform TEE certification are all targeted at verifying the security of complete products.

TrustZone CryptoCell provides the tools and building blocks necessary to comply with these standards.

TrustZone CryptoCell provides the security infrastructure to comply with the robustness rules published by many standardization bodies and commercial organizations such as: Microsoft, Apple, Google, CMLA, DTLA, 4C, DCP LLC, Netflix and IETF.

Commercial deployment and market traction

CryptoCell is commercially deployed within chipsets covering many different verticals and markets such as mobile, IoT, home entertainment and automotive.

source: TrustZone - ARM

When reading assembly-level code for any of the AArch32 or AArch64 instruction sets, you may have noticed that the stack pointer has various alignment and usage restrictions. These restrictions are part of the procedure-call standard – the set of common rules that allow functions to call one another. However, some of the rules also apply even if you aren't actually handling function calls. The stack is shared between parts of an application, any libraries that it uses as well as signal handlers, so it is important that these components agree on how the stack should behave.


If you're just writing C code, the compiler will sort this all out for you, but you'll need to understand the rules if you're dealing with any assembly code that needs to interact with the stack.


This article assumes that your platform uses ARM's AAPCS (for AArch32) or AAPCS64 (for AArch64). This is the case on Linux and Android, but other systems may define their own standards.


Shared Stack-Usage Rules



For both AArch32 and AArch64:


  • The stack is full-descending, meaning that sp – the stack pointer – points to the most recently pushed object on the stack, and it grows downwards, towards lower addresses.
  • sp must point to a valid address in the memory allocated for the stack.
    • Formally, sp must lie in the range stack_limit < sp <= stack_base, though the values of stack_limit and stack_base are often inaccessible.
  • The memory below sp (but above stack_limit) must not be accessed by your code.
    • In practice, signal handlers use this memory, so it can be corrupted unexpectedly and without warning.
  • At public interfaces, the alignment of sp must be two times the pointer size.
    • For AArch32 that's 8 bytes, and for AArch64 it's 16 bytes.
    • A "public interface" is typically a function that is visible to some other, separately-compiled code. The exact definition depends upon the language and the toolchain, and is out of scope of this article. It's reasonable to assume that any C or C++ functions that you interact with using assembly are treated as public interfaces.


Rules Specific to AArch32


For AArch32 (ARM or Thumb), sp must be at least 4-byte aligned at all times. As long as you only push and pop whole registers, this restriction will never be broken.


Rules Specific to AArch64


For AArch64, sp must be 16-byte aligned whenever it is used to access memory. This is enforced by AArch64 hardware.


  • This means that it is difficult to implement a generic push or pop operation for AArch64. There are no push or pop aliases like there are for ARM and Thumb.
  • The hardware checks can be disabled by privileged code, but they're enabled in at least Linux and Android.


C compilers will typically reserve stack space at the start of the function, then leave sp alone until the end, so the restriction is not as awkward as it first seems. However, you must be aware of it when handling assembly code, and it can be tricky for simple compilers (such as stack-based JIT compilers).


Note that unlike AArch32, arbitrarily-aligned values can be stored in sp, as long as the previously-described rules are followed for memory accesses and public interfaces. This is useful for allocating variable-length arrays of small values, for example:


// Allocate a variable-length array of bytes on the stack.
  sub sp, sp, x0                    // x0 holds the length.
  and sp, sp, #0xfffffffffffffff0   // Align sp.


Push and Pop on AArch64


The alignment-check-on-memory-access means that AArch64 cannot have general-purpose push- or pop-like operations.


For example:


// Broken AArch64 implementation of `push {x1}; push {x0};`.
  str   x1, [sp, #-8]!  // This works, but leaves `sp` with 8-byte alignment ...
  str   x0, [sp, #-8]!  // ... so the second `str` will fail.


In this particular case, the stores could be combined:


// AArch64 implementation of `push {x0, x1}`.
  stp   x0, x1, [sp, #-16]!


However, in a simple compiler, it is not always easy to combine instructions in that way.


If you're handling w registers, the problem will be even more apparent: these have to be pushed in sets of four to maintain stack pointer alignment, and since this isn't possible in a single instruction, the code can become difficult to follow. This is what VIXL generates, for example:


// AArch64 implementation of `push {w0, w1, w2, w3}`.
  stp   w0, w1, [sp, #-16]!   // Allocate four words and store w0 and w1 at the lower addresses.
  stp   w2, w3, [sp, #8]      // Store w2 and w3 at the upper addresses.




If you're dealing with hand-written AArch64 assembly code, you'll have to be aware of these patterns.


Many JIT compilers have a tricky situation, though: such compilers are built around a simple stack machine, and expect to be able to push and pop in an ad-hoc fashion. Managing this on AArch64 requires an inventive approach, and I'll describe a few possibilities in a follow-up article.



1Some time ago I was told that the 8-byte alignment restriction exists to allow the use of instructions such as ldrexd and strexd, which require an 8-byte-aligned address. Without a guarantee that a function will be entered with proper alignment, these instructions would be awkward to use on stack variables. There may also be other reasons, but I don't know what they are, and AAPCS doesn't document them.

If you’re ready to step your Arduino game up from 8-bit MCUs, the newly-unveiled SparkFun SAM D21 Dev Breakout is a great way to start. The Arduino-sized breakout for the Atmel | SMART ATSAMD21G18 — a 32-bit ARM Cortex-M0+ processor with 256KB of Flash, 32KB SRAM and an operating speed of up to 48MHz — provides you with an Arduino hardware option that solves the problems of low storage limits and dynamic memory stack overflows that have plagued the previous iterations of the Arduino family. Even better, the SparkFun SAM D21 Dev Breakout is fully supported in the Arduino IDE and libraries for the Arduino Zero.


The SparkFun SAM D21 Dev Breakout has been equipped with a USB interface for programming and power, surrounded with an RTC crystal, and a 600mA 3.3V regulator. By utilizing the Pro R3’s extra PCB real-estate, SparkFun has been able to leave room for a few extra GPIO pins and an integrated LiPo charger. To power this board, simply plug it into a USB port on your computer via its micro-B port.

Not near a USB port? Don’t fret, the SparkFun SAM D21 Dev Breakout is also equipped with a LiPo Battery connector and unpopluated supply input to solder on your own PTH Barrel Jack. If you’ve used any Arduino before, this pinout shouldn’t surprise you – the layout meets the Arduino 1.0 footprint standard, including a separate SPI header and additional I2C header.


One of the most unique features of the SAM D21 is SERCOM — a set of six configurable serial interfaces that can be turned into either a UART, I2C master, I2C slave, SPI master, or SPI slave. Each SERCOM provides for a lot of flexibility: the ports can be multiplexed, giving you a choice of which task each pin is assigned.

SparkFun has made a SAM D21 Mini/Dev Breakout Hookup Guide available online, which includes step by step instructions of how to connect your board as well as a few circuit examples to test out. Intrigued? Head over to its official page here to get yours!

This blog post originally appeared on Atmel Bits & Pieces.


Austin Convention Center
500 East Cesar Chavez Street, Austin, TX
ARM Booth #2015 – Exhibit Hall 2


The Radisson Hotel
111 East Cesar Chavez Street, Austin, TX
ARM Training & Meetings – Riverside South, Second Floor



November 16-19, 2015



ARM will have a featured presence throughout the conference in a variety of areas including:

  • Exhibition Booth #2015 – November 16-19 , Convention Center
    • SC Exhibition Welcome Reception: November 17, 7:00-9:00 p.m.
    • ARM in-booth Happy Hour: November 18, 4:30-6:00 p.m.
  • ARM Training and Meetings – Radisson Hotel, Riverside South, second floor
    • ARM on HPC Presentations – November 17, on-going sessions
    • OCP HPC User Group – November 18, 8:30-10:00 a.m.
    • Gem5 Workshop – November 18, 1:00-5:00 p.m.
  • Job & Opportunity Fair: November 18, 10:00 a.m. to 3:00 p.m., Convention Center
  • Panel: Future of Memory Technology for Exascale and Beyond III, November 17, 3:30-5:00 p.m., Room 16AB
  • BoF: Taking on Exascale Challenges: Key Lessons and International Collaboration Opportunities Delivered by European Cutting-Edge HPC Initiatives – November 19, 3:30-5:00 p.m., Room 13A



ARM’s presence at SC15 will provide a closer look at our innovative technology and partner ecosystem collaboration for server platforms. The exhibit booth will feature ARM and its partners demonstrating diverse silicon platforms for HPC including:

  • E4 Company ARKA Series featuring ARMv8 + GPU +1B
  • SoftIron HPC solutions featuring AMD Opteron A1100 + ARMv8
  • AtGames AMAAS.NET featuring ARM + Silver Lining Unified Fabric Architecture
  • ARM Performance Libraries for HPC   


View the event in the calendar here

During ARM TechCon this week, Synopsys announced the availability of our VC Verification IP for the new ARM AMBA 5 Advanced High-Performance Bus 5 (AHB5) interconnect. In addition, we announced extended system-level verification capabilities in our VIP.


The AHB5 protocol is an update to the widely adopted AMBA 3 AHB specification. It extends the TrustZone security foundation from the processor to the entire system for embedded designs. AHB5 supports the newly announced ARMv8-M architecture which drives security into the hardware layer to ensure developers have a fast and efficient way of protecting any embedded or Internet of Things (IoT) device.

AHB5 can enable high-performance multi-master systems with support for exclusive transfers and additional memory attributes for seamless cache integration.  It adds multiple logical interfaces for a single slave interface so you can address multiple peripherals over one bus.

The new AHB5 protocol also enables closer alignment with the AMBA 4 AXI protocol, enabling easier integration of AXI and AHB5 systems. AHB5 also adds support for secure/non-secure signaling so peripherals can keep state correctly. AHB5 also adds support for user signals.AHB-AMBA

Synopsys System Environment for AMBA AHB


The existing Synopsys system environment for AHB, which is part of the Synopsys system environment for AMBA interconnect, supports AMBA AHB2 and AHB3-Lite. Now, we have extended support for AHB5 protocol. Users simply have to change a few configuration attributes and required signal connections for that configuration.


In addition, Synopsys VC VIP offers advanced system-level capabilities for the ARM AMBA 5 CHI and AMBA 4 ACE protocols. The AMBA 5 CHI is an architecture for system scalability in enterprise SoCs, while AMBA 4 ACE is used for full coherency between processors. The expanded capabilities of Synopsys VIP include system level test-suites, a system monitor, protocol-aware debug and performance analysis. With the growth of cache-coherent designs, checkers and performance analysis are required. The system-level capabilities of Synopsys VIP enable SoC teams to further accelerate time to first test and improve overall verification productivity.


Synopsys VIP features SystemVerilog source code test-suites, which include system-level coverage for accelerated verification closure. The VIP now also offers performance measurement metrics for in-depth analysis of throughput, latency and bottlenecks across cache coherent ports. Synopsys VIP also features system monitors, which interact with other VIP to ensure cache coherency across the system, accurate protocol behavior and data integrity.


To learn more, register for our webinar on November 18th: A Holistic Approach to Verification: Synopsys VIP for ARM AMBA Cache Coherent Interconnects on VIP support for ARM Cache Coherent Interconnects.

ARM introduces TrustZone for ARMv8-M to bring mobile style security to microcontrollers and provides a new family of security subsystems: TrustZone CryptoCell

I’m at Techcon in Santa Clara this week and will be giving a talk on designing trustworthy devices later today.  On the exhibition floor are 3 flavours of TrustZone technology: two of them are brand new and I would like to spend this blog introducing them to you.

Security is a major theme of the show and two of the biggest announcements for me are the extension of TrustZone security to microcontrollers and the addition of a new, deeper layer of security with the introduction of TrustZone CryptoCell.

The number of devices that connect to the internet in your home is set to soar. Strategy Analytics forecasts that by 2020 there will be over 30 billion connected gadgets.  Today’s smart connected devices are based on applications processors such as ARM’s Cortex-A family.  They run major OS such as Linux and are protected by multiple layers of hardware based security, including TrustZone technology.  However, much of the future growth will come from simpler, lower cost microcontroller based devices.  Traditionally microcontrollers have had little built-in security but now that a modern Cortex-M based chip can run sophisticated internet protocols such as Transport Layer Security (TLS, formerly known as SSL) they also require a hardware based security architecture.

Our growing reliance on technology requires that security be deeply integrated at multiple levels to protect users, services and devices. The Internet of Things revolution will need to be built on trustworthy devices.   Service providers need to trust the data and this in turn means that the end points need to be secure from malicious attack.   Consumers who buy a connected gadget will expect them not to be hacked.  So the pressure will be on OEMs to deliver secure devices that can provide an appropriate level of security robustness to the assets they are protecting.


TrustZone today

Security on applications processors has been maturing over the last ten years originally driven by the needs of smartphones and more recently enterprise platforms.  TrustZone technology is used on billions of devices to provide the hardware isolation for a Trusted Execution Environment (TEE).  A TEE provides a secure enclave to protect sensitive code and data with the security promises of integrity and confidentiality, for example, a malicious application should not be able to read the private keys stored on the device. The TEE is designed to protect against scalable software attacks and if someone has stolen your device, from common hardware attacks sometimes referred to as “shack attacks” (attacks from a knowledgeable attacker with access to normal electronic enthusiast type of equipment).  

The TrustZone based TEE provides a “Secure World“ where the security boundary is small enough to offer a route to certification and provable security. It is typically used for securing cryptographic keys, credentials and other secure assets. TrustZone offers a number of system security features not available to the hypervisor:  it can support secure debug, offer secure bus transactions and take secure interrupts directly into the Trusted World (useful for trusted input). There is an argument to restrict the amount of security functionality in the trusted world to limit the attack surface and make certification a practical proposition.


The TrustZone security extensions work by providing the processor with an additional ‘secure state’ that allows secure application code and data to be isolated from normal operations.  This partitioning enables a protected execution environment where trusted code can run and have access to secure hardware resources such as memory or peripherals. Conventionally, the Trusted World is used with its own dedicated secure operating system and a trusted boot flow to form a TEE that works together with the conventional operating system, such as Linux® or Android™, to provide secure services. 



A TrustZone based Trusted Execution Environment has become a popular security building block of modern applications processors.


Anyway, on to the new stuff…


The IOT Challenge:


The success of the Internet of Things depends on consumers and services being protected: this requires security to be designed into the hardware and firmware of chips from the outset rather than being bolted on later.  Now that even low cost ARM microcontrollers are capable of internet protocols it is clear that hardware based security needs to be present in the tiniest IOT platforms.  The challenge then is how we bring high quality security solutions to all platforms and price points and help ensure the success of IOT.



Two New TrustZone Technologies


Getting chip based security right is difficult and requires an unbroken chain of well engineered hardware and software that works together.   By providing specialized engineering in the form of TrustZone technology, ARM enables this chain to be established so that all platforms can benefit from high quality security solutions.  


To enable layers of hardware based security across all devices ARM is expanding and deepening its security technology with the announcement that ARM TrustZone technology will be included in new ARMv8-M microcontrollers and TrustZone CryptoCell security subsystems will be available to work with any ARM processor:


TrustZone technology is now available for microcontrollers, as a security extension for the ARMv8-M architecture. It brings many of the familiar concepts such as a secure and normal world; a system-wide approach extending beyond the processor and secure interrupts.  Since microcontrollers often require deterministic interrupt responses and fast context switching, hardware optimizations have been added to make switching between the two worlds quick and energy efficient.  TrustZone for ARMv8-M expands integrated hardware security to low cost, resource constrained Internet of Things.



TrustZone for ARMv8-M brings familiar security architecture to microcontrollers.


ARM TrustZone Cryptocell is a family of security processors that provides a security sub-system and trust anchor.  It provides a hardware based multi-layer approach to protect the most valuable assets and acts as a co-processor speeding up complex algorithms.  In a typical system Cryptocell manages keys and critical processes such as secure boot.  As the product name suggests this family is derived from the recent acquisition of Sansa Technology.  TrustZone CryptoCell can be used on both applications processors and on more resource constrained microcontrollers.



TrustZone CryptoCell acts as a security subsystem and root of trust



Technology Model


The picture below shows multiple layers of hardware based security for an applications processor, including the new TrustZone CryptoCell subsystem providing enhanced security functions close to the root of trust.

The chain of trust starts with some immutable hardware e.g. Hardware Unique Keys, ROM code and secure hardware resources.  TrustZone CryptoCell interfaces to the ROT, performs secure boot and a set of trusted functions such as crypto and key management.  Then the authenticated Trusted boot starts at the highest level of privilege – Secure EL3

This will include the setup of trusted peripherals and establish a secure runtime (Calledl BL3-1 in the ARM Trusted Firmware implementation).

The Trusted OS is started which establishes the trusted services and then

Normal world boot is started.  If a hypervisor is present at EL2 this might be integrity checked by the TrustZone based TEE.  The hypervisor might have multiple VMs for separating large chunks of code.  Next, the guest OS will boot – this might be integrity checked by the TrustZone based TEE and finally Apps are enabled



A well designed chip or platform can use this chain of trust to be “secure by default” and can connect to cloud based services via encrypted links using standard internet protocols (such as TLS – Transport Layer Security) that consumers will be familiar with from the padlock symbol used to secure their online banking. With these low level hardware and software security features in place the system becomes trustable by higher level services.



The Outlook:

We now have expanded and deepened TrustZone technology to cover all ARM based platforms. We have taken years of experience providing the security foundations to mobile into the smallest platforms for future IOT devices.  Consumers, service providers, OEMs and silicon vendors will benefit from these technology advances by enabling devices and services they can trust.  We hope that you can build on these security foundations to enable a new era of trustworthy IOT devices.

The amount of data we generate is growing exponentially. Gartner has predicted that in 2015 the global mobile data traffic will be a total of 52 billion terabytes, an increase of 59 percent on 2014. It’s a staggering number that is driven not just by the continuation of the mobile revolution, but the next wave of connected devices all around us, constantly recording, analyzing and sending data across the network. All of the connected devices mean that the network has grown phenomenally fast.



However a network is only as strong as the weakest access point. We have seen with many high profile examples this year of how connected devices can be hacked. WIRED had a standout example, where two people gained control of a Jeep Grand Cherokee remotely through the internet connection port, and were able to do whatever they wanted with the car, including steering and engine control. It shows that the integrity of an entire system can be compromised if a hacker or someone with malevolent intent gains control of any access point. All of this adds up to a situation where, for the next generation of connected devices to be successful, they need to have an increased level of security from a system perspective.


Device Security 101.png



The most likely points of attack in a system are the smallest microcontrollers that gather and process data at the endpoint, due to their wide proliferation. It is also by far the most plentiful type of connected device as there are billions in the world, many of which use the ARM® architecture. ARM v8-M is the latest architectural specification from ARM for Cortex-M processors that brings significant improvements in security provisions, lower latency and increased scalability.


Alongside ARMv8-M is a new AMBA® specification, AMBA 5 AHB5, which is an open interface protocol for embedded SoCs. It is an extension of the previous generation AHB and AHB-Lite AMBA specifications for embedded devices. It is available to download for free under licence at


The integration of TrustZone™ technology with both architectural specifications means that ARM’s security solution is now available for embedded designers across the entire system, fortifying the security of microcontrollers and embedded SoCs. Security at a hardware level makes it easier to ensure the safety of our data.



Together, ARMv8-M and AHB5 enable enhanced connected, intelligent and secure devices, extending the market for embedded devices. Some of the key areas of improvement in the architectural spec are security provisions, lower latency and increased scalability.




AMBA 5 AHB5 Extends Security to the System


The AHB 5 specification extends security from the processor to the entire system. AHB5 complements the ARMv8-M architecture to extend the TrustZone security foundation from the processor to the system, enabling trust within an SoC. ARMv8-M and AHB5 offer designers a standard on which they can create secure systems through the provision of secure and non-secure transactions.



AHB5 diagram.png

AMBA 5 AHB5 Brings New Features



Extended memory types: AHB5 has additional memory types to support more complex systems. The AHB-Lite spec is the most widely-used open interface protocol for low-latency embedded designs, AHB5 enhances and extends this specification to address the security needs of the next generation of embedded SoCs. It enhances support for more complex systems, as well as easing integration of Cortex®-A & Cortex-M based systems in an SoC.


Secure transfers: Borrowing from the TrustZone methodology, the interface indicates whether a transaction is considered Secure or Non-Secure based on the source and identification protocols. Secure transactions can only be generated by Secure software or, in cases of testing, an authorised debugger. The integration of software into the specification at the grassroots, architectural level means that it is far easier to build a system that is tightly secure.


Exclusive transfers: Support semaphore-type operations.



Updated features

Multiple slave select: Single slave interface provides multiple logical interfaces and offers area efficiency.

Single-copy and multi-copy atomicity: Guarantees write to the same location are observed in the same order by all agents. Enable scaling to multiple cores.

User signalling: Allows for user extensions and consistency with AXI specification




In order to get the whole system to work, the hardware system architecture requires:

  • An on-chip bus protocol and bus infrastructure components that support sideband signal for indication of the transfer type (secure or non-Secure).
  • Various bus infrastructure components to support partitioning of memory spaces in memory components, and to block non-secure accesses to secure memories.
  • Optional peripheral access management to decide which peripherals are secure-accesses only and which are accessible by non-secure transfers. In some systems such security management could be hard wired.




Embedded designers looking to develop on the ARM v8-M architecture already have some of the important parts of the SoC design puzzle solved. AMBA 5 AHB5 is a new interface specification that is available to download now for free. It builds upon the most widely used open protocol for embedded designs and enhances the scalability and security of SoCs that will power the connected technology all around us.


What do you think about the latest announcement from ARM? What impact do you think it will have on the future of system design?

If you are coming to Techcon there's a lot of activity around ARM TrustZone technology.

There will be 3 TrustZone demos on Wed and Thursday - that's strange there's normally only one on the ARM stand. 

If you want to find out why you can come along to these Tech talks

1.  I'm doing a Security update / overview at 13:30 Ballroom H

2. Asaf Shen is doing a presentation on Wed at 10:30 in Ballroom F

3. Simon Crake is doing a talk on Wed at 15:30 in  Ballroom E


I'll try and do a longer blog later today.

Hope some of you can make it!

A report from Anandtech HiSilicon Announces New Kirin 950 SoC introduces Kirin 950.

Semihalf is happy to present the first blog post from the planned series of topics covering interesting facts about porting FreeBSD to armv8 architecture.


We're glad to announce that the Cavium ThunderX system has become the first ARMv8 hardware that runs FreeBSD.



Motivations and goals

FreeBSD is undoubtedly the most popular BSD operating system available.


Unlike Linux, it is released under a permissive BSD license and as a complete system distribution (this includes not only kernel but also base root file system and development tools).


One of the main areas of FreeBSD deployment is server market. It is therefore substantial for the FreeBSD community (developers and users) to keep up with growing interest in ARM-based servers. This idea was a motivation for Semihalf to pick up the gauntlet and bring FreeBSD to one of the most exciting ARMv8 platforms out there.


The newly introduced support was based on initial foundational work submitted by Andrew Turner and Robin Randhawa, with emulation as the primary target and is a joint work of Semihalf team, Andrew Turner, ARM Ltd., Cavium and The FreeBSD Foundation.


About the hardware platform

The Cavium ThunderX is currently the most advanced implementation of ARMv8 architecture. A single chip incorporates all features that are crucial for  modern server applications:

  • SMP scalability up to 48 cores per socket (2-socket configuration gives as many as 96 cores)
  • fast DDR3 ECC memory controllers with up to 128GB per socket
  • top-performing network interfaces, configurable to support 1/10/20/40GbE over fiber or copper links
  • variety of fast IO interfaces: PCIe 3.0, SATA 3.0


Semihalf contribution

Semihalf focus and responsibility was making FreeBSD work on ThunderX as the best performing chip in the FreeBSD/ARM64 world. From the beginning, our goal was to create a user-accessible support that could be taken from the FreeBSD-HEAD branch and utilized on the actual, ARM server.The work done includes:


  • Stabilizing the machine dependent kernel parts to work on an actual hardware. Before, all development was done in virtualized environment, like QEMU or ARM FastModels. Interacting with the hardware allowed to find a full variety of very nasty bugs within the ARMv8 base system.
  • Providing support for “extreme” SMP (Symmetric Multi-Processing) on ARM FreeBSD. ThunderX is the first machine that offers up to 96 CPU cores for the operating system. Previous ARM devices did not exceed 8 core count. Pushing it to its limit appeared to be a very interesting task.
  • Implementing or enhancing existing drivers for basic subsystems. The most interesting are:
    • GICv3 and ITS (Interrupt Translation Services) - provide the FreeBSD with support for completely new approach to interrupt controllers and message signalled interrupts.
    • PCIe - add support for Cavium implementation of PCIe controller
    • Virtualized Networking Interface (VNIC).


The platform offers ultra-fast, virtualized networking interfaces (1Gbps, 10Gbps, 20Gbps, 40Gbps) with a richful set of functionalities, including SR-IOV with up to 128 virtual functions.


Future of FreeBSD on ARMv8

The ARMv8 architecture and ThunderX system is intended to be the Tier-1 platform for the FreeBSD. This means that FreeBSD will maintain the ongoing support for ARM64 and will provide application packages, etc. in a manner known from the other Tier-1 platforms (I386, AMD64).


All integrations are targeted to FreeBSD-11- STABLE which is going to be released in 2016. By that time all work done by Semihalf will be fully integrated into the FreeBSD tree.


About Semihalf

Semihalf creates software for advanced solutions in the areas of platform infrastructure (operating systems, bootloaders), virtualization, networking and storage.  We make software tightly coupled with the underlying hardware to achieve maximum system capacity.


Technologies developed by Semihalf power a wide range of products, from consumer electronics to cloud data center elements and carrier-grade networking gear.


The team

Zbigniew Bodek

Dominik Ermel

Wojciech Macek

Michał Stanek

When we look at what’s happening in consumer electronic devices, you can see there is a clear evolution path. The smartphone has become the central computing device for most people and it is now being augmented by wearable devices. On the higher end, tablets continue to replace laptop purchases and the emergence of a premium tablet class and convertables makes the proposition even more compelling. Across all form factors there is a common thread, devices have responded to the consumer need to be constantly on, adaptable to its surroundings, and comfortable with multi-tasking across many applications.



The increase in pixel count on our screens and in our visual content means that our digital lives are in ultra high definition, and moving to 4K. The implications are that the system bandwidth must stay at least one step ahead in order to avoid bottlenecks, as users have grown accustomed to seamless computing where every command is carried out instantaneously. In reality, it doesn’t matter how powerful a CPU or GPU becomes, if there is not enough memory bandwidth in the system it will give a sluggish performance. It is clear then that next generation SoCs require a holistic view to optimize performance.




The answer lies in the system

As the mobile market is maturing, SoC designers have come to realise that one of the key routes to optimizing performance is through the system. It is through system performance that the next generation of SoCs will differentiate.


At ARM we have recognized that and have been focusing on developing system IP that gets the most out of silicon. All of our IP, Cortex® processors, Mali™ GPU and CoreLink™ System IP, are all designed, validated and optimized together to ensure the best performance per watt. We work closely with our partners to ensure the mobile devices of tomorrow can deliver experiences that continue to amaze consumers.




User experience (6 IP blocks).png

User experience is dependent on these IP blocks

System-optimized IP enables greater SoC differentiation

  • Reduced CPU latency
  • Efficient utilization of interconnect and memory bandwidth
  • Quality of Service (QoS) guarantees
  • Faster design cycle




CoreLink System IP Drives Innovation for Mobile Devices

ARM has launched two new System IP products that provide the foundation for next-generation SoCs, enabling new computing possibilities through increased system performance, improved power savings and better system integration.The CoreLink CCI-550 Cache Coherent Interconnect is a best-in-class AMBA interconnect for ARMv8-A systems. The CoreLink DMC-500 Dynamic Memory Controller is a low power, performance-optimized mobile memory controller with LPDDR4/3 support.

CoreLink CCI-550

CoreLink CCI-550 is the latest product in the market-leading ARM CoreLink Cache Coherent Interconnect family. Previous generation interconnects have been used in many millions of devices across multiple market segments, from mobile applications to smart TVs, automotive infotainment and cost-effective networking.

CoreLink CCI-550 delivers improvements in three key areas:

More bandwidth, less latency:

Greater than 60% of peak system bandwidth. This means that CoreLink CCI-550 is built and optimized for applications that require high bandwidth throughput to provide fluid, responsive applications and user interface, acceleration for apps, including video and photo editing, and improved multi-tasking and multi-windowing (compared to CCI-500).

QoS enhancements can reduce latency within the CCI by up to 20%

2x snoop hit bandwidth that extends efficiency across the system         

Advanced power efficiency           

Enables a fully coherent GPU which simplifies software and increases performance. Hardware coherency enables shared virtual memory and removes the need to copy data and the time consuming software managed cache maintenance.

Integrated snoop filter can save 100's of mW of memory system power


Extensive configurability including 1 to 6 ACE ports means it can be optimized for a wide range of applications from premium tablet in the high end, as well as down to smaller or cost-sensitive designs

Memory interfaces are scalable from 1 to 6, supporting high performance tablet requirements with 4K internal and external screens, and bandwidths exceeding 50GB/s

CCI-550 system bandwidth (slide).png

CoreLink CCI-550 enables a fully coherent GPU. Fully coherent memory systems can unleash the heterogeneous computing power of CPU and GPU simultaneously. It’s an exciting new area for mobile computing and holds the potential for many applications that would benefit enormously from the extra processing power a GPU could provide

Coherent GPU use cases.png

Example use cases of a fully coherent GPU

CoreLink DMC-500

The CoreLink DMC-500 offers lowest latency supporting LPDDR4/3 memories at up to LPDDR4-4266+ transfer speeds. The CoreLink DMC-500 along with the CoreLink CCI-550 provides the best end-to-end performance from CPU to memory at lowest power while ensuring that important system level functions such as coherency, QoS and TrustZone security are fully supported. It offers leading performance in the following ways:

Highly Optimized, efficient memory access

  • 27% increase in memory bandwidth utilization
  • Latest LPDDR4/3 memory support up to LPDDR4-4267
  • Low power design and operating modes


End to end quality of service

  • 25% reduction in average CPU latency
  • Complete solution with CoreLink interconnect


Integrated solution

  • TrustZone™ security and media protection for DRM content
  • Supports industry standard DFI 4.0 PHY interface
  • Integrated memory scheduling and memory controller enables highest utilisation

CoreLink DMC-500 improvements.png

Increasing memory bandwidth and reducing latency will bring features like immersive mobile gaming, 4K content and screen display, and 120fps video playback into the reckoning for next generation devices.
CoreLink DMC-500 extends the performance and low power leadership of ARM systems for the advanced LPDDR4/3 memories:

Single DFI 4.0 memory interface supporting X16 LPDDR4 up to DDR-4267 and X32 LPDDR3 up to DDR-2133, dual-DMC channel support for X32 LPDDR4

    • Support for clock gating, dynamic frequency change and memory low power modes for optimized power consumption




System-Optimized IP Enables Seamless Computing

Mobile with Mimir.png

QoS is way of prioritizing traffic dynamically across system masters. Masters can be categorized into three broad classes:

  1. Latency-sensitive masters - benefit most from lowest latency response from memory, for example CPU
  2. Greedy masters (or bulk transfer) - capable of submitting many requests to memory but no firm deadline
  3. Real-time masters - firm deadline by which response must be received, for example display controller



CoreLink CCI-550 and CoreLink DMC-500 have been designed with a system-wide QoS that has been validated to work CoreLink NIC-450, Cortex A53 and Cortex A72 processors and Mali GPU.

QoS offers system-wide flexibility to tasks that optimize the system performance of the Cortex processors and Mali GPU.  In benchmark tests, QoS enhancements have shown up to a 25% CPU latency reduction across the chip, which directly translates to faster performance.



System Optimized IP enables seamless computing (slide).png



TrustZone Secure Media Path to provide end-to-end protection for Ultra-HD content from Mali to memory. Together with the CoreLink DMC-500, which contains an integrated TrustZone controller, it ensures minimal latency, making the possibility of watching Netflix in 4K quality on your mobile device a distinct reality. And who could say no to that?



Users have come to expect more from our mobile devices as they have become our primary computing devices. The challenge for the semiconductor industry is to keep up with the demand, and there is a realization that the system performance is what really counts.


CoreLink System IP will be fundamental in the performance and functionality increases of next-generation SoCs for mobile devices, and I look forward to seeing the next jump in mobile user experience enabled by these.




Further information:

Press Release

CoreLink CCI-550

CoreLink DMC-500

Extended System Coherency - Part 1 - Cache Coherency Fundamentals

GPU Coherency uni paper

Hi all, I can finally share with you some of the interesting work we have been doing around Cortex-M0 Design Start and FPGA. ARM has re-launched the Cortex-M Design Start program making it more easily accessible and even more affordable. We have worked with our colleagues in the CPU design team to improve our FPGA support for Cortex-M0 substantially. Using our existing Cortex-M Prototyping System (MPS2) and a interesting feature from Altera called partial reconfiguration (PR), we can provide the user with a fully debuggable Cortex-M0 CPU with a user area which you and edit and modify with your IP without a CPU licence. Using the PR feature we have created an 'CPU partition' with full debug feature, to which you can connect your own IP in a 'user partition'. We have provided CMSDK peripherals and an example design in the user area from which you can start from. The 'CPU partition' includes debug but is fixed and encrypted.




You can use the Altera Quartus tool chain to resynthsize your design and provide your own customisable target. The platform comes with mbed drivers for all the peripherals such as SPI, GPIO, UART etc. We have created an application note how this all works. We have also created a FPGA test bench, so you can simulate your design at the target level (e.g FPGA). The simulation test bench requires you to download the obfuscated Cortex-M0 code from the DesignStart page above, but it would be quite useful to debug any of your IP issues ahead of running software at speed.


CM0DS jigsaw.jpg


In addition to this, we have also launched a new version of MPS2 called MSP2+ which is has double the FPGA capacity of the MPS2 board, other that the increase in the FPGA capacity the products are the same, So designs targeted at MPS2, not also have MPS2+ support. Best of all the price remains the same, double the FPGA capacity and no extra cost. Check out the MPS2 page for further information on the platform

Moving beyond the hype of the potential for the IoT, there are some real engineering challenges that must be overcome in order to satisfy the intoxicating market predictions of IoT uptake across industrial and commercial markets. We are addressing some of these in the IoT track at ARM TechCon next month. ABI Research estimates that the volume of data captured by IoT-connected devices exceeded 200 exabytes in 2014, and its annual total is forecast to grow seven-fold by the decade’s end, surpassing 1,600 exabytes—or 1.6 zettabytes—in 2020. So, how do we successfully create the devices that will handle all of this data?


I recently spent some time asking Qian Yu, technical marketing manager at ARM, Christian Légaré, CTO and EVP, Micrium, and Mike Anderson, CTO and Chief Scientist for The PTR Group, Inc about the challenges of IoT design in advance of ARM TechCon (Légaré and Anderson are members of the ARM TechCon Technical Program Committee). You can see their advice on how to pick a platform, how to balance power with functionality, and some of the latest products and techniques that make it all possible in this article.

Bosch Sensortec has just unveiled a compact 9-axis motion sensor, which incorporates an accelerometer, a gyroscope and a magnetometer along with an Atmel | SMART SAM D20 ARM Cortex M0+ core.


The BMF055 is the perfect match for those looking to develop advanced application-specific sensor fusion algorithms, add sophisticated motion sensing capabilities, and replace multiple discrete components with a single package. Boasting a tiny 5.2mm x 3.8mm x 1.1mm footprint, the latest board from Bosch Sensortec’s Application-Specific Sensor Node (ASSN) family easily integrates with a wide range of projects from robotics and drones, to gaming and navigation, to augmented reality and human interface devices for the IoT — all of which require a customized SiP solution.



On top of that, Bosch Sensortec provides an additional SDK featuring a precompiled BSX Lite fusion library with integration guidelines and API source files for individual sensors, as well as example projects as a plugin for Atmel Studio. Intrigued? Head over to BMF055’s page here.

Filter Blog

By date:
By tag:

More Like This