Skip navigation

Blog

1 2 3 Previous Next

ARM Processors

381 posts

Today at Hot Chips in Cupertino, I had the opportunity to present the latest update to our ARMv8-A architecture, known as the Scalable Vector Extension or SVE. Before going into the technical details, key points about ARMv8-A SVE are:

 

  • ARM is significantly extending the vector processing capabilities associated with AArch64 (64-bit) execution in the ARM architecture, now and into the future, enabling implementation choices for vector lengths that scale from 128 to 2048 bits.

  • High Performance Scientific Compute provides an excellent focus for the introduction of this technology and its associated ecosystem development.

  • SVE features will enable advanced vectorizing compilers to extract more fine-grain parallelism from existing code and so reduce software deployment effort.

 

I’ll first provide some historical context. ARMv7 Advanced SIMD (aka the ARM NEON instructions) is ~12 years old, a technology originally intended to accelerate media processing tasks on the main processor. It operated on well-conditioned data in memory with fixed-point and single-precision floating-point elements in sixteen 128-bit vector registers.  With the move to AArch64, NEON gained full IEEE double-precision float, 64-bit integer operations, and grew the register file to thirty-two 128-bit vector registers. These evolutionary changes made NEON a better compiler target for general-purpose compute.  SVE is a complementary extension that does not replace NEON, and was developed specifically for vectorization of HPC scientific workloads.

 

Immense amounts of data are being collected today in areas such as meteorology, geology, astronomy, quantum physics, fluid dynamics, and pharmaceutical research.  Exascale computing (the execution of a billion billion floating point operations, or exaFLOPs, per second) is the target that many HPC systems aspire to over the next 5-10 years. In addition, advances in data analytics and areas such as computer vision and machine learning are already increasing the demands for increased parallelization of program execution today and into the future.

 

Over the years, considerable research has gone into determining how best to extract more data level parallelism from general-purpose programming languages such as C, C++ and Fortran. This has resulted in the inclusion of vectorization features such as gather load & scatter store, per-lane predication, and of course longer vectors.

 

A key choice to make is the most appropriate vector length, where many factors may influence the decision:

 

  • Current implementation technology and associated power, performance and area tradeoffs.

  • The specific application program characteristics.

  • The market, which is HPC today; in common with general trends in computer architecture evolution, a growing need for longer vectors is expected in other markets in the future.

 

Rather than specifying a specific vector length, SVE allows CPU designers to choose the most appropriate vector length for their application and market, from 128 bits up to 2048 bits per vector register.  SVE also supports a vector-length agnostic (VLA) programming model that can adapt to the available vector length.  Adoption of the VLA paradigm allows you to compile or hand-code your program for SVE once, and then run it at different implementation performance points, while avoiding the need to recompile or rewrite it when longer vectors appear in the future.  This reduces deployment costs over the lifetime of the architecture; a program just works and executes wider and faster.

 

Scientific workloads, mentioned earlier, have traditionally been carefully written to exploit as much data-level parallelism as possible with careful use of OpenMP pragmas and other source code annotations.  It’s therefore relatively straightforward for a compiler to vectorize such code and make good use of a wider vector unit. Supercomputers are also built with the wide, high-bandwidth memory systems necessary to feed a longer vector unit.

 

However, while HPC is a natural fit for SVE’s longer vectors, it offers an opportunity to improve vectorizing compilers that will be of general benefit over the longer term as other systems scale to support increased data level parallelism.

 

It is worth noting at this point that Amdahl’s law tells us the theoretical limit of a task’s speedup is governed by the amount of unparallelizable code. If you succeed in vectorizing 10% of your execution and make that code run 4 times faster (e.g. a 256-bit vector allows 4x64b parallel operations), then you've reduced 1000 cycles down to 925 cycles, providing a limited speedup for the power and area cost of the extra gates. Even if you could vectorize 50% of your execution infinitely (unlikely!) you've still only doubled the overall performance. You need to be able to vectorize much more of your program to realize the potential gains from longer vectors.

 

So SVE also introduces novel features that begin to tackle some of the barriers to compiler vectorization. The general philosophy of SVE is to make it easier for a compiler to opportunistically vectorize code where it would not normally be possible or cost effective to do so.

 

What are the new features and the benefits of SVE compared to NEON?

 

FeatureBenefit
Scalable vector length (VL)Increased parallelism while allowing implementation choice of VL
VL agnostic (VLA) programmingSupports a programming paradigm of write-once, run-anywhere scalable vector code
Gather-load & Scatter-storeEnables vectorization of complex data structures with non-linear access patterns
Per-lane predicationEnables vectorization of complex, nested control code containing side effects and avoidance of loop heads and tails (particularly for VLA)
Predicate-driven loop control and managementReduces vectorization overhead relative to scalar code
Vector partitioning and SW managed speculationPermits vectorization of uncounted loops with data-dependent exits
Extended integer and floating-point horizontal reductionsAllows vectorization of more types of reducible loop-carried dependencies
Scalarized intra-vector sub-loopsSupports vectorization of loops containing complex loop-carried dependencies

 

SVE is targeted at the A64 instruction set only, as a performance enhancement associated with 64-bit computing (known as AArch64 execution in the ARM architecture). A64 is a fixed-length instruction set, where all instructions are encoded in 32 bits. Currently 75% of the A64 encoding space is already allocated, making it a precious resource.  SVE occupies just a quarter of the remaining 25%, in other words one sixteenth of the A64 encoding space, as follows:

 

The variable length aspect of SVE is managed through predication, meaning that it does not require any encoding space. Care was taken with respect to predicated execution to constrain that aspect of the encoding space.  Load and store instructions are assigned half of the allocated SVE instruction space, limited by careful consideration of addressing modes. Nearly a quarter of this space remains unallocated and available for future expansion.

 

In summary, SVE opens a new chapter for the ARM architecture in terms of the scale and opportunity for increasing levels of vector processing on ARM processor cores. It is early days for SVE tools and software, and it will take time for SVE compilers and the rest of the SVE software ecosystem to mature. HPC is the current focus and catalyst for this compiler work, and creates development momentum in areas such as Linux distributions and optimized libraries for SVE, as well as in ARM and third party tools and software.

 

We are already engaging with key members of the ARM partnership, and will now broaden that engagement across the open-source community and wider ARM ecosystem to support development of SVE and the HPC market, enabling a path to efficient Exascale computing.

 

Stay tuned for more information

 

Following on from the announcement and the details provided, initial engagement with the open-source community will start with the upstreaming and review of tools support and associated standards.  General specification availability is expected in late 2016/early 2017.

 

Nigel Stephens is Lead ISA Architect and ARM Fellow

Today we have exciting news: ARM and Intel Custom Foundry have announced an agreement to accelerate the development and implementation of ARM SoCs on Intel’s 10nm process. Specifically, we are making ARM’s Artisan® Physical IP available on the process as part of an ongoing collaboration.

 

I’m excited about our collaboration with Intel Custom Foundry for several reasons including:

  • The benefits to our partners by expanding the ARM ecosystem to offer more manufacturing choices for premium mobile and consumer SoCs.
  • Intel Custom Foundry will give its customers access to world-class physical IP and ARM implementation solutions.
  • All the major foundries now offer Artisan platforms, further confirming it as the industry standard for physical IP.

 

Today’s announcement represents what we expect to be a long-term, mutually beneficial partnership with Intel Custom Foundry.

 

One of the strengths and differentiators of the Artisan platform is the availability of ARM core-optimized IP—what we call ARM POP™ technology. The value of POP technology for an ARM core on the Intel 10nm process is tremendous, as it will allow for quicker knowledge transfer, enabling customers to lower their risk in implementing the most advanced ARM cores on Intel’s leading-edge process technology. Additionally, POP technology enables silicon partners to accelerate the implementation and tape-outs of their ARM-based designs. The initial POP IP will be for two future advanced ARM Cortex-A processor cores designed for mobile computing applications in either ARM big.LITTLE™ or stand-alone configurations.

 

Today at the Intel Developer Forum (IDF), I had the pleasure of joining Intel Senior Fellow, Mark Bohr and Intel Custom Foundry Vice President Zane Ball’s Technical Insights session to announce our collaboration.  We discussed how the partnership will accelerate design enablement for future devices in the premium mobile market including smartphones and tablets. Read more about Zane’s perspective on our collaboration.

 

Ecosystem enablement

You probably glanced at the headline and thought “ARM and Intel collaborating…what?” Despite press stories, Intel and ARM have worked together for years to help enable the ecosystem, and this is just the latest milestone in that long-standing relationship. I see it as a natural evolution of the design ecosystem: ARM is a leader in processor and physical design, and  Intel Custom Foundry is a leading integrated device manufacturer. This combination is a win-win for customers.  It reinforces an ARM tenet throughout our 25-year history: To continuously enable choice and innovation inside the ARM ecosystem.

 

This agreement provides access to another key manufacturing source and expands the EDA and IP ecosystem to ensure interoperability and a shorter on-ramp for early leading-edge process technology.

 

I’ve enjoyed broad experience in this industry, working in semiconductors, EDA and now IP. I love the relentless competition but I also am wowed by moments of cooperation that redefine the industry landscape. This agreement is one example of that and will deliver immense value to the design ecosystem and ultimately to our partners. ARM is committed to Intel’s success as a world-class custom foundry at 10nm. We stand behind our mutual customers when they make that choice.

 

Let me know your thoughts in the comments section below!

 

Related stories:

Power management is important, and has become increasingly complex. Recently, we have created an application note. See details below. Hopefully, you will find it useful.

Purpose

  • Provides high-level considerations for power management of a big.LITTLE system and helps you avoid some potential issues in your big.LITTLE design.

Intended Audience

  • It is written for hardware System on Chip (SoC) designers implementing power-down and power-up sequences for ARM processors.
  • It assumes that you have SoC design experience and are familiar with ARM products.

Scope

This application note focuses on the following processors and highlights important issues when powering up or powering down processor cores and clusters on an SoC.

  • Cortex®-A7.
  • Cortex®-A15.
  • Cortex®-A17.
  • Cortex®-A53.
  • Cortex®-A57.
  • Cortex®-A72.
  • Cortex®-A73.

Outline

This application note is organized into the following chapters:

  • Chapter 1 Introduction

Read this chapter for information about the purpose of the application note.

  • Chapter 2 Power-down and power-up considerations

Read this chapter for high-level considerations for powerdown and powerup.

  • Chapter 3 Potential SoC integration issues

Read this chapter for potential issues when implementing power management for processor cores or clusters on a typical SoC.

  • Chapter 4 Hardware considerations

Read this chapter for general advice from the hardware perspective when implementing power-down and power-up sequences for big.LITTLE systems.

Your feedback

If you have any feedback about this document, please feel free to contact me. My email address is Roy.Hu@arm.com.

See the attachment for details about the application note. Thanks.

SemiWiki recently published a book on FPGA-based prototyping titled “PROTOTYPICAL: The Emergence of FPGA-Based Prototyping for SoC Design.” Among other things the book explores ARM’s role in fpga prototyping technology.  Below is a excerpt from the book.  If you want to read the entire book, you can download it from the S2C web site at http://www.s2cinc.com/resource-library/prototyping-book

 

“Developing for ARM Architecture

Since ARM introduced its Cortex strategy, with A cores for application processors, R cores for real-time processors, and M cores for microcontrollers, designers have been able to choose price/performance points – and migrate software between them. How do designers, who are often doing co-validation of SoC designs with production software, prototype with these cores?

 

Some teams elect to use ARM’s hard macro IP offering, with optimized implementations of cores. ARM has a mixed prototyping solution with their CoreTile Express and LogicTile Express products. CoreTile Express versions are available for the Cortex-A5, Cortex-A7, Cortex-A9, and Cortex-A15 MPCore processors, based on a dedicated chip with the hardened core and test features. The LogicTile Express comes in versions with a single Xilinx Vertex-5, dual Virtex-6, or single Virtex-7 FPGAs, allowing loose coupling of peripheral IP.

 

Others try to attack the challenge entirely in software. Cycle-accurate and instruction-accurate models of ARM IP exist, which can be run in a simulator testbench along with other IP. With growing designs come growing simulation complexity, and with complexity comes drastic increases in execution time or required compute resources. Simulation supports test vectors well, but is not very good at supporting production software testing – a large operating system can take practically forever to boot in a simulated environment.

Full-scale hardware emulation has the advantage of accommodating very large designs, but at substantial cost. ARM has increased its large design prototyping efforts with the Juno SoC for ARMv8-A, betting on enabling designers with a production software-ready environment with a relatively inexpensive development board.

 

However, as we have seen SoC design is rarely about just the processor core; other IP must be integrated and verified. Without a complete pass at the full chip design with the actual software, too much is left to chance in committing to silicon. While useful, these other platforms do not provide a cost-effective end-to-end solution for development and debug with distributed teams. Exploration capability in a prototyping environment is also extremely valuable, changing out design elements in a search for better performance, power consumption, third-party IP evaluation, or other tradeoffs.

 

The traditional knock on FPGA-based prototyping has been a lack of capacity and the hazards of partitioning, which introduces uncertainty and potential faults. With bigger FPGAs and synthesizable RTL versions of ARM core IP, many of the ARM core offerings now fit in a single FPGA without partitioning. Larger members of the ARM Cortex-A core family have been successfully partitioned across several large FPGAs without extensive effort and adverse timing effects, running at speeds significantly higher than simulation but without the cost of full-scale hardware emulation.

 

A hybrid solution has emerged in programmable SoCs, typified by the Xilinx Zynq family. The Zynq UltraScale+ MPSoC has a quad-core ARM Cortex-A53 with a dual-core ARM Cortex-R5 and an ARM Mali-400MP GPU, plus a large complement of programmable logic and a full suite of I/O. If that is a similar configuration to the payload of the SoC under design, it may be extremely useful to jumpstart efforts and add peripheral IP as needed. If not, mimicking the target SoC design may be difficult.

 

True FPGA-based prototyping platforms offer a combination of flexibility, allowing any ARM core plus peripheral IP payload, and debug capability. Advanced FPGA synthesis tools provide platform-aware partitioning, automating much of the process, and are able to deal with RTL and packaged IP such as encrypted blocks. Debug features such as deep trace and multi-FPGA visibility and correlation speed the process of finding issues.

 

The latest FPGA-based prototyping technology adds co-simulation, using a chip-level interconnect such as AXI to download and control joint operations between a host-based simulator and the hardware-based logic execution. This considerably increases the speed of a traditional simulation and allows use of a variety of host-based verification tools. Using co-simulation allows faster turnaround and more extensive exploration of designs, with greater certainty in the implementation running in hardware.

 

Integration rollup is also an advantage of scalable FPGA-based prototyping systems. Smaller units can reside on the desk of a software engineer or IP block designer, allowing dedicated and thorough investigation. Larger units can support integration of multiple blocks or the entire SoC design. With the same synthesis, debug, and visualization tools, artifacts are reused from the lower level designs, speeding testing of the integrated solution and shortening the time-to-success.

 

Another consideration in ARM design is not all cores are stock. In many cases, hardware IP is designed using an architectural license, customized to fit specific needs. In these cases, FPGA-based prototyping is ideal to quickly experiment and modify designs, which may undergo many iterations. Turnaround time becomes very important and is a large productivity advantage for FPGA-based prototyping.”

The ISC16 event occurred last week in Frankfurt, Germany. ISC stands for International Super-computing and while ARM is known for its energy-efficient, mobile CPU cores, we are beginning to make some waves in the arena of the world’s largest computers.

 

ARMv8-A brings out the real strength of Fujitsu’s microarchitecture

To kick off the week, our partner Fujitsu unveiled their plan for the next generation “Post-K” supercomputer to be based on ARMv8-A technology.  It turns out ARM Research has been working hard for several years on a number of technical advantages that will give ARM partners an edge in the HPC market and Fujitsu has taken note.   At ISC16, both Fujitsu and RIKEN, the user of Japan’s fastest and current “K” super-computer, presented their plans to collaborate on the ARM-based Post-K supercomputer.  The significance of this announcement can’t be overstated as this strategic project is seen as Japan’s stepping stone to the Exascale tier of super-computing. Exascale requires roughly 10x the computing power of today’s fastest computers, yet must function within similar power envelopes. It is a lofty goal.

 

More will be divulged by ARM on its HPC technology at HotChips this August, but in the meantime, here are a few links to recent articles covering the Fujitsu announcement and others relating to ARM in HPC. The Next Platform article does a particularly good job of highlighting why Fujitsu and RIKEN see value in the ARMv8-A architecture and ARM server ecosystem:

 

 

Designed for HPC Applications

The last two articles linked above are interesting in they seem to imply that ARM is delving in HPC based on “mobile chips”.  This certainly isn’t the case.  ARM and its partners are taking advantage of the architectural flexibility the ARM business model provides them. Fujitsu and others are designing CPU’s from the ground up with HPC codes and end-user super-computer applications fully in mind, while still benefit from the energy-efficiency benefits of the ARMv8-A architecture.  As noted in the slide shown above, Fujitsu’s own “POST-K” microarchitecture and their collaboration with RIKEN and ARM is a great example of this.  We expect more to come from other ARM partners in the future, so stay tuned.

SAN FRANCISCO--In the decades since the open source software movement emerged, it’s always seemed to pick up momentum, never abating.

This year is no exception as we roll into Red Hat Summit, June 27-30, in San Francisco.RedHat-Summit-2016.jpg

ARM and its ecosystem partners will be at the Moscone Center outlining how server, networking and storage applications are deploying today and how optimized ARM technology platforms provide scalable, power-efficient processing.

For a sneak peek at one of interesting trends in open source, check out Jeff Underhill’s post on Ceph’s embrace of ARM (Ceph extends tentacles to embrace ARM in Jewel release - Visit us at Red Hat Summit (booth #129) to find out more!).

Then join us in booth #129 with our partners Cavium, Linaro, Penguin, SoftIron, AppliedMicro and Western Digital to get the latest insights on open source software and hardware design.

Don’t miss the Thursday panel (3:30-4:30 p.m.) “Building an ARM ecosystem for the enterprise: Through the thorns to the stars,” moderated by Red Hat’s Jon Masters and featuring Underhill, Yan Fisher (Red Hat), Mark Orvek (Linaro), and Larry Wikelius (Cavium).

 

Related stories:

Ceph extends tentacles to embrace ARM in Jewel release - Visit us at Red Hat Summit (booth #129) to find out more!

The amount of data that consumers are producing is increasing at a phenomenal rate, and shows no signs of slowing down anytime soon. Cisco estimated last year that global mobile data traffic would multiply tenfold between 2014 and 2019, up to a rate of 24.3 Exabyte per month. In order to support this continuing evolution, the infrastructure backbone of the cloud needs to stay ahead of the curve.

 

 

Cloud and server infrastructure needs to stay ahead of predicted usage trends

 

This is requiring large volume deployments of servers in the “cloud provider” space. Large Internet companies are building out datacenters at a scale that is unprecedented to manage all of this data. There is an insatiable appetite for more compute, with the caveat of it needing to be delivered at highest compute density within given server constraints to minimize Total Cost of Ownership (TCO). Datacenters are replacing servers on a shorter cycle as well as evaluating new installations more often because workflow demands are constantly changing. There is huge opportunity for server SoC vendors to innovate, with some aspects being critical to successfully building a server SoC:

 

  • Time-to-market
  • Performance/Watt
  • Higher levels of integration in costly advanced technology nodes

 

 

 

The explosion in computing applications brings opportunity for tailored server SoCs

 

To help more ecosystem partners enter the market, ARM has designed processors and System IP blocks (e.g. CoreLink™ Cache Coherent Network, Memory Controllers) that can meet the performance and reliability expectations for the mainstream server market. This helps our partners to develop SoCs for their target applications, which in turn enables OEMs and Datacenter providers to get the right performance within budget. ARM has now taken this a step further in enabling our silicon partners to deliver a mainstream server SoC by developing and delivering a server subsystem.

 

 

What is a Server Subsystem?

 

A server subsystem is a collection of ARM IP (processors and System IP) that has been architected and integrated together along with all the glue logic necessary to lay down the foundation for a server class SoC. The subsystem configuration specifically targets mainstream requirements, which covers roughly 80% of the server market.The subsystem allows a partner to quickly go from Register Transfer Level (RTL) which is a high-level hardware description language used for defining digital circuits to silicon. Even with ARM delivering the soft IP blocks to our partners, it can still take multiple months to architect the SoC, integrate all the IP correctly together while meeting power and performance targets. In addition, the design then needs to be fully verified and validated prior to taping out. The ARM subsystem helps “short circuit” this process by delivering verified and validated top level RTL that has already integrated the various ARM IP blocks together in a mainstreamserver configuration. (Find out more about ARM’s system validation process in the whitepaper System Validation at ARM: Enabling our Partners to Build Better Systems). This can save our partners up to a year of effort. The silicon partner can take our subsystem and then add in the necessary Input/Output logic (e.g. PCIe, SATA, and USB) along with any of their own differentiating IP to complete the SoC design. By providing partners with the subsystem, it significantly reduces the effort of integrating, verifying and validating the IP together for this configuration thus reducing overall development time and allows silicon partners to focus resources on differentiation.

 

Accelerating the Path to Server SoCs

 

So, how does this server subsystem help our partners build a competitive server SoC with faster time to market? ARM has architected the server subsystem to provide enough CPU compute to allow partners to efficiently manage the majority of server workloads. The subsystem consists of up to 48 cores, 12 Cortex®-A72 processor clusters, each with four CPU cores, attached to the CoreLink CCN-512 Cache Coherent Network along with four server class memory controllers (CoreLink DMC-520). Other ARM System IP has been integrated in to perform specialized tasks within the subsystem for the kind of use cases expected. CoreLink NIC-450 Network Interconnect for low power, low latency rest of SoC interconnect for peripheral inputs such as PCIe CoreLink GIC-500 Generic Interrupt Controller performs critical tasks of interrupt management, prioritization and routing supporting virtualization and boosting processor efficiency. The real value of the subsystem lies in the fact that all of the IP has been pre-integrated and pre-validated with ARM engineering “know-how” of our IP to ensure predictable performance with much less engineering resource or time required. By taking a holistic view to system performance, the integration teams were able to make the whole subsystem greater than the sum of its parts.

 

The picture below shows a high level view of the subsystem.


 

 

So what about Power, Performance, and Area?

 

In addition to the above, ARM provides a system analysis report along with the pre-configured and optimized RTL. The system analysis report gives the silicon partner data we collected on the performance, power, and area of the subsystem. It includes industry standard benchmark emulation results such as SPECCPU 2006, STREAM, and LMBench. Based on early analysis, expect this subsystem to scale to performance levels needed to win mainstream server deployments in large datacenters.

These benchmarks are key data points that an end customer buying a hardware platform based on the SoC leveraging the subsystem uses to decide what platform they will buy and deploy in their datacenter. It is critical that our silicon partners have a good understanding of performance expectations well before they have actual silicon they can test. The investment to develop server SoCs is high and reducing the likelihood of additional spins is key to time-to-market. In addition to the performance results, ARM also analyzes the power draw of the subsystem and includes this in the report. Also, ARM physical design team does preliminary floor planning and some timing constraint analysis for target process technology. In effect, it helps our partners understand die size and cost implications which ultimately ensure their design will meet customer’s expectations.

 

 

Reference Software and Documentation

 

In addition to giving our partners a head start on hardware design and understanding PPA (Performance, Power, and Area) targets, the subsystem also comes with a reference software package. The subsystem has been built to adhere to industry server standards (e.g. UEFI, ACPI). The reference software includes ARM Trusted Firmware and UEFI source code ported to the subsystem, ACPI tables populated for the subsystem, and any patches needed to run latest Linux kernel along with release documentation and guide on how to use the software. In addition, a        Fixed Virtual Platform (FVP) of the subsystem is available. The FVP is a fast, accurate platform built on fast models that helps our partner’s software development activities. The software developed for the subsystem is
ready-to-run on the FVP. Delivering this reference software stack along with the optimized RTL allows silicon partners to more rapidly develop the necessary software to allow booting an OS as soon as silicon arrives. On the hardware side, the subsystem also includes a technical reference manual that describes the various pieces of the subsystem in detail, implementation guide, and integration manual. All of this documentation is delivered along with the RTL to help our partners quickly understand the subsystem. This is critical in enabling SoC designers to get up to speed fast and devote as much time and resource as possible on differentiating their design through proprietary IP, customized software, or a mixture of both.

 

ARM’s Server Subsystem Provides a Fast Start to SoC Development

 

As I mentioned previously, the ever increasing data processing requirements that are occurring due to the continued electronics revolution have big implications for datacenters. It means that mainstream server SoCs are becoming increasingly complex every year. In addition, companies are replacing their server platforms at an unprecedented rate. This requires our silicon partners to deliver more capable SoCs faster. ARM has been enabling our partners with ARM processors and System IP that can be leveraged to deliver server SoCs. The ARM subsystem now takes this enabling activity a step further by giving our partners a fast start to their SoC development. By providing a pre-integrated, pre-verified foundation, it reduces the entry barriers for the ARM ecosystem to enter the changing server market and develop optimized SoCs for their target applications. For more information please contact me directly here via the comments below or private message and I’ll make sure to get back to you.

Functional safety for Silicon IP used to be a niche activity, limited to an elite circle of chip and system developers in automotive, industrial, aerospace and similar markets. However over the last few years that’s changed significantly. There’s now a more tangible vision towards self-driving cars with increasingly adventurous Advanced Driver Assistance Systems (ADAS) to capture people’s interest along with media-rich in-vehicle infotainment. Moreover the emergence of drones in all shapes and sizes and the growing ubiquity of industrial Internet of Things are also proliferating the requirement for functional safety, all of which are relevant to ARM®.

 

Much like any technology market surrounded in ‘buzz’ these burgeoning applications require semiconductors to make them happen and the fast-pace of product innovation has attracted huge interest from ARM’s partners. In the IP community ARM leads the way with a broad portfolio of IP from ARM Cortex®-M0+ to the mighty Cortex-A72 and beyond. With a heritage in secure compute platforms and functional safety ARM is well placed to enable the success of its silicon partners.

 

 

What’s functional safety all about?

 

In a nut-shell, functional safety is what the name says, it’s about ensuring that products operate safely and continue to do so even when they go wrong. ISO 26262 the standard for automotive electronics defines functional safety as:

 

ISO 26262 “the absence of unreasonable risk due to hazards caused by malfunctioning behaviour of electrical / electronics systems”.

 

 

Standards for other markets such as IEC 61508 for electrical and electronic systems and DO-254 for airborne electronic hardware have their own definitions, although more importantly they also set their own expectations for engineering developments. Hence it’s important to identify the target markets before starting development and ensure suitable processes are followed – attempts to ‘retrofit’ development processes can be costly and ineffective so best avoided. Figure 1 illustrates a variety of standards applicable to Silicon IP.

 

Standards green.png

Standards for functional safety of silicon IP

 

 

In practice, functionally safe means a system that is demonstrably safe to a skilled third-party assessor, behaving predictably in the event of a fault. It must fail safe which could be with full functionality or graceful degradation such as reduced functionality or a clean shutdown followed by a reset and restart. It's important to realize that not all faults will lead to hazardous events immediately. For example a fault in a car's power steering might lead to incorrect sudden steering action. However, since the electronic and mechanical designs will have natural timing delays, faults can often be tolerated for a specific amount of time. In the ISO 26262 this time is known as the fault tolerant time interval, and depends on the potential hazardous event and the system design.

 

 

What’s at fault?

 

Failures can be systematic, such as due to human error in specifications and design, or due to the tools used. One way to reduce these errors is to have rigorous quality processes that include a range of plans, reviews and measured assessments. Being able to manage and track requirements is also important as is good planning and qualification of the tools to be used. ARM provides ARM Compiler 5 certified by TÜV SÜD to enable safety-related development without further compiler qualification.

 

Another class of failure is random hardware faults; they could be permanent faults such as a short or broken via as illustrated by Figure 2. Alternatively they could be soft errors caused by exposure to natural radiation. Such faults can be detected by counter measures designed into the hardware and software, system-level approaches are also important. For example Logic Built-In-Self-Test can be applied at startup or shutdown in order to distinguish between soft and permanent faults. Error logging and reporting is also an essential part of any functionally safe system, although it’s important to remember that faults can occur in the safety infrastructure too.

 

 

Classes of fault.png

Figure 2. Classes of fault

 

 

Selection of counter measures is part of the process I enjoy the most, it relates strongly to my background as a platform and system architect, and often starts with a concept-level Failure Modes and Effects Analysis (FMEA). Available counter measures include diverse checkers, selective hardware and software redundancy, as well as full lock-step replication available for Cortex-R5 and the ‘old chestnut’ of error correcting codes which we use to protect the memories of many ARM products.

 

 

Get the measure of functional safety

 

Faults that build up over time without effect are called latent faults and ISO 26262 proposes that a system designated ASIL D, its highest Automotive Safety Integrity Level, should be able to detect at least 90% of all latent faults. As identified by Table 2, it also proposes a target of 99% diagnostic coverage of all single point failures and a probabilistic metric for random hardware failures of ≤10-8 per hour.

 

 

Table 1. ISO 26262 proposed metrics

Proposed metrics.png

 

These metrics are often seen as a normative requirement, although in practice they are a proposal, and developers can justify their own target metrics because the objective is to enable safe products, not add bullet points to a product datasheet.

 

A question I often ask myself in respect of semi-autonomous driving is whether it’s safer to meet the standard’s proposed metrics for ASIL D with 10,000 DMIPS of processing or have 100,000 DMIPS with reduced diagnostic coverage and enable ‘smarter’ algorithms with better judgement? The answer is application specific, although in many cases a more capable performant system could save more lives than a more resilient system with basic functionality, so long as its failure modes are not wildly non-deterministic.

 

Irrespective of the diagnostic coverage achieved, it’s essential to follow suitable processes when targeting functionally safe applications – and this is where the standards really help. Even if you’re not targeting safety, more rigorous processes can improve overall quality.

 

 

Get it delivered

 

When developing for functional safety, an essential part of the product is the supporting documentation which needs to include a safety manual to outline the product’s safety case, covering aspects such as the assumptions of use, explanation of its fault detection and control capabilities and the development process followed.

 

Safety cases are hierarchical in use, the case for an IP is needed by chip developers to form part of their safety case which then enables their customer and so forth. Most licensable silicon IP will be developed as a Safety Element out of Context (SEooC), where its designers will have little no idea how it will subsequently be utilised. Hence the safety manual must also capture insight from the IP developers about their expectations in order to avoid inappropriate use.

 

At ARM we support users of targeted IP with safety documentation packages, which always includes a safety manual.

 

So in summary when planning for functional safety think PDS:

  • Process
  • Development
  • Safety documentation package

The trend for the electronics industry remains the same as ever; we want chips that are smaller, faster, more efficient. When you look at the trajectory of SoC designs you can see that the cost of integrating IP rises sharply when a node process is changed. For example, at 10nm the IP integration cost is projected to be almost 4 times that of a 28nm process. It is a growing drain of project resource in terms of money and effort needed to properly integrate a system.

 

In an effort to solve this integration issue, we need to look within the design flow to identify areas where improvements can be made. One of these is IP configuration. IP configurability is evolving due to the growing reoccurrence of highly complex IP that designers are integrating into their SoCs. Add to this the amount of competition in the IP market, where silicon partners are looking for IP that is tailored to their design in order to optimize system performance.

 

IP integration cost per node.png

 

The above graph, provided by SemiCo Research, shows the costs for 1st time effort at each new node with design parameters maxed out. The trend is clear.

 

 

As systems become more complex the configurability requirements for certain types of IP becomes exponentially more complex e.g. a system interconnect (CoreLink NIC-450) or a debug and trace subsystem (CoreSight SoC-400).  These IPs can be considered to have an infinite configuration space which brings a new class of problem.

 

  • Where do I start?
  • How do I configure all the bits of the IP that I need?
  • How do I know it will work?

 

What we need, then, is more Intelligent IP configuration that is based on the system context and configured with awareness of PPA constraints making the downstream IP integration process simplified and highly automated.

 

Another thing to consider is the highly iterative nature of the IP Integration cycle. Between specification, configuration and integration of components it takes many versions before an optimized system can be built. When you add in the increase in data, dependencies and complexities of current IP, it only adds to the problem. Examples of complex IP configurations that need iteration include debug & trace, interrupts, interconnect, MMU, memory, I/O etc.

 

 

A solution we have been developing at ARM defines an intelligent IP configuration flow to make system integration more scalable and easier to manage. It involves the following:

 

  • Consistent method of IP configuration
  • Configure IP consistent with a system context
  • Automatic creation of IP micro architecture (µarchitecture synthesis)
  • Refinement step with quality assurance (µArchitecture DRCs)
  • Automatic integration of IP into the system (auto-integration)

 

IP Catalog.png

 

To enable this concept of intelligent IP configuration, you need tooling to automate the configuration and integration of IP, ensure system viability and reduce the time spent on iterations. ARM® Socrates IP Tooling can do this, using what we are calling ‘IP Creators’. ARM IP creators have a unique flow (lifecycle) that includes features such as :

 

  • Metadata harvesting for ensuring IP configuration is consistney with the system
  • µArchitecture synthesis
  • µArchitecture DRCs
  • µArchitecture refinement
  • Auto-Integration

 

 

These features accelerates the design cycle (case study shows an 8x reduction), reduces risk and simplifies system design. Let’s take a closer look at how this is done.

 

 

Metadata Harvesting for initial IP configuration

First, you need to automatically create the system specification. This is done through harvesting the system data, as well as identifying the interfaces on the particular IP, for example an interconnect.  Our current flow will read IP-XACT metadata from a system and be able to infer certain interface configuration for IP e.g. for an interconnect we can extract interface requirements e.g. AMBA® protocol type, data size, address size etc. For debug and trace, we can infer information like the number of ATB interfaces, size etc. This process accelerates the specification of the IP interfaces and will use this information to drive the final IP configuration.

 

System Specification - IP Tooling.png

 

 

The next step is to define and create the System IP µarchitecture. The system architect can input high level information e.g. data paths, memory maps, and other data that that are processed by algorithms to configure the IP. The  µArchitecture synthesis automatically creates the IP in a way that is correct-by-construction, through design rule checks (DRCs) that validate the configuration. You can see in the image below the master/slave connections that are generated by the algorithms.

 

 

Microarchitecture - IP Tooling.png

 

 

The major effect of the µArchitecture synthesis is that configuration iterations are greatly reduced. It results in a system assembly process that is faster and easier. Interfaces are automatically AMBA-compliant through the IP-XACT-driven approach to integration. The image below shows a fully connected system resulting from the µArchitecture  synthesis. 

 

 

System assembly - IP Tooling.png

 

 

Once system integration is complete, a number of deliverables are generated that can be easily used by different stakeholders within the design team. The RTL of the integrated system design, testbench, test cases, design spec and reports are all automatically published and ready for the next step of SoC design.

 

RTL Generation - IP Tooling.png

 

 

Putting this methodology of intelligent IP configuration and automated IP integration to the test, we conducted some internal studies. Typically the creation of a debug and trace subsystem is a time-consuming and iterative process. When using this new approach, the time spent was dramatically reduced from three months to just one week. Even more impressive was the elimination of 90 bugs when comparing the two approaches, as the intelligent methodology did not return a single bug during the design cycle.

 

Debug and trace results - IP Tooling.png

 

Looking to Future Productivity Gains with Socrates IP Tooling

 

The SoCs that are being designed today have increased dramatically in complexity over the last number of years, and will continue to do so. A combination of smaller process nodes, more complex IP and designs targeting highly specific performance points means that system integration plays an important role in the creation of an SoC. Using an automated tooling methodology based on designer input rules can make system assembly easier and faster. Looking to the future, there is potential for innovation around adding physical awareness as new metadata to enable better PPA analysis and trade-off.

At DAC today, June 6th, we announced the creation of a new partnership program for design houses. Called the ARM Approved Design Partner program, this initiative creates a group of design houses which ARM is happy to recommend to anyone needing design services around ARM IP.

 

We have linked it very closely with the DesignStart program. Launched last year, DesignStart allows registered users to evaluate the Cortex-M0 processor IP completely free of charge. A follow-on fast-track licence route then allows easy and cost-effective access to the full IP to go into production. DesignStart has generated significant interest since launch and one thing we have noticed is that many registrants do not have in-house SoC design capability. To fill this gap, we have recruited ARM Approved Design Partners, all fully audited, approved and recommended by ARM for their capability in successfully designing with ARM IP.

 

To find out more about the program, have a look at http://www.arm.com/armapproved...

 

The founder members, all present at DAC to join in the launch, are Sondrel (based Reading, UK), eInfoChips (based in Ahmedabad, India), Open-Silicon (based in Milpitas, CA) and SoC Solutions (based in Atlanta, GA). We are delighted to welcome them on board and to be able to recommend them.

 

Chris

Early adopters of ARM's 2017 premium mobile experience IP suite, including the ARM® Cortex®-A73 CPU, Mali™-G71 GPU, and the CoreLink™ CCI-550 cache coherent interconnect as well as the related Artisan POP™ technology, have successfully taped out using Synopsys tools and verification IP.

 

In support of ARM's launch of this new premium mobile suite, Synopsys issued a concurrent news release highlighting our mutual customer tape-out success. Among the Synopsys products used in these tapeouts are:

 

In addition, Synopsys announced the immediate availability of a Reference Implementation for the Cortex-A73 processor (using ARM Artisan POP technology) that you can use to jump start optimized implementation of your Cortex-A73 core.

 

Come to see ARM + TSMC + Synopsys at DAC to learn more about the Reference Implementation - breakfast DAC Monday, a great way to kick off DAC!

 

Congratulations to ARM on the well received product rollout and also to our mutual customers who are already moving toward products with this new premium mobile IP suite.

The end of the year is approaching, but I’d like to have one last delta before taking some time off. PowerPC vs. ARM, seems like an appropriate stand-off. In this rendition. However, I will incarnate the e200z0 core and the Cortex M4 core, which are the MCU implementations of each corresponding ISAs. For the sake of simplicity, each time the word PowerPC is uttered, I am referring to the e200z0 core; similarly, ARM will stand as a simplification of Cortex-M4.

 

Getting the obvious out of the way

PowerPC is sold both as silicon (i.e. MCU) as well as synthesizable IP blocks; ARM only sells IP, but there are a number of companies that sell microcontrollers built around said IP. At the end of the day, both cores cannot be compared in terms of technology node because their implementation depends on a third party. I will say, however, that PowerPCs are typically used in automotive and industrial applications which tend to use more robust technology nodes than consumer applications where ARM is typically found. I suspect, but cannot confirm, that one of the reasons for this is that the ARM core is relatively big (physically), and really benefits from a smaller node. Therefore, it is not strange to find PowerPC devices qualified at -40 – 125C ranges, in LPDF packages; ARM devices are normally only qualified in 0 – 85C ranges and come in smaller, BGA packages.

 

Architecture

Perhaps the easiest way to compare and contrast each standard is with a side-by-side comparison of the blocks defined by each spec, as taken from each spec:

Side by side comparison of blocks defined in each spec

Image 1: Side by side comparison of blocks defined in each spec; Cortex-M4 to the left, e200z0 to the right.

Similarly colored boxes show the equivalent blocks for each architecture. It should be immediately obvious that the Cortex-M4 to the left has a significant number of blocks without equivalent on the e200z0 architecture.

 

And this is what I’d like to talk about. Power.org has done an excellent job of defining a powerful core, one that is flexible and capable of being hooked-up to an almost infinite number of peripherals. And then it stops. Standard peripherals, such as an interrupt handler unit, or a debug trace unit are not defined in the standard, which means each vendor is free to implement as they wish. ARM, on the other hand, tightly integrates these “standard” peripherals into the core. ARM wins in this situation because tighter integration of debug peripherals means compatibility with standard tools; tighter integration of the interrupt handler unit means quicker interrupts (but let’s not get ahead of ourselves). This approach also helps vendors integrating the IP as they do not have to worry about handling these elements (which are more than likely far away from their target application, or from where they want to add value).

 

The direct effect of one approach vs. the other is quickly visible when it comes to interrupts: ARM’s Cortex-M4 guarantees a latency of 3-cycles from the time the Interrupt is flagged to the time the core is actually doing something with it. All context registers are stored automatically. The e200z0, on the other hand, will require an external controller to flag it to the core as an external interrupt. Next, some code will need to be written to ensure that the context registers are correctly stored. Finally, it is also code that will allow to jump to the pending interrupt and attend. Latency is therefore not guaranteed, and will vary from implementation to implementation.

 

But that is not to say that the e200z0 is inferior. Let’s take a look at Table 1:

Cortex M4e200z0
ExecutionIN-orderIN-order
Memory Management/Protection UnitYN
Instruction CacheNN
Signal processing extensionYN
Pipeline3-stage4-stage
Branch unit processorNot explicitY
Multiply11
Integer divide cycles2 – 12 cycles5 – 34 cycles
EndiannessLittleBig
ArchitectureHarvardHarvard
Interrupt controllerInternalExternal
Jump-to-Isr latency3 cyclesCode dependant; several cycles
Relocatable ISR tableYesYes
Debug InterfacesJTAG, J-LinkJTAG
Number of core registers13 + SP, LR, PC (16 total)32 + SP, CR, LR, CTR
Instruction set supportedThumb 16-bit instructionsVLE 16-bit instructions
Table 1.

In fact, when you look at  the generalities, the e200z0 and the Cortex-M4 are very similar: Harvard architecture, 32-bit RISC machies with no out-of-order execution and 1-cycle execution times for most instructions. Yes, the Cortex-M4 is about twice as fast ath the e200z0 when it comes to division, but the fact that the latter has double the amount of core registers means that it can economize load/store cycles.

 

Which brings us to the instruction set architecture.

ISA

In a similar effort, both ARM and Power.org have created extensions to their original ISA with the goal of reformatting instructions into 16-bit words to help with code density. Both communities have later released devices that are only compatible with these extensions, removing all support for the original ISA. This is the case for both the e200z0 and the Cortex-M4 with Variable Length Encoding, and Thumb ISAs, respectively.

 

Comparing and contrasting both ISAs probably deserves a blog entry by itself, but the gist of it is that both instruction sets have similar encodings. Perhaps worthy of a special mention is Thumb’s immediate rotate addressing mode, which allows to shift a core-register while performing another operation during the same execution cycle of the original operation.

 

Truth be told, both ISAs are so complex that it will be up to the compiler to fully exploit their advantages. Take, for example, the Cortex-M4 DSP extension which adds a DSP-like unit capable of 1-cycle Multiply-and-accumulate operations, among others. When writing code, a simple line such as

y = (m * x + b);

will compile using a standard sequence of loads, multiplies, stores, and adds. In order to use the DSP-extension, an abstraction layer needs to be downloaded fromARM.com, and function-like calls need be made (which are replaced by macros and take advantage of said extension).

 

Which means that code is no longer portable to, say, a PowerPC architecture.

 

Toolchain support

This category is tough. Both organizations have done an excellent job of standardizing their architectures, and a plethora of compilers and standard tools is available for both. Since both are also JTAG-compliant, this means that almost anything can be used to develop for them:

 

  • gcc

  • CodeWarrior

  • Green Hills

  • IAR Workbench (ARM only)

 

I’d say there’s a tie here, although there may be specialized tools on each case,debugging activities are not necessarily harder on one platform than on the other.

 

Conclusion

If both architectures were to hit the market for the first time today,with the same IP-based distribution model, it’s really hard to predict who would win. The Cortex-M4 is tightly integrated with an interrupt controller and debugging support, while the e200z0 allows a greater amount of customization to vendors. The Cortex-M4 allows bit-shifting as part of a register load or store, but the e200z0 doesn’t need to perform loads and stores as often because it has more core registers. The Cortex-M4 is slightly faster with fixed-point math division.  Toolchain support is excellent for both architectures. Without bringing down these characteristics to specific products, it’s hard to have a winner!

References

Power ISA v. 2.06B

Cortex-M4 Reference Manual

By now you would have read the news about the latest ARM® Cortex®-A73 processor and Mali™-G71 GPU. These new processors allow for more performance in an ever thinner mobile device, and accelerate new use cases such as Virtual Reality (VR), Augmented Reality (AR) and the playback and capture of rich 4K content. However, these applications place increased demands on the system, and require more data to be moved between processors, cameras, displays and memory. This is the job of the memory system.

 

3-NewUseCasesDrivingSystemDemand.png

 

 

ARM Develops IP Together at the System Level

 

To get the best user experience the memory system must balance the demands of peak performance, low latency and high efficiency. The ARM CoreLink™ interconnect and memory controller IP provide the solution. ARM develops processor, multimedia and system IP together, including design, verification and performance optimization, to get the best overall system performance and to help our silicon partners get to market faster with a reduced integration cost.

4-s-Memory-System-Is-Key-To-User-Experience.png

 

There are three key properties that the memory system must deliver:

 

  • Lower memory latency - to ensure a responsive and fluid experience. This helps maintain a high frame rate providing a more natural VR & AR experience, as well as improving most other use cases such as web browsing and social media interactions.
  • Higher peak bandwidth - to support the increase in pixels and frame rate expected by 4K and HDR content. Also we’re seeing mobile devices with higher megapixel count or multiple cameras, in both cases we need to move more data to and from memory.
  • Improved memory efficiency - to move more data in the same or lower power budget. This can be enabled by innovation in the interconnect, for example hardware cache coherency, as well as improvements in the memory controller to get the best utilization of dynamic memory.

 

This blog describes how the latest CoreLink System IP delivers on the above requirements.

 

 

Optimized Path to Memory with CoreLink CCI-550 and DMC-500

 

The ARM CoreLink CCI-550 Cache Coherent Interconnect and DMC-500 Dynamic Memory Controller have been optimized to get the best from Cortex-A73 and Mali-G71. ARM big.LITTLE™ processing has relied on CCI products to provide full cache coherency between Cortex processors for a number of years now. For the first time, Mali-G71 offers a fully coherent memory interface with AMBA® 4 ACE. This means sharing data between CPU and GPU is easier to develop, lower latency and lower power.

 

2-optimised-memory-system.png

 

 

Accelerating Heterogeneous GPU Compute

 

GPU compute exists today, but with software or IO coherency it can be difficult to use. Here’s a quote from a middleware developer regarding the cost:

 

“30% of our development effort was spent on the design, implementation and debugging of complex software coherency.”

Mukund Srinivasan, VP Media Client Business, Ittiam Systems

 

 

A fully coherent CPU and GPU memory system offer a simplified programming model and improved performance efficiency. This is enabled by two fundamental technologies:

 

  • Shared Virtual Memory (SVM) - where all processors use the same virtual address to access a shared data buffer. Sharing data between processes is now as simple as passing a pointer.
  • Hardware Coherency - which ensures all coherent processors see the same shared data and removes the need to clean and invalidate caches.

 

The following chart summarizes the benefit of these technologies and highlights how a fully coherent memory system can provide a ‘fine-grained’ shared virtual memory where the CPU and GPU can work on a shared buffer at the same time.

 

 

5-Shared-Virtual-Memory-and-full-coherency.pngFor a more detailed explanation see this blog:

Exploring How Cache Coherency Accelerates Heterogeneous Compute

 

 

OpenCL 2.0 is one API that enables programming with fine-grained SVM. Initial benchmarking at ARM is showing promising results. We have created a simple test called “Workload Balancing” that is designed to stress the processing and moving of data between CPU and GPU. As you can see from the chart below, moving from software coherency to a fine-grained fully coherent memory system can reduce overheads by as much as 90%.

 

6-GPU-Compute-Benefits-with-coherency.png

 

Increasing Cortex-A73 Processor Performance

 

A high performance and low latency path to memory for the Cortex processors is fundamental to providing a fluid and responsive experience for all applications. The snoop filter technology integrated into the CoreLink CCI-550 enables a higher peak performance and offers system power savings which are discussed later in the blog.

 

The following example shows how the snoop filter can improve memory performance of a Cortex-A73 in a system where the LITTLE core, Cortex-A53, is idle and running at a low frequency. Under these conditions, any big core memory access will snoop the LITTLE core and will see a higher latency. This could slow down any applications that access memory and may make the device feel sluggish and less responsive.

With the snoop filter enabled the memory requests are managed by the snoop filter and see a consistently low latency, even if the LITTLE core is in a lower power state and running at a low clock frequency.

 

7-Snoop-Filter-and-CPU-performance.png

 

 

As can be seen by the chart below, when the snoop filter is enabled the memory tests in the ‘Geekbench’ benchmark see a significant improvement, as much as 241%. Other tests, like integer and floating point are running within the processor caches and are not accessing memory so they see less of a benefit. Overall the improvement on Geekbench score is as much as 28%. In terms of real-world applications this would deliver a more fluid user experience.

 

8-Geekbench-improvement-with-snoop-filter.png

 

Reducing Memory Latency with Advanced Quality-of-Service (QoS)

 

Reducing latency can give a boost to any application that is working with memory, especially gaming, VR, productivity and web browser tasks. CoreLink CCI-550, NIC-450 and DMC-500 introduce a new interface called ‘QoSAccept’ which is designed to minimize the latency of important memory requests.

Benchmarking within ARM has shown a 38% reduction in latency through the interconnect for worst case traffic, in this example a CPU workload is limited to one outstanding transaction.

 

 

10-QoS-Accept-Demonstrates-lowest-latency.pngFor more details, refer to this whitepaper:

Whitepaper: Optimizing Performance for an ARM Mobile Memory Subsystem

 

 

System Power Savings with CoreLink CCI-550

 

Mobile devices are getting ever thinner, and while compute requirements are increasing, it means the whole system must deliver improved power efficiency. The CoreLink CCI-550 and DMC-500 play an important role as they are central to the memory system power. The snoop filter technology allows the number of coherent devices to scale without negatively impacting system power consumption. In fact, the snoop filter saves power in two ways:

 

  • On-chip power savings - by resolving coherency in one central location instead of broadcasting snoops to every processor.

  • DRAM + PHY power savings - by reducing the number of expensive external memory accesses, whenever data is found in on-chip caches.

 

As the chart below demonstrates, we see more power savings as the number of coherent ACE interfaces increase, and as the proportion of sharable data increases. In this example “30% sharable” might represent a system where only the big.LITTLE CPU accesses are coherent, and “100% sharable” might represent a future GPU compute use case where all CPU and multimedia traffic is coherent.

 

9-CCI-550-System-Power-Savings.png

 

 

While this example shows a system with 4x ACE interfaces, the CoreLink CCI-550 can scale to 6x ACE total interfaces to support systems with the highest performance 32 core Mali-G71.

 

 

Scalability to Minimize Area and Cost

 

Cost, including die area, is always important to the silicon partner and OEM. Reducing the area of silicon gates is also important for reducing power. For these reasons CoreLink CCI-550 has been designed to scale from low cost mobile up to high resolution, high performance tablets and clamshell devices. This scalability also allows the system integrator to tune the design to meet their exact system needs. In terms of peak system bandwidth, CoreLink CCI-550 can offer up to 60% higher peak bandwidth than the CoreLink CCI-500.

 

11-bandwidth+area.png

 

Memory System is Key to User Experience

 

To summarize, the interconnect and memory controller play an important role in delivering the performance expected from the latest Cortex and Mali processors. As noted above, CoreLink CCI-550 and DMC-500 can give a 28% increase in Geekbench, a 38% reduction in memory latency, and save potentially 100’s of mW of memory system power. This is fundamental to delivering the highest possible user experience within a strict power envelope.

ARM’s coherent interconnect products are silicon proven, have been implemented across a range of applications, and have been licensed over 60 times by silicon partners including AMD, HiSilicon, NXP, Samsung and Xilinx to name a few.

I look forward to seeing CoreLink CCI-550 in the latest devices!

 

 

 

Further Information:

 

Please feel free to comment below if you have any questions.

Consider this: The performance of smartphones, nearly all of which are powered by ARM processors, has grown by 100x since 2009. One hundred times in seven years! With that has emerged entirely new functionality, lightning-fast user responsiveness, and immersive user experiences – all in the same power footprint. It’s really an unrivaled engineering achievement, given the challenging design constraints in the mobile space.

Evolution of the smartphone.jpg

This performance, functionality and user experience dynamic has driven a truly remarkable market, which will see more than 1.5 billion handsets sold in 2016.

With this consumer embrace, smart phone design has become, in many ways, the platform for future innovation. Augmented and virtual reality, ultra-HD visualization, object-based audio processing or computer vision all underlie the demand for extra system performance. At the same time, smart phone designs have slimmed considerably in recently years, which limits thermal dissipation and ratchets up the need for thoughtful power management design. Battery capacity improvement cannot continue as smartphones have gotten as large as they practically can. To continue delivering more immersive user experiences and staying on the smartphone innovation path we’ve blazed in the past decade, we need to deliver more sustained performance with higher efficiency.

 

To this end, ARM has announced its latest high-performance processor, the Cortex-A73. After introducing Cortex-A72 just last year, ARM is accelerating its innovation pace with the Cortex-A73 processor, which will power premium smartphones by early 2017.

 

The Cortex-A73 is designed and optimized specifically for mobile and consumer devices. The aspects of  Cortex-A73 that I’m most excited about are all about efficient performance:

 

  • Delivers the highest performance in the mobile power envelope, at frequencies up to 2.8GHz
  • With 30% better power efficiency to sustain the best user experience
  • Inside the smallest ARMv8-A footprint ever.

 

I’ve had the privilege of sitting alongside the design team that has created the Cortex-A73, with the specific intent of meeting this challenge: to be the most efficient and highest performance ARM processor. What follows is an overview of the main features and key enhancements of the Cortex-A73 and their resulting benefits.

 

Cortex-A73: ARMv8-A high-performance processor

 

Cortex-A73 diagram

 

Starting with the basics, the Cortex-A73 supports the full ARMv8-A architecture. Its feature set is ideal for mobile and consumer devices. ARMv8-A includes ARM TrustZone technology, NEON, virtualization and cryptography. Both in 32-bit and 64-bit, the Cortex-A73 gives access to the widest mobile application and middleware eco-system – mobile software is developed and optimized by default on the ARM architecture.

The Cortex-A73 includes a 128-bit AMBA 4 ACE interface enabling integration in ARM big.LITTLE systems, either with the highly efficient Cortex-A53 in premium designs or with our latest ultra-efficient Cortex-A35 processor in mid-range and more cost constrained designs.

 

Highest performance

The Cortex-A73 processor is designed for your next-generation premium smartphone. When implemented in the advanced 10nm technology, the Cortex-A73 delivers 30% more sustained performance than our most recent previous high-performance CPU, the Cortex-A72. Running at frequencies up to 2.8GHz, the Cortex-A73 also delivers the highest peak performance, almost matched by the sustained performance of its extreme energy efficiency. What you’ll notice in the chart below is that the Cortex-A73 can sustain operation at nearly peak frequency, a rarity in mobile phone processors today, where real-world frequencies get throttled back.

Cortex-A73 Maximizes performance

 

 

Performance optimized for mobile

 

The Cortex-A73 micro-architecture includes several interesting performance optimizations that I can share (and quite a few others that I can’t share). It supports a 64kB instruction cache, state-of-art branch prediction based on the most advanced algorithms, and high-performance instruction prefetching. The main performance improvements are actually implemented in the data memory system. It uses advanced L1 and L2 data prefetchers, with complex pattern detection. We have also optimized the store buffer for continuous write streams and increased the data cache to 64kB without any timing impacts.

 

These enhancements translate into a performance uplift of up to 10% in mobile use cases compared to Cortex-A72 at iso-frequency. We expect silicon designs with Cortex-A73 to push further on frequency than in previous generations, a venture that is assisted by the increased efficiency. Moreover the Cortex-A73 consistently beats Cortex-A72 in all memory workloads by at least 15% to increase the performance across multiple applications, operating system operations or complex compute execution as NEON processing.

 

 

A73 performance optimized for mobile

 

 

Power efficiency benefits

To deliver the uplift in performance, the Cortex-A73 requires less power than the Cortex-A72. The Cortex-A73 implements several optimizations such as an aggressive clock-gating scheme, power optimized RAM organization, and optimal resource sharing for AArch32 and AArch64 execution to reduce power.

 

Compared to Cortex-A72, the power saving for a combination of integer workload is above 20% and even higher for workloads such as floating-point or memory access. This power efficiency enables a better user experience and extends the battery life. Or it can also be used to give extra headroom to the rest of the SoC, enabling the overall system and the graphics processor to increase performance and to provide better visual effects, higher frame rate or new functionality.

 

A73 power efficiency benefit

 

The smallest ARM Premium CPU

 

In addition to delivering the highest sustained and peak performance,  the Cortex-A73 is even more compelling as it delivers this performance in the smallest area for an ARMv8-A premium processor. This translates into a premium experience at mid-range costs for the increasingly important mid-range smartphone market. The Cortex-A73 is smaller than the ARMv7-A Cortex-A15; when compared to the Cortex-A57 and Cortex-A72, it offers 70% and 46% area reduction respectively, well over the benefit of the technology itself. At iso-process, Cortex-A73 core is up to 25% smaller than Cortex-A72. Optimal for implementation in advanced technology nodes such as 16nm and 10nm, the Cortex-A73 also scales very efficiently in mass-market nodes such as 28nm to provide significant performance uplift for mid-range devices. The reduced footprint offers silicon area for integrating more functionality or increasing the performance of the other IPs in premium systems, or to decrease SoC and device costs in mid-range systems.

 

A73 smallest ARM premium CPU

 

Boost your mid-range smartphone

 

With our big.LITTLE technology and CoreLink CCI, ARM provides a great scalability to enable our partners to differentiate and optimize their system. What does that mean? SoC designs can create designs with 1 or 2 big cores and 2 or 4 LITTLE cores that rival the performance and user experience of premium designs. An exclusive L2 cache can scale down to 1MB and still provide enough cache to support the big cores in real-world high performance workloads. big.LITTLE software can adapt to all of these scalable configurations by placing work optimally based on an energy model.

 

big.LITTLE technology is widely deployed in the mobile market today. The Cortex-A73, combined with Cortex-A53, will power the next-generation of premium smartphones, typically in an octa-core configuration. In addition, Cortex-A73 provides the opportunity to boost the mid-range user experience to a higher level. For example, in a hexa-core big.LITTLE configuration, a dual-core Cortex-A73 and quad-core Cortex-A53 or Cortex-A35 enables significant performance uplift in the same or less area than an octa-core Cortex-A53 - a common topology that has been very successful in entry and mid-range devices. In comparison to an octa-core Cortex-A53, the Cortex-A73 hexa-core delivers 30% more multi-core performance and twice the single-thread peak performance resulting in a considerable improvement of the user experience, thanks to a reduced response time for applications such as web browsing and interface scrolling.

 

A73 more performance

 

In summary, I am proud to have worked alongside the team that has developed the most efficient high-performance processor, all in pursuit of the continuous improvement of user experience that has come to characterize mobile devices based on the ARM architecture. With the Cortex-A73 processor, you get more for less: more performance, more battery life for less power and less area. Later this year and in 2017, our partners will integrate the Cortex-A73  bringing new functionality and new innovation into premium smartphones, tablets, clamshells, DTVs, and a wide range of consumer devices. I can’t wait to see what they will build.

 

Related stories:

A walk-through of the Microarchitectural improvements in Cortex-A72

Introducing Cortex-A32: ARM’s smallest, lowest power ARMv8-A processor for next generation 32-bit embedded applications

Memory System is Key to User Experience with Cortex-A73 and Mali-G71

 

Cache Coherency and Shared Virtual Memory

The Heterogeneous System Architecture (HSA) Foundation is a not-for profit consortium for SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs whose goal is to make it easier for software developers to take advantage of all the advanced processing hardware on a modern SoC. The CPU and GPU on a typical applications processor occupy a significant proportion of die area and applying these resources efficiently across multiple applications can improve the end user experience. Done right, efficiency can be gained in power, performance, programmability and portability.

 

This blog focuses on some of the hardware innovations and changes that are relevant to shared virtual memory and cache coherency, which are components of the HSA hardware specification.

 

What is Shared Virtual Memory?

Traditional memory systems defined separate memory for CPU and GPU. In the case of PCs, the GPU may have completely separate discrete memory chips on a different board. In these systems, any application that wants to share data between CPU and GPU will need to copy it from CPU memory to graphics memory at a significant cost of latency and power.

 

Mobile systems have had a unified memory system for many years where all processors can access the same physical memory. However, even though this is physically possible, the software APIs and memory management hardware and software may not allow this. Graphics buffers may still be defined separately from other memory regions and data sharing may still require an expensive copy of data between buffers.

 

Shared virtual memory (SVM) allows processors to see the same view of memory; specifically, the same virtual address on the CPU and GPU will point to the same physical memory location. With this architecture, an application only needs to pass a pointer between processors that are sharing data.

 

There are multiple ways to implement SVM; it doesn’t mean you have to share the exact same page table. The only requirement is that if a buffer is to be shared between processors then it must appear in the page tables for both memory management units (MMUs). With SVM in place, sharing data becomes as simple as passing a pointer between processors.

 

So What is Cache Coherency?

Let’s go back to basics and ask what does coherency mean? Coherency is about ensuring all processors, or bus masters in the system see the same data. For example if I have a processor which is creating a data structure in its local cache then passing it to a GPU, both the processor and GPU must see the same data. If the GPU reads from external DDR, the GPU will read old, stale data.

 

There are three mechanisms to maintain coherency:

 

  • Disable caching is the simplest mechanism but may cost significant processor performance. To get the highest performance, processors are pipe-lined to run at high frequency and access caches which offer a very low latency. Caching of data that is accessed multiple times increases performance significantly and reduces DRAM accesses and power. Marking data as “non-cached” could impact performance and power and in reality is not used.
  • Software managed coherency is the traditional solution to the data sharing problem. Here the software, usually device drivers, must clean dirty data from caches and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.
  • Hardware managed coherency offers an alternative to simplify software. With this solution any cached data marked ‘shared’ will always be up to date, automatically. All processors and bus masters in that sharing domain see the exact same value.

 

Challenges with Software Coherency

A cache stores external memory contents close to the processor to reduce the latency and power of accesses. On-chip memory accesses are significantly lower power than external DRAM accesses.

 

Software managed coherency manages cache contents with two key mechanisms:

 

  • Cache Cleaning:
    • If any data stored in a cache is modified, it is marked as ‘dirty’ and must be written back to DRAM at some point in the future. The process of cleaning will force dirty data to be written to external memory. There are two ways to do this: 1) clean the whole cache which would impact all applications, or 2) clean specific addresses one by one. Both are very expensive in CPU cycles.
    • With modern multi-core systems this cache cleaning must happen on all cores.
  • Cache Invalidation:
    • If a processor has a local copy of data, but an external agent updates main memory then the cache contents are out of date, or ‘stale’. Before reading this data the processor must remove the stale data from caches, this is known as ‘invalidation’ (a cache line is marked invalid).
    • An example is a region of memory used as a shared buffer for network traffic which may be updated by a network interface DMA hardware; a processor wishing to access this data must invalidate any old stale copy before reading the new data.

 

Complexity of Software Coherency

“We would like to connect more devices with hardware coherency to simplify software and accelerate product schedules”

“50% of debug time is spent on SW coherency issues as these are difficult to find and pinpoint”

Quotes from a system architect at an application processor vendor.

 

Software coherency is hard to debug; the cache cleaning and invalidation must be done at the right time. If done too often it wastes power and CPU effort. If done too infrequently it will result in stale data which may cause unpredictable application behaviour, if not a crash. Debugging this is extremely difficult as it will present occasional data corruption.

 

Looking specifically at CPU and GPU sharing, this software cache maintenance will be difficult to optimize and applications on these systems will try and avoid sharing data due to cost and complexity. One middleware vendor using GPU compute with software coherency noted that the developers spent around 30% of their time architecting, implementing and debugging the data sharing including breaking down image data into sub-frames and careful timing of the mapping and unmapping functions.

 

When sharing is used with software coherency, the size of the task running on the GPU must be large enough to make it worthwhile, taking into account the cost of software coherency.

 

Hardware Coherency Requires an Advanced Bus Protocol

Extending hardware coherency to the system requires a coherent bus protocol, and in 2011 ARM® released the AMBA® 4 ACE specification which introduces the “AXI Coherency Extensions” on top of the popular AXI protocol. The full ACE interface allows hardware coherency between processor clusters and allows an SMP operating system to extend to more cores.

 

With the example of two clusters, any shared access to memory can ‘snoop’ into the other cluster’s caches to see if the data is already on chip; if not, it is fetched from external memory (DDR). In mobile, this has enabled the big.LITTLE™ processing model which improves performance and power efficiency by utilizing the right core to suit the size of the task.

 

The AMBA 4 ACE-Lite interface is designed for IO (or one-way) coherent system masters like DMA engines, network interfaces and accelerators. These devices may not have any caches of their own, but they can read shared data from the ACE processors. Alternatively, they may have caches but these would still need to be cleaned and invalidated by software.

 

While hardware coherency may add some complexity to the interconnect and processors, it massively simplifies the software and enables applications that would not be possible with software coherency such as big.LITTLE processing.

 

Adding Hardware Coherency to the GPU

While processor clusters have implemented cache coherency protocols for many years, this is a new area for GPUs. As applications look to share more data between CPU and GPU, hardware cache coherency ensures this can be done at a low cost in power and latency, which in turn makes it easier, more power efficient and higher performance than any software managed mechanism. Most importantly it makes it easy for the software developer to share data.

 

There are two ways a GPU could be connected with hardware coherency:

 

  • IO coherency (also known as one-way coherency) using ACE-Lite where the GPU can read from CPU caches. Examples include the ARM Mali™-T600, 700 and 800 series GPUs.
  • Full coherency using full ACE, where CPU and GPU can see each other’s caches.

 

0-CCI-550.png

The Powerful Combination of SVM and Hardware Coherency

The following diagrams summarize what we’ve learned so far and also describe the coarse and fine grain shared virtual memory. These charts approximate elapsed time on the horizontal axis, and address space on the vertical axis.

 

1-SVM1.png

The above chart shows traditional memory systems where software coherency required data to be cleaned from caches and copied between processors to ‘share’ the data. In additional to cache cleaning, the target cache would also need to invalidate any old data before reading new data from DRAM. This is time consuming and power hungry and limits the applications that can take advantage of heterogeneous processing.

 

2- SVM2.png

With shared virtual memory the CPU and GPU can now share physical memory and operate on the same virtual address, which eliminates the copy. If we have an IO coherent GPU, in other words one-way coherent where GPU can read CPU caches, then we remove the need to clean data from CPU caches. However, because this is one-way, the CPU cannot see the GPU caches. This means the GPU caches must be cleaned with cache maintenance operations after processing completes. This ‘coarse-grain’ SVM means the processors must take turns accessing the shared buffer.

 

3- SVM3.png

Finally, if we enable a fully coherent memory system then both CPU and GPU can see exactly the same data at all times, and we can use ‘fine-grained’ SVM. This means both processes can access the same buffer at the same time instead of taking turns. Handshaking between processors uses cross-device atomics. By removing all of the cache maintenance overheads we can get the best overall performance.

 

Connecting Hardware with Software: Compute APIs

At this point it’s useful to map these hardware technologies to the software APIs. Compute APIs like OpenCL 2.0 can take full advantage of SVM and hardware coherency, and can run on HSA platforms. Not all OpenCL 2.0 implementations are the same; there are a number of optional features that can be enabled if the hardware supports it. These features can also be mapped to the HSA profiles: base profile and full profile, as shown in the table below.

 

OpenCL Feature

Shared Virtual Memory

Fully Coherent Memory

HSA Profile

Fine Grained Buffer

Required, buffer level

Required, fully coherent

Base Profile

Fine Grained System

Required, full memory

Required, fully coherent

Full Profile

Coarse Grain

Required, buffer level

Not required

(legacy, software or IO coherency)

 

 

HSA always requires hardware coherency, and with the base profile the scope of shared virtual memory can be limited to the shared buffers. This means only the shared buffers would appear in both CPU and GPU page tables, not the full system memory. This may be easier and lower cost to implement in hardware.

 

Full coherency is required for fine grain, and this enables both CPU and GPU to work on different addresses with the same data buffer at the same time.

 

Full coherency also allows the use of atomic operations, which allows processors to work on the same address within the same buffer. Atomic operations allow synchronization between threads, much like in a multi-core CPU. Atomics are optional for OpenCL but required for HSA.

 

For coarse grain, if hardware coherency is not present then it would need to use software managed coherency including cache maintenance operations, or optionally IO coherency for the GPU.

 

Hardware Requirements for Cache Coherency and Shared Virtual Memory

The hardware required to implement these technologies already exists today in the form of fully coherent processors and cache coherent interconnects. The interconnect is responsible for connecting processors, peripherals and memory together on the system on chip (SoC). The AMD Kavari APU already has a fully coherent memory between the CPU and GPU. ARM offers IP such as the CoreLink™ CCI-550 Cache Coherent Interconnect, Cortex®-A72 processor and the Mali Mimir GPU, which together support the full coherency and shared virtual memory techniques described above.

 

Interconnect innovations such as snoop filters, are essential to support scaling to higher performance memory systems. The snoop filter acts as a directory of processor cache contents and allows any memory access to be targeted directly to the processor that holds that data. More detail on this can be found in this blog: CoreLink CCI-500 and Snoop Filter.

 

Cache Coherency Brings Heterogeneous Compute One Step Closer

HSA, with full coherency and shared virtual memory, is all about delivering new, enhanced user experiences through advances in computing architectures that bring improvements across key areas:

 

  • performance
  • power efficiency
  • reduced software complexity

 

Application developers now have access to the complete compute potential on an SOC, where workloads can be moved seamlessly between computing devices enabling right sized computing for the given task.

Filter Blog

By date:
By tag:

More Like This