Skip navigation

Blog

1 2 3 Previous Next

ARM Processors

366 posts

The modern smartphone has long since ceased to be just a tool for calling other people. Nowadays we use it as a remote control for our world; recording our experiences first hand, entertainment devices, augmenting our hobbies and always-on social calendars. Those are just some examples of how we consciously use our phones, but they are also used as a primary computing device to support embedded applications such as fitness monitoring and smart home applications like climate control and security.

 

Modern SoCs need to be designed with these use cases in mind, in order to enable the always-on computing and high functioning computing tasks with an efficiency requirement to last all day. ARM partner MediaTek has a successful track record of designing SoCs that are tailored towards the demands of its end users, people like you and me.

 

MediaTek have announced a new addition to the Helio line of SoCs, the Helio X20. The Helio X family of SoCs are known for using cutting edge innovation and technology to deliver powerful performance for the rich experiences that are demanded of the modern smartphone.

 

 

Helio X20.png

    MediaTek's flagship mobile SoC

 

 

The Helio X20 is the world’s first mobile processor with a Tri-Cluster CPU architecture, featuring ten processing cores. The Tri-Cluster architecture is an innovation based on ARM’s big.LITTLE™ processing technology that delivers increased parallel performance at significantly lower average power.  Each processor cluster is designed to efficiently handle different types of workload, enabling a better allocation of tasks which results in optimum performance and extended battery life for users.

 

 

For example:

  • The smallest cluster may be used for background computing or simple tasks such as sending messages
  • The middle cluster gives the user an unprecedented balance of computing power and energy efficiency
  • The biggest cluster will be called into play for a task that requires more performance, such as video streaming

 

 

The chip delivers leading edge camera, display, gaming and audio features for today’s most demanding applications. Display is refreshed at an accelerated 120 frames per second for crisp and responsive scrolling of web content and maps, and uncompromised motion viewing supporting mobile games with high-resolution graphics. The sharp quality of the video and pictures are such that the screen is forgotten, it feels like real life.

 

 

State of the art visuals with MiraVision

 

Let’s take a look under the hood to see how MediaTek managed to do this with the Helio X20.

 

 

ARM Technology Powers the Helio X20

 

The ARM Cortex-A72 processor was used in a dual-core configuration as the large cluster on the chip, running at 2.3GHz. ARM’s flagship application processor ensured this level of performance was reached with 30% less power consumption, due to micro-architectural innovations that enhance floating point, integer and memory performance which improve the execution of every major class of workload.

 

The Cortex-A53 processor was used in the medium and small clusters, with a quad-core configuration running at 2.0GHz as the medium cluster and another quad-core with 1.4GHz as the small cluster. The 64-bit processor delivered these frequencies while operating within a low power and area footprint.

 

The graphics computing power was provided by the ARM Mali-T880, the highest-performance GPU in the Mali family.  Running at 780MHz, it gives 140% higher GPU performance in the Helio X20 chip when compared to its predecessor, while also providing 60% better power performance.

 

Advanced Power Management is provided through the Cortex-M4 based SCP which is directed by requests from the OS but  is also “system aware” and can enable clock and power control on key system events.

 

With the Tri-Cluster architecture running different tasks simultaneously, the complexity of managing interrupts increases significantly. The CoreLink GIC-500 generic interrupt controller used affinity level routing to manage the increase in scale in a simple and efficient manner. This boosts processor efficiency, leading to some of the power improvements in the new chip.

 

 

MediaTek SoCs Enable Mobile Devices to Perform All Day

 

As the mobile phone has cemented itself as the primary computing device, we are placing more and more demands on it every day to monitor our health, communicate with friends and entertain us. The variety in the different use cases means that an optimized approach to system design is needed if we are able to rely on our phone’s battery all day.

 

The MediaTek Helio X family focuses on uncompromised multimedia performance backed by state of the art mobile computing. The Helio X20 uses some world-leading innovations to deliver lasting premium performance due to its advanced power efficiency, and has already begun shipping to customers. The continued partnership between ARM and MediaTek allows both companies to satisfy customer demand in the mobile space.

(Huge thanks to the system validation team in Bangalore for providing me with all of the technical information here. Much appreciated!)

 

Functional validation is widely acknowledged as one of the primary bottlenecks in System-on-Chip (SoC) design. A significant portion of the engineering effort spent on productizing the SoC goes into validation. According to the Wilson Research Group, verification consumed more than 57% of a typical SoC project in 2014.

 

2014-WRG-BLOG-ASIC-8-1.png

Source: Wilson Research Group

 

 

In spite of these efforts, functional failures are still a prevalent risk for first-time designs. Since the advent of multi-processor chips, including heterogeneous designs, the complexity of SoCs has increased considerably. As you can see in the diagram below, the number of IP components in a SoC is growing at a strong rate.

 

#IP Blocks in System.png

 

Source: ChipDesignMag

 

 

SoCs have evolved into complex entities that integrate several diverse units of intellectual property (IP). A modern SoC may include several components such as CPUs, GPU, interconnect, memory controller, System MMU, interrupt controller etc. The IPs themselves are complex units of design that are verified individually. Yet, despite rigorous IP-level verification, it is not possible to detect all bugs – especially those that are sensitized only when the IPs interact within a system. This article intends to give you some behind-the-scenes insight into the system validation work done at ARM to enable a wide range of applications for our IP.

 

 

Many SoC design teams attempt to solve the verification problem individually using a mix of homegrown and commercially available tools and methods. The goal of system validation at ARM is to provide partners with high quality IP that have been verified to interoperate correctly. This provides a standardized foundation upon which partners are able to build their own system validation SOC solutions. Starting from a strong position, their design and verification efforts can be directed more at the design differentiation they add to the SoC and its interactions with the rest of the system.

 

 

Verification Flow

 

The verification flow at ARM is similar to what is widely practiced in the industry.

 

Validation flow.png

The ARM verification flow pyramid

 

 

Verification of designs starts early and at the granularity of units, which combine to form a stand-alone IP. During the entire verification cycle it is at unit-level when engineers have the greatest amount of visibility into the design. Individual signals that would otherwise be deep within the design may be probed or set to desired values to aid validation. Once unit-level verification has reached a degree of maturity, the units are combined to form a complete IP (e.g. a CPU). Only then can IP-level verification of the IP commence. For CPUs this is very often the first time assembly program level testing can begin. Most of the testing until this point is by toggling individual wires/signals. At IP level the tests are written in assembly language. The processor fetches instructions from memory (simulated), decodes them executes etc. Once top-level verification reaches some stability multiple IPs are combined into a system and the system validation effort begins.

 

IPs go through multiple milestones during their design-verification cycle that reflect their functional completeness and correctness. Of these, Alpha and Beta milestones are internal quality milestones.  LAC (Limited Access) represents the milestone after which lead partners get access to the IP. This is followed by EAC (Early Access), which represents the point after which the IP is ready to be fabricated for obtaining engineering samples and testing. By the REL (Release) milestone the IP has gone through rigorous testing and is ready for mass production.

 

IPs are usually between Alpha and Beta quality before going through the system validation flow. By this phase of the design cycle the IPs have already been subjected to a significant amount of testing and most low-level bugs have already been found. Stimulus has to be carefully crafted so that the internal state of the micro-architecture of each IP is stressed to the utmost. The stimulus is provided by either assembly code or by using specially designed verification IPs integrated into the system. ARM uses a combination of both methods.

 

Many of these bugs could result in severe malfunctions in the end product if they were left undetected. Based on past experience ARM estimates these types of bugs to take between 1-2 peta cycles of verification to discover and 4-7 man months of debug effort. In many cases, a delay that large would prove fatal to a chip’s opportunity to hit its target window in the market. Catching them early enough in the design cycle is critical to ensure the foundations in the IP are stable, before they go on to being integrated as part of an SoC.

 

 

 

System Validation

 

The nature of ARM’s IP means it is used in a diverse range of SoCs, from IoT devices to high end smartphones to enterprise class products. Ensuring that the technology does exactly what it is designed to do in a consistent and reproducible manner is the key goal of system validation, and the IP is robustly verified with that in mind. In other words, Focus of verification is IP but in a realistic system context. Towards this end, ARM tests IPs in a wide variety of realistic system configurations that are called Kits.

 

A kit is defined as a “group of IPs” integrated together in the form of a subsystem for a specific target application segment (e.g. Mobile, IoT, Networking etc.). It typically includes the complete range of IPs developed within ARM – CPUs, interconnect, memory controller, system controller, interrupt controller, debug logic, GPU and media processing components.

A kit is further broken down in to smaller components, called Elements. Elements can be considered building blocks for kits.  It contains at least one major IP and white space logic around it, though some of the elements have several IP integrated in together.

 

 

IP, Elements, Kits.png

 

 

These are designed to be representative of typical SoCs with different applications. One result is that it gives ARM a more complete picture of the challenges faced by the ecosystem of integrating various IP components together to achieve a target system performance.

 

The system validation team uses a combination of stimulus and test methodology to stress test kits. Stimulus is primarily software tests that are run on the CPUs in the system. The tests may be hand-created - either assembly or high-level language – or generated using Random Instruction Sequence - RIS tools, which will be explained in the upcoming sections. In addition to code running on CPUs, a set of Verification IPs (VIPs) are used to inject traffic into the system and to act as observers.

 

In preparation for validation, a test plan is created for every IP in the kit. Test planning captures various IP configurations, features to be verified, scenarios that will be covered, stimulus, interoperability consideration with IPs, verification metrics, tracking mechanisms , and various flows that will be a part of verification. Testing of kits starts with simple stimulus that is gradually ramped up to more complex stress cases and scenarios.

 

The testing performs various subsystem level assessments such as performance verification, functional verification, and power estimation. Reports documenting reference data, namely the performance, power, and functional quality, of selected kits are published internally.  This document focuses on functional aspects only and more on Performance and Power related topics will be covered in future blogs.

 

The system validation team at ARM has established a repeatable and automated kit development flow, which allows us to build multiple kits for different segments. ARM currently builds and validates about 25 kits annually.

 

The mix of IPs, their internal configuration, and the topology of the system are chosen to reflect the wide range of end uses. The kits are tested on two primary platforms – emulation and FPGA. Typically testing starts on the emulator and subsequently soak testing is done on FPGA. On average every IP is subjected to 5-6 trillion emulator cycles and 2-3 peta FPGA cycles of system validation. In order to run this level of testing, ARM has developed some internal tools .

 

 

 

System Validation Tools

 

There are three primary tools used in System validation, which are focused on areas like Instruction pipeline, Ip level and system level memory system, system coherency, Interface level interoperability, etc. Two of these tools are Random Instruction Sequence (RIS) generators. RIS tools explore the architecture and micro-architecture design space in an automated fashion, attempting to trigger failures in the design. They are more effective at covering the space than hand written directed tests. These code generators generate tests to explore different areas of architecture and micro-architecture in an automated fashion. The tests are multi-threaded assembly code, comprised of random ARM and Thumb instructions, designed to thoroughly exercise the functioning of different portions of the implementation.

 

The third tool  is a lightweight kernel that can be used as a platform to develop directed tests. The validation methodology uses a combination of directed testing and random instruction based automated testing. It supports basic memory management, thread scheduling, and a subset of the pthreads API, which allows users to develop parameterized directed tests.

 

 

 

Methodology

 

In order to stress test IP at the system level a more random approach is used rather than a directed approach. This enables ARM to cover a range of scenarios, stimulate multiple timing conditions and create complex events.  To this end,  Kits support various verification-friendly features like changing the clock ratios at different interfaces, enabling error injectors, stubbing out components that are not required for a given feature verification etc. Bus clock ratios at various interfaces in the system like CPU, interconnect and dynamic memory controller can be changed to stimulate realistic system clocking conditions.

System Validation bring up.png
System validation bring up

The diagram above shows how the system is initially brought up and how test complexity is gradually scaled up.

 

Integration Tests & KVS

Initial testing starts with a set of simple integration tests are run to confirm basic stability of the kit and flush out minor integration issues. Following which a suite of tests called Kit Validation Suite (KVS) is used to thoroughly test the integration of the kit. These tests are run early in the verification cycle to validate the Kit is good enough to run more stressful payloads. KVS can be configured to run on a wide variety of kits. It includes sub-suites to test integration, power, CoreSight debug and trace, and media IPs. There are specific tests in KVS to test integration of GPU and display as well as GPU coherence. Initial boot is usually done on simulation and gradually transition to emulators (hardware accelerators) for the integration testing.

 

RIS Boot and Bring up

After that we boot all the RIS tools with basic bring up tests on the kit to work through any hardware/software configuration issues.

 

RIS: Default and Focused Configurations

Once the kit is stable the complexity of tests and therefore the stress that they place on the system is increased. Random stimulus can cover the design space faster than directed stimulus and requires less effort towards stimulus creation. Therefore, for stress testing there is more reliance on random stimulus than directed tests. Initially default configurations of the RIS tools are run and after a suitable number of verification cycles, the tools are re-configured to stress the specific IPs in the kit.

 

RIS Soak

In the final phase of system validation the kit is soak tested on FPGAs. Though emulators are more debug friendly, FPGAs are faster and can provide a lot more validation cycles. Therefore, once the IPs are stable and mature, ARM does soak test on FPGAs to find complex corner cases.

 

 

Metrics, Tracking, Coverage and Milestone Closure

 

The number of validation cycles run for every Kit is one of the metrics that is tracked to ensure the target number of validation cycles have been met. This is especially useful to ensure the soak-testing cycle target has been met, increasing the confidence of the quality of the IP in various applications. In addition to that we quantify and track coverage using a statistical coverage method to ensure the full design including potential corner cases have been exercised sufficiently.

 

The latest version of the ARM Juno test chip was subjected to a total validation run time of 6,130 hours, the equivalent of 8 and a half months of testing. This gives a unique perspective into corner cases within the system that makes ARM better able to support partners who are attempting to debug issues within their own design. Furthermore, the bugs that are found during the validation process are then fed back into the IP design teams who use the information to improve the quality of the IP at each release milestone, as well as guide next-generation products.

 

 

Summary

 

System complexity has increased in line with SoC performance capabilities, causing a significant growth in the amount of time and money spent on validation. ARM verifies its IP for interoperability before it is released to partners to make sure it is suitable for a wide range of applications. ARM’s IP teams are continuously designing at the leading edge, and are helped by the system validation team to ensure they work together in the systems our partners are building.


Frank Schirrmeister of Cadence Design Systems cites the validation of their tool interoperability as one benefit. As an ARM ecosystem partner, Cadence relies on pre-verified ARM cores and subsystems that can be easily integrated into the designs that we use to validate our tool interoperability. ARM’s software-driven verification approach reflects the industry’s shift toward the portable stimulus specification and allows us to validate the integration and interoperability of ARM cores and subsystems on all Cadence System Development Suite engines, including simulation, emulation and FPGA-based prototyping engines.

 

Due to the wide variety of applications that the ARM partnership designs for, it is necessary to ensure our IP is functional in many different systems. The multi-stage approach to system validation at ARM gives our partners the peace of mind that they can rely on our IP. Over time the validation methodology has evolved into one that tests several system components and stresses most IPs in the system. In the future we have plans to extend and further improve our test methods to ensure an even higher standard of excellence across ARM IP.

 

Hi all,

 

I was recently interviewed by Ann Mutschler at SEMICONDUCTOR ENGINEERING regarding Cache Coherency and Configurability. It's a useful summary of why we need hardware coherency to simplify software, and the importance of configurability of interconnect and system products to optimize the performance, area and cost of SoCs.

 

 

Coherency, Cache And Configurability

 

Coherency is gaining traction across a wide spectrum of applications as systems vendors begin leveraging heterogeneous computing to improve performance, minimize power, and simplify software development.

 

Coherency is not a new concept, but making it easier to apply has always been a challenge. This is why it has largely been relegated to CPUs with identical processor cores. But the approach is now being applied in many more places, from high-end datacenters to mobile phones, and it is being applied across more cores in more devices.

 

“Today, in the networking and server spaces, we’re seeing heterogeneous processing there,” said Neil Parris, senior product manager in the Systems and Software Group at ARM. “It’s really a mixture of, for example, ARM CPUs with maybe different sizes of CPUs, but other processors such as DSP engines, as well. The reason they want the cache coherency comes down to efficiency and performance. You want to share data between the processors, and if you have hardware cache coherency, then the software doesn’t need to think about it. It can share the data really easily.”

Without hardware coherency, it has to be written in software. “So every time you need to parse some data from one CPU to the next, you have to clean it out of one CPU cache into main memory,” Parris said. “You have to tell the next CPU if you’ve got any old copies of this data in the cache. You have to invalidate that and clean it out. Then you can read the new data. You can imagine that takes CPU cycles to do that. It takes unnecessary memory accesses and DRAM power to do that. So really, the hardware coherency is fundamental to improving the performance of the system.”

...

 

Read the rest at: Semiconductor Engineering .:. Coherency, Cache And Configurability

 

Thanks,

Neil.

Introduction

 

In ARM® Cortex®-A class CPUs, the Memory Management Unit (MMU) and Operating System (OS) work together to protect address spaces. A process running in unprivileged mode has its own virtual address space and cannot access other processes’ memory or memory mapped I/O devices directly using physical addresses. Any attempt to do this is met with a memory access violation exception. This is possible because of the hardware support offered by the Cortex-A class CPUs in the form of the MMU. Access to resources outside a process’ virtual address space is possible only by moving to a higher privilege level. This is normally done via a system call to the OS using an SVC instruction.

 

Embedded and deeply embedded systems powered by Cortex-M CPUs have to honor real-time constraints and therefore cannot afford to have as many layers of abstraction to protect system resources.  Cortex-M CPUs do not have MMUs for this reason. The applications running on Cortex-M processors often use an OS, usually a Real-Time Operating System (RTOS). Cortex-M CPUs have a Memory Protection Unit (MPU) that collaborates with the OS to implement a memory protection mechanism. Typically, the MPU and OS collaborate to create a privilege-stack. Unprivileged software can communicate with privileged software using well-defined APIs similar to the stacks on Cortex-A cores created by the OS and MMU.

 

Privilege levels ensure data protection to a certain extent, but in the real-world, privileged software could contain vulnerabilities. As IoT applications become more complex, ensuring privileged software is free of vulnerabilities becomes challenging. In such cases, the MPU alone is insufficient to protect data. In processors that have an MPU in a scenario where an interrupt handler is untrusted, software wrappers are used to sandbox the handler into an unprivileged level to protect other resources. This adds additional overhead to servicing untrusted interrupts because the sandboxing involves reprogramming the MPU each time an untrusted handler is executed.

 

ARMv8-M introduces Security Extensions that provide hardware features for more secure devices. The Security Extensions allow the protection of trusted and system resources from untrusted handlers and applications. They can be executed without the additional software sandboxing overheads of reprogramming the MPU.  The Security Extensions collaborates with the associated software model to provide a mechanism to restrict access to system resources and processes through well-defined interfaces similar to system calls, thereby ensuring a higher level of protection and also offering an implicit privilege-stack.

 

ARMv8-M Security Extensions

 

Security Extensions for ARMv8-M provide a mechanism to create a protected space within a processor system design (Figure 1). This allows multiple security domains to exist in the application, which might be required when there are multiple sources of firmware on the chip, or when applications have security requirements.

 

se-1.pngFigure 1: High level concept of Security Extensions for ARMv8-M

 

 

As outlined, access into the secure domain is provided through well-defined interfaces offered by software in the secure domain. These interfaces are the only way resources in the non-secure domain can access protected resources in the secure domain. Any direct access to resources in the secure space is met with security access violations. Once control is with a resource in the secure domain, the non-secure domain has no control over what happens in the secure domain.

 

se-2.png

Figure 2: Access to secure resources only through calls to secure state

 

 

Protection behind the secure domain creates implicit privilege levels – you can consider the non-secure domain as being at a lower privilege level than the secure domain. This is similar to what happens in an OS-based software stack (Figure 3) where all system resources are made available only across a privilege boundary.

 

se-3.png

Figure 3: Access to system resources only through calls to privileged level.

 

 

System resources in a higher privilege level are protected from applications/resources at a lower privilege level by layers of abstraction.  These resources can be accessed only through well-defined interfaces provided by resources at a higher privilege level. For example, if a user-mode application wants to access a peripheral such as the serial port or the video buffer, it needs to make a system call that enters privileged-mode. The user-mode application cannot directly write to the video buffer or access the serial port because it is at a lower privilege level – it has to ask the kernel. Any direct access to system resources results in a memory-access violation.

Anatomy of a Privilege Call vs. a Secure Call

 

Crossing privilege levels and security domains can be considered as analogous. Privilege level transitions happen through system calls in a multi-privilege software stack. System calls are rarely called directly from user code – they are wrapped in libraries (eg. Glibc in Linux) where the system call is actually set up(Figure 4a). When user code calls a library routine, the library routine marshals the parameters, sets up the system call number and invokes the instruction (eg. SVC) that puts the processor in privileged mode and hands over control to the kernel. The kernel then services the application’s request for a system resource. During this service operation of the kernel, the user mode application has little or no control.

 

se-4a.png

Figure 4a: Privilege Call

se-4b.png

Figure 4b: Secure Call

Figure 4: Privilege Call vs Secure Call

 

Transitions into secure state happen through well-defined interfaces offered by the secure software. Entry into secure code can only happen through these interfaces. Like library wrappers for system calls, secure calls from non-secure domains happen through Secure Gateway wrappers (Figure 4b). The Secure Gateway wrapper sets up the Secure Gateway (SG) instruction to transition the security state and makes the call to the actual code in secure code – this is similar to the software interrupt method of changing privilege levels during system calls.

 

Conclusion

 

ARMv8-M Security Extensions (SE) provides an implicit mechanism to implement privilege levels. SE supplements the MPU and improves the capability of the system to provide enhanced security. It does all this whilst still honoring real-time constraints.

 

The role of SE technology is more than just IP protection. Since secure operations can also be protected, the same technology can be used to safe guard critical system operations. As a result, it is possible to deploy ARMv8-M based systems in a wide range of applications, such as industrial and automotive, with SE being used as a system reliability enhancement feature. This provides the potential for  new ranges of industrial and automotive microcontroller devices.

 

In conclusion, the ARMv8-M Architecture will power the next generation of ARM Cortex-M processors. The Security Extensions technology in this architecture helps address one of the most important challenges in future IoT and embedded systems.

 

Further Reading

 

Whitepaper - ARMv8-M Architecture Technical Overview

“It’s interesting to think you have something in your hands that you don’t quite understand.” Don Dingee.jpeg

You could apply that sentiment to inventions throughout history. You could apply that sentiment to the future of IoT, in fact, as we stand on the threshold of something big, yet we’re not quite sure how it’s going to play out.

But in this case, that quotation applies to ARM’s microprocessor technology, circa early 1990s, and the dawn of the mobile era. The words—uttered with a undertone of awe—come from Don Dingee, an engineer and writer who cut his teeth in the semiconductor industry working for Motorola many decades ago.

Dingee was talking about a book he’s co-written with SemiWiki founder Daniel Nenni on the rise of the mobile revolution and the history of ARM and the ARM ecosystem.

“Low power and small form factor weren’t things ARM founders  set out to do,” Dingee, speaking from his rural Texas home near Austin, argues. But Robin Saxby, former ARM CEO, helped shine a light on the value proposition, according to Dingee.

“They knew they weren’t consuming a lot of power,” Dingee said. “It was an artifact of the design-and-build process, rather than an objective of the design.”

The book, “Mobile Unleashed: The Origin and Evolution of ARM Processors In Our Devices,” traces the rise of mobile electronics systems design through the lens of the ARM ecosystem. The ecosystem began forming more than a quarter century ago when a group of engineers tried to figure out how to make their particular variant of the RISC architecture work in an increasingly crowded desktop and embedded computing marketplace. And today their mutual successes in mobile development have transformed societies around the world.

Mobile Unleashed: The Origin and Evolution of ARM Processors In Our Devices,” is available in print or Kindle. It’s definitely worth a read if you’re both fascinated by history and interested in trying to pull some threads into the future.

 

Related stories:

--A Brief History of ARM: Part 1

--A Brief History of ARM: Part 2

[note] This is an English translation edition of my article which appeared on the Interface magazine issued by the CQ publishing. I'm sorry for my poor English.

 

[1] Background of the strongest Cortex-M4.

 

[1-1] Silicon Vendors know -- Cortex-M4 is low power.

 

If we hear of the low power CPU about Cortex-M, we easily think of Cortex-M0 or M0+. However, it would be likely to adopt Cortex-M4 as a CPU for low power microcontrollers since about 2015 (It would be sometimes Cortex-M3 as the Blue Gecko).
For example:
- Apollo (Ambiq Micro)
- STM32L4/STM32L411 8STMicroelecctronics)
- Gecko Series (e.g. Blue Gecko) (Silicon Laboratories)
- MSP432/CC2640 (Texas Instruments)
- Bio-Processor (Samsung)

The recent trend would be appeal both low power and high performance by using Cortex-M4.

 

[1-2] Floating Point Unit is desirable.

 

One reason why Ambiq Micro has adopt the Cortex-M4 is that compared with the competitors, even if Cottex-M4 was adopted, Cortex-M4 would be higher performance in IoT area than Cortex-M0 at the same power range [1]. Also, the existence of FPU would be hidden advantage to port MATLAB [5]. Ambiq Micro seems to think MATLAB would become a killer application of IoT era.

MSP432 of Texas Instruments (TI) is a successor of MSP430 of which appeal points would be the ultra low power 16 bit MCU. TI's reason why he had adopt Cortex-M4 was that microcontrollers should become needed much higher operation performances in both the conventional industry areas and the future IoT related areas [2]. There is also a comment that the performance of Cortex-M4 would be about 10 times of one of Cortex-M0+ [3].

Accidentally (or naturally), the comments of Ambiq Micro and IT would be almost the same. Therefore, Cortex-M0/M0+ would have little performance for IoT areas. Also, FPU or DSP features would be welcome for IoT areas.

o How about Cortex-M7?
If the performance is mandatory, we can choose Cortex-M7. However, there is suspicious Cortex-M7's power consumption might be bigger even though the performance would be too much. The higher performance would mean the die size would be bigger. It would be against the vendors will which they would like to put a lot of functionality into the small chip. Therefore Cortex-M4 (or M3 in some areas) would be the most appropriate.

 

[1-3] Experiment to measure Cortex-M4F's FPU performance.

 

Using real development boards, I measured the FPU performances of Cortex-M4. The boards are low price MCU boards from Freescale (now, NXP) and they are FRDM-KL25Z (Cortex-M0+ base) and FRDM-K64F (Cortex-M4F base). Of course, as Cortex-M0+ has no FPU, the performance was measured by software emulation. As for Cortex-M4F, the performance was measured by both hardware FPU and software FPU emulation aspects. It could be said that the software FPU emulation performance would be identical to one of Cortex-M3 floating point operation performance.

The measurements were performed by the internal SysTick timer, counting CPU clock cycles. This means that the results would show the relative performance at the same operation clock frequency. The used test suites were Whetstone and Linpack benchmarks which are well known benchmark tests to measure floating point performance. The compiler is EWARM compiler. By the way, although the benchmark results would vary according to the number of matrix elements, this time, the number is the elements is (only) 50 because of several reasons. The results are shown in Figure 1.

In Cortex-M4 case, the FPU performance is about 60 to 80 % higher than software emulation performance. Compared with Cortex-M0+, Cortex-M4 performance is about 6 times higher performance. Also, because Cortex-M0+ adopted the 2 stage pipeline, it would not get such faster clock speed as Cortex-M4. If we consider the performance of Cortex-M4 and Cortex-M0+ including clock frequency, it would be proven the rumor which Cortex-M4 would be about 10 times faster than Cortex-M0+.

 

 

[2] Is Cortex-M4's power consumption lower than Cortex-M0/M0+?

 

[2-1] The power consumption of Cortex-M4F and Cortex-M0 would be the same (if the same operation would be executed).

 

In order to show the low power metrics of CPUs, EEMBC has released ULPBench [4]. According to ULPBench, MSP432 score is 167.4 and it is much better efficiency than MSP430 which is the predecessor of MSP432 and 16 bit MCU as MSP430 score is about 110 to 120. In addition, MSP432 was best efficiency among Cortex-M4 base MCU at the time of April of 2015.

However, the ULPBnech score of Cortex-M0+ base SAM L21 J18A-UES (Atmel) is 185.8 and Cortex-M0+ might have a possibility of higher efficiency than Cortex-M4. Anyway, the rough results of ULPBench is shown in Figure 2.

Although Cortex-M0+ could not have faster clock frequency, Cortex-M0 could have relatively higher clock frequency because of the longer pipeline stages (i.e. 3 stages) than Cortex-M0+ (2 stages). For example, DA14680, Wearable on Chip Series of Dialog Semiconductor, is adopted Coetex-M0 and it can run at fast speed of 96MHz but its power consumption is only 30uA/MHz. The SAM L21 above is 100uA/MHz.This means that at application areas which simple functionality and ultra low power are needed such as stand alone sensors, Cortex-M0/M0+ are still valuable.

 

 

[2-2] Even though Cortex-M0/M0+ exist, Cortex-M4F could be still the strongest.

 

We should look at the power consumption per operation. Given a certain operation, the faster it would be processed, the lower power it would consume. This means higher performance of Coetex-M4 (sometimes Cortex-M3) could be relatively more superior to Cortex-M0/M0+ in the power consumption view point.

This would come from the fact that ARM's official announce which Dhrystone or CoreMark performance per MHz is higher than Cortex-M0 by about 45%. Both Cortex-M0 and Cortex-M4 equip the 3 stage pipeline structure, but the performance differences would derive from ones of the instruction set architecture. As we know well, Cortex-M0 has Thumb compatible and Cortex-M4 has Thumb-2. This means that Cortex-M4 might achieve lower power consumption because of the smaller number of instructions to realize a certain operation.

As this thought might be proven, Ambiq Micro's Cortex-M4F base Apollo MCU had gotten 377.5 score at ULPBench, and it is being still the best score. Until then, the best score was 187.7 which was made by STMicro's Cortex-M4 base STM32L476. At this time, the 2nd position honor was replaced by Analog Devices'es Cortex-M3 base ADuCM302x of which score is 245.5. Actually, Cortex-M4 is more efficient power consumption than Cortex-M0/M0+.

 

[2-3] The story of IF: Is Cortex-M0F the strongest, if it exists.

 

Cortex-M series are put emphasis on the fact which there are scalable lineups from Cprtex-M0 to Cortex-M7. It is true and important metrics for their sales, but Cortex-M4 has been widely adopted in recent IoT devices or the wearables. This shows the common sense which Cortex-M0 is the lowest power consumption seems to have been forgotten. The main reason would be a lack of inexistence of FPU or DSP. If Cortex-M0 had FPU and DSP, the lowest power consumption and the highest MCU might be born.

 

[2-4] Isn't it Cortex-M4F, is it?

 

The recent fashion would be the low power Cortex-M4. This means the high performance (or clock frequency) would be mandatory. In older days, such application as requiring small die size and low power had adopted Cortex-M0 or Cortex-M0+. As Cortex-M0+ is the successor of Cortex-M0, the birth of Cortex-M0+ was thought that it would kill the Cortex-M0.

However. Cortex-M0 is survived and widely adopted for the non-FPU/non-DSP application areas. It is other than Cortex-M0+. The reason is why Cortex-M0 can run at faster than 200MHz but Cortex-M0+ cannot. This comes from the same 3 stage pipeline structure as Cortex-M4. Cortex-M0+ of which the pipeline stages are 2 seems not to achieve 200MHz clock frequency. Here, we shall forget the difference of ISA.

The main reason why Cortex-M0 remaims still valuable would be low power features(apart from the results of ULPBench) which would derive from its small die size. I am afraid the die size of Cortex-M0(F) would be the same as Cortex-M4F if Cortex-M0 had FPU and DSP. This might result in the same power consumption. In this meaning, Cortex-M0F could become meaningless.

 

[3] The significance of Cortex-M7.

 

[3-1] To get more performance, are caches and TCMs needed?

 

To get more performance, built-in caches and TCMs might be needed. This might be the trigger of ARM9 rehabilitation. It would against the ARM expectations. Thefore Cortex-M7 could be born. Regarding Cortex-M7, its significance would be still unknown.

To get more performance, built-in caches and TCMs might be needed. This might be the trigger of ARM9 rehabilitation. It would against the ARM expectations. Therefore Cortex-M7 had been born. Regarding Cortex-M7, its significance would be still unknown.

Today, Cortex-M series have been going on the original way which migh not be ARM's expectations or roadmap. It might be the time to reconsider the importance of existence of Cortex-M again.

 

<References>
[1] Subthreshold design at MCU-scale yields 10x energy efficiency.
http://www.electronics-eetimes.com/en/subthreshold-design-at-mcu-scale-yields-10x-energy-efficiency.html?cmp_id=7&news_id=222923565&vID=44#
[2] TI’s 32-bit‘Successor’to the 16-bit MCU.
http://www.eetimes.com/document.asp?elq=cc9c541e84b142a8a92294c69eaea9c3&elqCampaignId=22285&elqaid=25047&elqat=1&elqTrackId=19651301ed71477bb7e9895fde1f0024&doc_id=1326109&page_number=1
[3] MSP430 の系譜を継ぐ,低消費電力重視のARM Cortex-M4マイコン「MSP432」を発表 .
http://eetimes.jp/ee/articles/1504/02/news143.html
[4] EEMBC ULPBench web site.
http://www.eembc.org/ulpbench/
[5] Why Choose the ARM Cortex-M4 over the M0 for Wearables and IoT?
http://ambiqmicro.com/news/why-choose-arm-cortex-m4-over-m0-wearables-and-iot

We’d like to welcome all of you and describe few interesting issues we encounter during our work with ARMv8 and FreeBSD. In this cycle it is planned to talk a little about various bugs found in the kernel and the ways how they were resolved.

 

Exception model in ARMv8

The new ARM architecture represents different approach to exception levels. A typical figure showing all levels can be found in ARM documentation.

 

How does it look inside FreeBSD kernel

 

FreeBSD kernel only makes use of the first two lowest exception levels. All userspace processes are being run in EL0 mode, where all kernel code is executed in more privileged EL1.

At the first glance, the exception level model is drastically simplified as compared with ARMv7 and now looks similar to x86_64 architecture rather than to old-fashion RISC-like processor.

 

But how the change in exception level is handled in FreeBSD? Let’s take “syscall” as an example. As is known, when user process (via libc) wants to ask the kernel to do some work on its behalf, it must use a special system call. On ARMv8 it is done by generating special exception type, which is the only way to increase execution privilege and start running kernel code.

 

Upon receiving SVC call (exception from userspace indicating syscall access), the processor jumps to predefined vector and executes the following code:

 

ENTRY(handle_el0_sync)

  save_registers 0

  mov  x0, sp

  bl  do_el0_sync

  do_ast

  restore_registers 0

  eret

END(handle_el0_sync)

 

Not going in too much details, it stores all registers on the stack (creating trapframe, passed as a parameter to do_el0_sync) and calls C-function for handling this event. In current case, it parses SVC parameters and executes an appropriate syscall handler (svc_handler).

 

void

do_el0_sync(struct trapframe *frame)

{

  struct thread *td;

 

  td = curthread;

  td->td_frame = frame;

  …..

  switch(exception) {

  …...

    break;

  case EXCP_SVC:

    svc_handler(frame);

    break;

  ...

  default:

    ….

  }

}

 

Once syscall is done, function returns, handle_el0_sync restores all registers and goes back to the user process.

 

Exceptions are not as bad - page faults

 

The most common exception executed on FreeBSD is a page fault. It is absolutely normal that it happens when user or kernel thread wants to, for example, map a page which is being used for the first time.

 

FreeBSD kernel uses advanced memory management features, such as copy-on-write and lazy-alloc. Page fault is then used as an indication if any of mentioned operations should be performed by the kernel.

 

Of course, when a user process tries to do something which is not allowed to do, like dereference NULL pointer, page fault indicates invalid operation and the kernel sends a killing signal to the process - the kernel acts as a guard in this case not allowing a process to go anywhere outside predefined bounds.

 

Stack on ARM

 

The ARM core can utilize stack in various modes. The most common (and the one used by FreeBSD) is an descending implementation of the stack. That means, that stack grows into lower memory addresses, as shown on the picture below.

 



Dynamic stack growth in kernel threads and how can we get with dead system

 

To visualise the dangerous possibility, let’s discuss the real-life example, encountered on the beginning of porting FreeBSD to armv8.

When the kernel thread is being created, it shares the memory space with the rest of the kernel. The only thing which needs to be private it’s a stack. As might be expected, it’s done by a kernel version of malloc.

 

It looks fine at the first glance, we’re creating thread, malloc’ing the stack for it and everything should work just fine. But it didn’t. The problem was with a special feature mentioned before, lazy-alloc. Physical allocation of pages is a time consuming process, so it is better to do that just in time where the page is needed, i.e. when a first page-fault happens on an address from malloc’ed area.

 

It might not be obvious yet, but this can make the whole system stuck! Let’s see what is happening. Assume that the stack starts at 0x10008000 and grows down. At the beginning, malloc allocated only one page (because top-of-the-stack is always accessed during creation of the process and filled with some thread-specific stuff, descriptors etc.). Just when the call stack is big enough, it eventually grow pass the allocated space and falls into the page below (0x10006000 - 0x10006fff). The first access causes pagefault which is intended to be handled and allocate required page, but not this time.

 

Take a closer look at the assembler:

 

.macro  save_registers el

.if \el == 1

  mov  x18, sp

  sub  sp, sp, #128

.endif

  sub  sp, sp, #(TF_SIZE + 16)

  stp  x29, x30, [sp, #(TF_SIZE)]

  stp  x28, x29, [sp, #(TF_X + 28 * 8)]

  stp  x26, x27, [sp, #(TF_X + 26 * 8)]

  stp  x24, x25, [sp, #(TF_X + 24 * 8)]

….

ENTRY(handle_el1h_sync)

  save_registers 1

  mov  x0, sp

  bl  do_el1h_sync

  restore_registers 1

  eret

END(handle_el1h_sync)

 

The first thing done in EL1_sync handler is, yep, storing all registers onto the stack. But, we’re just run out of allocated space, so there is no stack accessible here and the EL1_sync exception repeats just when the first “stp” instruction is executed. What’s more, it repeats constantly ever since and the only way to recover is to hard reset the board.

 

Ways to workaround

Unfortunately, armv8 architecture is susceptible to this scenario. There is always a chance that the kernel thread exceeds its allocated stack range and ends up in described state. What we can do is to minimize the chance for that to happen.

 

Following things can be done on FreeBSD:

  • when allocating the stack, ensure that all pages are allocated (wired) and no pagefault occur for the whole stack range - this is a solution currently implemented in FreeBSD which works well for over a year of testing.
  • Allocate more stack than requested and in the exception handler check if stack is bigger than a predefined size. Then we still have some of the stack left, so the system can, for example, enter debugger or do a sysdump. (this was not implemented)

 

How can we avoid this and why ARMv7 was different

On 32-bit ARM architecture, the situation was a lot easier. Previously, almost every exception level had its own stack pointer and there was almost impossible that the system-stack gets overflowed. The lazy-alloc functionality was also easier to implement and use.

 

Conclusion

The ARMv8 architecture is superior in most of the aspects, but the programmer must be aware of some dangers hiding inside. I hope this short article was interesting and helps to visualise the issues we were facing with during FreeBSD ARMv8 porting.

 

About Semihalf

Semihalf creates software for advanced solutions in the areas of platform infrastructure (operating systems, bootloaders), virtualization, networking and storage.  We make software tightly coupled with the underlying hardware to achieve maximum system capacity.

 

Technologies developed by Semihalf power a wide range of products, from consumer electronics to cloud data center elements and carrier-grade networking gear.

 

The team

Zbigniew Bodek <zbb@semihalf.com>

Dominik Ermel <der@semihalf.com>

Wojciech Macek <wma@semihalf.com>

Michał Stanek <mst@semihalf.com>

Chinese Version 中文版:三星全新Exynos 7 Octa 7870 为大众手机提升品质

All consumers have premium tastes. These days, we demand more of our phones than ever. Where once they were just for calling and texting, now we expect to be able to surf social media, check emails, edit documents, play games and heaps more. All of this has to be possible within the constraints of the mobile form factor and, perhaps more importantly, its power limits. Achieving these goals without a whacking premium price tag can be a real challenge, but ARM partner Samsung Electronics has just released their latest answer to this very modern problem.

 

Samsung Exynos 7 Octa 7870.png

(Samsung 7 Octa 7870)

 

 

Samsung have announced a new member of its Exynos 7 Octa processor family, the Samsung Exynos 7 Octa (7870). Samsung are continuing their innovation in the mobile space by enabling more consumers to access their latest technology advancements in SoC manufacturing. The new SoC delivers a cutting edge experience to a wider audience, with strong performance and power efficiency.

 

These improvements in the processor will enable users to enjoy seamless HD video streaming and console quality gaming on their mobile device. The integrated modem allows for this experience to be uninterrupted while on mobile networks, increasing the reliability of download speeds.

 

The new chip is targeting mass market mobile devices and is expected to be widely adopted where the aim is to provide a premium experience on a budget.

 

 

14nm FinFET.jpg(Samsung’s 14nm FinFET technology)

 

 

It’s the first time that the advanced 14nm process has been utilised for mid-range mobile SoCs. Samsung debuted the process on the Exynos 7 Octa 7420 SoC in 2015. The new Exynos 7 Octa 7870 processor consumes over 30 percent less power than mobile SoCs built with 28nm High-k Metal Gate process technology at the same performance level.

 

The ARM Cortex-A53 is the first ARMv8-A high efficiency processor providing 64-bit capability while being fully compatible with the existing ARMv7-A software for 32-bit compute. In the Exynos 7870, Samsung have opted to take advantage of the high versatility and scalability of the ARM Cortex-A53 in an octa-core symmetrical multi-processor implementation, tuned to deliver a premium mobile experience in the mass market budget.

 

On the graphics side, the ARM Mali™-T830 GPU provides an optimum blend of high area efficiency and high performance. Reducing the silicon area simultaneously reduces the cost while still supporting complex content such as 3D gaming. With an additional arithmetic pipeline than its predecessor, the Mali-T830 can handle such content at higher speeds with maximum performance and a minimal silicon area. Additional optimizations such as quad prioritization; and bandwidth saving features such as ARM Frame Buffer Compression (AFBC) and smart composition, all combine to make the Mali-T830 the perfect choice for mass market mobile.

 

The ARM CoreLink™ CCI-400 Cache Coherent Interconnect enables the intelligent allocation of processing workloads between processing clusters. In this sense, providing rapid coherent on-chip communication for the CPU and GPU enabled Samsung to optimize the system performance in the Exynos 7 Octa 7870. It does this through providing the cache coherency that allows both clusters to share the same view of memory, thus saving power and allowing more CPU resource to be used on giving the best user experience possible. The fact that CoreLink System IP is designed, validated and optimised with Cortex processors and Mali multimedia enables designers to get the optimal system performance from their SoC.

 

ARM CoreSight™ debug and trace technology was instrumental to the successful bring-up of the Exynos 7870. When designers are working on optimizations to eke out the maximum performance, there is peace of mind in knowing that CoreSight gives the best real-time trace delivering visibility onto the chip fast in order to fine tune the performance.

 

The Exynos 7870 also has an integrated Cat.6 modem based on Cortex-R7 for connectivity. LTE Cat.6 2CA Modem that supports 300Mbps downlink speed and FDD-TDD joint carrier aggregation for better network flexibility. The ARM Cortex-R7 is a superscalar out-of-order processor with advanced dynamic and static branch prediction. It provides a high-performance, real-time solution that makes it the perfect solution for LTE-advanced modems.

 

 

ARM and Samsung partnership can bring about sustained innovation

It’s exciting to see 14nm technology being made available at a reasonable price point so quickly after its debut in the market. We have all become used to searching for the latest killer app or use case that is only accessible at the leading edge. However in many ways what Samsung are doing with the Exynos 7 Octa (7870) is more powerful, as it enables a wide range of smartphone users to get increased performance from their mid-range devices. The continued partnership between ARM and Samsung enables both partners to create product offerings that satisfy customer demand in every niche in the market.

You may have assumed that Mobile World Congress is an event that is entirely dedicated to the smartphones and devices that have changed our lives over the past couple of decades. However, in amongst the new and shiny gadgets there is also a lot of exhibition real estate dedicated to the back-end of what makes the modern day mobile experience what it is today (Check out my review of the show here: 6 Takeaways from Mobile World Congress). While new devices such as the Samsung Electronics Galaxy S7 take the headlines, the user experiences the extra processing power on state-of-the-art phones is only as good as the mobile network doing the heavy lifting of connecting devices to the Internet.

 

The ARM® booth at MWC showed some example demonstrations of how the world’s #1 computing ecosystem has come up with solutions that are enabling network equipment providers and OEMs a new way to think about how they test and deploy networking infrastructure.

 

 

Cost-effective Mobile Networks

 

There is a massive presence at MWC from the telecommunications industry who are trying to solve a number of challenges around mobile connectivity, amongst others a cost-effective way of increasing network coverage geographically. Enter Core Network Dynamics (CND), a startup from the Fraunhofer FOKUS institute in Germany, which is researching and providing the technology to shape mobile networks for 4G and beyond.

 

Their demonstration was to show CND’s OpenEPC (Evolved Packet Core) software running in a compact server with minimal footprint powered by an NXP LS1043 implementing a complete LTE core network.

 

For this, ARM showed a real LTE small cell from AirSpan connectecd to a RF shielded box, inside of which sits a Samsung Galaxy phone with a CND SIM Card. The box blocks out all wireless networks except for the LTE network coming from the AirSpan cell.

 

As the box needs to remain sealed, the phone can be operated through a Raspberry Pi 2 outside of the box, which is acting as a remote console (with a large display).

 

During the demo, the phone is switched from airplane mode into connected and discovers the test LTE network inside the box. The telephone connects and is operated from the outside to start any standard service from the Internet via the OpenEPC.

 

CND demo.png

 

The OpenEPC software on the NXP board with the ARM processor is a complete mobile core network. EPC, or Evolved Packet Core, is a novel network architecture that combines the advantages of previous generation mobile network solutions with cost-cutting wireless environments such as WiFi or WiMAX.

 

The NXP LS1043 utilizes a quad-core ARM Cortex®-A53 processor connected by the CoreLink™ CCI-400 cache coherent interconnect. This powerful combination enables the networking SoC to deliver up to 10Gb/s performance with minimal CPU overhead.

 

NXP LS1043.pngHaving the complete core network running in a compact server with a small footprint and high performance enables decentralizing and distributing core networks to handle the increase in load that 5G and Internet of Things bring, thus taking us one step closer to the vision of the Intelligent Flexible Cloud. As CND’s CEO Carsten Brinkschulte stated, “By distributing the core network and deploying at the edge (Mobile Edge Computing) or even running it on the processor of the small cell – we can avoid the round-trips to the central core network and the backhaul traffic associated with that.”

 

Breaking down barriers for optimized network infrastructure

 

The real significance of this demo is that it shows how you can take a generic piece of hardware and run any piece of software on it that you wish to run the network. By providing standardized hardware that can run different software stacks, it allows multiple software vendors to develop software that can run networking workloads for specific use cases. This means that OEMs and network providers can mix and match HW and SW to suit differing requirements, not only reducing costs but allowing them to customize network base stations for different geographies. With the amount of ARM partners in the ecosystem that will potentially develop on this foundation, the choice for OEMs can only increase to suit their target applications.

 

In this example the prospect of a decentralized network also has its benefits in terms of reliability, increased coverage and unlocking new use cases. Resiliency and fast recovery are indispensable for a public safety communications network carrying mission-critical voice, video and data. A wider distribution of edge stations ensures that one component’s failure has no impact on service to the user. You can learn more about the benefits of decentralized networks here (Link).

 

 

Testing lab for next generation of equipment

 

Enea Pharos Lab.png

 

One of the issues with the design and ordering of infrastructure equipment is that the variety of boards and software stacks can leave OEMs wondering whether their choice is the perfect mix for their needs.

 

The software company Enea demonstrated an OPNFV lab at the ARM booth, designed for hosting continuous integration, deployment, and testing of the OPNFV platform. As in any data centre there is a control node and a compute node, in this case the AppliedMicro X-Gene and Cavium ThunderX respectively, that do the heavy processing.

 

It’s a demonstration of a cloud-based sandbox for developers to run their particular software flow on the different types of server stacks they have in the lab. A user connects to the reference stack, codenamed Brahmaputra after the Indian river, and runs the code through the lab. The software can come from any one of the variety of vendors that are built upon the OPNFV platform, including Openstack, Canonical and KVM.

 

One benefit of what was demonstrated is that network providers can now run their own lab tests for workloads that are important to them on the different hardware that is included here. The real life ‘try before you buy’ allows providers to know exactly what kind of performance they can expect, and deploy systems that are optimal for the workloads they need.

 

When speaking to Enea's senior systems architect Joe Kidder, he explained that the benefits of a lab like this aren’t reserved for only deciding on the right hardware setup. “Indeed, companies are able to plug their server stacks into the lab even after it has been setup. It is great news for chipmakers targeting this end of the market as it means they can quickly begin running benchmarks in evaluations.

 

“Some companies will use a lab like this to run tests on their key workloads, while others will get value from renting out their lab space to potential users.”

 

 

The future of mobile networks is likely to be ARM powered

 

The unsung hero of the mobile revolution is the network which enables all of the use cases and applications that we now take for granted. One of the common themes from the two demonstrations at the ARM booth was that there is a large ecosystem around ARM in the networking space, one that continues to evolve. Not only does this increase the pace of innovation and specialisation for target workloads, it also gives OEMs and network providers the choice they need when it comes to investing in the right hardware and software solution to meet their needs. This is essential for keeping up with the speed of technological advancement that is still occurring in the mobile space.

 

 

More Information

Core Network Dynamics OpenEPC

Enea launches new professional services packages for accelerating VNF development and deployment

Chinese Version中文版:世界移动通信大会的 6 大亮点

Mobile World Congress isn’t only about mobile phones, it’s all about the internet and the technologies that continue to enable the mobile revolution. It was my first time at MWC and these six things just blew me away.

 

 

Narrow-Band IoT

A number of exhibitors were making noise around Narrow-Band IOT (NBIOT), a radio technology for Embedded devices that focus on indoor coverage, low cost, low power and the ability to support a large number of devices. It can be deployed in the GSM and LTE spectrum and will transmit data at a rate of hundreds of bits, or a few kilobits, per second.

 

NBIOT borrows off the 3GPP network, so will benefit from the same security protocols such as link locking and encryption. It can use reconfigured existing hardware that is already deployed, the only thing it needs is a software update. The fact it is part of a licensable spectrum and can be controlled for interference means operators can guarantee a certain quality of service, which boosts reliability.

 

A lot of embedded and IoT work has been looking at connectivity through WiFi or bluetooth radio, essentially local area networks that exist predominantly in the home. That creates limitations for smart cities and places outside of these networks, which is where the thinking for NBIOT came about.

 

One of the advantages is that it has better reach than GSM, so for example T Mobile demonstrated a car park with sensors underneath the asphalt that could still communicate and send alerts where a space is free. The combination of this and self-parking cars could be the first step to being able to exit your car outside your destination and trust it can find its own parking space nearby.

 

T_Mobile_NBIOT[1].jpg

 

5G is on everybody’s lips

 

The oft-hidden part of the mobile experience was on display at the show, with plenty of chatter around next generation mobile networks and 5G being mentioned a lot. ARM’s David Maidment laid out the steps that need to be followed in the whitepaper “The Route to 5G”. From the large network equipment manufacturers like Huawei, Ericsson and Nokia presenting their vision of the near future, to carriers such as Vodafone and Telefónica talking about the possible services that 5G will enable, it seems the new standard can’t come soon enough. There is much more to it than just utilizing the latest hardware, and implementing a software stack on top of it. On the ARM booth were two demonstrations of innovative thinking that are giving network providers more to think about when selecting their equipment for the next generation, which I'll go into in detail in a later blog. ARM has just released a new real-time processor, the ARM Cortex-R8, which will be instrumental in delivering next generation connectivity to mobile devices. Find out more in: “A Potential Look Inside the 5G Modem Baseband

 

LTE Advanced to 5G.png

 

 

Virtual reality will change consumer experiences

 

2016 is the year when virtual reality is making the transition from breathless anticipation to real, commercial product. HTC launched their Vive VR headset as an alternative to the ARM Mali-based Samsung Gear VR headset, a collaboration from Samsung Mobile and Oculus. While the Vive was powered by a desktop when I tried it, the Gear has the advantage of linking in with the Samsung Galaxy phones, making it more portable. Mark Zuckerberg contributed to the excitement by backing Samsung Gear VR as the best mobile virtual reality headset, due to its price competitiveness and OLED display. He also declared VR as the future of sharing, and announced that Facebook is experimenting with dynamic streaming for its 360 videos in the hope of quadrupling the video resolution while cutting the bandwidth needed to stream it. The ARM booth had a Samsung Gear VR showing off the Mali Ice Cave demo, which you can learn more about in Virtual Reality: The Ice Cave . The demos were mostly around gaming but the potential for VR is huge, with possible applications for construction, professional training, art and live political and sporting events or concerts. On a personal level, it still takes a bit of getting used to seeing rows of people “plugged” into computing like this. Matrix blue pill anyone?

 

Samsung_VR[1].jpg

 

 

More devices mean more security concerns

 

Security has been a recurring theme throughout the entire electronics industry over the past 18 months or more. As everything is becoming more connected, the realisation has dawned that every part of the network needs to be secure from hacking. In his keynote on Tuesday explaining the need for Internet of Things security, ARM’s CEO simonsegars pointed out that over one billion health records were stolen in 2014. The time is now for embedding security in the IoT ecosystem, which needs to be robust enough to fend off attacks and easy to use so that people will follow the security guidelines. As you can see in the graphic below, a multi-layered approach to security across all network stakeholders is the best way to avoid attacks. ARM’s Beetle test chip is a proof of concept of how to deliver a secure foundation for IoT endpoints, which you can find out more about here: ARM enables IOT with Beetle Platform

 

Securing the IoT.png

 

Competition is driving creativity in the smartphone space

 

There has been a spate of announcements this week, with many devices launched aimed at the top end of the smartphone market, such as the Samsung Galaxy S7, ZTE Blade S7, Xiaomi Mi5, and LG G5. It’s noticeable that device manufacturers are focusing on unique features to stand out from the competition and be the killer app. For example, Samsung’s Galaxy S7 (powered by a Qualcomm Snapdragon 820 or Samsung Exynos 8 Octa 8890 depending on the market) features a dual pixel sensor which should dramatically improve the quality of photos taken in low light situations. For me the standout was the LG G5 (powered by the Snapdragon 820), which had the whole exhibition talking with its modular features. The ability to remove and add custom modules like a camera shutter, extra battery or HiFi speaker allows users to have a degree of customized ownership according to their own needs.

 

 

LG_G5[1].jpg

 

The other surprise for me was the sheer amount of companies promoting their smartphones. There were a large number of (for me) previously unknown companies with devices targeting a price point around €200, running Android Marshmallow with ARM-based 4G processors. While it may not grab the headlines, the mid-range tier of the market is a rich hive of competition, also evidenced by last week's announcement by Samsung that they will launch an SoC specifically for that market, the Exynos 7 Octa 7870.

 

 

It’s BIG

 

The first thing to comment about Mobile World Congress is that it’s absolutely massive. While I had been told about the size of the show beforehand, spread across 9 exhibition halls as well as conference halls and keynote sessions, I still wasn’t prepared for the sheer size of it all. The event is expected to see more than 94,000 visitors pass through over the 4 days of the show, and they will be treated to a barrage of bright lights, loud music and brand new technology. MWC is where the technology industry makes world headlines, and it shows. Just make sure to bring some comfortable shoes if you plan to check out all of the exhibition halls.

 

 

Did you attend MWC? Let me know what your impressions were of the show or if you have any opinion on the above by leaving a comment

Take one look at the outside world in 2016 and you will see that everything is becoming connected; our inanimate objects are growing smarter while our electronic devices are gaining ever more computing power and connectivity. Mark Zuckerberg said at a recent event that “Mobile networks need to continue evolving, high bandwidth and low latency is what everybody needs”. The ramifications are varied, but one thing that regularly gets overlooked is the strain this extra amount of connection will place on the underlying network that supports all of the data being moved around the world.

 

Let’s take a look at the capabilities of previous mobile networks.

 

Wireless networks.gif.png

 

So what could the 5G network look like?

The growing need for a new generation of mobile network is being pushed by three major forcing functions.

 

Latency

5G will require ultra-low latency, in the realms of <1ms end to end. It’s an order of magnitude faster than the acceptable latency in 4G, which is in the order of 30-40ms.

 

Bandwidth

According to British telecommunications company EE, over three quarters of the data we consume in the year 2030 is expected to be video content. This will put significant strains on the underlying network, as the monthly data demand will be around 2,200 petrabytes (1 million gigabytes) of data used every month, or the equivalent of 22 times the current rate. Multi-gigabit services will allow consumers to download digital content near instantaneously and ultra-low latency connections enable services such as virtual and augmented reality.

 

Greater reliability

The reliability of our mobile networks will assume greater importance as we use them more often for control and safety functions, particularly for automotive use cases. When you look at automotive infotainment for example, passengers will expect the quality of their connection to remain constant regardless of the location and speed.

 

You can go into more depth on the topic of the challenges that stand in the way of implementing the networking technology in the white paper “The Route to 5G”.

 

Bringing this back closer to our own pockets, the new generation of mobile network will also have ramifications on the modem baseband processor in your mobile internet enabled device, be it a smartphone, wearable device, or even a car. The baseband is the chip in a device that connects to mobile networks to deliver that always-on connected experience. Those chips will also need to meet the standards required of the network, in order to deliver the improvements in connectivity, reliability and bandwidth that 5G brings.

 

As the industry requirements for 5G are being hashed out, we have gone a step further and taken a look at what a 5G-ready modem baseband processor could look like.


Introducing the ARM Cortex-R8 real-time processor

The new ARM® Cortex®-R8 processor is the latest from the family of ARM processors optimized for high performance, hard real-time applications. ARM has been the engine of the cellular modem since the very first GSM handsets back in the days of 2G, and the Cortex-R8 offers double the performance of its predecessor.

 

 

Cortex-R8 Features.png

 

What could the 5G modem baseband system look like?

 

Cortex-R8.png

 

At the heart of the system is the CoreLink™ NIC-450 to deliver the extremely low latency 5G processing demands, providing an optimized path from CPU to memory. NIC-450 is a highly configurable interconnect for low power, low latency communication across the chip and can be tailored to suit particular system requirements, following successful use in billions of devices[AH1] . NIC has advanced Quality-of-Service (QoS) features to help meet some of the more stringent latency requirements in a real-time system such as the baseband system.

 

The CoreLink DMA-330 Direct Memory Access controller enables set-up and supervision of direct transfers of large blocks of data directly to or from a peripheral out of or into RAM. DMA-330 releases the micro-processor from using its own I/O instructions to move data in blocks. In this system, the effect of the DMA is to make data transfers to or from peripherals many times faster than the processor could achieve, allowing the processor to focus on other tasks. It negotiates with NIC-450 and gets allocated the appropriate bandwidth.

 

Single-channel CoreLink DMC-500 Dynamic Memory Controller is used for rapid access to DDR memory, up to 8.5GB/s per channel.

 

CoreLink interconnect and DMC have integrated QoS capabilities that enable predictable traffic prioritization, allowing for deterministic processing to take place when low latency masters require access to memory. This is important within the context of 5G as enhanced timeliness and reliability are a cornerstone of the network, and the baseband modem needs to be as capable.

 

In real-time systems like this, CoreSight™ debug and trace technology is invaluable as a tool to give designers a way of optimizing their SoC bring-up to reduce the risk in squeezing out the best performance from the chip.

All of ARM’s IP is designed, validated and optimized as a system to ensure it delivers predictable performance while conquering low power benchmarks. As device manufacturers look to every possible avenue to increase battery life, the system-context applied to ARM’s design process enables ever-increasing performance while reducing system power.

 

Conclusion

 

The race to become the first country or city that hosts a 5G network is heating up, with Tokyo among the frontrunners, aiming to have a network in place for the 2020 Olympic Games. The ARM Cortex-R8 processor is a strong step forward for the next generation of modem baseband systems that will support the new network. ARM’s system oriented approach to design will enable partners to realize connectivity that is fast and reliable.

The announcement of the ARM® Cortex®-A35 processor marked the beginning of a new family of ultra high efficiency application processors from ARM. Today, ARM announced the second member of that family, the Cortex-A32, a new 32-bit processor. Highlights of the Cortex-A32 include:

 

  • ARM’s smallest, lowest power ARMv8-A processor, optimized for 32-bit processing (supports the A32/T32 instruction set, and is fully compatible with ARMv7-A)
  • Provides ultra efficient 32-bit compute for the next generation of embedded products including consumer, wearable and IoT applications.

 

Roadmap.png

In this blog, I’ll provide the market context and some highlights of the Cortex-A32 while answering the question: Why did we create the Cortex-A32?

 

Embedded Markets

 

The embedded market is incredibly diverse. It covers innumerable products – almost everything that is not a phone, a PC, or a server - and spans a huge range of processing requirements. The diversity of requirements in embedded is well served by the three major processor families from ARM: Cortex-A, Cortex-R and Cortex-M. The fundamental differences between the A, R, and M families are shown below:

 

 

A R and M.png

 

Much has been written about Cortex-M processors in the embedded market – they are incredibly prevelant. Less attention has been given so far to the growing use of Cortex-A processors in embedded applications. This blog focuses on these rich-embedded applications, where a full OS is required. These are the sweet spot for Cortex-A.

 

Two fundamental aspects make rich-embedded applications different than the traditional embedded applications using Cortex-R and Cortex-M processors. The first is rich operating system support that requires virtual memory and memory management unit. The vast majority of Cortex-A based embedded products run full virtual memory based OSes like Linux, Android, and Windows. The second aspect is higher performance. The performance needed is again very diverse, and in some cases embedded applications need performance approaching that of smartphones and laptops, which of course Cortex-A processors can deliver.

 

The rich embedded market is already well established. According to VDC estimates, ARM based devices occupy over 70% market share in the rich embedded segment (SoCs). Just like the embedded market as a whole, the rich embedded market is extremely diverse.  There are many use cases, some high performance and others more cost and power sensitive.  Let’s look at a few examples - industrial devices, smart watches, smart glasses, and a whole range of products for the home - from thermostats to media hubs. These devices all use Cortex-A, and deliver a richer experience to users.

 

Market share.png

 

The rich embedded market is growing rapidly, fueled by two key drivers:

  • a wide choice of affordable silicon platforms delivering low cost, and high performance
  • the largest rich embedded software ecosystem

 

SBC.png

 

Today, more than 100 Cortex-A based Single Board Computers (SBCs) are available in various performance and cost points. Rich operating systems, open source and proprietary, have become more accessible, and this has opened up embedded development to a wider range of developers.  The software ecosystem for Cortex-A processor also includes support from the leading RTOS and embedded tools vendors. Their interest in Cortex-A is driven by demand from their customers, who want to take advantage of Cortex-A performance, compatibility, wide availability, and the benefits of multiple suppliers and price/performance points.

 

Number1 Ecosystem.png

Much has been said lately about 64-bit, which is driving in smartphone and open compute markets, however in embedded the majority of the software ecosystem is focused on 32-bit software. While there are some embedded applications that are moving to 64-bit, like high-end SBCs, NAS, and ADAS systems, many embedded applications are sticking with 32-bit to keep costs and complexity low. We can expect a significant number of embedded devices to remain 32-bit for the foreseeable future.

 

Highlights of Cortex-A32 Processor

 

We built the Cortex-A32 for embedded, first and foremost. Embedded is an exciting market and wanted to continue to processors that accelerate the innovation in this market.  So, what benefits does the Cortex-A32 processor offer for rich embedded?

  1. ARMv8 architectural enhancements
  2. Higher efficiency and performance
  3. Scalability to target diverse embedded markets

 

Let us look at some details for each one of these key offerings.

1. ARMv8-A architectural enhancements

 

Cortex-A32 is the only ARMv8-A processor optimised for 32-bit compute. As such, the Cortex-A32 offers an ARMv8 upgrade path for applications that today use ARMv7-A processors like Cortex-A5 and Cortex-A7 or classic ARM processors like ARM926 and ARM1176.

 

The ARMv8-A architecture supports both 32-bit and 64-bit compute capabilities in the AArch32 and AArch64 execution states. Cortex-A32 is optimized to support the A32/T32 instruction set in the AArch32 execution state, which is ideal for 32-bit rich embedded applications that need the lowest cost and power. Even in AArch32, ARMv8-A adds more than 100 new instructions – and the Cortex-A32 benefits from all of these.

 

Architecture benefits.png

 

2. Higher efficiency and performance

 

Cortex-A32 is 25% more efficient (more performance per mW) than Cortex-A7 in the same process node. Cortex-A32 delivers this efficiency through performance improvements and power reduction, two often conflicting design goals that the Cortex-A32 team managed to deliver in tandem.

 

Efficiency Performance.png

 

The Cortex-A32 also delivers performance improvements compared to Cortex-A5 and Cortex-A7 processors. The performance improvements relative to the Cortex-A5 range from 30% to a massive1300% across a range of benchmarks relevant to embedded markets. Streaming and crypto are key benchmarks at the top end of this scale. Compared to the Cortex-A7, the Cortex-A32 offers 5% to 25% higher performance. To put things in perspective, the Cortex-A32 delivers similar performance to the Cortex-A9, which was the premium smartphone standard just a few years ago. That performance is coming to the lowest cost rich embedded devices now, and at significantly less power.

 

For integer workloads, the combination of performance improvements and power reduction provided by the Cortex-A32 translates into a greater than 25% efficiency gain over the Cortex-A7 and more than 30% efficiency gain over the Cortex-A5.  Compared to Cortex-A35, the Cortex-A32 offers same 32-bit performance but consumes 10% less power and has a 13% smaller core. This means that Cortex-A32 is 10% more efficient than Cortex-A35 processor in the 32-bit world.

 

3. Scalability

 

Given the diversity of embedded applications, we knew we had to make the Cortex-A32 scalable. Cortex-A32 therefore offers a wide range of configuration options. The diagram below shows two configurations of Cortex-A32 but there is a range of possibilities in between.

 

Scalability.png

 

The configuration on the left in the diagram above shows a typical performance optimized multi-core configuration - quad core, larger cache sizes and includes optional features like NEON and Crypto engines. This configuration provides excellent performance for most rich embedded applications and retains ARM’s low power leadership – consuming less than 75mW per processor core, when running at 1.0 GHz on a 28nm process node.  At the other extreme, the smallest configuration of the Cortex-A32 processor, with a physical implementation optimized for area, occupies less than quarter of mm2 and consumes less than 4mW at 100 MHz in the same 28nm process node. With this scalability, the Cortex-A32 is suitable for a wide range of rich embedded applications.

 

In summary, the lowest cost rich embedded applications are about to get a lot more exciting. Cortex-A is already the number 1 CPU architecture for rich embedded. The Cortex-A32 expands the Cortex-A family and adds our most efficient 32-bit application processor yet. The Cortex-32 is set to drive future innovation in rich embedded and IoT – I can’t wait to see what our partners will build with it.

ARM Cycle Models have long been used to perform design tasks such as:

 

  • IP Evaluation
  • System Architecture Exploration
  • Software Development
  • Performance Optimization

 

In October 2015, ARM acquired the assets of Carbon Design Systems with the primary goal of enabling earlier availability of cycle accurate models for ARM processors and system IP. The announcement of the ARM® Cortex®-R8 is the first step in demonstrating the benefits of early Cycle Model availability. Another goal is to provide Cycle Models which can be used in SystemC simulation environments. The Cortex-R8 model is the first Cycle Model available for use in the Accellera SystemC environment right from the start.

 

The Cortex-R8 model has been available to lead partners since the beginning of 2016 and will be generally available on ARM IP Exchange this month.

 

Earlier cycle accurate model availability has led to increased focus on using Cycle Models to understand new processors. This article describes some of the ways the Cycle Model has been used by ARM silicon partners to understand the Cortex-R8.

 

Prior to early availability of Cycle Models these tasks would have been performed using RTL simulation or FPGA boards. RTL simulation can be cumbersome, especially for software engineers doing benchmarking tasks, and it lacks software debugging and performance analysis features. FPGA boards are familiar to software engineers, but lack the ability change CPU build-time parameters such as cache and TCM sizes.

 

The examples below provide more insight on how Cycle Models are being used.

 

Benchmarking

 

A common activity for a new processor such as Cortex-R8 is to run various benchmarks and measure how many cycles are required for various C functions. SoC Designer provides an integrated disassembly view which can be used to set breakpoints to run from point A to point B and measure cycle counts.

 

diss.PNG

 

DS-5 can also be connected to the Cortex-R8 for a full source code view of the software.

 

ds5.PNG

 

The cycle count is always visible on the toolbar of SoC Designer.

 

cycle-count.PNG

 

Many times a simple subtraction is all that is needed to measure cycle count between breakpoints.

 

After the first round of benchmarking is done, the code can be moved from external memory to TCM and execution repeated. The Cortex-R8 cycle model will boot from ITCM when the INITRAM parameters are set to true. Right clicking on the Cortex-R8 model and setting parameters make it easy to change between external memory and TCM.

 

param.PNG

 

In addition to just counting cycles, SoC Designer provides additional analysis features. One useful feature is a transaction view.

 

The transaction monitor can be used to make sure the expected transactions are occurring on the bus. For example, when running out of TCM little or no bus activity is expected on the AXI interface, and if there is activity it usually indicates incorrect configuration. Below shows a transaction view of the activity on the AXI interface when running from external memory. Each transaction has a start and end time to indicate how long it takes.

 

trans.PNG

 

All PMU events are instrumented and can be automatically captured in Cycle Models. These are viewed by enabling the profiling feature and looking at the results using the analyzer view. The hex values to the left of each event correspond to the event codes in the Technical Reference Manual. In addition to raw values, graphs of events over time can be created to identify hotspots.

 

an-pmu.PNG

 

The analysis tools also provide information about bus utilization, latency, transaction counts, retired instructions, branch prediction, and cache metrics as shown below.  Custom reports can also be generated.

 

sys-met.PNG

 

After observing a benchmark in external memory and TCM, it’s common to change TCM sizes and cache sizes. Models with different cache sizes and TCM sizes can easily be configured and created using ARM IP Exchange and the impact on the benchmark observed. The IP configuration page is shown below. Generating a new model is as simple as selecting new values on the web page and pushing the build button. After the compilation is done the new model is ready for download and can replace the current Cortex-R8 model.

 

ip-config.PNG

 

Cache and Memory Latency

 

Another use of the Cortex-R8 Cycle Model is to analyze the performance impact of adding the PL310 L2 cache controller. There is a Cycle Model of the PL310 available from ARM IP Exchange. It can be added into a system and enabled by programming the registers of the cache controller. The register view is shown below.

 

pl310.PNG

 

SoC Designer provides ideal memory models which can be configured for various wait states and delays. Performance of memory accesses using these memory models can be compared with adding the PL310 into the system. The same analysis tools can be used to determine latency values from the L2 cache and the overall performance impact of adding the L2 cache. Right clicking on the PL310 and enabling the profiling features will generate latency and throughput information for the analysis view.

 

Example systems using the Cortex-R8 and software to configure the system and run various programs are available from ARM System Exchange. The systems serve as a quick start by providing cycle accurate IP models, fully configured and initialized systems, and software source code. Most users take an example system as a starting point and then modify and customize it to meet particular design tasks.

 

Conclusion

 

Previously, the only ways to evaluate performance and understand the details of a new ARM processor were RTL simulation or FPGA boards with fixed configurations. ARM Cycle Models have become the new standard for IP evaluation and early benchmarking and performance analysis. The Cortex-R8 Cycle Model is available for use in SoC Designer and SystemC simulation. Example systems and software are available, models of different configurations can be easily generated using ARM IP Exchange, and the software debugging and performance analysis features make Cycle Models an easy to use environment to evaluate and make informed IP selection decisions.

Cortex-R8 Chip Image.pngOn Thursday 18th February we announced the latest real-time processor IP, the ARM® Cortex®-R8. Cortex-R8 is a quad-core high-performance real-time processor, building on the R profile ARMv7-R architecture, already firmly established by Cortex-R4, R5 and R7.

 

With Cortex-R8 we’re delivering increased performance and introducing new features to meet the demands of next-generation storage device controllers and mobile communications with a particular focus on the forthcoming 5G cellular wireless standards. Though this blog focuses on the application of Cortex-R8 in storage and modem products, the Cortex-R8 is also applicable to many other markets where the fastest real-time performance is required.

 

ARM architecture family

To give some orientation; a real-time Cortex-R processor is one of three variants of the ARM architecture family, the others being Cortex-A for applications and Cortex-M for microcontrollers. These three architectures have a lot in common, in terms of instruction set, programming model and support in the wider ARM ecosystem, but they are each specifically equipped for their intended application spaces.

 

For Cortex-A that is primarily running a high level operating system such as Linux or Android. You’ll find our rich portfolio of Cortex-A processors in just about every mobile phone as well as tablets, servers, enterprise systems, networking equipment, industrial controllers and so on. Cortex-M is profiled to enable our partners to build the very lowest power and lowest cost microcontrollers and edge devices in the Internet of Things such as remote sensors and Embedded wireless chips for standards like Bluetooth.

 

Cortex-R sits in between, with a range of processors and multi-core configuration options offering high performance with cached memory systems and a tightly-coupled memory system for fast and deterministic response to system events. This is what you need, for example, in a System on Chip that controls a storage device, especially magnetic storage with hard disks, which is typical of a hard real-time system where deadlines are measured in micro-seconds or less.

 

Meeting increased storage demands

In storage devices, and in particular for hard disk drives, the Cortex-R processors have long been established as number one choice for performance and response when it comes to controlling heads and motors and controlling the host interface. All the major hard disk manufacturers use Cortex-R processors, and they also ship in increasingly high volume in the solid state flash drive space, in both consumer and enterprise class storage.

 

Storage device capacities and interface data rates are still increasing rapidly, both for magnetic and solid state devices and we see increasingly higher input-output operations per second and increasingly complex algorithms for keeping track of data and managing errors as the physical limits of storage media are challenged.

 

Cortex-R8 Storage.png

 

Our new Cortex-R8 processor offers storage controller designers both additional performance and new AMBA® bus ports with error correcting code protection, amongst other things.  Some of these features are a direct outcome of our close engineering relationships with storage System On Chip architects in the ARM silicon partnership where we've worked together to optimise the technology boundary between the ARM processor IP and rest of the system.

 

LTE-Advanced Pro and 5G

Now, turning to cellular modems, here we see ARM’s processors already used in very high numbers within the modem sub-system of a mobile phone SoC, either as a stand-alone modem chip or, as is now usually the case, in highly-integrated modem plus application processor chips.

 

Cortex-R processors are well-suited to the modem task where they both manage scheduling of data flows through the signal processing for reception and transmission and run the protocol stack software tasks to establish and manage connections whilst a data, voice or video call is taking place. Once again, these are hard real-time tasks where the processor must respond to events in the communication channel with micro-second granularity.  Otherwise data is dropped and has to be re-transmitted over and over. The data rates and complexity are increasing, thus placing higher workload and feature set demands on the modem processor.

 

Third generation, and now fourth generation cellular communications using the LTE set of standards, are established worldwide with over a billion subscribers to mobile services. LTE and LTE-Advanced are providing data rates upwards of 300 Mbits per second and importantly LTE-Advanced enables operators to maximise use of their spectrum allocation by aggregating transmission over a number of carrier frequencies. This flexibility of spectrum usage is very valuable to operators who license it from governments but it places significant additional workload on the modem processor as it requires multiple instances of some tasks in the protocol software stack for each carrier.

A clearer future brings faster data rates

Now the future of cellular communications is becoming clearer as outline standards and schedules have been set for the introduction of fifth generation and the last fourth generation LTE-Advanced Pro.  Substantially higher data rates to a Gigabit and beyond, even more carrier frequencies, multiple antenna arrays and new features for emergency services and the like all contribute to increasing workload and feature set requirements for the modem processor and this is where the Cortex-R8 comes in.

 

Cortex-R8 5G Development.png

 

The bars in the image above show the period when we’ll see our partners designing and testing their chips up until their market introduction in a mobile phone.  The service roll-out happens after that.  You can see we anticipate the second wave of 5G to be a more challenging when new air interfaces using very high millimetre-wave frequencies have to be developed.

 

The next generation LTE-Advanced Pro standard brings both WiFi and also new unlicensed band cellular technology together with existing LTE-Advanced in the same modem.  This creates a further substantial increase in the modem processing tasks.

 

Then, with 5G, even more carriers are planned.  There will be higher data rates, multi-dimensional antennas, direct phone to phone services, mission-critical services designed for first responders, low-latency services for vehicles and highways and new narrow band communications designed for the IoT and other capabilities for the 2020s.

 

Cortex-R8 LTE-Advanced Pro.png

All of this requires a real-time multi-core processor that can cope with increasing software workloads in the protocol stack layers and manage data scheduling through the modem signal processing and its various dedicated hardware accelerators for security, compression and the like.

 

In addition, modem designers are asking for even more layer-1 scheduling activities to be managed by software in the ARM processor instead of dedicated hardware as this allows more flexibility when switching between all the different communication standards.

 

And, like the storage use case, here again ARM has worked very closely with modem teams around the silicon partnership to understand their requirements and deliver a processor that both integrates neatly into the SoC hardware design and executes their software efficiently.

Cortex-R8 Modem in Phone.png

The modem processor and associated DSP and hardware accelerators are key parts of a mobile phone SOC along with the applications and graphics processing etc. ARM has introduced the Cortex-R8 to support this next set of LTE-Pro modems and the initial 5G design cycles. It is a very fast processor and can deliver a total 15,000 Dhrystone MIPS from a quad core configuration on a 28 or 16 nm silicon process (1.5 GHz).

 

Like our top-line application processors, it can execute instructions out of order.  This can be key to success in real-time applications like modems because it enables the processor to continue execution whilst outstanding memory or peripheral transactions are in flight.

 

Reliability is very important in storage applications where error detect and correct features must ensure that soft errors do not propagate through the control processor into the storage medium.

 

Like most ARM processors the Cortex-R8 scales in terms of size and capability.  Chips designers can optimise it for their application by selecting configurations from one to four CPU cores, level-1 memory sizes, a choice of bus interfaces, error handling features and so on. Also, once a chip is running, the software can power cores up and down depending on workload.  For example a modem may run all four cores during a video call but drop down to a single core when the phone is almost asleep in your pocket.

 

In common with all the Cortex-R processors, the Cortex-R8 takes interrupts into its pipeline as quickly as possible and then services them with code and data stored in a tightly-coupled memory thus avoiding the longer and non-deterministic latency cycles you get when fetching interrupt service routines into the cached memory system. Cortex-R8 supports eight times as much TCM than the R7 did, all the way up to 2 MB for each CPU core.

 

In the block diagram below you can see the pipeline, TCM and error handling features:

Cortex-R8 Processor Overview.png

Why the Cortex-R8 is attractive to System on Chip designers? Firstly, there is a lot of software already out there, for example modem protocol stacks and drivers going all the way back to 2G, then GPRS, HSPA etc., followed by first generation LTE.

 

All this software and the associated electronic system-level design, simulation and verification equipment and know-how represents a huge investment for the modem design teams at our silicon partners, so we must protect that investment by offering them scalability and forward compatibility, which the Cortex-R8 of course does.

 

In addition, the complexity of this software has increased dramatically and the rest of the modem hardware is also very complex.  So you can see that the Cortex-R8’s quad CPU core and coherent memory system allows software execution to be parallelised across the four cores and various interfaces into the modem hardware can be used to achieve the best performing and lowest power overall design.

 

Another characteristic of all the ARM Cortex real-time processors is that they are designed to minimise latency for memory reads and writes using a protected memory system. This is different to the virtual memory system that is needed for a high level OS like Linux or Android and, for the Cortex-R processors running a Real Time Operating System, it is key to their ability to start responding to hard real-time events in 1/10th of a microsecond or less.

 

The diagram below demonstrates how we’ve evolved our real-time processor line-up with a combination of micro-architectural and multi-core developments to keep up the pace of innovation in communications through 3G, 4G and onward to 5G.

Cortex-R8 Real-time Processing.png

Of course, this has followed the semiconductor process technology roadmap.  Broadly speaking, the Cortex-R4 designs were on 65 nm, Cortex-R5s on 40, Cortex-R7s on 28 and the Cortex-R8 products will be on 16, 14 and 10 nm, maybe even 7 nm.  The process technology is a key enabler for increasing data rate and modem capabilities as it allows us more transistors within the same cost and power budget. ARM develops its processors to take best advantage of this and overall the phone user gains an ever-improving mobile experience.

 

With four cores to spread the software over, there are more advantages as it’s less likely that implementation techniques like voltage overdrive or very high frequency operation using low threshold transistors will be needed.  That of course can save a considerable amount of power.

 

So, we believe that Cortex-R8 is by far the best performing real-time processor IP available and it offers all the right features for its intended applications.  Its aggregate multi-core performance up to 28,000 CoreMarks is more than sufficient for the next set of LTE-Advanced Pro and first 5G modem designs, and testing with silicon partners’ software has already demonstrated this to be the case.

 

Combine that with large TCM memories, out-of-order execution,  the rich set of interface ports, error management etc. and you have the best solution for any modem or storage controller or similar high-performance deeply-embedded hard real-time application.

 

To conclude, we are know that our processors are already the best and most widely used in the deeply embedded hard real-time applications such as modems and storage controllers. And now, with the new Cortex-R8, we have delivered the next step in hard real-time performance and features to extend that leadership as we enable the next design cycles for high performance storage controllers and cellular modems targeting 4G-Pro and thereafter 5G communications standards.

 

To find out more about LTE Advanced Pro and 5G, read ARM’s Whitepaper “The Route to 5G”

Chinese Version 中文版:引发下一次移动计算革命-ARMv8 SoC处理器

 

I recently had the opportunity to reflect on the mobile computing revolution of the last five years. I use the term 'mobile computing' deliberately - the compute tasks we handle on mobile phones today directly rival those that were only possible on laptops and desktops several years ago. With uninterrupted direct supply from the wall, laptop and desktop PCs needed fan assisted cooling, and their architecture is designed around that capacity. Today, mobile devices run similarly demanding workloads for a full day (or more) on a single charge and serve as communications hub, entertainment center, game console, and mobile workstation. The architecture of ARM®  based mobiles devices is and has always been designed around the mobile footprint. Continuing to improve the user experience in that footprint requires commitment to deliver the most out of each milliwatt and every millimeter of Silicon.

 

The success of smartphones and tablets and the software app economy (worth $27 bn and growing) is largely based on SoCs (System-on-Chips) from ARM Partners. Mobile SoCs balance ever-increasing performance with form factor, battery life and price point across an incredibly diverse range of consumers. Most of them to date have been based on the ARMv7-A architecture, accounting for 95% share of the growing smartphone market. The growing app ecosystem ( with over 40bn downloads ) has been largely designed and coded specifically for the ARM architecture resulting in a vast application base. We are now at the transition point to ARMv8-A, the next generation in efficient computing.

 

2014 will see the arrival of numerous devices featuring the latest ARMv8-A architecture, opening the door for developers, while retaining 100% compatibility with the vast app ecosystem based on 32-bit ARMv7.  It is great to finally be at a point where the first ARMv8 mobile SoCs are coming to the market, and it is particularly positive that some of the upcoming SoCs employ ARM big.LITTLE®  technology,  which combines the high-performance CPUs and high-efficiency CPUs in one processing sub-systems, capable of both 32-bit and 64-bit operation while dynamically moving  workloads to the right size processor and saving upwards of 50% of the energy.

 

Qualcomm® recently announced their Snapdragon® 810 processor which uses four Cortex®-A57 cores and four Cortex-A53 cores in a big.LITTLE configuration, and the Snapdragon 808 processor which uses two Cortex®-A57 cores and four Cortex-A53 cores, again in a big.LITTLE configuration.  These processors are expected to be available in commercial devices by the first half of 2015 and will feature 64-bit ARMv8 support for Android. We have been working together with teams from Qualcomm Technologies and other ecosystem partners for several to ensure that OEMs and OS providers are able to take full advantage of the ARMv8-A Architecture, ensuring that they can rely on the same design philosophy that has made ARMv7-A based Snapdragon processors so successful in the multiple segments of the mobile market.

 

My colleague  James Bruce and I recently collaborated with our counterparts at Qualcomm in writing a paper that delves further into ARMv8-A and explains the journey of bringing an ARMv8 SoC to market - I recommend it for anyone seeking to better understand the SoC design process and mobile processor market space.

 

The white paper (which you will find below) dispels a few myths about ARMv8-A (it's more than just 64 bit, it doesn't double code size, etc.) and outlines the approach one ARM partner takes in combining ARM IP with in-house IP to build a product line ranging from premium smartphone and tablets down to low-cost smartphone tiers for emerging markets.

 

The first half of the paper offers some useful insights into the mobile market, how ARM competes in the market, how Android is delivered on ARM platforms, and the benefits of the latest ARM Cortex-A processors and ARMv8 instruction set architecture.  The second half of the paper dives a bit deeper into Qualcomm's approach to delivering a complete SoC, combining in-house designed components with ARM IP, then optimizing the whole platform. It discusses Qualcomm's use of Cortex-A57 and Cortex-A53 along with big.LITTLE technology in the announced Snapdragon 808 and 810 SoCs, as well as their use of custom-designed CPUs, GPUs, and other components in the Snapdragon product line.

 

The ready availability of ARM IP and the flexibility of the ARM business model provide the freedom to mix and match and the opportunity to rapidly innovate which have been a big factor in enabling ARM partners like Qualcomm to be so successful in the smartphone and mobile computing revolution.

Filter Blog

By date:
By tag:

More Like This