1 2 3 Previous Next

SoC Implementation

137 posts

The first Carbon Performance Analysis Kit (CPAK) demonstrating the AMBA 5 CHI protocol has been released on Carbon System Exchange. The design features the ARM Cortex-A57 configured for AMBA 5 CHI and the ARM CoreLink CCN-504 Cache Coherent Network. The design is a modest system with a single core running 64-bit bare-metal software with memory and a PL011 UART, but for anybody who digs into the details there is a lot to learn.

 

Here is a diagram of the system:

 

ccn-504-system1


AMBA 5 CHI Introduction

 

Engineers who have been working with ARM IP for some time will quickly realize AMBA 5 CHI is not an extension of any previous AMBA specifications. AMBA 5 CHI is both more and less complex compared to AMBA 4. CHI is more complex at the protocol layer, but less complex at the physical layer. AXI and ACE use Masters and Slaves, but CHI uses Request Nodes, Home Nodes, Slave Nodes, and Miscellaneous Nodes. All of these nodes are referenced using shorthand abbreviations as shown in the table below.

 

nodes2


Building the A57 with CHI

 

The latest r1p3 A57 is now available on Carbon IP Exchange. CHI can be selected as the external memory interface. The relevant section from the IP Exchange configuration form is shown below.

 

a57chi1

 

The CHI memory interface relies on the System Address Map (SAM) signals. All of the A57 input signals starting with SAM*are important in constructing a working system. These values are available as parameters on the A57 model, and are configured appropriately in the CPAK to work with the CCN-504.

 

Configuring the CCN-504


The CCN-504 Cache Coherent Network provides the connection between the A57 and memory. The CPAK uses two SN-F interfaces since dual memory controllers is one of the key features of the IP. A similar set of SAM* parameters is available on the CCN-504 to configure the system address map. Like other ARM IP, the CCN uses the concept of PERIPHBASE to set the address of the internal, software programmable registers.

 

Programming Highlights

 

The CCN-504 includes an integrated level 3 cache. The CPAK demonstrates the use of the L3 cache.

The CPAK startup assembly code also demonstrates other CCN-504 configuration including how to setup barrier termination, load node ID lists, programming system address map control registers, and more.


AMBA 5 CHI Waveforms

 

One of the best ways to start learning about AMBA 5 CHI is looking at the waveforms between the A57 and the CCN-504. The lastest SoC Designer 7.15.5 supports CHI waveforms and displays Flits, the basic unit of transfer in the AMBA 5 CHI link layer.

 

wave

Summary


A new CPAK by Carbon Design Systems running 64-bit bare-metal software on the Cortex-A57 processor with CHI memory interface connected to the CCN-504 and memory is now available. It demonstrates the AMBA 5 CHI protocol, serves as a starting point for optimization of CCN-based systems, and is a valuable learning tool for projects considering AMBA 5 CHI.

My colleague, Tom De Schutter, wrote a good blog about a recent accomplishment of the Synopsys Press book "Better Software. Faster!" -- more than 3,000 copies in distribution to designers in more than 1,000 companies. The success of the book highlights the interest in using virtual prototyping as a key methodology to "shift left" product development.

 

You can download a free Better Software. Faster! eBook in English or Chinese by using either your SolvNet ID or email address. The Japanese edition is underway as well, so stay tuned for that.

 

The book, which includes case studies from thirteen companies, including one written by Rob Kaye of ARM, dives deep in to virtual prototyping as the key methodology to enable concurrent hardware/software development by decoupling the dependency of the software development from hardware availability. .

 

At ARM TechCon 2014, Nguyen Le, a Principal Design Verification Engineer in the Interactive Entertainment Business Unit at Microsoft Corp. documented a real world case study using a formal app and verification management tools to achieve his code coverage goals significantly faster.


Specifically, in the paper titled “Advanced Verification Management and Coverage Closure Techniques”, Nguyen outlined his initial pain in verification management and improving cover closure metrics, and how he conquered both these challenges – speeding up his regression run time by 3x, while simultaneously moving the overall coverage needle up to 97%, and saving 4 man-months in the process. The following article reports the highlights of his presentation and paper:


ARM® Techcon Paper Report: How Microsoft Saved 4 Man-Months Meeting Their Coverage Closure Goals Using Automated Verific…



I’m excited to introduce the most complex Carbon Performance Analysis Kit (CPAK) created by Carbon; an 8-core ARM Cortex-A53 system running 64-bit Linux with full Swap & Play support. This is also the first dual-cluster Linux CPAK available on Carbon System Exchange. It’s an important milestone for Carbon and for SoC Designer users because it enables system performance analysis for 64-bit multi-core Linux applications.

 

Here are the highlights of the system:

  • Dual-cluster, quad-core Cortex-A53 for a total of 8 cores
  • ARM CoreLink CCI-400 providing coherency between clusters
  • Fully configured GIC-400 interrupt controller delivering interrupts to all cores
  • New Global System Counter connected to A53 Generic Timers

 

Here is a diagram of the system.

octacore1

The design also supports fully automatic mapping to ARM Fast Models.

 

I would like to introduce some of the new functionality in this CPAK.

 

Dual Cluster System


The Cortex-A53 model supports the CLUSTERIDAFF inputs to set the Cluster ID. This value shows up for software in the MPIDR register. Values of 0 and 1 are used for each cluster, and each cluster has four cores. This means that CPU 3 in Cluster 1 has an MPIDR value of 0x80000103 as shown in the screenshot below.

 

mpidr1


Global System Counter

 

Another requirement for a multi-cluster system is the use of a Global System Counter. A new model is now available in SoC Designer which is connected to the CNTVALUEB input of each A53. This ensures that the Generic Timer in each processor has the same counter values for software, even when the frequency of the processors may be different. This model also enables Swap & Play systems to work correctly by saving the counter value from the Fast Model simulation and restoring it in the Cycle Accurate simulation.

 

Generic Timer to GIC Connections


To create a multi-cluster system the GIC-400 is used as the interrupt controller, and the A53 Generic Timers are used as the system timers. This requires the connection of the Generic Timer signals from the A53 to the GIC-400. All of these signals start with nCNT and are wired to the GIC. When a Generic Timer generates an interrupt it leaves the CPU by way of the appropriate nCNT signal, goes to the GIC, and then back to the CPU using the appropriate nIRQ signal.

 

As I wrote in my ARM Techcon Blog, 64-bit Linux uses nCNTPNSIRQ, but all signals are connected for completeness.

 

Event Connections

 

Additional signals which fall into the category of power management and connect between the two clusters are EVENTI and EVENTO. These signals are used for event communication using the WFE (wait for event) and SEV (send event) instructions. For a single cluster system all of the communication happens inside the processor, but for the multi-cluster system these signals must be connected.

WFE and SEV communication is used during the Linux boot. All 7 of the secondary cores execute a WFE and wait until the primary core wakes them up using the SEV instruction at the appropriate time. If the EVENTI and EVENTO signals are not connected the secondary cores will not wake up and run.

 

Boot Wrapper Modifications

 

The good news is that all of the software used in the 8-core CPAK is easily downloadable in source code format. A small boot wrapper is used to take care of starting the cores and doing a minimal amount of hardware configuration that Linux assumes to be already done. Sometimes there is additional hardware programming that is needed for proper cycle accurate operation that is not needed in a Fast Model system. These are similar to issues I covered in another article titled Sometimes Hardware Details Matter in ARM Embedded Systems Programming.

 

SMP Enable

 

Although not specific to multi-cluster, the A53 contains a bit in the CPUECTLR register named SMPEN which must be set to 1 to enable hardware management of data coherency with the other cores in the cluster. Initially, this was not set in the boot wrapper from kernel.org and the Linux kernel assumes it is already done so it was added to the boot wrapper during development.

 

CCI Snoop Configuration

 

Another hardware programming task which is assumed by the Linux kernel is the enabling of snoop requests and responses between the clusters. The Snoop Control Register for each CCI-400 slave ports is set to 0xc0000003 to enable coherency. This was also added to the boot wrapper during development of the CPAK.

The gaps between the boot wrapper functionality and Linux assumptions are somewhat expected since the boot wrapper was developed for ARM Fast Models and these details are not needed to run Linux on Fast Models, but nevertheless they make it challenging to create a functioning cycle accurate system. These changes are provided as a patch file in the CPAK so they can be easily applied to the original source code.

 

CPAK Contents

 

The CPAK comes with an application note which covers the construction of the Linux image.

 

The following items are configured to match the minimal hardware system design, and can be extended as the hardware design is modified.

  • File System: Custom file system configured and created using Buildroot
  • Kernel Image: Linux 3.14.0 configured to use the minimal hardware
  • Device Tree Blob:  Based on Versatile Express device tree for ARM Fast Models
  • Boot Wrapper: Small assembly boot wrapper available from kernel.org

 

A single executable file (. axf file) containing all of the above items is compiled. This file contains all of the artifacts and is a single image that is loaded and executed in SoC Designer.

One of the amazing things is there are no kernel source code changes required. It demonstrates how far Linux has come in the ARM world and the flexibility it now has in supporting a wide variety of hardware configurations.

 

Summary


An octa-core A53 Linux CPAK is now available which supports Swap & Play. The ability to boot the Linux kernel using Fast Models and migrate the simulation to cycle accurate execution enables system performance analysis for 64-bit multi-core systems running Linux applications.

 

Also, make sure to check out the other new CPAKs for 32-bit and 64-bit Linux for Cortex-A53 now available on Carbon System Exchange.

 

The “Brought up 8 CPUs” message below tells it all. A number of 64-bit Linux applications are provided in the file system, but users can easily add their favorite programs and run them by following the instructions in the app note.

 

8cpus

WMH_sized.jpg

Join industry experts and a handful of ARM Partners at the Winchester Mystery House in San Jose, California (great venue for the subject) on Tuesday, October 14th as they unravel the strange and wonderful secrets of semiconductor intellectual property at Unlock the Mystery of IP. This free one-day conference will address cutting-edge semiconductor technology, market trends and projections, and challenges facing players in the IP industry.


Jim Feldhan, President of Semico Research Corporation, will ground the day's programming with two data-rich keynote presentations. Throughout the day, speakers will share 30-mintue “deep-tech” presentations on today’s cutting-edge IP products to equip attendees with the knowledge they need to make informed decisions for their next design projects. Then finally, there will be two panel discussions that will address today’s hottest topics: the Internet of Things (IoT) and IP subsystems.


 

  • "IP Subsystems: Build or Buy?"
    • Moderator:
      • Gabe Moretti, Extension Media
    • Panelists:


If it couldn't get any better, a hosted bar and hors d'oeuvres networking reception will conclude the action-packed day. To register and to view the complete agenda, visit the IPextreme website.

The ARM Cortex-M7 processor is out, developed to address digital signal control markets that demand a blend of control and signal processing capabilities. The ARM Cortex-M7 has been designed with a variety of highly efficient signal-processing features to address the design demands of market applications such as next-generation vehicles, connected devices, and smart homes and factories.

 

In many of these end markets, engineering teams demand:

  • Maximum performance within power budgets
  • Maximum power savings targeting a given frequency

 

These are significant challenges to address, so how do we deal with them?

(Cadence recently published a white paper that details the challenges and some solutions. It described ways in which Cadence and ARM worked to optimize power and timing closure in the ARM Cortex-M7.)

 

We start by identifying and confronting the issues. Let’s take dynamic power, for example. Dynamic power is the largest component of total chip power consumption (the other components are short-circuit power and leakage power). It can quickly become a design challenge in leading designs.

 

Then there are timing-closure challenges. One fundamental timing closure issue is the modeling of physical overcrowding. Among other things, this problem can be addressed by deftly managing layout issues (such as placement congestion, overlapping of arbitrary-shaped components, and routing congestion).

 

K.T. Moore, Group Director in Cadence’s Digital and Signoff Group, said:

“Closure requires a different way of thinking. You have to consider multiple constraints in the closure process with a unified objective function in mind. This is easier said than done because many constraints conflict with each other if you simply address their effects only on the surface.”         

 

In the past, teams relied solely on post-route optimization to salvage setup/hold timing in tough-to-close timing situations. But now we can rely on in-route optimization to bridge timing closure earlier during the routing step itself using track assignment.

 

In addition, opportunities exist to reduce area and gate capacitance in other ways.


The Approaches     

Among several methods, the team explored placement optimization using the GigaPlace engine, available in Cadence Encounter® Digital Implementation System 14.1. GigaPlace places the cells in a timing-driven mode by building up the slack profile of the paths and performing the placement adjustments based on these timing slacks.

 

The team also trained its sights on using in-route optimization for timing optimization to help hit the final frequency target.

 

Lastly, the team introduced the “dynamic power optimization engine” along with the “area reclaim” feature in the post-route stage. These options saved time and cut by nearly half the gap that earlier existed between the actual and desired power target.

 

By the end of this exercise, the team achieved power savings greater than 35% on the logic (excluding constants like such as macros and so forth).

 

For complete details, check out the detailed white paper here.

 

Brian Fuller

Related stories:

-- Whitepaper: Pushing the Performance Boundaries of ARM Cortex-M Processors for Future Embedded Design

--Cortex-M7 Launches: Embedded, IoT and Wearables

--New Cortex-M7 Processor Balances Performance, Power

--The new ARM® Cortex®-M7 »

Registration: Xilinx University Program : Workshops Schedule


Venue: Cidade De Goa, Goa, India

 

Date & Time: 8 AM - 4 PM, Tuesday, December 16th 2014

 

The ARM University Program and Xilinx University Program (XUP) will be conducting a faculty workshop around the ARM SoC Design Lab-in-a-Box (LiB). The Lab-in-a-Box is about designing an SoC around the ARM Cortex®-M0 DesignStart™ Processor Core with Peripheral Interfaces using the AHB-Lite Bus. This LiB targets courses such as Design of Digital Systems or Embedded System Design using FPGA and is part of ARM University Program's ongoing commitment to share ARM technology with academia globally. The LiB has been designed with academics in mind and will allow participants to gain first hand experience not only on how to teach the material in their own courses, but also the essentials of hands-on SoC Design. The workshop will cover topics such as: Designing AHB-Lite Compliant Hardware Peripherals such as Memory, UART and GPIO, to name a few; Integrating these Peripherals around the ARM Cortex-M0 core and Implementing the SoC on FPGA.

 

Workshop Agenda

  • Introduction to ARM Cortex-M0 DesignStart Processor Core
  • Overview of AHB-Lite (AMBA 3) Bus Protocol
  • Xilinx Artix-7 FPGA Architecture
  • Xilinx Vivado Design Flow
  • Simple AHB-Lite Peripheral Design and Integration
  • Introduction to UART Peripheral
  • Introduction to Interrupts and CMSIS
  • Integrating UART Peripheral and Interrupt
  • Snake Game Application Demo

 

Preparation

  • Faculty attendees must come with their own laptops running Windows OS and already installed with Keil MDK-ARM. This can be downloaded at Keil MDK-ARM.
  • Each Faculty attendee must individually register at the DesignStart Portal using an Official University Email ID for the "ARM Cortex-M0 DesignStart Processor" IP download instructions and then download the IP to her/his laptop before coming to the workshop.
  • Knowledge of embedded system design and experience in programming microcontrollers will be helpful.
  • Attendees are expected to make their own travel and stay arrangements.

 

If you require any further information please write to university@arm.com. We look forward to seeing you on the 16th of December, 2014!

This week is the 10th year for ARM Techcon, which has evolved into the best place for all things related to ARM technology. I will be attending this year, and giving a presentation on Friday at 3:30 titled “Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models”.

 

Based on the agenda this year, ARMv8 will be one of the primary topics. For the past few years there have been presentations about ARMv8, but it’s clear many people now have hands-on experience and are ready to share it at the conference. To get warmed up for ARM Techcon, I will share a couple of fun facts about 64-bit Linux on ARMv8.


Swap & Play Technology

 

One of the differentiating technologies of Carbon SoC Designer is Swap & Play, which enables a system to be simulated with ARM Fast Models to a breakpoint and saved as a checkpoint. The simulation checkpoint can be restored into a cycle-accurate simulator constructed with models built directly from the RTL of the IP. The most common use case for this technology is running benchmarks on top of the Linux operating system. Swap & Play is attractive for this application because the Linux boot is only a means for setting up the benchmark, and the accuracy is critical to the benchmark results and the system performance analysis. It may seem strange to simulate Linux using cycle accurate models because it requires billions of instructions, but there are times when being able to run Linux benchmarks on accurate hardware models is invaluable. In fact, this is probably required before a chip can complete functional verification.

 

swap n play2 resized 600

 

One of the useful features of the ARM® Cortex®-A50 series processors is the backward compatibility with ARMv7 software. I have had good results running software binaries from A15 designs directly on A53 with no changes. Mobile devices have even started appearing with A53 processors that are running Android in 32-bit mode which have the possibility of upgrading to 64-bit Android in the future.

 

One of the reasons we always focus on the system at Carbon is because today's IP is complex and configurable, and this can lead to integration pitfalls which were not anticipated. Take for instance 64-bit Linux on ARMv8. It would be a reasonable assumption that if a design has A53 cores and is successfully running 32-bit Linux, it should be able to run 64-bit Linux just by changing the software.

 

Below are a couple of fun facts related to migrating from 32-bit Linux to 64-bit Linux on ARMv8 to get warmed up for ARM Techcon 2014.


Generic Timer Usage

 

The A15 and A53 offer a similar set of four Generic Timers. Many multi-cluster A15 (and even some single cluster) designs have used the GIC-400 as an interrupt controller instead of the internal A15 interrupt controller, so the update to A53 seems straightforward to change the CPU and run the same A15 software on the A53 in AArch32 state.

 

It turns out that 32-bit Linux uses the Virtual Generic Timer and 64-bit Linux uses the Non-Secure Physical Timer as the primary Linux timer. From a hardware design view, this probably doesn’t matter much as long as all of the nCNT* signals are connected from the CPU to the GIC, but understanding this difference when doing system debugging or building minimal Linux systems for System Performance Analysis is helpful. As I wrote in previous articles, architects doing System Performance Analysis are typically not interested in the same level of netlist detail that the RTL integration engineer would be performing, so knowing the minimal set of connections between key components in the CPU subsystem needed to run a system benchmark is helpful.

 

Below is a comparison of the Generic Timers in 32-bit and 64-bit Linux. The CNTV registers are used in 32-bit Linux and the CNTP registers are used in 64-bit Linux. The CTL register shows which timer is enabled and the CVAL register having a non-zero value indicates the active timer.

 

timer compare resized 600


Processor Mode

 

The reason for the different timers is likely because 32-bit Linux runs in supervisor mode in the secure state, and 64-bit Linux runs in normal (non-secure) mode. I first learned about these processor modes when I was experimenting with running kvm on my Samsung Chromebook, which contains the Exynos 5 dual-core A15. I found out that to run a hypervisor like kvm I had to start Linux in the hypervisor mode, and the default configuration is to run in supervisor mode. After some changes to the bootloader setup, I was able to get Linux running in hypervisor mode and run kvm.

 

It may seem like the differences between the various modes are minor and unlikely to make any difference to the system design beyond the processors, but consider the following scenario.

 

Running 32-bit Linux on A53 in AArch32 state runs fine using CCI-400, NIC-400, and GIC-400 combined with some additional peripherals. The exact same system would be expected to run 64-bit Linux without any changes. What if, however, the slave port of the NIC-400 which receives data from the CCI-400 was configured in AMBA Designer for secure access? This is one of the three possible configuration choices for slave ports. Here are the descriptions of AMBA Designer choices for the slave port:

nic 400 slave

 

If secure was selected, the system would run fine with 32-bit Linux, but would fail when running 64-bit Linux because the non-secure transactions from the A53 would be presented as secure transactions to the GIC (because of the NIC-400 configuration) and would result in reading wrong values from GIC registers such as the Interrupt Acknowledge Register (IAR) when trying to determine which peripheral is signaling an interrupt. The result would be a difficult to debug looping behavior in which the kernel is unable to service the proper interrupt. All of this because of a NIC-400 configuration parameter. For more information on the NIC-400 design flow, a recording of the recent Carbon Webinar is available.


Summary

 

As you can see, seemingly minor differences in the processor operating mode between 32-bit and 64-bit Linux can impact IP configuration as well as connections between IP. These are just two small examples of why ARM Techcon 2014 should be an exciting conference as the ARM community shares experiences with ARMv8.

 

Make sure to stop by the Carbon booth for the latest product updates and information about Carbon System Exchange, the new portal for pre-built virtual prototypes.

 

Jason Andrews

That's right, designers! If you'd like to see some stuff for FREE and even get a few free lunches at ARM TechCon 2014, you need to register in advance and use code ARMEXP100 to get a free Expo Pass. With that pass you can attend:

See you there at ARM TechCon 2014.

-phil

Time flies: 10 years ago, ARM rolled the first of what's become the storied line of Cortex-M processors (chart right). The launch of that family couldn't have been timed better, starting as it did just as the world of system design moved toward mobile solutions that were both power and size constrained. Today mobile, wearables and IoT are huge and show no sign of slowing. Consider that during those 10 years, more than 8 billion Cortex-M cores have shipped, more than half of them in the past 18 months, according to Thomas Ensergueix, ARM senior product marketing manager.

 

The latest in the series was announced this week, when ARM rolled the Cortex-M7 with an eye toward that always tricky balance between a design team's need for performance and its power constraints.

ARM Cortex-M7The 32-bit device doubles the compute and digital signal processing (DSP) capability of existing (and powerful) ARM-based MCUs while keeping power under control.

Using a Cadence implementation flow, design teams can wring even more power optimization from the M7 and tackle tough parasitic extraction issues along the way. We'll post about that in detail next week during ARM TechCon.


For now, here's the latest on the M7 launch, announced Sept. 24:

 

Targeting

Market applications for the ARM Cortex-M7 include next-generation vehicles, connected devices, and smart homes and factories. Companies including Atmel, Freescale and ST Microelectronics are counted among early licensees.

More information about the new processor is available at ARM.


Next week at ARM TechCon, Cadence's Paddy Mamtora, Product Engineering Group Director, Digital and Signoff Business Unit, will join with ARM Principal Engineer Aditya Bedi Wednesday (Oct. 1) at 4 p.m. to talk about pushing the boundaries of embedded design with Cortex-M.

 

Related Stories:

-- Getting a Glimpse at the Future Early – Cadence & ARM at ARM TechCon 2014!

Ok, with one week to go, I'm getting excited about this year's ARM TechCon!

 

Synopsys ARM at ARM TechCon and our mutual customers (e.g.,AMD, HiSilicon, Samsung and STMicroelectronics) will be sharing technical content and successful examples of solutions for ARM-based design spanning soc_implementation, verification, prototyping andvirtual prototyping. All but one of these sessions require only a free EXPO badge (not a full conference badge), so you can just drop by for one or more as your time permits. We even have two free lunch sessions with excellent technical presentations.

 

NOTE: Please be sure to register first and use code ARMEXP100 to get a FREE expo pass.

 

Wednesday, October 1 – Mission City Ballroom 1

11:00 – 11:50am: Performance Analysis and Optimization of ARM® CorelinkNIC-400 based Systems Using Synopsys Platform Architect

12:00 – 12:50pm: Turbocharge Verification of your ARM-based Systems with Synopsys Hybrid Emulation

1:00-1:50pm: A Processor-Based Approach to Acceleration in Modern SoCs
Lunch will be provided for all attendees

2:00-2:50pm: Integrate Pre-Verified Synopsys IP Subsystems into an ARM-Based SoC in Minutes

3:00-3:50pm: Accelerating Development of Fujitsu Embedded Platform SoC using Synopsys Virtual Prototyping and Galaxy Implementation Solutions

4:00- 4:50pm Efficient hardening of ARM® Cortex®-A57/Cortex-A53 Processor Subsystems in FD-SOI Process Technology with Synopsys Galaxy Platform (presentation by STMicroelectronics)

Thursday, October 2 – Grand Ballroom A

10:30-11:20am: AMD Tapeout of a High-Performance ARM® Cortex-A57® Processor-Based Server SoC using Synopsys Galaxy Design Platform

11:30am -12:20pm: Addressing 16nm FinFET Challenges to Tapeout HiSilicon’s 50M+ Gate ARM® Cortex-A57® Processor-based SoC using Synopsys IC Compiler (learn about the first 16nm FinFET networking processor running up to 2.6 GHz)

12:30-1:20pm: Q&A Panel with AMD , HiSilicon STMicroelectronics Achieving Optimum Results on the Latest ARM® Cortex®-A Processor Family with Galaxy Platform
Lunch will be provided for all attendees

1:30 – 2:20pm: Innovation in Debug for ARM-based SoCs Driving Innovation Verification and HW-SW Bring-up

2:30 – 4:20pm: ARM-Samsung-Synopsys A Simple Formula for Success with Next-Generation Wearables to High- Performance SoCs

Friday, October 3 – Grand Ballroom H

1:30-2:20 pm: Performance Analysis and Verification of an ARM® based SoC Interconnect
**NOTE: This session requires a conference badge to attend

- See more info at: Synopsys at ARM Technology Conference 2014

 

As always, please check out our microsite www.synopsys.com/ARM for more information about Synopsys' optimized solutions for ARM-based design.

 

Oh, yeah, and please root for me in the ARM Step Challenge as I use my ARM-powered fitbit  to go up against my arch rivals, John Heinlein and Brian Fuller. After a tough fought competition at DAC 2014, I'm ready for the re-match!http://schedule.armtechcon.com/session/synopsys-addressing-16nm-finfet-challenges-to-tapeout-a-50m-arm-cortex-a57-processor-based-soc-using-synopsys-ic-compiler

Interesting article on Semiconductor Engineering where options other than increasing clock frequency are considered for improving performance and data throughput of advanced node devices:

Semiconductor Engineering .:. Making Chips Run Faster


Our approach is to provide the 'different IP' as suggested in the article. On the advanced nodes, performance optimisation schemes require conditions to be accurately monitored on-chip and within the core. We have the belief that PVT conditions should be monitored and sensed by small analog sensors such as accurate temperature sensors, voltage monitors (core and IO) and process monitors. Quite simply, the more accurate you sense conditions the more watts can be saved for both idle leakage and active states of a device. For example, our embedded temperature sensors have been developed to monitor to a high accuracy for this reason. Once you have the 'gauges' in place you can then play with the 'levers,' by implementing Dynamic Voltage and Frequency Scaling (DVFS) schemes or Energy Optimisation Controllers (EOCs) with are able to vary system clock frequencies, supplies and block powering schemes.


Again, we believe that these peripheral monitors are nowadays less 'nice to have' and becoming a more critical requirement. With that, these monitors must be reliable and testable in situ as failing sensors could have a dramatic effect to the system.


Another point is that we're seeing device architectures that cannot cope with each and every block being enabled. With increased gate densities on 28nm bulk and FinFET, hence greater power densities, hence greater thermal dissipation, we're seeing that devices cannot be fully power-up and at the same time, operate within reasonable power consumption limits.


All these problems of coping with PVT conditions on-chip and the increasing process variability on advanced nodes mean that the challenges, and opportunities of innovation, for implementing more accurate, better distributed embedded sensors and effective Energy Optimisation (EO) schemes are here to stay.

What did you think of Apple’s latest blockbuster product announcements this week, the iPhone 6 and the smart watch? Can’t wait to buy them?

 

I thought the announcement was fascinating but not because I’m running out to buy or pre-order the devices (heck, I've only had my iPhone 5 for six months). It was fascinating because it illuminates a fundamental shift in electronics system design. And, at its heart, the story is about the difference between mammals and insects (more on this in a moment).

 

Two paths

Smart phones and wearables represent two distinctly different ways to design systems. The smart phone architecture, generally speaking, descends from computer systems design: big, powerful, OS-centric, able to manage a multiplicity of tasks. An industry ecosystem has coalesced around these high-volume devices with a standard array of products and services: various processors, RF basebands, memory subsystems, sensor technologies, and so forth.

 

The world of wearables design is completely different. It's a subset of a broadly defined Internet of Things or Internet of Everything sector. The applications in this area are arguably almost infinite in number and wildly diverse. And as such, their technology requirements—their power, performance and area considerations—are just as varied. One size generally fits only one size, not fit all.

 

Mammals and Insects

Cadence IP Group CTO Chris Rowen likens it to mammals versus insects (see comparison chart nearby, from Current Results). In a conversation we had recently, he noted that the smart phone/tablet/PC/server world can be viewed as mammals: A relatively small number of species in the ecosystem functioning as generalists, he says. IoT applications, on the other hand, are more like insects: Too numerous to count and having key, highly specialized roles within a larger ecosystem.

 

mammals-v-insects-chart.jpg

This ecosystem requires a holistic approach to system design enablement from IP implementation and verification all the way to tape out. It requires an awareness of what Rowen calls cognitive layering. Oversimplified, this means matching the right processing, power, memory, and software attributes with the right tasks at the right time. We'll be writing more about this in the coming months.

 

shift left

This system-design perspective is one of the drivers behind the "shift left"trend (chart, left) that my colleague Frank Schirrmeister often writes about. Our industry has talked for many years about the need for hardware-software codesign to speed time to market, but today increasing system complexity and diversity requires it as well. Not understanding how your hardware design affects your application software (and vice versa) at the earliest stages of your design can be perilous.

 

I know nothing about how Apple or Samsung apportions its design teams and how those teams are traveling along these two distinct paths. But I suspect the teams are different, as are their design approaches to systems and SoCs.

 

Just consider the Apple watch. Apple's promotional materials laid out the challenge: "Massive constraints have a way of inspiring interesting, creative solutions.... No traditional computer architecture could fit within such a confined space." Apple engineers responded with the S1 SoC, an system-in-package device that includes processing and sensing.


You could consider the S1 (what little we know about it) to bridge the mammals-insect worlds, but just imagine the thinking that goes into far more specialized "insect" applications. It's going to be a fascinating future indeed.

 



Related Stories

- Sealing the Seams in System Design

- Q&A with Nimish Modi: Going Beyond Traditional EDA

The processors from ARM® get all of the attention.  After all, ARM partners have shipped over 50 billion ARM processors so far.  10 billion of those in 2013 alone.  With so many processors shipping, you would think that this would be reflected on Carbon's IP Exchange web portal and ARM's processors would be the most popular IP models created and downloaded.

 

In truth, it's pretty rare that any of ARM's processors top the list of the most popular models generated on Carbon's IP portal.  That title consistently goes to one of ARM's CoreLink interconnect offerings.  Month in and month out, one of the NIC-400, NIC-301 and PL301 interconnect models top the list.  The comparison is a bit unfair since theARM Carbon Webinar Logo typical architect will try out only a handful of different processor configurations but it's not at all uncommon for a single user to create dozens of various configurations for the system interconnect.  It does reflect though the importance that users place on having accurate models for the components in their system that have the greatest impact on overall performance.  (Not surprisingly, the next most commonly created type of component on our portal is a memory controller)

 

Carbon has blogged a few times about the importance of accurate models for the NIC-400 and NIC-301 for system tasks ranging from IP selection to accurate firmware bringup and debug.  We've discussed how accurate interconnect models enable you to avoid arbitration problems, detect system bottlenecks meet your price, performance and area targets. On September 11th at 1pm EDT (17:00 GMT) we'll be holding a webinar together with ARM to talk about the impact of NIC-400 configuration choices on the performance of the system. Well see how this performance can be analyzed and optimized not just using a few traffic generators and bare metal software but also when running a complete Linux operating system running commercial benchmarks.  Far too often, this type of performance optimization waits until emulation or FPGA prototyping when it's too late to really have much impact.  We'll demonstrate how you can use Swap & Play to get to your system booted quickly and then switch to 100% accurate models to run those important system benchmarks that drive performance decisions.  We'll also show how you can use our CPAKs to get up and running within minutes of download.  The demos will be done using a multi-cluster Cortex-A53 system but will obviously apply no matter what processor you're using.

 

The webinar will feature sections by William Orme, ARM's product manager for the NIC-400 as well as multiple demonstrations by Eric Sondhi, a corporate applications engineer here at Carbon.  Although the webinar will be available as a recording afterwards, I'd urge you to attend live if possible to ensure that your questions are answered.  You can of course, always get answers to any questions you have by clicking on the button below.

 

Sign up for the Pre-silicon Optimization of System Designs using the ARM® CoreLink™ NIC-400 Interconnect webinar.

 

Request More Information    Optimization of ARM Cortex-A15 and AMBA4 Designs using a Virtual Prototype    AXI Interconnect Optimization using a Virtual Prototype

An interesting article by Daniel Payne on SemiWiki that approaches the predicted finFET issues through simulation analysis:

SemiWiki - FinFET Design for Power, Noise and Reliability

 

Many of the points raised are of interest to us, as an advanced node development team, and our customers. Gate density (as is the intention on finFET!) is a significant contributor to thermal issues and IR drop issues. We believe that Moortec Semiconductor's approach accompanies the analysis from simulation. We provide embedded temperature sensors, voltage and process monitors, essentially 'lifting the lid' to on-chip PVT conditions for advanced node SoCs (Analog IP and Custom Mixed Signal ASIC IC Chip Design Services).

 

In a thermal context, gate density equates to power density and in turn, localised thermal issues. Only accurate core temperature sensors placed near to potential hot spots provide the system with sufficient feedback to implement a dynamic control scheme for clock speed or supply. Schemes such as DVFS are becoming the big application area for on-chip PVT monitors as you can then performance optimise on a per chip basis (we prefer to use the term 'Energy Optimisation' which leads to 'Energy Optimisation Controller' schemes being implemented within a system).

 

In terms of IR Drop, as the gate density increases and the impedance of metal tracking for supplies increases, together with reduced headroom due to supply reduction, we're seeing a greater problem for advanced nodes. Using on-chip core voltage supply monitors allow chip developers to see what the supply conditions are really like and how this compares to simulation results. In addition, when data outputs from these monitors is included in the architectural level of an SoC, the power supplies can be optimised for better performance, or power saving, as required. We can only see demand for such monitors increase as we move down the technology curve.

Filter Blog

By date:
By tag: