1 2 3 Previous Next

SoC Implementation

142 posts

After ARM and Synopsys jointly announced at ARM TechCon the extended collaboration breadth and depth enabled by our new multi-year IP subscription agreement, I put together an article describing some of the existing design solutions we have in place across the entire design flow (including optimized implementation, verification/debug/emulation/VIP, complementary interface, AMBA interconnect and memories/logic libraries, FPGA and virtual prototyping, etc.).


The best way to learn about most of these collaborative efforts is to hear our mutual customers talking about their successes using them to design creative, state-of-the-art products. The article gives a few examples of these, including a few presentations (e.g., AMD, STMicroelectronics, Samsung) that were recorded live at TechCon and are available through www.synopsys.com/ARM, the go-to place for design solutions for your ARM-based products. Look for the "Videos" tab on the right-hand side of the  pages on the www.synopsys.com/ARM microsite (and the "more" link at the end of the Videos list to see all videos).


I suggest you start with the collaboration summary article, take a look at the Synopsys-ARM solution microsite, then view a few of the  mutual customer (and ARM & Synopsys) videos to get a good idea of what's available for you today. If you have any questions, talk with your Synopsys application consultant.

How to Break Through the 3GHz Barrier - New On-Demand Webinar!

If you missed the live event last week, and are wondering just what you are going to do with all the free time you will have during the holiday break -- don't worry -- the on-demand version of this informative webinar is now available!
Check out Breaking Through 3.0GHz with ARM Cortex-A53 to learn how to balance the need for high-performance with low-power requirements and small area. Find out what methodologies were developed for the ARM Artisan® POP IP-based Cortex®-A53 implementation solution.


This will kill at least an hour before someone pulls out the old 'honey do' list!

I started my design career far too long ago doing system verification on a multi-processor server design.  Basically, I was charged with assembling a model of the system and then writing some tests to exercise it.  This was long before the days of virtual prototypes so I assembled the system in RTL simulation using an LMSI hardware modeler to represent the existing processor and cache components in the system.  When it came time to get software up and running on the system, I started off by writing a few directed tests to run on the processors in the design.  These tests were designed to stress the system but configuring all of the components started to become burdensome so I went to the software team and started borrowing the code that they were writing for the eventual real silicon and got it up and running on my system model.  After spending far too much time figuring out the problems in my verification system (it was my first job after all) I started finding real system problems.  Software driven verification was finding problems that the hardware verification team had missed.  Since this was the first time that software was being run on the real hardware, albeit as a simulation model, we found numerous problems in both the hardware and the software.


Evolving Approaches

A few years later, I migrated from doing design work to working as an applications engineer at Quickturn Systems. I saw firsthand the huge amounts of money and design resources which companies would allocate to assemble a model of the system before silicon.  They were taking a lot of the same approach that I had done in my first job running real software on real hardware.  Instead of assembling the system in software talking to a hardware modeler they were using a cobbled together system with a washing machine sized emulator hooking into a specially designed hardware board with what seemed like miles of spaghetti-like cables in between.  The hardware teams would typically use the systems during the day time hours to do their system verification work with the software teams relegated to nighttime hours for their time on the box.  (After all, emulators were an expensive resource.  It made sense to schedule them for round the clock usage and I spent more than one 2am session in the lab to help keep the boxes running.)  The value was high though.  The interaction of real hardware and real software before silicon accelerated design schedules and found corner case functional and performance issues that would have otherwise made their way into silicon.  It was difficult, it was expensive but for many design teams, it was worth it.


Fast forward a few years to the present day and it looks like the path to do this system level validation is evolving once again.  We’ve seen an increasing number of design teams adopt a validation strategy that involves using system level software to drive the validation of the design long before silicon.  While this has historically been done using the actual system software (getting to that boot prompt well before

Breker A53 CPAK

tapeout is still a typical milestone) many design teams are now crafting software specifically for the purpose of validating their system.  This software can either be targeted software they’ve written themselves or third party verification software from leading companies like Breker Systems.  We recently published an article together with Breker in EETimes which talks about doing precisely this to address the huge problem of coherency validation in the newest generation of ARM V8-based SoC designs.  It goes into a good amount of depth on cache coherency so I'd certainly recommend it if your next design is using hardware coherency.  The CPAK discussed in the article uses two clusters of four core ARM Cortex-A53 CPUs but it can be easily modified to better represent your actual design.


While emulation and FPGAs remain a popular, albeit expensive, way to execute this validation software, an increasing number of teams have been performing this valuable step on virtual prototypes.  Using virtual prototypes together with system software isn’t new of course, that’s been done for a long time.  The latest wrinkle though is the ability to do this software development on a virtual prototype which is actually an accurate representation of the system.  Traditionally, virtual prototypes have been functional models only and have abstracted away the implementation details of the system in order to achieve performance.  Today, it is possible to use virtual prototypes that have both the speed of these high level virtual prototypes (10s to 100s of MIPS) but also still have all of the accuracy of the RTL implementation.  What’s more, many of these systems are already built and have system software already running on them.


Of course, when I say RTL-accurate, alarm bells start going off.  Does this mean I have to debug my software using waveforms?  Am I going to have to learn how to run a hardware simulator?  And of course: Don’t accurate models run too slowly to execute real software?  Thankfully, the answers to all of these questions is no.  Let’s see why.


Getting Started With Software Driven Verification

Most times, the fastest way to get a working model of your system is to take a working model of a system similar to yours and port it to more closely represent your design.  This is the reason that we have so many CPAKs on our System Exchange web portal.  Using the search parameters, you can easily narrow down the lengthy list of pre-built systems (well over 100 when this blog is written) to one that most closely matches your design.  You can even choose the software you want to run, from the simplest bare-metal benchmark to a full Linux boot and OS-level benchmarks.


Once downloaded, the CPAK can be easily customized to mimic your actual design.  You can do this by using models from IP Exchange, RTL models you’ve compiled using Carbon Model Studio or SystemC models.  They can either add to the supplied system or replace existing components.


Your next step depends upon your design needs.  If you want to develop high level software without need for system accuracy you can certainly do so.  Simply use the ARM Fast Model representation of the system.  This enables you to run at Fast Model speeds which are typically in the 10s to 100s of MIPS.  You can even execute in a hybrid configuration if desired, mixing Fast Model components together with Carbonized RTL models.  This is a common use case for components such as GPUs which don’t have Fast Model representations.  The system runs at Fast Model speeds except when accessing the GPU or rendering a frame.  This approach enables a fast OS boot before beginning video operations which then run accurately since the GPU is RTL-accurate.  Bear in mind of course that any hybrid combination of Fast Models and accurate models will not generate the same types of accesses to the GPU as would be seen in a real system.  This is true if this hybrid combination takes place entirely in the virtual world or by tying a virtual prototype to an emulator.  Since Fast Model representations are functional only and don’t attempt to correctly model cycle accuracy this type of an approach is only well-suited for software development and not for system architecture or validation.  For those tasks, only an accurate system model will do.


This brings us back to where we started: using software on an accurate representation of the system to drive validation.  Before we talk more about that though, I should answer the questions I mentioned above about debugging and execution speed.  The debugging question is an easy one.  Although the waveforms are there is you really want to be masochistic, all Carbon models of ARM IP available on our IP Exchange web portal contain an integration with ARM’s DS-5 debugger to enable truly interactive debugging.  This isn’t a post-process “gee, I wish I could change that value but it’s too late now” integration.  It’s one that enables the designer to view and modify the contents of any register or memory location while the program is running.  The entire system also runs in a complete virtual prototype environment so no hardware simulator connection is necessary.  No complicated command lines needed or extra licenses to check out at runtime.


This of course brings us back to the speed question. After all, functional models run faster than accurate models precisely because they’ve eliminated accuracy.  How can we get the speeds needed to boot an OS or develop application level software and still expect to have accuracy?  This is where Carbon’s Swap & Play technology comes to the rescue.  Our virtual prototypes and CPAKs have the ability to start running using a Fast Model representation of the system and then swap over to 100% accurate models at any software breakpoint. This approach lets you boot your OS in under a minute, far faster than with any emulator or FPGA prototype, and then continue running with the accurate representation.  This enables the tasks that require accuracy such as performance optimization or system validation.  You can even create multiple breakpoints to start running accurately at different points in the system execution.



Software can be a very effective way to verify the behavior of an SoC long before tapeout.  Whether you're using actual system software to do this or leveraging dedicated system verification software from companies like Breker, you have the ability to see true system behavior and fix problems earlier in the design cycle.  Virtual prototypes simplify this task with their ability to offer true interactive debugging of both hardware and software with complete visibility. Carbon's CPAKs offer a great way to further accelerate the development of these systems and let that verification task start earlier in the cycle.

ARM Cortex A9 Virtual Prototype Running Coremark Benchmark Embedded System Virtual Prototype Booting Linux Demo Request More Information

We recently ran a webinar that covered common pitfalls in verification and performance analysis of cache-coherent ARM-based designs. Don't worry if you missed it - you can register to watch the recorded session. Here are the particulars:


Click here to register.


Date: December 3, 2014
Time: 11:00 AM PST
(Note: The event will be recorded; if you register and are not able to attend an e-mail notification will be sent advising you of the event's availability for viewing within 24 hours)

Event Summary:
We will be covering how Verification IP for AMBA enables users to generate correct coherent stimulus for cache coherent SoC verification.

The technical webinar will cover the complexities of configuration, stimulus, coverage and checking. We will highlight and discuss common verification pitfalls, for example:

1) Lack of system checks results in late discovery of coherency issues

2) Lack of time and expertise to create complex scenarios (e.g., concurrent accesses to same cache line, trigger interesting cache transitions, etc.)

3) Insufficient performance analysis due time pressures

4) Excessive debug time to find root cause of failures

The webinar will include many examples of how users can address these and other pitfalls using simple techniques and advanced VIP.



Neill Mullinger

Product Marketing Manager for Verification IP, Synopsys

Neill Mullinger is a product marketing manager at Synopsys for verification IP. Neill joined Synopsys in 2000 and has been focused on verification IP and protocol verification since 2002. He brings more than 25 years of experience in the hardware and EDA industries as an applications engineer and product manager.



Tushar Mattu

Corporate Application Engineer (CAE) for Verification Group, Synopsys

For more than 10 years, Tushar has been working as a verification solutions engineer at Synopsys. Tushar has been supporting some of Synopsys’ key customers to architect testbenches using best verification practices based on VMM and UVM. Currently, Tushar’s focus is on AMBA Verification IP, and he works closely with VIP users.

03 DECEMBER, 2014

Today, Huawei has introduced a new addition to its own processor family. The Huawei Kirin 620 system-on-a-chip is definitely not a top-of-the-line chipset, but is nonetheless a solid performer, designed for mid-ranged devices. The CPU has a 64-bit architecture and comes with 8 cores, clocked at 1.2GHz.

It is a 28nm chip, based on the Cortex-A53 with LPDDR3 RAM support. As far as connectivity goes, it will offer GSM / TD-SCDMA / WCDMA / TD-LTE /LTE FDD support, as well as Cat. 4 LTE for speeds up to 150Mbps. The GPU used is a Mali450 MP4, which is a little dated. Camera support is limited to a 13MP sensor, while video encoding and decoding capabilities can handle up to 1080p resolutions at 30Hz.


From the link below you can find more information:

Huawei releases a new octa-core Kirin 620 chipset - GSMArena.com news

The first Carbon Performance Analysis Kit (CPAK) demonstrating the AMBA 5 CHI protocol has been released on Carbon System Exchange. The design features the ARM Cortex-A57 configured for AMBA 5 CHI and the ARM CoreLink CCN-504 Cache Coherent Network. The design is a modest system with a single core running 64-bit bare-metal software with memory and a PL011 UART, but for anybody who digs into the details there is a lot to learn.


Here is a diagram of the system:



AMBA 5 CHI Introduction


Engineers who have been working with ARM IP for some time will quickly realize AMBA 5 CHI is not an extension of any previous AMBA specifications. AMBA 5 CHI is both more and less complex compared to AMBA 4. CHI is more complex at the protocol layer, but less complex at the physical layer. AXI and ACE use Masters and Slaves, but CHI uses Request Nodes, Home Nodes, Slave Nodes, and Miscellaneous Nodes. All of these nodes are referenced using shorthand abbreviations as shown in the table below.



Building the A57 with CHI


The latest r1p3 A57 is now available on Carbon IP Exchange. CHI can be selected as the external memory interface. The relevant section from the IP Exchange configuration form is shown below.




The CHI memory interface relies on the System Address Map (SAM) signals. All of the A57 input signals starting with SAM*are important in constructing a working system. These values are available as parameters on the A57 model, and are configured appropriately in the CPAK to work with the CCN-504.


Configuring the CCN-504

The CCN-504 Cache Coherent Network provides the connection between the A57 and memory. The CPAK uses two SN-F interfaces since dual memory controllers is one of the key features of the IP. A similar set of SAM* parameters is available on the CCN-504 to configure the system address map. Like other ARM IP, the CCN uses the concept of PERIPHBASE to set the address of the internal, software programmable registers.


Programming Highlights


The CCN-504 includes an integrated level 3 cache. The CPAK demonstrates the use of the L3 cache.

The CPAK startup assembly code also demonstrates other CCN-504 configuration including how to setup barrier termination, load node ID lists, programming system address map control registers, and more.

AMBA 5 CHI Waveforms


One of the best ways to start learning about AMBA 5 CHI is looking at the waveforms between the A57 and the CCN-504. The lastest SoC Designer 7.15.5 supports CHI waveforms and displays Flits, the basic unit of transfer in the AMBA 5 CHI link layer.




A new CPAK by Carbon Design Systems running 64-bit bare-metal software on the Cortex-A57 processor with CHI memory interface connected to the CCN-504 and memory is now available. It demonstrates the AMBA 5 CHI protocol, serves as a starting point for optimization of CCN-based systems, and is a valuable learning tool for projects considering AMBA 5 CHI.

My colleague, Tom De Schutter, wrote a good blog about a recent accomplishment of the Synopsys Press book "Better Software. Faster!" -- more than 3,000 copies in distribution to designers in more than 1,000 companies. The success of the book highlights the interest in using virtual prototyping as a key methodology to "shift left" product development.


You can download a free Better Software. Faster! eBook in English or Chinese by using either your SolvNet ID or email address. The Japanese edition is underway as well, so stay tuned for that.


The book, which includes case studies from thirteen companies, including one written by Rob Kaye of ARM, dives deep in to virtual prototyping as the key methodology to enable concurrent hardware/software development by decoupling the dependency of the software development from hardware availability. .


At ARM TechCon 2014, Nguyen Le, a Principal Design Verification Engineer in the Interactive Entertainment Business Unit at Microsoft Corp. documented a real world case study using a formal app and verification management tools to achieve his code coverage goals significantly faster.

Specifically, in the paper titled “Advanced Verification Management and Coverage Closure Techniques”, Nguyen outlined his initial pain in verification management and improving cover closure metrics, and how he conquered both these challenges – speeding up his regression run time by 3x, while simultaneously moving the overall coverage needle up to 97%, and saving 4 man-months in the process. The following article reports the highlights of his presentation and paper:

ARM® Techcon Paper Report: How Microsoft Saved 4 Man-Months Meeting Their Coverage Closure Goals Using Automated Verific…

I’m excited to introduce the most complex Carbon Performance Analysis Kit (CPAK) created by Carbon; an 8-core ARM Cortex-A53 system running 64-bit Linux with full Swap & Play support. This is also the first dual-cluster Linux CPAK available on Carbon System Exchange. It’s an important milestone for Carbon and for SoC Designer users because it enables system performance analysis for 64-bit multi-core Linux applications.


Here are the highlights of the system:

  • Dual-cluster, quad-core Cortex-A53 for a total of 8 cores
  • ARM CoreLink CCI-400 providing coherency between clusters
  • Fully configured GIC-400 interrupt controller delivering interrupts to all cores
  • New Global System Counter connected to A53 Generic Timers


Here is a diagram of the system.


The design also supports fully automatic mapping to ARM Fast Models.


I would like to introduce some of the new functionality in this CPAK.


Dual Cluster System

The Cortex-A53 model supports the CLUSTERIDAFF inputs to set the Cluster ID. This value shows up for software in the MPIDR register. Values of 0 and 1 are used for each cluster, and each cluster has four cores. This means that CPU 3 in Cluster 1 has an MPIDR value of 0x80000103 as shown in the screenshot below.



Global System Counter


Another requirement for a multi-cluster system is the use of a Global System Counter. A new model is now available in SoC Designer which is connected to the CNTVALUEB input of each A53. This ensures that the Generic Timer in each processor has the same counter values for software, even when the frequency of the processors may be different. This model also enables Swap & Play systems to work correctly by saving the counter value from the Fast Model simulation and restoring it in the Cycle Accurate simulation.


Generic Timer to GIC Connections

To create a multi-cluster system the GIC-400 is used as the interrupt controller, and the A53 Generic Timers are used as the system timers. This requires the connection of the Generic Timer signals from the A53 to the GIC-400. All of these signals start with nCNT and are wired to the GIC. When a Generic Timer generates an interrupt it leaves the CPU by way of the appropriate nCNT signal, goes to the GIC, and then back to the CPU using the appropriate nIRQ signal.


As I wrote in my ARM Techcon Blog, 64-bit Linux uses nCNTPNSIRQ, but all signals are connected for completeness.


Event Connections


Additional signals which fall into the category of power management and connect between the two clusters are EVENTI and EVENTO. These signals are used for event communication using the WFE (wait for event) and SEV (send event) instructions. For a single cluster system all of the communication happens inside the processor, but for the multi-cluster system these signals must be connected.

WFE and SEV communication is used during the Linux boot. All 7 of the secondary cores execute a WFE and wait until the primary core wakes them up using the SEV instruction at the appropriate time. If the EVENTI and EVENTO signals are not connected the secondary cores will not wake up and run.


Boot Wrapper Modifications


The good news is that all of the software used in the 8-core CPAK is easily downloadable in source code format. A small boot wrapper is used to take care of starting the cores and doing a minimal amount of hardware configuration that Linux assumes to be already done. Sometimes there is additional hardware programming that is needed for proper cycle accurate operation that is not needed in a Fast Model system. These are similar to issues I covered in another article titled Sometimes Hardware Details Matter in ARM Embedded Systems Programming.


SMP Enable


Although not specific to multi-cluster, the A53 contains a bit in the CPUECTLR register named SMPEN which must be set to 1 to enable hardware management of data coherency with the other cores in the cluster. Initially, this was not set in the boot wrapper from kernel.org and the Linux kernel assumes it is already done so it was added to the boot wrapper during development.


CCI Snoop Configuration


Another hardware programming task which is assumed by the Linux kernel is the enabling of snoop requests and responses between the clusters. The Snoop Control Register for each CCI-400 slave ports is set to 0xc0000003 to enable coherency. This was also added to the boot wrapper during development of the CPAK.

The gaps between the boot wrapper functionality and Linux assumptions are somewhat expected since the boot wrapper was developed for ARM Fast Models and these details are not needed to run Linux on Fast Models, but nevertheless they make it challenging to create a functioning cycle accurate system. These changes are provided as a patch file in the CPAK so they can be easily applied to the original source code.


CPAK Contents


The CPAK comes with an application note which covers the construction of the Linux image.


The following items are configured to match the minimal hardware system design, and can be extended as the hardware design is modified.

  • File System: Custom file system configured and created using Buildroot
  • Kernel Image: Linux 3.14.0 configured to use the minimal hardware
  • Device Tree Blob:  Based on Versatile Express device tree for ARM Fast Models
  • Boot Wrapper: Small assembly boot wrapper available from kernel.org


A single executable file (. axf file) containing all of the above items is compiled. This file contains all of the artifacts and is a single image that is loaded and executed in SoC Designer.

One of the amazing things is there are no kernel source code changes required. It demonstrates how far Linux has come in the ARM world and the flexibility it now has in supporting a wide variety of hardware configurations.



An octa-core A53 Linux CPAK is now available which supports Swap & Play. The ability to boot the Linux kernel using Fast Models and migrate the simulation to cycle accurate execution enables system performance analysis for 64-bit multi-core systems running Linux applications.


Also, make sure to check out the other new CPAKs for 32-bit and 64-bit Linux for Cortex-A53 now available on Carbon System Exchange.


The “Brought up 8 CPUs” message below tells it all. A number of 64-bit Linux applications are provided in the file system, but users can easily add their favorite programs and run them by following the instructions in the app note.




Join industry experts and a handful of ARM Partners at the Winchester Mystery House in San Jose, California (great venue for the subject) on Tuesday, October 14th as they unravel the strange and wonderful secrets of semiconductor intellectual property at Unlock the Mystery of IP. This free one-day conference will address cutting-edge semiconductor technology, market trends and projections, and challenges facing players in the IP industry.

Jim Feldhan, President of Semico Research Corporation, will ground the day's programming with two data-rich keynote presentations. Throughout the day, speakers will share 30-mintue “deep-tech” presentations on today’s cutting-edge IP products to equip attendees with the knowledge they need to make informed decisions for their next design projects. Then finally, there will be two panel discussions that will address today’s hottest topics: the Internet of Things (IoT) and IP subsystems.


  • "IP Subsystems: Build or Buy?"
    • Moderator:
      • Gabe Moretti, Extension Media
    • Panelists:

If it couldn't get any better, a hosted bar and hors d'oeuvres networking reception will conclude the action-packed day. To register and to view the complete agenda, visit the IPextreme website.

The ARM Cortex-M7 processor is out, developed to address digital signal control markets that demand a blend of control and signal processing capabilities. The ARM Cortex-M7 has been designed with a variety of highly efficient signal-processing features to address the design demands of market applications such as next-generation vehicles, connected devices, and smart homes and factories.


In many of these end markets, engineering teams demand:

  • Maximum performance within power budgets
  • Maximum power savings targeting a given frequency


These are significant challenges to address, so how do we deal with them?

(Cadence recently published a white paper that details the challenges and some solutions. It described ways in which Cadence and ARM worked to optimize power and timing closure in the ARM Cortex-M7.)


We start by identifying and confronting the issues. Let’s take dynamic power, for example. Dynamic power is the largest component of total chip power consumption (the other components are short-circuit power and leakage power). It can quickly become a design challenge in leading designs.


Then there are timing-closure challenges. One fundamental timing closure issue is the modeling of physical overcrowding. Among other things, this problem can be addressed by deftly managing layout issues (such as placement congestion, overlapping of arbitrary-shaped components, and routing congestion).


K.T. Moore, Group Director in Cadence’s Digital and Signoff Group, said:

“Closure requires a different way of thinking. You have to consider multiple constraints in the closure process with a unified objective function in mind. This is easier said than done because many constraints conflict with each other if you simply address their effects only on the surface.”         


In the past, teams relied solely on post-route optimization to salvage setup/hold timing in tough-to-close timing situations. But now we can rely on in-route optimization to bridge timing closure earlier during the routing step itself using track assignment.


In addition, opportunities exist to reduce area and gate capacitance in other ways.

The Approaches     

Among several methods, the team explored placement optimization using the GigaPlace engine, available in Cadence Encounter® Digital Implementation System 14.1. GigaPlace places the cells in a timing-driven mode by building up the slack profile of the paths and performing the placement adjustments based on these timing slacks.


The team also trained its sights on using in-route optimization for timing optimization to help hit the final frequency target.


Lastly, the team introduced the “dynamic power optimization engine” along with the “area reclaim” feature in the post-route stage. These options saved time and cut by nearly half the gap that earlier existed between the actual and desired power target.


By the end of this exercise, the team achieved power savings greater than 35% on the logic (excluding constants like such as macros and so forth).


For complete details, check out the detailed white paper here.


Brian Fuller

Related stories:

-- Whitepaper: Pushing the Performance Boundaries of ARM Cortex-M Processors for Future Embedded Design

--Cortex-M7 Launches: Embedded, IoT and Wearables

--New Cortex-M7 Processor Balances Performance, Power

--The new ARM® Cortex®-M7 »

Registration: Xilinx University Program : Workshops Schedule

Venue: Cidade De Goa, Goa, India


Date & Time: 8 AM - 4 PM, Tuesday, December 16th 2014


The ARM University Program and Xilinx University Program (XUP) will be conducting a faculty workshop around the ARM SoC Design Lab-in-a-Box (LiB). The Lab-in-a-Box is about designing an SoC around the ARM Cortex®-M0 DesignStart™ Processor Core with Peripheral Interfaces using the AHB-Lite Bus. This LiB targets courses such as Design of Digital Systems or Embedded System Design using FPGA and is part of ARM University Program's ongoing commitment to share ARM technology with academia globally. The LiB has been designed with academics in mind and will allow participants to gain first hand experience not only on how to teach the material in their own courses, but also the essentials of hands-on SoC Design. The workshop will cover topics such as: Designing AHB-Lite Compliant Hardware Peripherals such as Memory, UART and GPIO, to name a few; Integrating these Peripherals around the ARM Cortex-M0 core and Implementing the SoC on FPGA.


Workshop Agenda

  • Introduction to ARM Cortex-M0 DesignStart Processor Core
  • Overview of AHB-Lite (AMBA 3) Bus Protocol
  • Xilinx Artix-7 FPGA Architecture
  • Xilinx Vivado Design Flow
  • Simple AHB-Lite Peripheral Design and Integration
  • Introduction to UART Peripheral
  • Introduction to Interrupts and CMSIS
  • Integrating UART Peripheral and Interrupt
  • Snake Game Application Demo



  • Faculty attendees must come with their own laptops running Windows OS and already installed with Keil MDK-ARM. This can be downloaded at Keil MDK-ARM.
  • Each Faculty attendee must individually register at the DesignStart Portal using an Official University Email ID for the "ARM Cortex-M0 DesignStart Processor" IP download instructions and then download the IP to her/his laptop before coming to the workshop.
  • Knowledge of embedded system design and experience in programming microcontrollers will be helpful.
  • Attendees are expected to make their own travel and stay arrangements.


If you require any further information please write to university@arm.com. We look forward to seeing you on the 16th of December, 2014!

This week is the 10th year for ARM Techcon, which has evolved into the best place for all things related to ARM technology. I will be attending this year, and giving a presentation on Friday at 3:30 titled “Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models”.


Based on the agenda this year, ARMv8 will be one of the primary topics. For the past few years there have been presentations about ARMv8, but it’s clear many people now have hands-on experience and are ready to share it at the conference. To get warmed up for ARM Techcon, I will share a couple of fun facts about 64-bit Linux on ARMv8.

Swap & Play Technology


One of the differentiating technologies of Carbon SoC Designer is Swap & Play, which enables a system to be simulated with ARM Fast Models to a breakpoint and saved as a checkpoint. The simulation checkpoint can be restored into a cycle-accurate simulator constructed with models built directly from the RTL of the IP. The most common use case for this technology is running benchmarks on top of the Linux operating system. Swap & Play is attractive for this application because the Linux boot is only a means for setting up the benchmark, and the accuracy is critical to the benchmark results and the system performance analysis. It may seem strange to simulate Linux using cycle accurate models because it requires billions of instructions, but there are times when being able to run Linux benchmarks on accurate hardware models is invaluable. In fact, this is probably required before a chip can complete functional verification.


swap n play2 resized 600


One of the useful features of the ARM® Cortex®-A50 series processors is the backward compatibility with ARMv7 software. I have had good results running software binaries from A15 designs directly on A53 with no changes. Mobile devices have even started appearing with A53 processors that are running Android in 32-bit mode which have the possibility of upgrading to 64-bit Android in the future.


One of the reasons we always focus on the system at Carbon is because today's IP is complex and configurable, and this can lead to integration pitfalls which were not anticipated. Take for instance 64-bit Linux on ARMv8. It would be a reasonable assumption that if a design has A53 cores and is successfully running 32-bit Linux, it should be able to run 64-bit Linux just by changing the software.


Below are a couple of fun facts related to migrating from 32-bit Linux to 64-bit Linux on ARMv8 to get warmed up for ARM Techcon 2014.

Generic Timer Usage


The A15 and A53 offer a similar set of four Generic Timers. Many multi-cluster A15 (and even some single cluster) designs have used the GIC-400 as an interrupt controller instead of the internal A15 interrupt controller, so the update to A53 seems straightforward to change the CPU and run the same A15 software on the A53 in AArch32 state.


It turns out that 32-bit Linux uses the Virtual Generic Timer and 64-bit Linux uses the Non-Secure Physical Timer as the primary Linux timer. From a hardware design view, this probably doesn’t matter much as long as all of the nCNT* signals are connected from the CPU to the GIC, but understanding this difference when doing system debugging or building minimal Linux systems for System Performance Analysis is helpful. As I wrote in previous articles, architects doing System Performance Analysis are typically not interested in the same level of netlist detail that the RTL integration engineer would be performing, so knowing the minimal set of connections between key components in the CPU subsystem needed to run a system benchmark is helpful.


Below is a comparison of the Generic Timers in 32-bit and 64-bit Linux. The CNTV registers are used in 32-bit Linux and the CNTP registers are used in 64-bit Linux. The CTL register shows which timer is enabled and the CVAL register having a non-zero value indicates the active timer.


timer compare resized 600

Processor Mode


The reason for the different timers is likely because 32-bit Linux runs in supervisor mode in the secure state, and 64-bit Linux runs in normal (non-secure) mode. I first learned about these processor modes when I was experimenting with running kvm on my Samsung Chromebook, which contains the Exynos 5 dual-core A15. I found out that to run a hypervisor like kvm I had to start Linux in the hypervisor mode, and the default configuration is to run in supervisor mode. After some changes to the bootloader setup, I was able to get Linux running in hypervisor mode and run kvm.


It may seem like the differences between the various modes are minor and unlikely to make any difference to the system design beyond the processors, but consider the following scenario.


Running 32-bit Linux on A53 in AArch32 state runs fine using CCI-400, NIC-400, and GIC-400 combined with some additional peripherals. The exact same system would be expected to run 64-bit Linux without any changes. What if, however, the slave port of the NIC-400 which receives data from the CCI-400 was configured in AMBA Designer for secure access? This is one of the three possible configuration choices for slave ports. Here are the descriptions of AMBA Designer choices for the slave port:

nic 400 slave


If secure was selected, the system would run fine with 32-bit Linux, but would fail when running 64-bit Linux because the non-secure transactions from the A53 would be presented as secure transactions to the GIC (because of the NIC-400 configuration) and would result in reading wrong values from GIC registers such as the Interrupt Acknowledge Register (IAR) when trying to determine which peripheral is signaling an interrupt. The result would be a difficult to debug looping behavior in which the kernel is unable to service the proper interrupt. All of this because of a NIC-400 configuration parameter. For more information on the NIC-400 design flow, a recording of the recent Carbon Webinar is available.



As you can see, seemingly minor differences in the processor operating mode between 32-bit and 64-bit Linux can impact IP configuration as well as connections between IP. These are just two small examples of why ARM Techcon 2014 should be an exciting conference as the ARM community shares experiences with ARMv8.


Make sure to stop by the Carbon booth for the latest product updates and information about Carbon System Exchange, the new portal for pre-built virtual prototypes.


Jason Andrews

That's right, designers! If you'd like to see some stuff for FREE and even get a few free lunches at ARM TechCon 2014, you need to register in advance and use code ARMEXP100 to get a free Expo Pass. With that pass you can attend:

See you there at ARM TechCon 2014.


Time flies: 10 years ago, ARM rolled the first of what's become the storied line of Cortex-M processors (chart right). The launch of that family couldn't have been timed better, starting as it did just as the world of system design moved toward mobile solutions that were both power and size constrained. Today mobile, wearables and IoT are huge and show no sign of slowing. Consider that during those 10 years, more than 8 billion Cortex-M cores have shipped, more than half of them in the past 18 months, according to Thomas Ensergueix, ARM senior product marketing manager.


The latest in the series was announced this week, when ARM rolled the Cortex-M7 with an eye toward that always tricky balance between a design team's need for performance and its power constraints.

ARM Cortex-M7The 32-bit device doubles the compute and digital signal processing (DSP) capability of existing (and powerful) ARM-based MCUs while keeping power under control.

Using a Cadence implementation flow, design teams can wring even more power optimization from the M7 and tackle tough parasitic extraction issues along the way. We'll post about that in detail next week during ARM TechCon.

For now, here's the latest on the M7 launch, announced Sept. 24:



Market applications for the ARM Cortex-M7 include next-generation vehicles, connected devices, and smart homes and factories. Companies including Atmel, Freescale and ST Microelectronics are counted among early licensees.

More information about the new processor is available at ARM.

Next week at ARM TechCon, Cadence's Paddy Mamtora, Product Engineering Group Director, Digital and Signoff Business Unit, will join with ARM Principal Engineer Aditya Bedi Wednesday (Oct. 1) at 4 p.m. to talk about pushing the boundaries of embedded design with Cortex-M.


Related Stories:

-- Getting a Glimpse at the Future Early – Cadence & ARM at ARM TechCon 2014!

Filter Blog

By date:
By tag: