1 2 3 Previous Next

SoC Implementation

150 posts

Introduction

In Part 1 of this blog series (found here ) we introduced the ARM CoreLinkTM CCI-500 Cache Coherent Interconnect and described some of the new configurable features which are available over and above what was available with the previous generation CoreLinkTM CCI-400. We described how a UVM testbench is needed in order to start exploring the enhanced performance capabilities that are on offer, and we introduced an automation tool, Interconnect Workbench, which removes the need for manual testbench creation.


In Part 2 of the blog we start by exploring how CoreLink CCI-500 performs in a CoreLink CCI-400 like configuration and follow that by showing the full performance potential of CoreLink CCI-500 when configured for maximum performance.

 

CoreLink CCI-500 as a CoreLink CCI-400 replacement

As a first experiment we have configured CoreLink CCI-500 to have 2x ACE input ports, 3x ACELite input ports, 2x memory ports and 1 system port, this matches the fixed configuration of the previous generation of Cache Coherent Interconnect, the CoreLink CCI-400. We have then created a scenario which drives saturating transactions into all of the input ports targeting the two memory ports. The transactions are all defined as Non-shareable so we eliminate the effect of L2 Cache Snoops and see just the raw throughput. Running the testbench at 500MHz also provides a useful point of reference as many CoreLink CCI-400 designs are run at this speed.

 

As in Part 1, we can easily generate a testbench for the CoreLink CCI-500 configurations, a diagram of the generated UVM testbench is shown below.

 

Screen Shot 02-19-15 at 11.53 AM.JPG

 

The generated testbench contains all the necessary instances of fully configured AMBA VIP needed along with an instance of the Interconnect Validator VIP connected to all of the interface VIP. This additional VIP provides full system scoreboard functionality to support transaction tracking from entry point to its exit point including all the necessary coherency modelling need for ACE. In addition it also captures all necessary timing details needed for performance analysis, the graphs shown in this blog all come from this source.

 

It is also important to note that all the Slave VIP instances which model the memory in the system are configured to have zero delay in order that we only see effective delays and bandwidth of the CoreLink CCI-500.

Screen Shot 02-18-15 at 08.43 AM.JPG

 

As can be seen from the chart the CCI-500 delivers around 14GB/s of both READ and WRTE bandwidth.

 

Unleashing the full CoreLink CCI-500 performance

The big benefit of CoreLink CCI-500 over its predecessor is the capability to support two additional memory ports over and above the two supported by CoreLink CCI-400, in addition the design can be targeted to run at 667Mhz in appropriate technologies. The figure below shows the same non-shareable saturating test running on the CoreLink CCI-500 configuration described previously, i.e. with two memory ports at 500Mhz (the red lines) with a CoreLink CCI-500 configuration with four memory ports running at 667Mhz (the blue lines). The graphs show both read and write bandwidth for the two implementations and the improvement in bandwidth is clearly shown, around 14GB/s vs 33 GB/s of bandwidth for both read and write traffic.

 

Screen Shot 02-19-15 at 02.52 AM.JPG


One of the interesting phenomena we see in the simulations is that there is a delay between simulation startup and high bandwidth levels being achieved despite the fact that all masters are trying to make saturating memory accesses from the start. This is caused by the need for the Snoop Filter (more blogs are coming to explain the Snoop Filter) to initialize its RAM, the configuration we have chosen for this test is the “Large” configuration of CoreLink CCI-500 with four memory ports (blue graphs) and also 4x ACE ports (compared to 2x in the 500Mhz case). To support more ACE ports the Snoop Filter RAM is larger and hence takes longer to initialize.


Managing Bandwidth Requests

In order to understand how well CoreLink CCI-500 handles demanding scenarios it is useful to visualize how much it is stalling the requesting masters, sometimes this is called “back pressure”. A useful proxy for “back-pressure” in AMBA infrastructure is the concept of Outstanding Transactions. The AMBA ACE, ACE-Lite, AXI4 and AXI3 protocols all support multiple transaction issuing assuming that the receiving interface can support it.

 

As multiple transactions get issued into the system the number of Outstanding Transactions (OT), i.e. transactions that have been initiated but are incomplete, increases. The OT level will increase until the receiving interface (in this case the CoreLink CCI-500) throttles it, this is generally called the read_acceptance or write_acceptance limit. If we look at the chart below it shows WRITE bandwidth in red (we saw this on the last chart) plotted with WRITE OT Level, in blue, for all of the initiating masters combined. As the system is stalled waiting for the Snoop Filter RAM to be initialized we can see the OT level is flat, but once the RAM initialization is complete the CoreLink CCI-500 starts trying to balance the requesting masters.

 

Screen Shot 02-19-15 at 11.27 PM.JPG

 

After a brief peak of nearly 100 Outstanding Transactions the OT level starts to generally decrease to settle at around the 65 OT level plus or minus around 5 OT. We could run the simulation for longer to understand if this is steady state.

 

If you happen to be attending DVCon in a week's time I will be jointly presenting alongside Simon Rance from ARM at Session 6.3 on Tuesday 3rd March 2015, we would be very happy to chat about this or other ARM system performance topics.

 

Event Details | DVCon

 

Watch out for more parts of this blog in which we will further explore key features of CoreLink CCI-500.

 

Exploring the ARM CoreLinkTM CCI-500 performance envelope - Part 1

Carbon cycle accurate models of ARM CPUs enable system performance analysis by providing access to the Performance Monitor Unit (PMU). Carbon models instrument the PMU registers and record PMU events into the Carbon System Analyzer database without any software programming. Contrast this non-intrusive PMU event collection with other common methods of software execution:

 

  • ARM Fast Models focus on speed and have limited ability to access PMU events
  • Simulating or emulating CPU RTL does not provide automatic instrumentation and event collection
  • Silicon requires software programming to enable and collect events from the PMU


The ARM Cortex-A53 is a good example to demonstrate the features of SoC Designer. The A53 PMU implements the PMUv3 architecture and gathers statistics on the processor and memory system. It provides six counters which can count any of the available events.


The Carbon A53 model instruments the PMU events to gather statistics without any software programming. This means all of the PMU events (not just six) can be captured from a single simulation.


The A53 PMU Events can be found in the Technical Reference Manual (TRM) in Chapter 12. Below is a partial list of PMU events just to provide some flavor of the types of events that are collected. The TRM details all of the events the PMU contains.


pmu-events

 

Profiling can be enabled by right-clicking on a CPU model and selecting the Profiling menu. Any or all of the PMU events can be enabled. Any simulation done with profiling enabled will write the selected PMU events into the Carbon System Analyzer database.prof



Bare Metal Software

 

The automatic instrumentation of PMU events is ideal for bare metal software since it requires no programming and will automatically cover the entire timeline of the software test or benchmark. Full control is available to enable the PMU events at any time by stopping the simulator and enabling or disabling profiling.

 

All of the profiling data from the PMU events, as well as the bus transactions, and the software profiling information end up in the Carbon Analyzer database. The picture below shows a section of the Carbon Analyzer GUI loaded with PMU events, bus activity, and software activity.

 

 

a1


The Carbon Analyzer provides many out-of-the-box calculation of interesting metrics as well as a complete API which allows plugins to be written to compute additional system or application specific metrics.


Linux Performance Analysis

 

Things get more interesting in a Linux environment. A common use case is to run Linux benchmarks to profile how the software executes on a given hardware design. Linux can be booted quickly and then a benchmark can be run using a cycle accurate virtual prototype by making use of Swap & Play.

 

Profiling enables events to be collected in the analyzer database, but the user doesn’t have the ability to understand which events apply to each Linux process or to differentiate events from the Linux kernel vs. those from user space programs. It’s also more difficult to determine when to start and stop event collection for a Linux application. Control can be improved by using techniques from Three Tips for Using Linux Swap & Play with ARM Cortex-A Systems.


Using PMU Counters from User Space

 

Since the PMU can be used for Linux benchmarks, the first thing that comes to mind is to write some initialization code to setup the PMU, enable counters, run the test, and collect the PMU events at the end. This strategy works pretty well for those willing to get their hands dirty writing system control coprocessor instructions.


Enable User Space Access

 

The first step to being able to write a Linux application which accesses the PMU is to enable user mode access. This needs to be done from the Linux kernel. It's very easy to do, but requires a kernel module to be loaded or compiled into the kernel. All that is needed to set bit 0 in the PMUSERENR register to a 1. It takes only one instructions, but it must be executed from within the kernel. The main section of code is shown below.

 

pmu-mod2

 

Building a kernel module requires a source tree for the running kernel. If you are using a Carbon Performance Analysis Kit (CPAK), this source tree is available in the CPAK or can easily be downloaded by using the CPAK scripts.

 

A source code example as well as a Makefile to build it can be obtained by registering here.

 

The module can either be loaded dynamically into a running kernel or added to the static kernel build. When working with CPAKs it’s easier for me to just add it to the kernel. When I’m working with a board where I can natively compile it on the machine it’s easier to dynamically load it using:


$ sudo insmod enable_pmu.ko


Remember to use the lsmod command to see which modules are loaded and the rmmod command to unload it when finished.


The exit function of the module returns the user mode enable bit back to 0 to restore the original value.


PMU Application

 

Once user mode access to the PMU has been granted, benchmark programs can take advantage of the PMU to count events such as cycles and instructions. One possible flow from a user space program is:

  • Reset count values
  • Select which of the six PMU counter registers to use
  • Set the event to be counted, such as instructions executed
  • Enable the counters to start counting

Once this is done, the benchmark application can read the current values, run the code of interest, and then read the values again to determine how many events occurred during the code of interest.

 

pmu-app

 

The cycle counter is distinct from the other 6 event count registers. It is read from a separate CP15 system control register. For this example, event 0x8 is monitored, instruction architecturally executed, using event count register 0. Please take a look at the source code for the simple test application used to count cycles and instructions of a simple printf() call.

 

Summary

 

This article provided an introduction to using the Carbon Analyzer to automatically gather information on ARM PMU events for bare metal and Linux software workloads. Carbon models provide full access to all PMU events during a single simulation with no software changes and no limitations on the number of events captured.

 

It also explained how additional control can be achieved by writing software to access the PMU directly from a Linux test program or benchmark application. This can be done with no kernel changes, but does require the PMU to be enabled from user mode and is limited to the number of counters available in the PMU; six for CPUs such as the Cortex-A15 and A57.

 

Next time I will look at an alternative approach to use the ARM Linux PMU driver and a system call to collect PMU events. 

There's little doubt that the Internet of Things (IoT) market is keenly dependent on sensor design. This raises a number of engineering challenges, not the least of which are power, area and integration. In addition, because it's a fast-growing, fast-changing segment, time to market is critical. This means design tools and methodologies need to evolve to enable these systems.


In the second of two webinars on the topic, experts from ARM, Cadence and Coventor discuss

  • How to create a MEMS component using Coventor tools
  • How to design and integrate analog conditioning circuits using Cadence platforms
  • And how to design energy-efficient discrete smart sensors and sensor fusion hubs with the ARM® Cortex®-M processor family about MEMS, IoT and sensor design is now available.


My Cadence colleague, Richard Goering, offers a summary. In the webinar,MEMS2.jpg Tim Menasveta,  CPU Product Manager, Cortex-M0 and Cortex-M0+ Processors, for ARM, Ian Dennison, Solutions Marketing Senior Group Director for the Custom IC and PCB Groups at Cadence from Cadence and Chris Welham, Worldwide Applications Engineering Manager, from Coventor, present how to design a MEMS vibration sensor (right).


In the first webinar, Diya Soubra, CPU Product Manager for ARM Cortex-M3 processors at ARM, and Dennison guided listeners through ways to reduce time to market and realize power-performance-area design targets.

 

Related stories:

-- IoT Webinar Series Part 1

-- IoT Webinar Series Part 2

-- Upcoming Webinar: SoC Verification Challenges in the IoT Age

-- Iconic wearable hits the Million mark, sign of things to come?

-- Whitepaper: Pushing the Performance Boundaries of ARM Cortex-M Processors for Future Embedded Design

--Cortex-M7 Launches: Embedded, IoT and Wearables

--New Cortex-M7 Processor Balances Performance, Power

--The new ARM® Cortex®-M7 »

Introduction

You may have noticed the ARM announcement last week of a group of Premium Mobile products (if not you can find it here ARM Sets New Standard for the Premium Mobile Experience - ARM) covering a new core processor IP, new GPU IP and a new Interconnect IP. While the headlines may belong to the Core and GPU announcements I want to focus on the new Cache Coherent Interconnect, the ARM CoreLinkTM CCI-500. The real-world system performance of any mobile SoC is really determined by the choice of DDR technology and how effectively the SoC architecture can squeeze every last drop of performance out of that DDR.

 

In this multi-part blog I want to explore how early performance exploration enables users of the CCI-500 to gain valuable insight into the configurability that CCI-500 brings, how the IP behaves under different loading conditions and therefore enable architects, implementers and verification engineers to be better prepared for projects which plan to include CCI-500 in their SoCs.

 

Configurability

One of the key differences between the previous generation of CCI and the latest is in the configurability of the IP. CCI-500 allows users to more effectively tune the interconnect to match the needs of a broad range of mobile SoCs. For example the number of coherent clusters supported by the AMBA ACE protocol has increased to a maximum of 4 but can also be reduced to 1 for smaller applications. The following table provides a few example CCI-500 configurations that might be commonly used.

 

 

Example

 

Address Width

 

COHERENT Clusters -

ACE


I/O Coherent

ACELite

 

Memory

Ports

 

System

Ports

Small

34

2

1

1

1

Large

40

4

3

4

2

Smart Phone

34

2

3

2

1

Tablet

34

2

5

4

2

 

This configurability obviously allows better matching of CCI configuration with target SoC requirements, however it poses a number of questions. For example what memory bandwidth can a given configuration support? Will adding more ports get me the performance I need? What impact does changing the configuration have?

 

Exploring these configuration options in a meaningful way requires accurate measurements of the RTL performance of the IP. This is exactly the kind of challenge that the Cadence Interconnect Workbench was architected to address.

 

Creating a UVM Testbench

Taking the Large example from the previous table the following diagram illustrates the UVM testbench features which are required to start cycle-accurate performance exploration.

Screen Shot 02-10-15 at 07.05 AM.JPG

As can be seen the testbench needed comprises a number of instances of AMBA VIP to drive each of the 14 interfaces as well as a system scoreboard called Interconnect Validator which tracks transactions through their life-cycle. In addition, test sequences are required to define the shape of AMBA traffic to be injected into the CCI-500 configuration.

 

The power of Interconnect Workbench is that this potentially tedious, time-consuming and error-prone testbench creation task can be completely automated through a simple spreadsheet.

 

Below is shown a spreadsheet for a simple configuration (the "small" example in the table) from which a fully working, automatically generated, UVM testbench can be created in a matter of minutes. This simple example is chosen simply to make viewing it in this blog less of an eye test. We have created and tested templates for all the examples listed in the table.

 

SmallConfigXLScreenShot.JPG

As can be seen, creating this spreadsheet is a much simpler task than writing the 10’s of 1000’s lines of SystemVerilog code by hand.

 

In the next part of the blog I will present some of the performance results than can be easily extracted using the automated testbench and how different CCI setups can be readily compared to ensure the correct configuration is identified early in your project.

 

Exploring the ARM CoreLink™ CCI-500 performance envelope – Part 2


When we look back on today, we might say it was the dawn of server-in-your-pocket technology.

 

That’s because ARM announced a new premium mobile IP suite (New Announcements: ARM Sets New Standard for the Premium Mobile Experience) that includes the ARM Cortex-A72 64-bit processor core, ARM Mali-T860 and T880 GPUs and a new, faster CoreLink interconnect.

 

In tandem, Cadence announced a reference flow for the suite that supports advanced manufacturing processes including TSMC 16-nanometer FinFET Plus. Also available with the Cadence flow is ARM Artisan physical IP and ARM POP IP for the ARM Cortex-A72 processor and ARM Mali-T860 and T880 GPUs, enabling designers to meet aggressive processor performance and power goals.

 

Cortex-A72-chip-diagram-LG.png

Dr. Chi-Ping Hsu, Cadence senior vice president and chief strategy officer for EDA, said the two companies worked together to ensure that the Cadence flow allows customers to integrate the ARM Mali-T880 GPU and ARM CoreLink CCI-500 to achieve optimal results at advanced process nodes.

 

ARM used the Cadence digital and system-to-silicon verification tools and IP during the ARM Cortex-A72 processor development to ensure that the flow met complex mobile design requirements, Hsu added.

So why a server in your pocket? The processing power that comes with this innovation is the equivalent of a server or more to your pocket in mobile devices.

 

What’s more it won’t burn a hole in your pocket. The Cortex-A72 is 3.5x faster than 2014's 32-bit Cortex-A15 and consumes 75 percent less power, according to ARM.

 

The Register reported that the A72 has twice the performance and half the power of ARM’s 64-bit flagship the A57. That processor has been targeted at servers, among other applications. So server in your pocket? Indeed.

 

Related stories:

 

Cadence System Design and Verification

Cadence Functional Verification

Optimizing ARM Cortex-M7 with Cadence


January is always so deceptive. We return from the holidays to desks we cleared and cleaned in December, and it all seems so peaceful, so manageable. But before you know it, they’re piled high with work and the scrum's begun. So it is in the first weeks of 2015. What lies ahead? Plenty.


Here’s what I’m seeing:

 

The past is not necessarily a predictor but it’s always prologue, so I’m looking forward to watching how design challenges evolve and emerge in 2015 and how our industry, collectively, tackles them.


And speaking of challenges, will the ecosystem see another ARM Step Challenge (The Aftermath of the ARM Step Challenge at DAC) this year? One can only hope, Brad Nemire!

ARM University Program (AUP) and Xilinx University Program (XUP), together, organized a one-day workshop (ODW) on System-on-Chip (SoC) Design as part of their Faculty Development Program (FDP) initiatives in India on December 16, 2014. The workshop was collocated with IEEE’s International Conference on High-performance Computing HiPC 2014 in Goa, India. The participants were from all across India as well as neighboring Sri Lanka, representing a unique cross-section of universities from Ramdeobaba College of Engineering Nagpur, University of Moratuwa Sri Lanka, BITS Pilani Goa and Vellore Institute of Technology Chennai to BMS College of Engineering Bangalore, Indian Institute of Technology Kharagpur, DKTE Society’s Textile and Engineering Institute Ichalkaranji and RMK College of Engineering and Technology Chennai. The aim of the workshop was to showcase AUP’s flagship Lab-in-a-Box (LiB) on the first principles of SoC Design as a conceptual hands-on aid to faculty contemplating introducing SoC Design in a university curriculum. This particular LiB comprises of the ARM Cortex-M0 DesignStart Processor IP made available at no cost to university faculty through a EULA by ARM, Basys3 or Nexys4 FPGA boards donated by Xilinx, 100 licenses of Keil MDK Pro microcontroller software development tool donated by ARM and a full suite of teaching materials specially created for academics donated by AUP.

 

The workshop began with an introduction to SoC Design elucidating its three essential ingredients, namely, the processor core, the bus interconnect that stitches the peripheral interfaces and the interfaces themselves, such as UART and GPIO, to name a couple. The introduction included an overview of the ARM Cortex-M0 processor architecture and later touched upon the essentials of the AMBA 3 AHB-lite bus protocol. The ARM Cortex-M series of processor cores are all AHB-lite compliant. The participants marveled at how the specially pre-configured ARM Cortex-M0 DesignStart processor IP core made understanding and dealing with the AHB-lite signals incredibly simple, indeed building their confidence in this particular introductory-level SoC Design flow designed for academia. The explanation of the AHB-lite protocol was followed by an introduction to Artix-7 FPGA Architecture and an over-view of Vivado Design Flow the Artix7 FPGA chip on the Basys3 board requires for its configuration.

 

Parimal Patel of Xilinx University Program
Sadanand Gulwadi of ARM University Program
Parimal.jpgSadanand.jpg

 

The introductions were immediately followed by a very basic lab – integrating LEDs on Basys3 with the IP core – for the proper learning and understanding of the Vivado Design Flow on Basys3, key to full and complete adoption of the SoC Design LiB in a university curriculum. The basic lab also involved integrating a peripheral memory block from which the processor IP core would fetch instructions. The more complex labs following lunch were made easy to follow through with a continually and gradually evolving flow. Side-by-side LED connectivity, first the UART interface had to be integrated for the purposes of establishing connectivity between a terminal window application, such as TeraTerm on a user laptop, and the SoC. A memory controller was simultaneously integrated for the purposes of managing the flash memory external to the Artix-7 FPGA on Basys3. Printf and Scanf functions were re-targeted to use the UART peripheral interface to input or output characters, enabling displaying of text messages and memory contents on the terminal window application for better debugging. With the UART interface in place and working, the next lab illustrated how to interrupt the processor core from a low-power mode. A character sent from the serial terminal of a user laptop through the UART interface generated the interrupt signal to wake up the processor core. After waking up, the processor core executed instructions to display the received character using the LEDs.

 

Faculty Participants Absorbed in Hands-on Labs
Audience.jpg

Discussions with the participants during the workshop revealed they were interested in developing and possessing the ability to design their own SoCs around a soft processor IP core such as the ARM Cortex-M0 DesignStart Processor IP. For, with this ability, they would be able to tailor SoCs to intended applications. That in turn would mean just the minimum necessary interfaces around the SoC, keeping designs simple and costs economical with fewer gates, lower power, simpler fabrication and easy debug.


The faculty participants, in response, were made aware that designing a SoC with the ARM Cortex-M0 DesignStart IP actually presents the possibility of taping out the SoC. That would further mean they could potentially expose their students and researchers to possible other areas, such as those in back-end design or silicon validation, verification and testing, which are otherwise harder to introduce in a university setting. The participants were also informed that, to help universities out with fabrication requests at affordable costs, service providers such as MOSIS in the US and EUROPRACTICE in Europe have made available special Multi-project Wafer (MPW) services.

Today, I have three tips for using Swap & Play with Linux systems.

 

  1. Launching benchmark software automatically on boot
  2. Setting application breakpoints for Swap & Play checkpoints
  3. Adding markers in benchmark software to track progress


With the availability of the Cortex-A15 and Cortex-A53 Swap & Play models as well as the upcoming release of the Cortex-A57 Swap & Play model, Carbon users are able to run Linux benchmark applications for system performance analysis. This enables users to create, validate, and analyze the combination of hardware and software using cycle accurate virtual prototypes running realistic software workloads. Combine this with access to models of candidate IP, and the result is a unique flow which delivers cycle accurate ARM system models to design teams.


cav

Swap & Play Overview

 

Carbon Swap & Play technology enables high-performance simulation (based on ARM Fast Models) to be executed up to user specified breakpoints, and the state of the simulation to be resumed using a cycle accurate virtual prototype.One of the most common uses of Swap & Play is to run Linux benchmark applications to profile how the software executes on a given hardware design. Linux can be booted quickly and then the benchmark run using the cycle accurate virtual prototype. These tips make it easier to automate the entire process and get to the system performance analysis.


Launch Benchmarks on Boot


The first tip is to automatically launch the benchmark when Linux is booted.Carbon Linux CPAKs on System Exchange use a single executable file (.axf) for each system with the following artifacts linked into the images:

  • Minimal Boot loader
  • Kernel image
  • Device Tree
  • RAM-based File System with applications

To customize and automate the execution of a desired Linux benchmark application, a Linux device tree entry can be created to select the application to run after boot.

 

The device tree support for “include” can be used to include a .dtsi file containing the kernel command line, which launches the desired Linux application.

 

Below is the top of the device tree source file from an A15 CPAK. If one of the benchmarks to be run is the bw_pipe test from the LMBench suite a .dtsi file is included in the device tree.

 

dts11

 

The include line pulls in a description of the kernel command line. For example, if the bw_pipe benchmark from LMbench is to be run, the include file contains the kernel arguments shown below:

 

dts22

 

The rdinit kernel command line parameter is used to launch a script that executes the Linux application to be run automatically. The bw_pipe.sh can then run the bw_pipe executable with the desired command line arguments.

 

Scripting or manually editing the device tree can be used to modify the include line for each benchmark to be run. A unique .axf file for each Linux application to be run can be created. This gives an easy to use .axf file that will automatically launch the benchmark without the need for any interactive typing. Having unique .axf files for each benchmark also makes it easy to hand off to other engineers since they don’t need to know anything about how to run the benchmark; just load the .axf file and the application will automatically run.

 

I also recommend to create an .axf image which runs /bin/bash to use for testing new benchmark applications in the file system. I normally run all of the benchmarks manually from the shell first on the ARM Fast Model to make sure they are working correctly.


Setting Application Breakpoints

 

Once benchmarks are automatically running after boot, the next step is to set application breakpoints to use for Swap & Play checkpoints. Linux uses virtual memory which can make it difficult to set breakpoints in user space. While there are application-aware debuggers and other techniques to debug applications, most are either difficult to automate or overkill for system performance analysis.

 

One way to easily locate breakpoints is to call from the application into the Linux kernel, where it is much easier to put a breakpoint. Any system call which is unused by the benchmark application can be utilized for locating breakpoints. Preferably, the chosen system call would not have any other side effects that would impact the benchmark results.

 

To illustrate how to do this, consider a benchmark application to be run automatically on boot. Let’s say the first checkpoint should be taken when main() begins. Place a call to the sched_yield() function as the first action in main(). Make sure to include the header file sched.h in the C program. This will call into the Linux kernel. A breakpoint can be placed in the Linux kernel file kernel/sched/core.c at the entry point for the sched_yield system call.

 

Here is the bw_pipe benchmark with the added system call.

 

yield1

 

Put a breakpoint in the Linux kernel at the system call and when the breakpoint is hit save the Swap & Play checkpoint. Here is the code in the Linux kernel.

 

kernel-yield2

 

The same technique can be used to easily identify other locations in the benchmark application including the end of the benchmark top stop simulation and gather results for analysis.

 

The sched_yield system call yields the current processor to other threads, but in the controlled environment of a benchmark application it is not likely to do any rescheduling at the start or at the end of a program. If used in the middle of a multi-threaded benchmark it may impact the scheduler.

 

Tracking Benchmark Progress


From time to time it is nice to see that a benchmark is proceeding as expected and be able to estimate how long it will take to finish. Using print statements is one way to do this, but adding to many print statements can negatively impact performance analysis. Amazingly, even a simple printf() call in a C program to a UART under Linux is a somewhat complex sequence involving the C library, some system calls, UART device driver activations, and 4 or 5 interrupts for an ordinary length string.

 

A lighter way to get some feedback about benchmark application process is to bypass all of the printf() overhead and make a system call directly from the benchmark application and use very short strings which can be processed with 1 interrupt and fit in the UART FIFO.

 

Below is a C program showing how to do it.

 

syscall1

 

By using short strings which are just a few characters, it’s easy to insert some markers in the benchmark application to track progress without getting in the way of benchmark results. This is also a great tool to really learn what happens when a Linux system call is invoked by tracing the activity in the kernel from the start of the system call to the UART driver.

 

Summary

 

Hopefully these 3 tips will help Swap & Play users run benchmark applications and get the most benefit when doing system performance analysis. I’m sure readers have other ideas how to best automate the running of Linux application benchmarks as well locating Swap & Play breakpoints, but these should get the creative ideas flowing.

 

Jason Andrews

After ARM and Synopsys jointly announced at ARM TechCon the extended collaboration breadth and depth enabled by our new multi-year IP subscription agreement, I put together an article describing some of the existing design solutions we have in place across the entire design flow (including optimized implementation, verification/debug/emulation/VIP, complementary interface, AMBA interconnect and memories/logic libraries, FPGA and virtual prototyping, etc.).

 

The best way to learn about most of these collaborative efforts is to hear our mutual customers talking about their successes using them to design creative, state-of-the-art products. The article gives a few examples of these, including a few presentations (e.g., AMD, STMicroelectronics, Samsung) that were recorded live at TechCon and are available through www.synopsys.com/ARM, the go-to place for design solutions for your ARM-based products. Look for the "Videos" tab on the right-hand side of the  pages on the www.synopsys.com/ARM microsite (and the "more" link at the end of the Videos list to see all videos).

 

I suggest you start with the collaboration summary article, take a look at the Synopsys-ARM solution microsite, then view a few of the  mutual customer (and ARM & Synopsys) videos to get a good idea of what's available for you today. If you have any questions, talk with your Synopsys application consultant.

How to Break Through the 3 GHz Barrier - New On-Demand Webinar!

If you missed the live event last week, and are wondering just what you are going to do with all the free time you will have during the holiday break -- don't worry -- the on-demand version of this informative webinar is now available!
Check out Breaking Through 3.0GHz with ARM Cortex-A53 to learn how to balance the need for high-performance with low-power requirements and small area. Find out what methodologies were developed for the ARM Artisan® POP IP-based Cortex®-A53 implementation solution.

 

This will kill at least an hour before someone pulls out the old 'honey do' list!

I started my design career far too long ago doing system verification on a multi-processor server design.  Basically, I was charged with assembling a model of the system and then writing some tests to exercise it.  This was long before the days of virtual prototypes so I assembled the system in RTL simulation using an LMSI hardware modeler to represent the existing processor and cache components in the system.  When it came time to get software up and running on the system, I started off by writing a few directed tests to run on the processors in the design.  These tests were designed to stress the system but configuring all of the components started to become burdensome so I went to the software team and started borrowing the code that they were writing for the eventual real silicon and got it up and running on my system model.  After spending far too much time figuring out the problems in my verification system (it was my first job after all) I started finding real system problems.  Software driven verification was finding problems that the hardware verification team had missed.  Since this was the first time that software was being run on the real hardware, albeit as a simulation model, we found numerous problems in both the hardware and the software.

 

Evolving Approaches

A few years later, I migrated from doing design work to working as an applications engineer at Quickturn Systems. I saw firsthand the huge amounts of money and design resources which companies would allocate to assemble a model of the system before silicon.  They were taking a lot of the same approach that I had done in my first job running real software on real hardware.  Instead of assembling the system in software talking to a hardware modeler they were using a cobbled together system with a washing machine sized emulator hooking into a specially designed hardware board with what seemed like miles of spaghetti-like cables in between.  The hardware teams would typically use the systems during the day time hours to do their system verification work with the software teams relegated to nighttime hours for their time on the box.  (After all, emulators were an expensive resource.  It made sense to schedule them for round the clock usage and I spent more than one 2am session in the lab to help keep the boxes running.)  The value was high though.  The interaction of real hardware and real software before silicon accelerated design schedules and found corner case functional and performance issues that would have otherwise made their way into silicon.  It was difficult, it was expensive but for many design teams, it was worth it.

 

Fast forward a few years to the present day and it looks like the path to do this system level validation is evolving once again.  We’ve seen an increasing number of design teams adopt a validation strategy that involves using system level software to drive the validation of the design long before silicon.  While this has historically been done using the actual system software (getting to that boot prompt well before

Breker A53 CPAK

tapeout is still a typical milestone) many design teams are now crafting software specifically for the purpose of validating their system.  This software can either be targeted software they’ve written themselves or third party verification software from leading companies like Breker Systems.  We recently published an article together with Breker in EETimes which talks about doing precisely this to address the huge problem of coherency validation in the newest generation of ARM V8-based SoC designs.  It goes into a good amount of depth on cache coherency so I'd certainly recommend it if your next design is using hardware coherency.  The CPAK discussed in the article uses two clusters of four core ARM Cortex-A53 CPUs but it can be easily modified to better represent your actual design.

 

While emulation and FPGAs remain a popular, albeit expensive, way to execute this validation software, an increasing number of teams have been performing this valuable step on virtual prototypes.  Using virtual prototypes together with system software isn’t new of course, that’s been done for a long time.  The latest wrinkle though is the ability to do this software development on a virtual prototype which is actually an accurate representation of the system.  Traditionally, virtual prototypes have been functional models only and have abstracted away the implementation details of the system in order to achieve performance.  Today, it is possible to use virtual prototypes that have both the speed of these high level virtual prototypes (10s to 100s of MIPS) but also still have all of the accuracy of the RTL implementation.  What’s more, many of these systems are already built and have system software already running on them.

 

Of course, when I say RTL-accurate, alarm bells start going off.  Does this mean I have to debug my software using waveforms?  Am I going to have to learn how to run a hardware simulator?  And of course: Don’t accurate models run too slowly to execute real software?  Thankfully, the answers to all of these questions is no.  Let’s see why.

 

Getting Started With Software Driven Verification

Most times, the fastest way to get a working model of your system is to take a working model of a system similar to yours and port it to more closely represent your design.  This is the reason that we have so many CPAKs on our System Exchange web portal.  Using the search parameters, you can easily narrow down the lengthy list of pre-built systems (well over 100 when this blog is written) to one that most closely matches your design.  You can even choose the software you want to run, from the simplest bare-metal benchmark to a full Linux boot and OS-level benchmarks.

 

Once downloaded, the CPAK can be easily customized to mimic your actual design.  You can do this by using models from IP Exchange, RTL models you’ve compiled using Carbon Model Studio or SystemC models.  They can either add to the supplied system or replace existing components.

 

Your next step depends upon your design needs.  If you want to develop high level software without need for system accuracy you can certainly do so.  Simply use the ARM Fast Model representation of the system.  This enables you to run at Fast Model speeds which are typically in the 10s to 100s of MIPS.  You can even execute in a hybrid configuration if desired, mixing Fast Model components together with Carbonized RTL models.  This is a common use case for components such as GPUs which don’t have Fast Model representations.  The system runs at Fast Model speeds except when accessing the GPU or rendering a frame.  This approach enables a fast OS boot before beginning video operations which then run accurately since the GPU is RTL-accurate.  Bear in mind of course that any hybrid combination of Fast Models and accurate models will not generate the same types of accesses to the GPU as would be seen in a real system.  This is true if this hybrid combination takes place entirely in the virtual world or by tying a virtual prototype to an emulator.  Since Fast Model representations are functional only and don’t attempt to correctly model cycle accuracy this type of an approach is only well-suited for software development and not for system architecture or validation.  For those tasks, only an accurate system model will do.

 

This brings us back to where we started: using software on an accurate representation of the system to drive validation.  Before we talk more about that though, I should answer the questions I mentioned above about debugging and execution speed.  The debugging question is an easy one.  Although the waveforms are there is you really want to be masochistic, all Carbon models of ARM IP available on our IP Exchange web portal contain an integration with ARM’s DS-5 debugger to enable truly interactive debugging.  This isn’t a post-process “gee, I wish I could change that value but it’s too late now” integration.  It’s one that enables the designer to view and modify the contents of any register or memory location while the program is running.  The entire system also runs in a complete virtual prototype environment so no hardware simulator connection is necessary.  No complicated command lines needed or extra licenses to check out at runtime.

 

This of course brings us back to the speed question. After all, functional models run faster than accurate models precisely because they’ve eliminated accuracy.  How can we get the speeds needed to boot an OS or develop application level software and still expect to have accuracy?  This is where Carbon’s Swap & Play technology comes to the rescue.  Our virtual prototypes and CPAKs have the ability to start running using a Fast Model representation of the system and then swap over to 100% accurate models at any software breakpoint. This approach lets you boot your OS in under a minute, far faster than with any emulator or FPGA prototype, and then continue running with the accurate representation.  This enables the tasks that require accuracy such as performance optimization or system validation.  You can even create multiple breakpoints to start running accurately at different points in the system execution.

 

Conclusion

Software can be a very effective way to verify the behavior of an SoC long before tapeout.  Whether you're using actual system software to do this or leveraging dedicated system verification software from companies like Breker, you have the ability to see true system behavior and fix problems earlier in the design cycle.  Virtual prototypes simplify this task with their ability to offer true interactive debugging of both hardware and software with complete visibility. Carbon's CPAKs offer a great way to further accelerate the development of these systems and let that verification task start earlier in the cycle.

ARM Cortex A9 Virtual Prototype Running Coremark Benchmark Embedded System Virtual Prototype Booting Linux Demo Request More Information

We recently ran a webinar that covered common pitfalls in verification and performance analysis of cache-coherent ARM-based designs. Don't worry if you missed it - you can register to watch the recorded session. Here are the particulars:

 

Click here to register.

 

Date: December 3, 2014
Time: 11:00 AM PST
(Note: The event will be recorded; if you register and are not able to attend an e-mail notification will be sent advising you of the event's availability for viewing within 24 hours)

Event Summary:
We will be covering how Verification IP for AMBA enables users to generate correct coherent stimulus for cache coherent SoC verification.


The technical webinar will cover the complexities of configuration, stimulus, coverage and checking. We will highlight and discuss common verification pitfalls, for example:

1) Lack of system checks results in late discovery of coherency issues

2) Lack of time and expertise to create complex scenarios (e.g., concurrent accesses to same cache line, trigger interesting cache transitions, etc.)

3) Insufficient performance analysis due time pressures

4) Excessive debug time to find root cause of failures


The webinar will include many examples of how users can address these and other pitfalls using simple techniques and advanced VIP.


Speakers:

 

Neill Mullinger

Product Marketing Manager for Verification IP, Synopsys

Neill Mullinger is a product marketing manager at Synopsys for verification IP. Neill joined Synopsys in 2000 and has been focused on verification IP and protocol verification since 2002. He brings more than 25 years of experience in the hardware and EDA industries as an applications engineer and product manager.

 

 

Tushar Mattu

Corporate Application Engineer (CAE) for Verification Group, Synopsys

For more than 10 years, Tushar has been working as a verification solutions engineer at Synopsys. Tushar has been supporting some of Synopsys’ key customers to architect testbenches using best verification practices based on VMM and UVM. Currently, Tushar’s focus is on AMBA Verification IP, and he works closely with VIP users.

03 DECEMBER, 2014

Today, Huawei has introduced a new addition to its own processor family. The Huawei Kirin 620 system-on-a-chip is definitely not a top-of-the-line chipset, but is nonetheless a solid performer, designed for mid-ranged devices. The CPU has a 64-bit architecture and comes with 8 cores, clocked at 1.2GHz.

It is a 28nm chip, based on the Cortex-A53 with LPDDR3 RAM support. As far as connectivity goes, it will offer GSM / TD-SCDMA / WCDMA / TD-LTE /LTE FDD support, as well as Cat. 4 LTE for speeds up to 150Mbps. The GPU used is a Mali450 MP4, which is a little dated. Camera support is limited to a 13MP sensor, while video encoding and decoding capabilities can handle up to 1080p resolutions at 30Hz.

截图00.jpg

From the link below you can find more information:

Huawei releases a new octa-core Kirin 620 chipset - GSMArena.com news

The first Carbon Performance Analysis Kit (CPAK) demonstrating the AMBA 5 CHI protocol has been released on Carbon System Exchange. The design features the ARM Cortex-A57 configured for AMBA 5 CHI and the ARM CoreLink CCN-504 Cache Coherent Network. The design is a modest system with a single core running 64-bit bare-metal software with memory and a PL011 UART, but for anybody who digs into the details there is a lot to learn.

 

Here is a diagram of the system:

 

ccn-504-system1


AMBA 5 CHI Introduction

 

Engineers who have been working with ARM IP for some time will quickly realize AMBA 5 CHI is not an extension of any previous AMBA specifications. AMBA 5 CHI is both more and less complex compared to AMBA 4. CHI is more complex at the protocol layer, but less complex at the physical layer. AXI and ACE use Masters and Slaves, but CHI uses Request Nodes, Home Nodes, Slave Nodes, and Miscellaneous Nodes. All of these nodes are referenced using shorthand abbreviations as shown in the table below.

 

nodes2


Building the A57 with CHI

 

The latest r1p3 A57 is now available on Carbon IP Exchange. CHI can be selected as the external memory interface. The relevant section from the IP Exchange configuration form is shown below.

 

a57chi1

 

The CHI memory interface relies on the System Address Map (SAM) signals. All of the A57 input signals starting with SAM*are important in constructing a working system. These values are available as parameters on the A57 model, and are configured appropriately in the CPAK to work with the CCN-504.

 

Configuring the CCN-504


The CCN-504 Cache Coherent Network provides the connection between the A57 and memory. The CPAK uses two SN-F interfaces since dual memory controllers is one of the key features of the IP. A similar set of SAM* parameters is available on the CCN-504 to configure the system address map. Like other ARM IP, the CCN uses the concept of PERIPHBASE to set the address of the internal, software programmable registers.

 

Programming Highlights

 

The CCN-504 includes an integrated level 3 cache. The CPAK demonstrates the use of the L3 cache.

The CPAK startup assembly code also demonstrates other CCN-504 configuration including how to setup barrier termination, load node ID lists, programming system address map control registers, and more.


AMBA 5 CHI Waveforms

 

One of the best ways to start learning about AMBA 5 CHI is looking at the waveforms between the A57 and the CCN-504. The lastest SoC Designer 7.15.5 supports CHI waveforms and displays Flits, the basic unit of transfer in the AMBA 5 CHI link layer.

 

wave

Summary


A new CPAK by Carbon Design Systems running 64-bit bare-metal software on the Cortex-A57 processor with CHI memory interface connected to the CCN-504 and memory is now available. It demonstrates the AMBA 5 CHI protocol, serves as a starting point for optimization of CCN-based systems, and is a valuable learning tool for projects considering AMBA 5 CHI.

My colleague, Tom De Schutter, wrote a good blog about a recent accomplishment of the Synopsys Press book "Better Software. Faster!" -- more than 3,000 copies in distribution to designers in more than 1,000 companies. The success of the book highlights the interest in using virtual prototyping as a key methodology to "shift left" product development.

 

You can download a free Better Software. Faster! eBook in English or Chinese by using either your SolvNet ID or email address. The Japanese edition is underway as well, so stay tuned for that.

 

The book, which includes case studies from thirteen companies, including one written by Rob Kaye of ARM, dives deep in to virtual prototyping as the key methodology to enable concurrent hardware/software development by decoupling the dependency of the software development from hardware availability. .

 

Filter Blog

By date:
By tag: