1 2 3 Previous Next

ARM Processors

289 posts

Chinese Version中文版:从数月到几天:52DAC 大会上的 IP 集成创新

Some innovations give such an exponential productivity shift that they are often only appreciated when viewed with the perspective of history. Isambard Kingdom Brunel built the first train line from London to Bristol and cut down the travel times from days to hours. In doing so he actually moved time itself; at the time Bristol was 30 minutes behind London. Clocks in the 19th century were based off sunrise and sunset in each location as it was never necessary to be so precise when travel between two places meant a much longer journey. However the real benefit he provided was literally giving time to people, by shortening the travel time it enabled people to dedicate time to solving other problems. The spread of train tracks across the UK and the rest of the world enabled the rapid development of the Industrial Revolution that provided the foundation for the modern world. There is a fantastic documentary on Isambard Brunel on YouTube for those who wish to find out more.



Clifton suspension bridge.jpg

The Clifton suspension bridge in Bristol. A revolutionary construct in 1864 that dramatically cut travel times between London and Bristol



The narrative of saving time on one task to allow a person focus on other issues is incredibly apt when referring to tools that help with SoC development. Ever since the birth of the EDA movement in the 1980’s, tools have played a key role in automating tasks and freeing up time to work on more important design decisions.



When you look at the way SoCs and subsystems are put together currently, it consists of manually stitching using RTL. Nowadays, for example, an ARM® CoreSight™ debug and trace subsystem will consist of thousands of connections, hundreds of registers and hundreds of interface ports. That all adds up to a highly complex bit of IP. Trying to stitch this together in a way that gets the most out of the system, knowing where the interfaces are and what they need to do, requires true experts that have intimate knowledge of how the architecture works. Even for someone with vast experience it is still a time-consuming, error-prone process that requires constant reference to TRMs. Let's face it: everybody has something better to be doing than looking at TRMs. Add into that the fact that designers are using IP from different sources which may not all be on the same standard; some in RTL and some in IP-XACT. Today it takes months to properly build a functional subsystem.



As the premier design automation conference, DAC is instrumental in showcasing technology that brings huge improvements to SoC development. This year at DAC52 ARM will unveil new technologies that will transform the way complex debug subsystems and entire SoCs are configured and integrated. Similar to Brunel’s breakthrough 150 years ago, it will reduce the time scale required from months to days and enable system architects to focus their attention on delivering performance improvements through system optimization instead of toiling away at manual stitching. In a previous conversation with ARM IP Tooling Architect David Murray, he gave away a few hints into how this productivity shift will be achieved,



What we’re trying to do here is use the metadata to give a fast, correct configuration in a system context. What I mean by system

context is that you can see how different system requirements have a knock-on effect on the configuration of each IP and the system as a whole.

What that allows us to do is reduce the time that’s spent on actually integrating the parts into a system because 90% of that work will have

been done through intelligent configuration. In order to realize our ‘System in a Day’ vision for IP integration we need to do it through intelligent

configuration. You need to have a solution for these complex IP blocks so that they can reconfigure themselves as the system is being defined”.

This level of fast configuration that can show the system implications in real time is a revolutionary step forward in terms of SoC development. Its true value may only be seen in a few years when users of the technology have squeezed even more performance out of SoCs with the time that has been freed up for them. Find out more about ARM’s commitment to simplifying the process of system integration at Booth #2428 at 52DAC.


DoCoMo Fist operator worldwide to launch FIDO authentication on devices

I am reporting here from Tokyo following NTT DoCoMo’s press event today.  It’s so exciting to see so many people gathered here for this important event talking about DoCoMo’s recent announcement of handsets with FIDO UAF 1.0 capability. The benefits of FIDO really boil down 3 words: security with simplicity.   There are a couple key announcements coming out around this:


It is exciting to see a Tier1 carrier to embrace FIDO to bring convenience and security to a variety of consumer services and demonstrates how carriers, when proactive on new technology, and drive incremental monetization.   In conjunction with this announcement is also DoCoMo’s announcement of “d-point”, which appears to be a loyalty program in partnership with Lawson, Inc.   According to the pres release, this program appears to allow you to receive points at a LAWSON stores and will be able to  redeem points for a variety of DoCoMo services.  


ARM is very pleased to be a board member of the FIDO alliance and actively involved in ensuring that FIDO security is enhanced further through the use of ARM TrustZone® and Trusted Execution Environments (TEEs).  My colleague at ARM, Rob Coombs, also FIDO board member, details more about the benefits on how FIDO can be enhanced using TEE in his blog and whitepaper.


FIDO complemented by TEE will have dramatic impact on driving simplicity of authentication on a variant of services and can drastically increase engagement by consumers which can ultimately result in increased monetization opportunities for carriers and services providers who work with carriers.  

Identity thieves are getting quite sophisticated when it comes to stealing your username and password.   We might be wary of an unsolicited email containing a link but what if it came from a friend's email account?  This happened to my wife recently - her friend's PC had been taken over and was sending believable emails to all her contacts (the PC was later encrypted by the hackers and held to ransom for  bitcoins).  If you clicked the link up would pop a window purporting to be from Google asking you to login with your username and password... .  Then there is the issue of having to remember too many long and complex passwords for the different web services we all use.     I think most of us would agree that passwords aren't safe and they are painful to use.

Fortunately the combination of a new authentication protocol called FIDO (Fast ID Online) and biometrics is changing the landscape rapidly.   The FIDO Alliance is a group of approximately 200 companies working together to create a new protocol that provides simpler, stronger authentication.   It can work with many different types of authenticator such as fingerprint sensor, iris scanner or trusted PIN entry.  The device (not the remote sever) creates a public/private key pair for each combination of user/device/relying party during registration and provides the public key to the relying party.   The sensitive parts of the algorithm e.g. crypto, matching, key stores need to be protected from scalable attacks.   Fortunately ARM based applications processors usually implement a TrustZone based Trusted Execution Environment consisting of isolation hardware, authenticated trusted boot and a small Trusted OS.   The TEE is being standardised by GlobalPlatform who are working on a security certification scheme so that it will soon be possible for platforms to be tested by 3rd party labs.  The attached white paper looks at how the TrustZone based TEE is being used with FIDO based systems to protect assets and accelerate the revolution to a world without passwords. 

Chinese Version中文版:互联社区登陆52DAC大会

Ladies and gentlemen it's that time of the year again, DAC is less than three weeks away! I will be in San Francisco to provide daily updates of all the major news from the Moscone Center for those of you who won't be attending, so you won't miss out on anything from the three day exhibition. The ARM Connected Community is the place to be to find out about DAC news, photos, videos, partner announcements and gossip from the show floor throughout the event! In this blog I'll give you a flavour of what DAC is all about as well as highlighting how ARM is engaging with partners in workshops, panels and poster sessions.


When you look back over the history of DAC, there are a few common themes. One is the interwoven nature of DAC with another three letter acronym; EDA. The two have been closely tied ever since the first trade show for EDA was held at DAC in 1984. Indeed, close connections are a theme of DAC that I will explain further below as I highlight the many events that ARM is participating in at this year’s show. The other striking point about this event is its seemingly relentless pursuit of improvements in system design and automation.





Binoculars can only show you the second best view in San Francisco




Even DAC itself is not immune to the forces of creative destruction. The inaugural conference in 1964 was named Society to Help Avoid Redundant Effort (SHARE) but its name was soon changed into something deemed to be more ‘streamlined’, thus the Design Automation Conference (DAC) was born. Similarly, the content of DAC has evolved significantly over the years. Initially it was an academic only conference that displayed technical methodologies but it has since grown into the all-encompassing event it is today. In its current form it dedicates 30% to embedded systems and software, has a strong automotive section and still harkens back to its roots with an ACM/IEEE refereed research content that remains the backbone of the conference. On the business side of DAC, the industry it supports is healthier than ever. An April 13 report from the EDA Consortium announced that the EDA industry has broken even more records with Q4 2014 revenues reaching $2104 million, an increase of 11.9 percent. As they say, what’s good for the goose is good for the gander and this level of unprecedented success for EDA bodes well for DAC this time out and in many years to come.


DAC is also incredibly important as a milestone for the year, coming in June it represents a half-way point in the year and allows one to take stock of what has happened in the 12 months since the previous time we all got together in San Francisco. Taking a look back 12 months ago makes you realise how quickly time flies by, as DAC51 was when ARM announced its acquisition of IP integration experts Duolog Technologies. Stay tuned for my next blog when I reveal a little more on how Duolog has become part of ARM, and what they have achieved in the past year.


This year will see a continuation of ARM’s long-standing tradition of giving a spotlight to its ecosystem partners within the ARM Connected Community (booth #2414). Partners include virtual prototyping experts, software simulation and embedded developers. Here you can see a full list of the ARM partners exhibiting at DAC.





"The ARM Connected Community is unified by a set of common values"

In a landmark essay, James Moore defines the connected technology community as “a vast global collection of companies that are unified by a few standards and core technologies such as architecture, radios and signal processors. But what unifies them most – and this is most fascinating - is a set of values about openness of ideas and technologies, treating each other well, and finding creative ways of profit sharing and risk mitigation so all members can thrive”.


This set of values embodies one of the major themes of DAC; the feeling that everything is interlinked. There is a great sense of camaraderie amongst people from various companies as many will have worked together in the past or attended the same events for years. The same is true for the technology itself, with many small EDA vendors proudly proclaiming that their solution is compatible with design flows from the big design houses and industry standards. This combines to provide an atmosphere that is 50% tech conference and 50% high school reunion. The many social events add to this and let people blow off some steam and socialise after spending hours talking ‘shop’ all day. I’m a firm believer that real, lasting relationships are forged when people are a little more relaxed and open to chatting at events like the Cadence Design Systems Denali Party, but that could be the Irish in me


In an industry that is driven in a large part by the innovation of startups, it is crucial there is compatibility between smaller and larger companies. Small companies generally need to comply with established standards in order to fit seamlessly into the design flow, while on the other hand larger companies must stay aware of and open to new methodologies that could potentially take off. DAC is an excellent microcosm of the semiconductor industry as you have a number of different parts of the SoC supply chain:




Modern SoC ecosystem.png


Modern SoC design requires input from many different partners


I’m sure I’ve left out a stakeholder in that list so please correct me in the comments section with the other important parts of SoC design. There is huge value in each of these stakeholders being in sync and working together. Maintaining a vibrant and successful ecosystem is one of the best ways to drive innovation to the benefit of the industry and the end user.





ARM working together with partners at DAC


The sheer amount of panel discussions, joint demos and workshops is a real breath of fresh air and is living proof of the values I mentioned above; openness of ideas and technologies, treating each other well and finding creative ways of profit sharing and risk mitigation. ARM is engaging in a number of joint efforts to highlight what can be done when multiple parties are committed to working together. One example is happening at a breakfast session on Tuesday June 9th at 7:30am in Park Central Hotel, when ARM teams up with Synopsys and Samsung to implement a Cortex®-A53 processor and a CoreLink™ CCN-502 Cache Coherent Network on a 14FF process. It is only through deep collaboration and a generous degree of open communication that chips are brought to silicon, which is what we intend to demonstrate.


On a similar theme, ARM’s Brenda Westcott chairs an IP implementation session on Monday June 8th at 1:30pm in Room 101 that includes discussion of synthesis constraint methodologies for high-speed SERDES, techniques for mapping analog IP to different foundries, on-chip POP package co-design for DDR interfaces, using SystemC for hardware/software for ULP Wi-Fi IP. The concluding panel will be moderated by Semiconductor Engineering’s Ed Sperling and will showcase Xilinx's Darren Jones, Cavium's Surya Hotah, and EZChip's Bob Doud.


Another highlight is a subsystem IP & IP integration discussion on Monday June 8th at 4:30pm in Room 101 chaired by Taiwan Semiconductor Manufacturing Corp. (TSMC)'s Clark Chen, which will go into detail on design for analytics, IoT processor IP platforms, ISO 26262 automotive safe certification, and vision processing subsystems. The concluding panel will be moderated by Semiconductor Engineering’s Ann Mutschler and will showcase ARM’s Leah Schuth, Cadence’s Thomas Wong, Global Unichip Corp's Lewis Chu, and Synopsys' Navraj Nandra.





Here’s a short list of the other sessions that ARM will be involved in at DAC



Session Title


Date & Time


Event Type

Workshop 6: Design Automation for HPC, Clouds, and Server-Class SoCs

Rob Aitken

6/7 8:30 AM-5:00 PM



Workshop 9: Interdisciplinary Academia Industry Collaboration Models and Partnerships

Sadanand Gulwadi

6/7 9:00 AM-5:00 PM



IP Implementation

Brenda Westcott

6/8 1:30 PM-2:30 PM


IP Track

Subsystem IP & IP Integration

Leah Schuth

6/8 4:30PM-5:30PM


IP Track

It’s All in the Margins

Brian Cline & Rob Aitken

6/9 10:30 AM-12:00 PM


Special Session

Planar to FinFET

Brian Cline

6/9 10:30 AM-12:00 PM


Designer Track

Build Your Own? Why Not!

Dominic Pajak & Rob Aitken

6/9 1:30 PM-3:00 PM


Special Session

DAC Designer and IP Track Poster Session

Neha Agarwal(2), Anand Balan, Rupal Gandhi, Rob Kaye(2), Faisal Khoja, Chris Lamb, Ramesh Manohar, Joonsoo Park & Sagar Undale

6/9 4:30 PM-6:00 PM

Exhibit Floor

Designer and IP Track Poster

Scalable Verification: Evolution or Revolution?

Bill Greene

6/10 4:30 PM-6:00 PM



Work-in-Progress Poster Session

Abdellah Bakhali, Fabrice Blanc & Vikas Chandra

6/10 6:00 PM-7:00 PM

Esplanade Foyer


Making Designs Better: A Holistic Approach

Geoffray Lacourba

6/11 10:30 AM-12:00 PM


Research Paper Session

Validation, Validation, and Validation: The 1-2-3 of Secure SoC

Vikas Chandra

6/11 1:30 PM-3:00 PM


Special Session

The Long and Winding Road to IoT Connectivity: Are We There Yet?

David Flynn

6/11 4:00 PM-5:30 PM







This is the list of confirmed appearances so far, there are some more that will be confirmed in the days to come. I will constantly update this table to make sure you are up to date with all of the appearances ARM is making with its partners, or alternatively you can pop by the ARM booth #2414 at DAC to find out our activities at DAC. What event at DAC are you looking forward to most?  See you there soon!

V5_to_V8_Architecture%5B1%5D[1].jpgFollowing on from the popularity of the Cortex-A Series Programmer’s Guide for ARMv7-A, there is now a programmer's guide for processors implementing the ARMv8-A architecture profile.


The new Cortex-A Series Programmer's Guide for ARMv8-A is available now and does not require a click-through agreement to download.


Besides a general introduction to the ARMv8-A architecture, the guide covers:


  • The ARMv8-A A64 and A32 instructions sets

The most significant change introduced in the ARMv8-A architecture is the addition of a 64-bit instruction set called A64. This set complements the existing 32-bit instruction set architecture. This addition provides access to 64-bit wide integer registers and data operations, and the ability to use 64-bit sized pointers to memory.

The AArch64 execution state provides thirty one 64-bit general-purpose registers.

ARMv8-A also includes the original ARM instruction set, now called A32.

  • Floating-point and NEON improvements (ARM Advanced SIMD architecture)

There are now thirty-two 128-bit registers, rather than the 16 available for ARMv7.

Smaller registers are no longer packed into larger registers, but are mapped one-to-one to the lower-order bits of the 128-bit register. A single precision floating-point value uses the lower 32 bits, while double precision value uses the lower 64 bits of the 128-bit register.

ARMv8-A supports both single-precision (32-bit) and double-precision (64-bit) floating-point vector data types and arithmetic as defined by the IEEE 754 floating-point standard.

  • The Application Binary Interface (ABI) for ARM 64-bit software

The ABI specifies fundamental rules to which all executable native code modules must adhere so that they can work correctly together.

  • Changes to the exception model and exception handling

In ARMv8-A, execution occurs at one of four Exception levels. In AArch64 state, the Exception level determines the level of privilege, in a similar way to the privilege levels defined in ARMv7.

  • Porting software to A64

For many applications, porting code from older versions of the ARM Architecture, or other processor architectures, to A64 means simply recompiling the source code. However, there are a number of areas where code is not fully portable such as constants, atomic load and store, and conditional execution.

  • Memory access improvements

The ARMv8-A architecture employs a weakly-ordered model of memory. In general terms, this means that the order of memory accesses is not required to be the same as the program order for load and store operations. Hardware optimizations, such as the use of cache and write buffer, improve the performance of the processor. Bandwidth between the processor and external memory can be reduced and the long latencies associated with such external memory accesses are hidden.

  • Improvements for multi-core systems

The ARMv8-A architecture provides a significant level of support for systems containing multiple processing elements.

The Cortex-A53 and Cortex-A57 processors support coherency management between the different cores in a cluster to ensure that all processors or bus masters within a system have the same view of shared memory.

Coherency management and a shared interrupt controller simplify creating big.LITTLE systems that combines energy-efficient LITTLE cores with high-performance big cores.

The exception level access restrictions and TrustZone extensions provide high security for multiple processors running at different privilege levels.

  • New ARMv8-A Models

Platform models enable development of software without the requirement for actual hardware. Software models provide models of processors and devices from the perspective of a programmer. The functional behavior of a model is equivalent to real hardware.


The programmer's guide complements rather than replaces other ARM documentation for the Cortex-A series processors.

For information on a specific processor, see the appropriate ARM Technical Reference Manual:

The most important and definitive reference for the ARMv8-A architecture remains the ARMv8-A Reference Manual.

1/ The formular in WRAP burst address in the specification is :

" For a WRAP burst, if Address_N=Wrap_Boundary+(Number_Bytes x Burst_Length), then :

  •      use this equation for the current transfer :
    • Address_N=Wrap_Boundary
  • use this equation for any subsequent transfers :
    • Address_N=Start_Address+((N-1) x Number_Bytes) - (Number_Bytes x Burst_Length)

--> I think the Start_Address in this formular must have Aligned_Address.Please you sent me correct information about it.


2/ I am designing the AMBA AXI3 Master. I think "write data interleaving and write data out-of-oder" support by interconnect from many Masters have different speeds. Do AXI Master have support write data interleaving and write data out-of-oder when AXI Master direct connection with AXI Slave ?. I think AXI Master must issue the data of write transactions in the same order in which it issues the transaction addresses. Please you sent me correct information about it.


3/ Why the reason AXI4 remove the function "Write data interleaving" . I think it very helpful for high-performance

Chinese Version中文版:ARM简史:第二部分

Continuing on from my earlier blog:A Brief History of ARM: Part 1. I pick up the history of ARM® in 1998 as it became a public company after being floated on the London stock exchange (FTSE) and the NASDAQ. Please see the ARM milestones on the website for a more in-depth look at the breakdown of the years ARM has been operating. There is also a nice section on the website that details the range of classic ARM processors.


Late 1990’s: Growth, the Dot com crash and pressures of being a public company

The late 1990’s were a booming time for technology, and ARM was at the height of that boom. The problem was that there was this disruptive internet bubble where many technology companies sprang up out of nowhere and were valued at ridiculous amounts even though they were built on debt often with very little in the way of concrete revenue forecasts. ARM was riding the high stock evaluations of technology companies with a share price of over £10 – valuing the company at over 300 times its actual earnings for 1999 (Crazy, I know), and it was ranked 30th on the FTSE 100 index of more valuable companies. The massive valuation for the size of the company at the time was something that has not been seen since the dot com boom.


In the early 2000’s the inevitable happened and the crash came. The technology sector crumbled and devalued overall on the stock market by 80-90%. Irrespective of whether a company was profitable or not, it took a sharp decline. This is best illustrated by global semiconductor revenues for those years




Semiconductor revenue since 1998 - note the drop at 2001 and 2009 recessions - Credit - Can lean innovation bring growth and profits back to semiconductors? | Solid State Technology



Even though ARM hit its earning targets, and had no debt or financial disappointments they still felt the squeeze of the recession. From people who worked here at that time it has been described as a period of ‘unhappiness’ and ‘not a very nice place to be’. Industry redundancies came but the workforce at ARM thankfully wasn’t hit too badly. ARM had entered a new age where they didn’t live quarter to quarter. They carefully laid out road-maps for where the business would be in 5 years and started to follow a long term plan, and in 2001 Warren East was appointed CEO of ARM, with Robin Saxby taking up to the role of Chairman of ARM. The vision of becoming the standard processor architecture was coming true.



The Age of Maturity, 2002-2005

Microprocessors had become so small that they only occupied a small part of the chip, so the issue was how to build software-based systems on a single chip or system on chip (SoC) solutions Notwithstanding the huge investment and cost-of-ownership associated with maintaining a proprietary processor architecture. The majority of companies had neither design teams with the ability to build their own microprocessor, or the tools needed to make them usable. This is one of the main reasons that microprocessors were one of the first to use the IP license model and as a result, ARM was designed into more and more SoCs, especially in the explosively growing cell-phone market where ARM had gradually become the de-facto standard.


However, the ARM core was “hard IP” and its application to different technologies was a real problem. ARM needed to produce a synthesizable core that could be licensed to anyone without needing a technology-specific port of the core. In 2001 the ARM926EJ-S was announced. It was fully synthesizable with a 5 stage pipeline and an integrated MMU, as well as hardware support for Java acceleration and some DSP operations. It went on to be licensed by over 100 silicon vendors worldwide and has gone on to ship multiple billions of units.


The importance of the success of ARM’s partners was something that really stood out to me when I joined ARM nearly a year ago. As mentioned in my previous blog, the success of partners means success for ARM. It is one of the rare times in business when each company with which ARM conducts business has a symbiotic relationship of ‘better together’. This approach is aided by the fact that ARM is British, and headquartered in Britain (a somewhat neutral ground for the semiconductor industry). ARM has never been focused solely on a select number of partners, and to this day has had multiple hundreds of them (as seen below)



The ARM Connected Community includes 1200+ Partners



After the dot com crash the industry was still recovering, as was ARM. However steady growth and the ARM9 became the new ARM7, which became the ARM9E and then the ARM10. The ARM 10 and ARM11 technology really broke new ground in terms of low power, high performance processing. ARM tripled its headcount from 400 people to 1,300 people in only 3 years! But ARM was a more mature, smarter company by this point, and realized they could not continue the upwards and to the right trend of their current offerings. They decided to diversify the offerings to cover all the needs of the industry.



The Age of Cortex®, 2005-2012

The Cortex family was the diversification that ARM brought to the industry. Cortex-A continued the current offerings following on from the ARM11 following the trend of leading edge mobile applications demanding higher performance. Cortex-R provided high performance, real time processors that catered for the highly specialized real-time requirements. Cortex-M provided extremely low power, low cost cores to the micro-controller industry. This was driven by the simple observation that the market for high performance processors is huge, but the market for low-cost micro controllers is truly colossal and this market was not well addressed by the latest ARM cores.



The most recent Cortex family of processors.



By 2008 the smartphone market was booming and the demand for increased performance while at the same time maintaining a long battery life presented quite a challenge. Ever more powerful single core architectures would not be the solution forever and ARM responded with the Cortex-A9 MPCore, a multi-core processor which was better able to address the huge dynamic range in processing to accommodate for a smartphones vastly different user needs, from gaming to texting. This was further improved with the introduction of the heterogeneousbig.LITTLE™approach in 2011, which provides high performance with a powerful core when required and then switches back to much lower power core when high performance is not needed.

ARM currently has a 96% share in the mobile market, and shows no signs of slowing down.


If you want a more detailed breakdown of the Cortex series, please view this great breakdown by Chris Shore - in his blog Navigating the Cortex Maze



The Age of Leadership and Diversification. 2012 to the future….

The ARM model is another really interesting aspect of the history. The licensing provides up-front revenue for ARM, but the royalties don’t come into play until up to 5 years after. So technically, ARM is licensing processors that don’t start fully generating revenue until 5 years later. An example of this is the ARM7, which is no longer sold or supported by ARM, but you can see it has shipped more each year (date range up to 2011).




ARM7 Shipments up to 2011 - note the upward trend even though ARM stopped supporting many years ago.




With the Cortex series being over 10 years old, the royalties will be building up for the chips produced over the next few years, and we can see from the graph below, that ARM based units are growing massively year on year, currently at 12 Billion a year, and have just passed 60 Billion total units shipped. At this rate, ARM is set to reach 20 Billion units shipped a year (and a total of 150bn shipped) by 2020. This is simply a number beyond belief and expresses the importance of partners to ARM once again.





Page 25, 2014 SR.png

ARM partner shipments from the most recent ARM Strategic Report





What’s next for ARM?


Well if you want to know more, these blogs may be of interest!

Meet the new ARM Cortex-M7 processor: supercharging embedded devices by Bee Hayes-Thakore

ARM Cortex-A72 and the New Premium Mobile Experience by Nandan Nayampally

System IP for 2016 Premium Mobile Systems by Andy Nightingale

ARM Cortex-R real-time processors speed your mobile communications by Chris Turner


To finish, I'll share a great photo which Chris Shore introduced to me. It really demonstrates the innovation of the ARM processor over the years. Pushing performance barriers and power demands is not the only way to go. Only one part of the market lies in this quadrant.



processor sizes.png





Here is the very first ARM1, developed by Acorn in 1985. 6000 gates and 50mm2 on 3u technology. Beside it is the Cortex-M0, 8000 gates and less than 1/10000th of the size on 20nm technology. At the other end of the scale, here is a dual-core Cortex-A9 with Mali-400 graphics core. Weighing in at 100M gates, on 40nm technology it is almost exactly the same size as the original ARM1 yet it is nearly impossible to explain the vast difference in performance.

It is certainly true to say that no other processor company offers a range of solutions which are as diverse or as complete. ARM is a unique company, and it's history proves that!


If you feel there are more topics I should research for a blog, please comment and I'll get working on it!

Thanks to all that helped with the research of this blog including Eoin McCann, Alban Rampon and Chris Shore

How the ARM Architecture has fostered differentiation through diversity?


Since 2014, there has been an ever increasing number devices shipping with ARMv8-A based Cortex Processors – ranging from $65 smartphones to premium flagship devices. This is a wide range and evidence of the ways in which the transition to 64-bit continues the advance in system design and process technology in the mobile space; enabling a fresh wave of innovation on the ARM architecture. I thought now would be a good time to explore the degrees of freedom ARM partners have in building SoCs based on the ARM CPU architecture.


When designing a CPU, ARM IP offers two levels of possible differentiation through the ARM licensing model – proprietary or custom microarchitecture and an ARM Cortex processor with system design and implementation choices. Both are fully compatible with the ARM architecture.



  • Proprietary microarchitecture

This allows our partners to license one of the architectures (e.g. ARMv8-A or ARMv7) and have their own implementation of the ARM ISA. The ISA remains unaltered in these cases but partners can choose their own approach to design a CPU from the ground up that complies with the ARM architecture specification.


ARM partners do this to target unique design points or features to address specific segments of the market, albeit it at higher development cost. It is important to remember that independently developed, proprietary microarchitecture CPUs based on the ARM architecture have to pass an ARM mandated compliance suite to ensure that they are 100% compatible with the ARM architecture. This ensures the ecosystem value of the ARM partnership is preserved and enhanced - code written for custom ARM Architecture CPUs will run on other ARM CPUs.


  • ARM Cortex Processor

Partners license ARM designed implementations of the ARM Architecture, such as the ARM Cortex-A processors. At ARM, we are focused on sustaining and growing the largest ecosystem on the planet for efficient computing. Software developed for one ARM-based SoC will run on any other ARM-based SoC that uses the same or newer version of the ARM architecture.


When licensing any combination of Cortex Processors, partners configure the cores to suit their applications without modifying the microarchitecture. This retains the strong foundation of software compatibility. We take great care to ensure that no special modifications are made that could break this compatibility – it is extremely important that all ARM SoCs in a given profile (Cortex-A, Cortex-R, Cortex-M) are software compatible so that the ecosystem is as broad and deep as possible.


Innovation and differentiation within the ARM ecosystem


Even with a “standard” Cortex CPU, there are many ways that partners can in fact differentiate.


  • CPU configuration:

Partners who license ARM CPUs can choose the cache size (L1 and L2), bus interface (e.g. AMBA4 or AMBA5), number of cores in a cluster (1 to 4), and how many CPU clusters to use in the design (2 clusters in a big.LITTLE. design for example). We have seen that partners have built 2+4 big.LITTLE configurations with 2 high performance cores and 4 max efficiency cores for midrange and premium smartphone markets, and 4+4 topologies for higher end smartphone and tablet markets. Similarly, we have seen partners build 2 clusters of 4 LITTLE cores to deliver Octacore capabilities at low to mid-range price points.


L2 size is an important factor in performance on many benchmarks, so high-end designs often push L2 sizes to 2MB for the high performance CPU cluster; low-end and mid-range designs can sometimes play this trade-off differently, with a 1MB L2 for the high performance cluster, or 512kB L2 cache size for a high efficiency CPU cluster in a big.LITTLE SoC, trading off performance for cost savings. This range of configurability allows ARM partners to tailor the CPU capacity in their SoC to their target markets, while retaining full compatibility with the ARM architecture and full access to the benefits of the ARM ecosystem.


  • Power domains:

Cortex-A CPU IP comes with optional power domains around each CPU core, the L2 subsystem, and other areas of the design. Partners can choose how to implement these voltage domains, and can choose to share or group some domains. Further to this, ARM introduced state retention modes for CPU cores and for the Advanced SIMD units in some of our more recent CPUs that partners can optionally use to offer finer grained power management in the SoC.

  • Peripherals:

There are of course numerous peripherals and interfaces beyond the CPU, GPU, and other processing subsystems that can differentiate an SoC. By taking standard Cortex-A CPUs, some partners choose to devote more of their engineering resources to optimizing and tuning specific peripherals and interfaces to differentiate their SoCs.


  • Memory system performance:

Although every Cortex-A CPU is equivalent to every other Cortex-A CPU of the same revision in terms of performance within the CPU, often CPU performance depends quite heavily on memory system performance, and we can observe two Cortex-A CPUs of the same type delivering significantly different performance as a result of this. As one example, the latency to L2 memory depends on the number of slices a partner uses to meet timing for their target frequency; a partner with lower latency to the L2 will have an advantage in performance benchmarks that spill outside the L1 instruction or data caches. As another example, the latency to main memory can differ a lot from one SoC to another - if one SoC has a memory latency of 100 cycles and the other 140 cycles, the 100 cycle latency memory system will be a big advantage in many (but not all) of the key benchmarks, and is often an observable advantage in terms of delivered performance on real-world workloads.


Often partners seek to differentiate on memory system performance, recognizing the large impact this has on overall performance even against other SoCs with the same Cortex-A CPU. One last point on the topic of memory system performance; CPU performance is very sensitive to latency to main memory, and GPU performance is more sensitive to bandwidth to main memory, so ARM partners will optimize and balance between latency and bandwidth in the design of the memory system for their target applications.


  • SoC level power management:

The way in which a given SoC manages power incorporates several different mechanisms to slow down or shut down components when under light or zero demand during different phases of use. With so many different components in an SoC design, ARM partners have a lot of ways in which they can manage power, and some partners differentiate on the power management mechanisms in the SoC, the big.LITTLE tuning and power management framework, or the software that organizes the management of component shutdown and presents it to the OS or middleware.


There are several system and implementation choices which further offer ways for ARM partners to differentiate when using Cortex-A standard CPUs:

  • Process node: ARM IP is shipped as synthesizable RTL that can be implemented on several different process nodes. Today (early 2015) partners at the highest premium end of the market are building with ARM IP on 16nm and 14nm, while many premium designs are being built and currently shipping on 20nm, with a range of designs targeting 28nm for lower-cost premium SoC platforms for the mid-range and entry level. The frequency and power characteristics can vary significantly for the same ARM CPU implemented on different process nodes, so the choice of process remains one of main ways (and most obvious) that partners differentiate on ARM IP


  • Physical implementation: The time and effort spend on physical placement, routing, and optimization of the logic and RAM arrays in a design can significantly differentiate one Cortex-A CPU from another. For example, investment in physical design can produce higher maximum frequency for the same design, lower power at the same maximum frequency, or some combination of the two. Also, partners sometimes iterate on the physical design of a CPU, such that the 2nd or 3rd generation of a product can be significantly improved in power, performance, and area (cost) characteristics due to improvements in the physical implementation of the same Cortex-A CPU, providing further differentiation for the partner. ARM POP IP has been a factor in improving the quality of results that can be achieved in physical design by partners, and also improves the next differentiation factor in this list… time to market.


  • Time to market: Release windows are critical in markets like high-end smartphones and premium tablets, where a delay of one month can mean missing a whole year design cycle for devices with an annual refresh. Some partners differentiate on being very fast to market based on designs with Cortex-A CPUs. Often in those fast markets, the initial SoC product will be followed with a revised version that improves on the original.


  • GPU, ISP, video and audio subsystems: In a modern mobile SoC, the performance of the chip is often influenced even more strongly by the performance of the graphics processor, the image processing, the video and audio subsystem, and of course the way these components all work together. ARM provides industry leading IP in the Mali GPU and video subsystem, but we allow our partners to mix and match between our IP, their own IP, and that of 3rd parties. This allows the ARM partnership to experiment with different combinations of IP, iterate rapidly, and compete for the best combination in each device generation. This competitive iteration has led to rapid innovation in smartphones and tablets and is a key benefit of the ARM ecosystem, a benefit that is now well established in networking markets, and making inroads into server markets, for example.


  • System design: The way in which the CPU, GPU, ISP, video subsystem, coherent interconnect, and memory system work together as a combined system is an increasingly important factor in modern SoCs, and a key way for partners to differentiate their chips. Examples of differentiation in the system design include the use and configuration of cache coherent interconnect, next level cache memories, dynamic memory controllers, and the software that configures the system and optimizes things like power down modes and operating points at run-time.


  • Software: Beyond the hardware IP and custom components in an SoC, there is of course the software that configures and operates the SoC. The key attribute we have been discussing is the compatibility of all ARM-based designs, so that the Linux kernel, application software, and middleware all run the same on ARM-based CPUs. ARM partners can differentiate along all of the dimensions listed above, and still maintain full software compatibility that allows them to tap in to the vast wealth of software written for the ARM architecture. The chip support and board support packages with a given SoC can be a point of differentiation for ARM partners that invest there.


As a result of all of these opportunities for differentiation, any 2 Cortex-A57, Cortex-A72, or Cortex-A53-based processors can be quite different in their system, power, and performance characteristics, while still being identical from a software perspective.


A quick listing of ways the performance can differ (summarizing some of the points made above):

  • Max frequency (and max sustainable frequency - influences by power)
  • Power (affects sustained frequency in a thermally constrained environment)
  • Latency to the L2
  • Latency to main memory
  • Bandwidth to main memory
  • L2 size (and L1 size for some ARM CPUs)
  • big.LITTLE topology - number of cores
  • big.LITTLE tuning and scheduling policy
  • Coherent interconnect


Beyond all these of course, our partners innovate around the core with their own IP blocks and design techniques.


To sum up, ARM prioritizes the value of the ecosystem - that ability to design code for all ARM-based CPUs of a given architecture release - and offers partners two ways in which to achieve this – through proprietary microarchitecture or by licensing standard ARM Cortex CPUs. There remain numerous important ways in which ARM partners can differentiate and as our partners can and do differentiate along all of these dimensions, it is very important to analyze these characteristics when assessing one SoC based on an ARM Cortex-A core against another.


A benefit of this range of configurability and differentiation is that ARM CPU IP can scale to address a broad range of different markets, and the ARM partnership can respond quickly as new markets start to emerge. An example of this is the recent emergence of the wearables market. ARM partners have repurposed low-end smartphone SoCs, based on the incredibly low-power consuming Cortex-A7, to service the initial wave of watches, along with even lower power Cortex-M CPUs (an order of magnitude less power) for fitness bands and other wearables that don’t require a UI, complex display, or MMU-based OS. Now we are starting to see Cortex-A7 based designs optimized specifically for wearable product, and the targeted physical implementation enables low power wearable implementation that runs under 10mW at 100MHz for a full Apps core - this coming from the same Cortex-A CPU that is shipping in 8 core 2GHz versions for low-cost mid-range smartphones.


Clearly it is important for OEMs to assess the many differentiating factors when choosing between ARM-based SoCs for devices, and it is even more critical for ARM partners to differentiate along each of these paths in the competitive market for SoCs in the ARM ecosystem. It is through this freedom of choice that the ARM partnership has innovated so rapidly and will continue to do so as the ARM ecosystem expands to more fully serve other markets.

In early 2015, ARM announced a suite of IP for Premium Mobile designs, with the ARM® Cortex®-A72 Processor delivering a 3.5x increase in sustained delivered performance over 28nm Cortex-A15 designs from just a two years ago. At ARM we are focused on delivering IP that enables the most efficient computing and the Cortex-A72 micro-architecture is the result of several enhancements that increase performance while simultaneously decreasing power. Last week, at the Linley Mobile Conference, I had the pleasure to introduce the audience to the micro-architectural detail of this new processor which I thought would be good to summarize as there has been quite some interest.


From a CPU Performance view, we have seen a tremendous growth in performance: a 50x increase in the last five years (15x at the individual core level). The graph below zooms in on performance increases in single core workloads broken into floating point, memory, and integer performance. All points shown up to 2015 are measured from devices in market, and Cortex-A72 is projected based on lab measurements to give a preview of what is expected later in 2015 and in 2016. The micro-architectural improvements in Cortex-A72 result in a tremendous increase across all aspects – floating point, CPU memory and integer performance. For the next generation of mobile designs, Cortex-A72, particularly on 14nm/16nm process technology, is going to change the game – the combination of the performance and efficiency of this process node and CPU are extremely compelling.




The improvements shown here have come through improvements at the micro-architectural level coupled with increasing clock frequencies. But delivering peak performance alone isn’t the challenge of designers – mobile devices are characterized by the constrained thermal envelope SoC designers have to operate within. Hence, to increase performance within the mobile power and thermal envelope, turning up the frequency or increasing the issue rate in the micro architecture isn’t the answer – you have to improve power efficiency.




The Cortex-A72 micro-architectural improvements increase efficiency so much that it can deliver the same performance of Cortex-A15 in half the power even on 28nm, and for 75% less power on 14/16nm FinFET nodes. The performance of a Cortex-A15 CPU can be reproduced on the Cortex-A72 processor at reduced frequency and voltage resulting in a dramatic power reduction. However, mobile apps often push the CPU to maximum performance rather than a specific absolute required level of performance. In this case, a 2.5GHz Cortex-A72 CPU consumes 30~35% less power than the 28nm Cortex-A15 processor, still delivering more than 2x the peak performance.


Enhancements to the Cortex-A72 micro-architecture

Below is a simplified view of the micro-architecture. Those familiar with the Cortex-A57 pipeline will recognize that the Cortex-A72 CPU sports a similar 3-wide decode, 8 wide issue pipeline. However in Cortex-A72 the dispatch unit has been widened to deliver up to 5 instructions (micro-ops) per cycle to the execution pipelines.



I list here some key changes and the difference they make (an exhaustive list would be too long!) that highlight the way in which the design of Cortex-A72 CPU was approached, beginning with the pipeline front end.

Pipeline front end

One of the most impactful changes in the Cortex-A72 micro-architecture is the move to a sophisticated new branch prediction unit. There is an interesting trade-off here - a larger branch predictor can cost more power, but for realistic workloads where branch misses occur, the new predictor’s reduction in branch miss rate more than pays for itself in reduction of mis-prediction and mis-speculation. This reduces overall power, while simultaneously improving performance across a broad range of benchmarks.


The instruction cache has been redesigned to optimize tag look-up such that the power of the 3-way cache is similar to the power of a direct mapped cache – doing early tag lookup in just one way of the data RAM instead of 3 ways. The TLBs and micro BTBs have been regionalized, so that the upper bits can be disabled for the common case when page lookups and branch targets are closer rather than farther away. Similarly, small-offset branch-target optimizations reduce power when your branch target is close. Suppression of superfluous branch predictor accesses will reduce power in large basic blocks of code – the A72 recognizes these and does not access the branch predictor during those loops to save power.


Decode/Rename block

Of the many changes in the decode block, the biggest change is in handling of microOps – the Cortex-A72 keeps them more complex up to the dispatch stages – this increases performance and reduces decode power. AArch64 instruction-fusion capability deepens the window for instruction level parallelism. In addition to this, the decode block has undergone extensive decoder power optimization, with buffer optimization and flow-control optimizations throughout the decode/rename unit.



In the dispatch/retire section of the pipeline, the effective dispatch bandwidth has increased to 5-wide dispatch, offering increased performance (by increasing instruction throughput), while reducing decode power – decoding full instructions rather than microOps gets more work done per stage for those instructions. Cortex-A72 also features a power-optimized reorganization of architectural and speculative register files, with significant port-reduction and area. It has also optimizations in commit-queue and register-status FIFOs, arranging and organizing them in a more power efficient manner.


One final example of the improvements in the dispatch/retire section is the suppression of superfluous register-file accesses - detecting cases where operand data is guaranteed to be in the forwarding network. Every time you avoid a read from the register file, you save power.


Floating Point Unit and Advanced SIMD

Here the biggest improvement is the introduction of new lower latency FP functional units. We’ve reduced latencies to:

  • 3-cycle FMUL unit (40% latency reduction)
  • 3-cycle FADD unit (25% latency reduction)
  • 6-cycle FMAC (33% latency reduction)
  • 2-cycle CVT units (50% latency reduction)

These are very fast floating point latencies, comparable with the latest high performance server and PC CPUs. Floating point latency is important in typical mobile and consumer use cases where there is commonly a mix of FP and integer work. In these settings, the latency between computation and result is critical. Shorter latencies mean integer instructions waiting on the results of those instructions are less likely to be stalled.


This performance increase shows up in SpecFP and SpecFP2006 as an uplift of approximately 25%. This type of improvement is less useful for high-performance compute applications where pure floating point throughput is required. For mobile use cases, floating point shows up in combination with integer work. A good example of this combination of floating point and integer is in javascript code where the native data type is double precision float. In addition, the divide unit has gone to a Radix-16 FP divider, doubling the throughput of divide instructions executed.


Other improvements in this area of the design include an improved issue-queue load-balancing algorithm, and multiple zero-cycle forwarding data paths resulting in improved performance and reduced power. Finally, the design features a source-reduction in the integer issue-queue which cuts power without performance loss.


Load/Store unit

The Load/Store unit features several key optimizations. The main improvement is the replacement of the pre-fetcher with a more sophisticated combined L1/L2 data prefetcher - it is more advanced and recognizes more streams. The Load/Store unit also includes late-pipe power reduction with a L1 D-cache hit predictor. Performance tuning of Load-Store buffer structures improves both effective utilization as well as capacity. Increased parallelism in MMU table-walker and a reduced-latency L2 main TLB improves performance on typical code scenarios where data is spread across many data pages, for example in Web browsing where data memory pages typically change frequently. Finally, there is power optimization in configurable pipeline support logic, and extensive power optimization in L2-idle scenario. The combination of the new pre-fetcher and other changes enable an increase of more than 50% CPU memory bandwidth.


In summary, the Cortex-A72 micro architecture is a significant change from Cortex-A57. There are several changes to the microarchitecture working in concert to produce a big step up in performance while consuming less power. that importantly also results in a design that consumes lower power. The important takeaways for the ARM Cortex-72 are:


  • Performance improvements in numerous key areas
  • Generational performance upside across all workload categories
  • Extensive power-efficiency improvements throughout the microarchitecture
  • Reduced-area, lower-cost solution


Cortex-A72 is truly the next generation of high-efficiency compute. A Cortex-A72 CPU based system increases performance and system bandwidth further when combined with ARM CoreLink CCI-500 Interconnect. A premium mobile subsystem will also contain the Cortex-A53 in a big.LITTLE CPU subsystem to reduce power and increase sustained performance in thermally constrained workloads such as gaming. Finally, the ARM Mali-T880 graphics processor can combine with Cortex-A72 and CCI-500 to create a premium mobile system. That’s all for this edition of the blog. What further features of the Cortex-A72 and the big.LITTLE premium mobile system are you interested in hearing about in future editions?

A technical report from ARM Reveals Cortex-A72 Architecture Details, hope it helps.

The modern SoC is a feat of engineering that continually squeezes greater performance from defined power and area constraints. However the arch nemesis of reliability is complexity.


“Debugging is twice as hard as writingthe code in the first place. Therefore,

if you write the code as cleverly as possible, you are, by definition, not smart

enough to debug it.”

          — Brian W.Kernighan and P. J. Plauger in The Elements of Programming Style


As SoC complexity continues to grow exponentially, it is only wise to build in some advanced debug capability in to the SoC. We’re all familiar with the concept of “a stitch in time saves nine” and this is particularly relevant for debugging; the later you find a bug, the more tedious, time-consuming and expensive it becomes to resolve. Visibility is a precious resource to system designers, as it gives them an opportunity to spot bugs early, and make subtle changes that can alter and optimize an SoC’s performance. On-chip visibility acts as a screening process to identify any snags.

There are certain SoC bugs that tend to manifest themselves through either a data corruption or a system lock up which occurs only when a series of contributing factors align to cause the fault. Factors may be as diverse as manufacturing tolerances being exceeded, bit errors being introduced, complex real-world software exercising new unvalidated spaces, or race conditions between multiple out-of-order transactions.

So if your design does get hit by a rare, extremely difficult to reproduce and tricky to diagnose issue it’s critical you have some tools to deploy to help you get to the bottom of the problem as fast as possible. Almost by definition any bug found in silicon is not going to be found by a simple test case you can run on a simulator or emulator of parts of your design.




Diagnosing the problem

The complexity of multi-core processors and cache coherent interconnects mean much of what was previously visible through CoreSight Embedded Trace Macrocells (ETM), essentially the programmer’s view, is now hidden inside the IP blocks.

With this in mind, ARM has developed a new weapon to add to the CoreSight on-chip debug and trace armoury called the CoreSight™ ELA-500 Embedded Logic Analyzer in order to provide a more accurate diagnosis of system bugs. As the name suggests this is a logic analyzer-like IP block for embedding in to your SoC to monitor up to 12 groups of 128 signals, generate triggers from assertion-like conditions and with a small embedded SRAM to collect a recent trace history of selected signals.




Example debug setup with ELA

So step one is to find out the state your system has got itself in to and what illegal or suspicious condition has occurred. The trace aspect of debug is similar to a detective using CCTV cameras when solving a crime. To help with this the ELA-500 contains a way to set up complex multi-state conditional triggers such as:


  • Trace next 6 write requests plus cache attributes to address 0x12345678
  • Load request from core 0 to address A will advance to trigger_state_1 which
    will then trigger debug mode after core 1 read from address A

The ELA-500 provides a number of tools to discover any malicious conditions:


  • A state machine with 4 trigger states programmable in any sequence including loops
  • Each trigger state can select one of the 12 signal groups as input for trigger conditions
  • Each trigger condition is programmable for comparisons to mask and match any
    combination of 128 signals:  =, !=,  >,  >=, <, <=
  • Each trigger state has a 32-bit counter input to count events, count clock cycles or act
    as watchdog timer



  Figure1: Trigger set up in the ELA-500


Step 2 is to start looking at what happened around this suspect state or condition; which can be done by storing selected signal states to the ELA-500 dedicated SRAM, configurable between four and over one billion trace data entries, or by triggering another action outside the ELA-500.

Up to 8 programmable output actions that can be triggered for each trigger state, such as: Stop clocks, enter debug state, start/stop signal trace, trigger another logic analyser or ETM, or assert a CPU interrupt.

It is likely that from the information gleaned new trigger conditions will be set to see what other unexpected conditions or states are occurring, so repeating steps 1 and 2 to establish the chain of events leading to the error condition.

For really extreme cases even further visibility may be required around the trigger condition, not visible except through a scan chain dump. For this step 3 is to program a stop clock action on the ELA-500 and then use scan chain dump and information on the SoC’s scan chains to provide exact state or any and all registers within the SoC on a scan chain. The ELA-500 here provides the precision on which scan chain dumps to analyse, so less of this time-consuming exercise needs to be done.



Where to deploy an Embedded Logic Analyzer

The ELA-500 can monitor any signal you connect to its inputs. SoC designers will benefit from connecting up signals from ARM IP and proprietary or third party IP. A typical design might contain multiple ELA-500’s deployed to monitor signals in different domains of the SoC, as shown in figure 2, with one per main processor cluster, one for the Cache Coherent Interconnect and one for other signals selected by the SoC designer.



ELA-500 diagram.png Figure 2: Example deployment of the ELA-500 in a system


Figure 2 shows the clock stop requests (in red) running the Clock Controller from each ELA and the connectivity (in black) of trigger in/out to the CoreSight Cross Trigger Interfaces (CTI) and the Cross Trigger Matrix (CTM). The debug APB bus is used to both set up trigger conditions and to read back the contents of the ELA’s SRAM, as controlled by the debugging tool, such as the ARM® DS-5™ debug tool.




Connecting the ELA-500 to the Cortex-A72 processor

For connection to ARM IP a Logic Analyzer IP Kit (LAK-500A) is provided with a pre-selected set of signals for that IP. The first of these is available for the recently released Cortex®-A72 processor to ensure the ELA-500 can sample signals at the maximum operating frequency of the Cortex-A72 without any impact on the operation of the processor.

The LAK-500A Logic Analyzer IP Kit includes the following:


  • Documented debug signal list and organization into 12 signal groups of 128 debug signals
  • A port puncher script that takes the debug signal list and adds connection to the top level
    ports of the Cortex-A72 processor. The script also has an option to add a register slice to
    debug signals to ensure timing closure
  • A LEC script to ensure nothing but the debug ports changed in the Cortex-A72 processor

The observation interface signals provide debug visibility of: each core-to-L2 interface, power-management interfaces, and the L2 memory system power-management interface. The core-to-L2 interface provides visibility of the physical addresses of L1 misses to the L2, and the following transaction details:


  • Memory type: normal, device, or strongly ordered
  • Read or write
  • Fetches
  • DSB or DMB
  • AArch32 or AArch64
  • L1 set index
  • Byte transfer size
  • Last data received
  • Memory attributes: not shareable, inner shareable, or outer and inner shareable
  • Whether access is from privileged mode
  • Read type: read clean, read unique, icache, data cache, or TLB invalidate
  • Write type: eviction, device, unique, or streaming
  • Eviction has double bit ECC error
  • Signals that determine proper operation of the Load/Store L2 interface.
  • Core snoops,  including cache maintenance Instruction Cache Maintenance Operation
    (ICMO) and TLB Maintenance Operation (TMO)
  • L2 pre-fetch

Future support is planned for new ARM Cortex-A and Mali™ processors as well as the CoreLink™ CCI Cache Coherent Interconnects, where transactions in flight and snoop traffic can be observed.



CoreSight ELA-500 can find corner-case bugs

The CoreSight ELA-500 provides visibility into the states leading to lock-ups and data corruption. It provides visibility of CPU load, stores, speculative fetches, cache activity and transaction lifecycle; properties that are not visible with existing ETM trace of instructions. This offers a greater scope for finding corner-case bugs that could potentially spell disaster if discovered too late.

The ELA-500 can monitor error states and hazard conditions across the SoC, giving visibility to debug lock ups in designs without resorting to complex scan chain dump analysis, and cases with invalid accesses to device memory. The ELA can spot data corruptions early, whereas conventional timeouts occur too late and causation events are often lost/overwritten. I go into even more detail on some of the use cases for the CoreSight ELA-500 in a Video interview with silicon debug expert Mark LaVine

All this ensures you have the fastest debug route available should your SoC suffer a catastrophic failure found only when the silicon comes back and full software is running on the device.



A full specification of the CoreSight ELA-500 can be found on the ARM Infocenter



You can find more information on the CoreSight ELA-500 webpage

On the list of activities that system designers enjoy doing, “debugging” is invariably near the bottom. That’s because it is often complicated, time-consuming and downright frustrating to track down and identify what is going wrong on the chip. However it goes without saying that debug is a critical part of SoC development. The ‘quality control’ that debugging provides means that OEMs can be assured of a high standard of functionality from a chip. The peace of mind this affords is invaluable, much in the same way you would be a lot more relaxed in the knowledge the new car you have just bought has had its brakes tested for quality assurance.


An effective debug strategy requires an experienced head and a good set of tools to get things done properly. When it comes to experience, Mark LaVine is an expert on the matter, having spent the last 15 years developing debug and trace solutions in order to minimize the frustrations that system designers feel when attempting to diagnose on-chip problems. Mark sat down recently with William Orme to talk about some of the common challenges related to silicon debug and some of the strategies available to overcome them.


In the video below he opens up about the topic of silicon debug and the major problems that surround this area, “today we’re looking at highly integrated products with very limited visibility”. He goes through some of the scenarios that lead to bugs being found in silicon, as well as the implications they can have, “usually it’s either a lock up or data corruption. Data corruption is the most difficult area to debug because typically it gets detected very late, from where the originating corruption occurred. To do experiments and trace back the original source can be very time-consuming”.






Mark has just finished working on the development of the brand new ARM® CoreSight™ ELA-500 Embedded Logic Analyzer, which is designed especially to diagnose and identify corner-case bugs. These are the type of bug that typically slip through the net of normal debug and trace protocols and only show up later on in the process when it suddenly becomes a more arduous task to get rid of them. Not to mention the greater costs involved in removing bugs found in silicon. In the video below you can see Mark speak about some examples of how the ELA-500 could be used to provide greater visibility and detect these issues before it’s too late, including on the new Cortex®-A72 processorWith the Cortex-A72 processor we provide a visibility on the CPU to L2 interface, which is very useful for accesses that could go external. In the case of a hang or lock-out you could find out which accesses were going on prior to the lockup. For other things like data corruption you could get a trace of those instructions”.


ELA-500 diagram.png

An example of the ELA-500 being deployed in a system




If you have any questions for Mark on the subject of silicon debug then please leave them in the comments section below and we will do our best to answer them here or with a follow up video.


My colleague William Orme has also written a blog that goes into more detail on the ELA-500 and how it succeeds in Taking the fear out of silicon debug.


For more information on the CoreSight ELA-500

Chinese version 中文版:ARM简史:第一部分

One of the things I have noticed about ARM over the last year that I have been working here is people having a great interest in ARM’s history. After a quick Google search and multiple open tabs I realized that there was much debate and comments on the actual history of ARM.

You can easily attain the timeline of events of ARM as a company,  but it doesn’t really tell the story of how ARM came into existence and how it rose to the top of its respected industry. It does however give you a full timeline of licensees of ARM and the key moments in the company's history.  Please join in the debate in the comment section if you feel there is more to add or possible topics you would like to see researched in further blog entries. Also don’t be afraid to comment with corrections or extra information from the era of 1980-1997 of which this Part 1 blog is based.

This blog will be posted over two entries – The History of ARM: Part 1 and The History of ARM: Part 2.

The Beginning: Acorn Computers Ltd

Any British person the age of 30 will most likely remember Acorn Computers Ltd and the extremely popular BBC Micro (launched with a 6502 processor in 1981). The background of Acorn is a very interesting story in itself (and probably deserves its own blog), set in the booming computer industry in the 1980’s. The founders of Acorn Computers Ltd were Christopher Curry and Herman Hauser. Chris Curry was known for working very closely with Clive Sinclair of Sinclair Radionics Ltd for over 13 years. After some financial trouble Sinclair sought government help, but when he lost full control of Sinclair Radionics he started a new venture called Science of Cambridge Ltd or later known as Sinclair Research Ltd. Chris Curry was one of the main people in the new venture, but after a disagreement with Sinclair on the direction of the company, Curry decided to leave Sinclair Computers Ltd.

Curry soon partnered with Herman Hauser, an Austrian PHD of Physics who had studied English in Cambridge at the age of 15 and liked it so much, returned for his PHD. Together they set up CPU Ltd which stood for Cambridge Processing Unit which had such products as microprocessor controllers for fruit machines which could stop crafty hackers from getting big pay outs from the machine. They launched Acorn Computers as the trading name of CPU to keep the two ventures separate. Apparently the reasoning behind the naming of Acorn was to be ahead of Apple computers in the telephone directory!

Fast forward a few years and they landed a fantastic opportunity to produce the BBC Micro, a government initiative to put a computer in every classroom in Britain. Sophie Wilson, and Steve Furber were two talented computer scientists from the University of Cambridge who were given the wonderful task of coming up with the microprocessor design for Acorn’s own 32 bit processor – with little to no resources. Therefore the design had to be good, but simple – Sophie developed the instruction set for the ARM1 and Steve worked on the chip design. The first ever ARM design was created on 808 lines of Basic and citing a quote from Sophie from a telegraph interview; ‘We accomplished this by think about things very. Very carefully beforehand’. Development on the Acorn RISC Machine didn't start until some time around late 1983 or early 1984. The first chip was delivered to Acorn (then in the building we now know as ARM2) on 26th April 1985. The 30th birthday of the architecture is this year! The Acorn Archimedes which was released in 1987, was the first RISC based home computer.

If there is enough interest I will do a full blog on the history of Acorn Computers Ltd but for now you can find a great TV movie by the BBC called Micro Men  – watch out for the Sophie Wilson cameo appearance! (Credit to the BBC - Source here for British iPlayer users)

Micro Men - A BBC Movie


ARM is founded.

ARM back then stood for ‘Advanced RISC Machines’ but to answer the age old question asked by many people these days, it actually doesn’t stand for anything – as the machines they were named after are long but outdated, ARM continued with its name – which funnily enough, means nothing! It does have a cool logo though!


ARM Logo (2015)



The company was founded in November 1990 as Advanced RISC Machines Ltd and structured as a joint venture between Acorn Computers, Apple Computer (now Apple Inc.) and VLSI Technology. The reason for this was because Apple wanted to use ARM technology but didn’t want to base a product on Acorn IP – who, at the time were considered a competitor. Apple invested the cash, VLSI Technology provided the tools, and Acorn provided the 12 engineers and with that ARM was born, and its luxury office in Cambridge – A barn!

Fig_ARM_Headquarters.jpgARM headquarters



In an earlier venture, Hermann Hauser had also created the Cambridge Processor Unit or CPU. While at Motorola, Robin Saxby supplied chips to Hermann at CPU. Robin was interviewed and offered the job as CEO around 1991. In 1993 the Apple Newton was launched on ARM architecture. For anyone that has ever used an Apple Newton you will know it wasn't the best piece of technology, as unfortunately Apple over reached for the technology that was available for them at the time - the Newton has flaws which lowered its usability vastly. Due to these factors ARM realized they could not sustain success on single products, and Sir Robin introduced the IP business model which wasn’t common at the time. The ARM processor was licensed to many semiconductor companies for an upfront license fee and then royalties on production silicon. This made ARM a partner to all these companies, effectively trying to speed up their time to market as it benefited both ARM and its partners. For me personally, this model was one that was never taught to us in school, and doesn’t really show its head in the business world much, but it creates a fantastic model of using ARM architecture in a large ecosystem – which effectively helps everyone in the industry towards a common goal; creating and producing cutting edge technology.

TI, ARM7, and Nokia

The crucial break for ARM came in 1993 with Texas Instruments (TI). This was the break that gave ARM credibility and proved the successful viability of the company’s novel licensing business model. The deal drove ARM to formalize their licensing business model and also drove them to make more cost-effective products. Such deals with Samsung and Sharp proved networking within the industry was crucial in infecting enthusiastic support for ARM’s products and in gaining new licensing deals. These licensing deals also led to new opportunities for the development of the RISC architecture. ARM’s relatively small size and dynamic culture gave it a response-time advantage in product development. ARM’s big break came in 1994, during the mobile revolution when realistic small mobile devices were a reality. The stars aligned and ARM was in the right place at the right time. Nokia were advised to use ARM based system design from TI for their upcoming GSM mobile phone. Due to memory concerns Nokia were against using ARM because of overall system cost to produce. This led to ARM creating a custom 16 bit per instruction set that lowered the memory demands, and this was the design that was licensed by TI and sold to Nokia. The first ARM powered GSM phone was the Nokia6110 and this was a massive success. The ARM7 became the flagship mobile design for ARM and has since been used by over 165 licensees and has produced over 10 Billion chips since 1994.


mtnok61g.jpgNokia 6110 - the first ARM powered GSM phone (You may remember playing hours of the game snake!)

Going Public

By the end of 1997, ARM had grown to become a £26.6m private business with £2.9m net income and the time had come to float the company. Although the company had been preparing to float for three years, the tech sector was in a bubble at the time and everyone involved was very apprehensive but felt it was the right move for the company to capitalize on the massive investment in the tech sector of the time.

On April 17th, 1998, ARM Holdings PLC completed a joint listing on the London Stock Exchange and NASDAQwith an IPO at £5.75. The reason for the joint listing was twofold. First, NASDAQ was the market through which ARM believed it would gain the sort of valuation it deserved in the tech bubble of the time which was mainly based out of the states. Second, the two major shareholders of ARM were American and English, and ARM wished to allow existing Acorn shareholders in the UK to have continued involvement. ARM going public caused the stock to soar and turned the small British semiconductor design company into a Billion Dollar company in a matter of months!


mo_052008f.jpgARM Holdings was publicly listed in early 1998



Please continue the ARM journey with A Brief History of ARM: Part 2 (1997-2015). Please leave comments and feedback for items you would like to see discussed!

Credit to Markus Levy, Convergence Promotions. For more see here for more detailed information on the technologies used during those early years

Credit also to the internal help received by many during the writing of this blog.

Last week, several of our partners unveiled new Chrome OS devices powered by Cortex-A17 based processor. These new products include two Chromebooks from Haier and HiSense at very competitive low price, a convertible laptop-tablet called the Chromebook Flip and a brand new kind of HMDI dongle called Chromebit, both from Asus.




Following Cortex-A17’s top score in Antutu’s “Best Performance Android Smartphones 2014”, these new devices re-affirm the capabilities of Cortex-A17 CPU in combination with ARM Mali-T760 GPU, to provide a high-performance computing experience in devices such as tablets in highly cost-effective implementations.

The announcement of these new devices is a very good opportunity to review the characteristics of the Cortex-A17 that make it a success in many popular consumer products like smartphones, tablets and OTT devices that require highest performance in thermally constraint form factors.

Cortex-A17 - A Balanced design for premium performance and cost efficiency


Cortex-A17 is the third generation of ARMv7-A out-of-order processors, following successful products as Cortex-A9 and Cortex-A15. Cortex-A17 processor was designed to some very aggressive PPA goals, including:

  1. provide a significant boost of performance over the current generation CPUs with improved branch prediction and out-of-ordering issue capabilities
  2. maintain an optimal power and area profile that fits thermally constrained form factors, especially by keeping a 2-way super scalar architecture
  3. build a micro-architecture that is tuned and optimized towards mobile workloads through for instance better use of the memory system

  This enables Cortex-A17 to provide best single thread performance for 32-bit application over any other ARMv7-A cores.




The single thread performance is critical for the user experience as it is at the heart of key applications like user interface and mostly web browsing. If Cortex-A17 and Cortex-A15 have similar SpecInt2k results, Cortex-A17 exceeds Cortex-A15 performance for web browsing, enabling the new Chrome OS devices to score better than previous 2014 successful devices.



Source arstechnica.com


Cortex-A17 achieves higher performance on benchmarks representative of today's complex and demanding real-world web applications running on mobile and desktop browsed such as kraken, octane, sunspider. This is achieved through a combination of design optimization, especially around memory system and streaming performance. These optimizations are designed in an optimal power and area profile to result into a better power efficiency. Better power efficiency allows sustaining maximum frequency before hitting thermal limits on the SoC and so directly translates into performance uplift. Area is also a significant part as it contributes to silicon cost as well as leakage power. The Cortex-A17 has been extensively tuned, and is considerably more area and power efficient than Cortex-A15 and similar to Cortex-A9.


This power efficiency enables our partners to optimize Cortex-A17, especially in a mature and cost efficient node like 28nm. The Cortex-A17 has broad support from ARM Physical IP in 28nm like ARM Artisan POP IP that allows system design with lowest risk and fast-time-to-market.


An optimized software ecosystem is fundamental for a great user experience. Today’s mobile world is based around the ARMv7-A architecture which supports over one million applications across many device categories. The Cortex-A17 processor leverages the popular applications and libraries that are specifically optimized for performance and efficiency on this architecture. New ecosystems around ARMv8-A are being built, and these complement the ARMv7-A ecosystem, particularly where a 64-bit instruction set is a necessity such as in server and enterprise applications.


What’s coming next for Cortex-A17?


We are very happy to see our partners introducing new innovative devices and enabling access to premium performance at a very attractive price. In the coming months, Cortex-A17 will continue to be at the heart of a great number of new mid-range devices while Cortex-A57 will power high-end products. It is today's choice for 32-bit devices that require highest performance in thermally constrained form factors. So we are expecting to see more and more Cortex-A17 devices from smartphone to Smart TV and set-top boxes, but also in key markets with similar technology constraints like home networking, industrial applications and high-end wearable.

Which new Cortex-A17 devices will your imagination build ?

While TI has slowly shrunk the Tiva (former Luminary Micro) families, it has sampled a few weeks ago a Cortex-M4F part. Guess in which top level family the Cortex M4F has landed? Find out in our latest post...

Filter Blog

By date:
By tag: