1 2 3 Previous Next

ARM Processors

212 posts

Yesterday we released version 3.10.0 of Valgrind, a GPL'd framework for building simulation-based debugging and profiling tools.  3.10.0 is the first official release to support 64-bit ARMv8.  The port is available from http://www.valgrind.org, and the release notes are available at http://www.valgrind.org/docs/manual/dist.news.html.


Porting the framework to the 64-bit ARM instruction set has been relatively straightforward.  The main challenge has been the large number of SIMD instructions, with some instructions involving significant arithmetical complexity: saturation, rounding, doubling and lane-width changes.  On the whole, the 64-bit instruction set is easier to simulate efficiently than the 32-bit ARMv7 instruction set, as it lacks dynamically conditionalised instructions (a la Thumb) and partial condition code updates, both of which hinder fast simulation.  As the port matures I expect it to attain performance comparable with other Valgrind-supported architectures.


Porting the tools based on the framework was almost no effort, because the framework is specifically designed to insulate tools from the details of underlying instruction sets.  Currently the following tools work well enough for serious use: Memcheck (memory checking), Helgrind, DRD (thread checking), Cachegrind and Massif (time and space profiling).


Initial development was done using cross-compilation and running on the ARM Foundation model, which proved to be a reliable starting point.  Further development was done on an ARM Juno board running a Fedora snapshot.  The Juno board made a big difference, as it facilitated building Valgrind "natively" and can build and run regression tests in a reasonable time frame.


We look forward to feedback from developers using the port to debug/profile serious workloads, on the order of millions to tens of millions of lines of C++.

Embedded processors are frequently compared through the results of Power, Performance and Area (PPA) implementation analysis. Jatin Mistry and I have created a whitepaper that describes the specific details of the PPA analyses performed on the Cortex-R Series processors.


Often high-level figures are quoted for processors, for example http://www.arm.com/products/processors/cortex-r/cortex-r5.php under the "Performance" tab, shows top level details of the Cortex-R5 in a mainstream low power process technology (40nm LP) with high-density, standard-performance cell libraries and 32KB instruction cache and 32KB data cache - this shows the total area as 0.45mm2.

However, behind the top-level power, performance and area results there are many variables and details that can dramatically alter these figures. Different implementations target different configurations, for example the cache sizes or inclusion of the Floating Point Unit (FPU), and target different goals, for example aiming to achieve the highest possible frequency or the lowest possible area. The process and libraries used have a dramatic affect. The attached whitepaper describes the process we use to perform a PPA analysis for the Cortex-R Series processors.


The goal of the whitepaper is to describe, for those without really deep processor implementation knowledge, the many variables that should be understood to get real value from any PPA data presented to enable an estimation of the real PPA of your own proposed processor implementation and also to make fair comparisons between processors, both from a single IP partner or between processors from different processor IP vendors.


Any PPA data without understanding the details behind it is of very little value. We hope that you find it informative.

What is the connection between rugby football, interconnect and performance analysis kits?


There is a seemingly never-ending march towards smaller, cheaper and more efficiency in complex chip design, and every component of the modern SoC is being squeezed for more with each new design. There is a case of diminishing returns when seeking improvements and designers need to be creative in order to find new ways to eke out those extra bits of performance that ultimately make the difference across the entire chip. The World Cup-winning rugby coach Sir Clive Woodward famously stated that excellence was best achieved by improving 100 factors by 1%, and this theory certainly holds true for a lot of the SoC’s that are being designed these days. Staying on the theme of rugby for the moment, the interconnect is like a scrum half (or a quarterback for those of you who live east of the Atlantic!) as it acts as the go-between for each component and marshals them effectively to make the chip greater than the sum of its parts. A scrum half’s performance is measured by the speed and efficiency with which he passes the ball to his teammates, thus enabling them to do their job more effectively, similarly to how you would want your system interconnect to function.

Scrum half.jpg

This role increases in importance as massive growth in system integration places on-chip communication at the centre of system performance. The ARM CoreLink NIC-400 is a very powerful and highly configurable interconnect with many high-end features to manage traffic passing through it. It is in fact so configurable that it is regularly one of the most popular IP models created and downloaded on Carbon Design Systems’ IP exchange portal for virtual prototyping (found here). This configurability allows a single user to create dozens of models for the system interconnect, and reflects the importance that users place on having accurate models for the components in their system that have a great influence on overall performance. With so many parameters in play the ability to test the interconnect within the system prior to tapeout is clearly of great value. Just setting all parameters to max performance is rarely a sensible option as power and cost budgets demand that less silicon is used to achieve the same levels of performance the full system modelling allows refinement to save silicon are, reduce the number of wires without compromising performance goals.


While the configurability of the interconnect is an inherent and indeed crucial part of its effectiveness, the vast amount of choices available also means that users often do not fully optimise the interconnect to their individual system. This is where virtual prototyping tools come into the equation, and help designers to avoid arbitration problems, detect system bottlenecks and give a better picture of how to manage PPA requirements. This ability to foresee and avoid potential issues before they become a problem is invaluable in an age where the pressure to get designs right first-time and on time is a concern of every system architect. Additionally, the depth of analysis that the Carbon tool can undertake provides fast and meaningful feedback that can help you measurably improve your design. Last year I co-wrote a white paper on this subject with Bill Neifert, titled “Getting the most out of the ARM CoreLink NIC-400”, which is available to download.

In the example shown here, a simple NIC-400 is configured with two masters and two slaves. The masters are set up to mimic the data loads from a CPU and DMA controller and the dummy targets are an Ethernet MAC and a DDR3 memory controller. Of course, since the traffic generators are quite configurable, it’s possible to model any number of different sources or targets and we’ll get more into that in a bit. Note though that we’re analysing traffic on any of the connections. The graphs shown here track the latency on the CPU interface and the queues in the DDR controller. The exact metrics for the system in question will of course vary based upon design targets however. It’s also beneficial to correlate data across multiple analysis windows and indeed even across multiple runs.


The important thing we’ve done here is establish a framework to begin gathering quantitative data on the performance of the NIC-400 so we can track how well it meets the requirements. The results can be analysed which will likely lead to reconfiguration, recompilation and re-simulation. It’s not unheard of to iterate through hundreds of various design possibilities with only slight changes in parameters. It’s important to vary the traffic parameters as well as the NIC parameters however since the true performance metric of the NIC-400 and really, all interconnect IP, is how it impacts the behavioural characteristics of the entire system.


I will be going into more detail on all of this on Thursday at 18:00 BST (1:00 pm EDT, 10:00 am PDT) in a webinar titled “Pre-silicon optimisation of system designs using the ARM CoreLink NIC-400 Interconnect” with Eric Sondhi, a corporate applications engineer at Carbon Design Systems. You can register for the webinar here, and make sure to attend live to ensure that your questions are answered immediately.

The ARM® Cortex®-R family is perhaps the unsung hero of the ARM powered world, quietly running infrastructure from Hard Disk Drive and Solid State Drive controllers, through to mobile phone baseband processing and even automotive ABS controllers. While not having the all-out performance of the Cortex-A series application processors, the Cortex-R family of processors provide several key benefits for systems requiring hard, real-time performance.


The main differences between application processors and real-time processors are:

  • Deterministic timing - A system is said to be real-time if the total correctness of an operation depends not only on its logical correctness, but also on the time in which it is performed.
  • Latency - There are time constraints to respond to external events. A car braking system must consistently respond within a certain time. The ARM Real-time (R) profile defines an architecture aimed at systems that require deterministic timing and low interrupt latency.
  • Safety and reliability - For embedded applications requiring high performance combined with high reliability, Cortex-R series processors provide features such as soft and hard error management, redundant dual-core systems using two cores in lock-step, and Error Correcting Codes (ECC) on all external buses.


The new ARM Cortex-R Series Programmer’s Guide extends the software development series of programming guides available from ARM by covering Cortex-R series processors conforming to the ARMv7-R architecture.


The Cortex-R Series Programmer’s Guide describes the following areas which differ between the Cortex-R series and the Cortex-A and Cortex-M series:

  • Floating-point support is available as an option on most Cortex-R series processors to provide computation functionality compliant with the IEEE 754 standard.
  • Unlike most other ARM processors, Cortex-R processors typically have some memory that is tightly coupled to the processor core to minimize access time and guarantee latency for critical routines.
  • The Cortex-R processors use an MPU instead of an MMU. The MPU enables you to partition memory into regions and set individual protection attributes for each region.
  • Fast and consistent interrupt response is a key feature of the Cortex-R processors.
  • Fault detection and control can be provided by lock-step processors, ECC on buses and memory, and watchdog timers.


This guide is aimed at anyone writing software for the Cortex-R family of processors, and complements, rather than replaces the existing documentation for the Cortex-R family.

If you’re new to using Cortex-R processors and looking to understand where to begin writing bare-metal programs, or you’re an experienced applications designer wanting to understand how to make the most of the underlying processor, then this guide is a good introduction to the Cortex-R family.


The document is only available to registered ARM customers. See, Cortex-R Series Programmer's Guide.

Current specifications for Rayeager PX2 enhanced board:

SoC – Rockchip PX2 Dual-core ARM Cortex-A9, up to 1.4GHz Mali-400MP4 Quad-core GPU, up to 400MHz

System Memory – 2GB/1GB DDR3

Storage – 8GB eMMC flash + micro SD slot

Video I/O

HDMI 1080P

VGA 1080P

LCD (selectable)

Audio Output / Input – HDMI, optical S/PDIF, headphone, and built-in MIC

Connectivity – Gigabit Ethernet, dual band 802.11 b/g/n Wi-Fi with external antenna, and Bluetooth

USB – 3x USB 2.0 host ports, 1x micro USB OTG

Expansion Headers –YCBCR_IN x1,CVBS_IN x1,Keys x5,Gsensor x1,Compass x1,RTC x1 , UART to USB debug port x1.

Power Supply –DC5V @ 2.0A with HDD support Li-battery / PMIC TPS659102

Dimensions – 150 x 97 mm


Rayeager PX2 enhanced Development Board 100% open source hardware,include the hardware schematics,component’s placement,and components’datasheet.

Rayeager PX2 enhanced Development Board supports Android 4.4.2 and Ubuntu,and the SDKs,tutorial and hardware files will all be available from the ChipSpark.com.

Os processadores e microcontroladores construídos com a arquitetura ARM são identificados conforme a versão da arquitetura adotada, o perfil e suas variantes.

Até o momento já foram definidas 7 versões de arquitetura ARM, sendo atualmente em uso apenas 4, identificadas pelo Prefixo ARMv, sendo elas ARMv4, ARMv5, ARMv6 e ARMv7.

Considerando a mais atual a ARMv7, temos 3 perfis de uso definidos, ARMv7-A, ARMv7-R e ARMv7-M sendo respectivamente usadas para, processadores de aplicação geral, processadores e microcontroladores para aplicações de uso critico e resposta em tempo real, e finalmente o perfil para uso em microcontroladores de uso geral.


As variantes são identificadas por letras adicionados as versões no momento existem as seguintes:

  • ARMv4,
    uma variante que inclui apenas o conjunto padrão de instruções ARM.
  • ARMv4T,
    nessa variante é adicionado o conjunto de instruções Thumb.
  • ARMv5T 

    melhorias em relação a interworking e instruções ARM. adicionado "Count Leading Zeros" (CLZ) e instruções para "Software Breakpoint"(BKPT).

  • ARMv5TE

    Melhorias no suporte aritmético relativo a algoritmos de processamento de sinal (DSP) , adicionado "Preload Data" (PLD), "Load Register Dual" (LDRD), Store Register Dual (STRD), e adicionado instruções para transferencias de 64-bits para registradores de coprocessador (MCRR, MRRC).

  • ARMv5TEJ,
    Adicionado a instrução BXJ e outros suportes para extensão arquitetural Jazelle®.
  • ARMv6,
    Adicionado novas instruções para o conjunto padrão ARM, formalizado e revisado o modelo de memória, e a arquitetura de Depuração.
  • ARMv6K,
    Adicionado instruções para suporte a multiprocessamento ao conjunto padrão de instruções e alguns recursos extras para o modelo de memória.
  • ARMv6T2,
    Introduz a tecnologia Thumb-2, que dá suporte a um maior desenvolvimento de instruções fornecendo um nível de funcionalidade similar ao conjunto de instruções padrão ARM.

Há também as extensões que são opcionais que podem ser adicionadas conforme o fabricante, as extensões são dividas em grupos, algumas delas estão listadas abaixo:

  • Extensões relativas ao conjunto de Instruções
    • Jazelle, é uma extensão que dá poder a variante arquitetural ARMv5TE como ARMv5TEJ.
    • Extensão para Virtualização.
    • ThumbEE é uma extensão que fornece um conjunto de instruções ampliado do conjunto Thumb padrão e que permite código dinamicamente gerado, sendo obrigatório no perfil ARMv7-A e é opcional no perfil ARMv7-R, para a versão arquitetural ARMv7.
    • Extensões de ponto flutuante é uma extensão para comprocessador de ponto flutuante. Esta extensão é historicamente chamada de Extensão VFP.
    • Advanced SIMD, é uma extensão do conjunto de instruções que adiciona instruções do tipo "Simgle Instruction Multiple Data" (SIND), para operação com vetores com os tipos de dados Inteiros e ponto flutuante de precisão simples, sobre registradores doubleword e quadword.
  • Extensões arquiteturais
    • Extensões de segurança.
    • Extensões para Multiprocessamento.
    • Extensões para Endereçamento Físico de Maior Largura.
    • Extensões para Virtualização.

Este resumo foi proposto para a Wikipedia por mim no link: Arquitetura ARM – Wikipédia, a enciclopédia livre

Para habilitar ou desabilitar uma interrupção em um cortex-m0, há dois registradores, este método é a melhor forma para evitar "race conditions" seja em um ambiente multitask ou não, além de reduzir o número de instruções assembly Para gerar uma interrupção via software é adotado também o mesmo procedimento.


Quando se usa multitask, em um microcontrolador, o que não é muito comum em microcontroladores de 8-bit, você precisa fazer uso de certos procedimentos para evitar problemas.


Em um ambiente multitarefas, duas ou mais tarefas ou mesmo quando apenas uma interrupção interfere no registrado além do processo principal, podem interferir um único registrador, interferindo em seus bits para habilitar ou desabilitar a interrupção, ou mesmo para simular uma interrupção externa via seu código. para evitar a ocorrência de "race conditions" ou seja a disputa pelo uso do registrador, usando poucos passos, os microcontroladores Cortex-M, usam dois registradores para o mesmo recursos, são dois para habilitar/desabilitar respectivamente e dois para colocar a interrupção em pending_mode, ou remover esta condição.


Veja, colocando uma interrupção em estado pendente (Pending Mode) é como provar o lançamento de tal interrupção, simulando a ocorrência externa em sua origem. Porém você pode também remover esta ocorrência, limpando esta estado antes que ele seja processado.


Há dois registradores para habilitar/desabilitar uma interrupção, e são chamados setena e clrena, respectivamente "Set Enable Interrupt" e "Clear Enable Interrupt", estes registradores são membros da coleção de registradores existentes no nvic (Nested Vectore Interrupt Controller), NVIC é um recurso externo ao núcleo do processador que gerencia as interrupções e exceções. Na figura abaixo, retirada do livro de Joseph Yiu, [1], é apresentado o mapeamento de memória onde se consegue acesso aos registradores do NVIC, permitindo assim sua parametrização. Tais registradores se encontram entre o endereço 0xE0000000 a 0xFFFFFFFF, tal faixa é chamada de Espaço de Controle do Sistema (System Control Space scs) que se resume a faixa 0xE000E000 até 0xE000EFFF, que por sua vez está dentro do Barramento Interno de Periféricos (Private Peripheral Bus ppb).

Captura de tela 2014-08-21 00.18.34.png

O pacote CMSIS oferece um amplo suporte através de funções e macros para gerir tais registradores, mas iremos focar na codificação em C e Assembly para compreendermos os benefícios arquiteturais nos dado pelo ARM


O registrador SETENA, comentado acima,  é acessado  através do endereço 0xE000E100, este endereço permite leitura e escrita, quando o processador inicializa após um reset seu valor é 0x00000000, cada bit é representação do estado de uma interrupção, o bit 0 é a interrupção de número 0 (#0) ou seja a exception de número 16 (#16), o bit 2 é a interrupção de número 2 (#2), ou seja a exception #18, e assim por diante.


O segundo registrador que faz par com este é usado para limpar os estados definidos por este é o registrador CLRENA e é acessado pelo endereço 0xE000E180.


Estes dois registradores portanto são usados para habilitar e desabilitar, havendo outros registrado, como citado para representar a ocorrência da interrupção externamente, e que podem ser usado para simular por software tal ocorrência, este dois registradores são setpend acessado pelo endereço 0xE000E200  que define haver uma interrupção pendente, e clrpend que acessado pelo endereço de mémoria 0xE000280. Iremos ver mais detalhes mais a frente.


Como já falamos o registrador SETENA tem como função habilitar a ocorrência de interrupções, para isso basta definir como 1 o bit correspondente a interrupção que se deseja habilitar, porém nada acontece quando se define o respectivo bit como zero, ou seja limpa o bit, já que este registrador apenas é para habilitar a interrupção e/ou saber se ela está habilitada.


para desabilitar uma interrupção é preciso usar o registrador que faz par com o SETENA, que tem nome de CLRPEND, uma fato interessante é que este registrador não é oposto ao SETENA, ele apenas tem função oposta, para se saber qual interrupção está desabilitada é necessário consultar no registrador SETENA o respectivo BIT se ele está 0. Para desativar uma interrupção basta escrever o respectivo bit com o valor 1, não tendo efeito algum escrever o valor 0 neste registrador.

Os outros dois registradores que tem função identificar uma interrupção pendente, identificados como SETPEND e CLRPEND tem função similar aos registradores SETENA e CLRPEND, mas sua função é informar que há interrupções pendentes para serem tratadas, portanto ao ler o registrador SETPEND você irá saber que há uma determinada interrupção para ser tratada conforme o bit que está ativo, a ordem dos bits é a mesma usada em SETENA, porém há a possibilidade como já dito de se simular que uma interrupção ocorreu, bastando escrever 1 no respectivo bit, logo que isso for feito a interrupção será lançada, e poderá ser tratada pelo respectivo handler/vetor. Porém suponha que esteja dentro de outra interrupção e que ao manipular algum periférico alguma interrupção pode ser lançada acidentalmente por este periférico e se deseja retirar o estado de pendência dela, basta portanto escrever 1 no bit correspondente desta interrupção no registrador CLRPEND, assim a pendência para esta interrupção deixa de existir, como nos pares anteriores escrever 0 em ambos registradores não tem efeito.


Concorrência pelos registradores (Race Condicion)


Sobre o problema de concorrência de registradores, muito comum em sistemas multitarefa, a Arquitetura ARM adota esta prática de dois registradores com funções inversas exatamente para evitar a necessidade de  leitura previa do registrador para depois efetuar a mudança de estado, assim não há problemas de concorrência e perda de estados.


Além deste problema de concorrência, onde dois processos podem intervir no mesmo registrador e um perder a alteração feita pelo outro, temos também o numero de passos necessários para efetuarmos tal mudança, já que com esta abordagem não precisamos consultar o estado atual do registrador para regrava-lo, basta mudar o bit desejado e não há perda do outro estado, já que a escrita do valor 0 é ignorada, ou seja não se muda o estado oposto usando o mesmo registrador.


Veja o código abaixo em C, ao se escrever o valor 0x4 (B00000100) no registrador SETENA.

*((volatile unsigned long *) (0xE000E100)) = 0x4; // Disable interrupt #2


Tal escrita apenas interfere nos bits que são setados com o valor 1, sendo ignorado os bits que são de valor 0, com esta estratégia evita-se a necessidade de leitura do registrador para se fazer a equiparação dos bits e definir o desejado. Veja abaixo como fica tal código em Assembly


LDR    R0, =0xE000E100    ; armazena o endereço do registrador SETENA em R0
MOVS   R1, #0x04          ; move o valor 0x4 para R1, equivalente em binário B00000100, 
                          ; bit 2, é a interrupt #2 (Exception #18)
STR    R1, [R0]           ; Escreve o conteúdo de r1 no endereço armazenado em r0.

Observe que somente três instruções são usadas em assembly para ativar uma determinada interrupção sem interferir no estado das demais.

Usamos os seguintes comandos: LDR, MOVS STR
e os Registradores R0 e R1

Como pode ver, não é preciso ler o registrador antes de altera-lo, uma vez que ele somente considera a escrita do valor 1, portanto ao escrever o valor 0 ele não considera, assim você não consegue eliminar acidentalmente alterações realizadas por outros processos.


Vejamos por questões didáticas a abordagem convencional. OU seja um registrador para habilitar/desabilitar uma interrupção, estamos usando aqui o mesmo endereço, mas isso não representa a realidade.


*((volatile unsigned long *) (0xE000E180)) = *((volatile unsigned long *) (0xE000E180)) | 0x4; // Desabilita a interrupt #2


Como pode ver, no procedimento tradicional, você primeiro lê o registrador, altera o valor obtido e grava novamente no mesmo registrador, porém o que aconteceria se um segundo processo toma-se a execução neste instante? e alterasse o registrador também? você teria portanto um valor inválido e perderia em seguida a alteração realizada pelo segundo. Veja o mesmo em assembly abaixo, observe como se gasta mais instruções e assim aumenta a possibilidade de concorrência.


MOVS     R2, #0X04        ; Mascara de bytes, somente o bit 2 é habilitado
LDR      R0, = 0XE000E100 ; registra o endereço de SETENA no registrador R0
LDR      R1, [R0]         ; Obtém o estado atual do registrador
ORRS     R1, R1, R2       ; altera o valor obtido com o novo valor do bit 2
STR      R1, [R0]         ; Devolve o valor para o registrador

Como pode ser observado você irá gastar duas instruções a mais para ativar uma interrupção, além disso entre a execução da instrução 03 (ORRS) e 05 (STR) é possível haver alteração no valor do registrador SETENA, sendo o valor armazenado em R1 inválido.

usamos neste exemplo as instruções MOVS, LDR, ORRS e STR e os registradores R1, R2 e R0


O mesma situação pode ocorrer com o par de  registradores SETPEND e CLRPEND, acarretando situações imprevisíveis e comportamentos indesejados, como perda de sincronismo entre sequências de interrupções.

Este post se refere a anotações que tenho feito relativo aos meus estudos da arquitetura Cortex-M em especial Cortex-M0, e poderão sofrer alterações e melhoras no decorrer de meus estudos.



[1] - The Definitive Guide to the ARM Cortex-M0, Joseph Yiu


Chinese Information  中文信息:参与ARM技术培训的新途径

Just a short update to highlight an exciting new development. In response to demand, ARM has launched a limited program of public open-enrollment training courses. We are hosting these at our major regional support centres in San Jose, Cambridge and Shanghai. The program, as I say, is limited at present but touches several of our most popular courses, including Cortex-M System Design, TrustZone and ARMv8 Software Development.


You can check out the full schedule here: ARM Training Courses - ARM


If you have any questions, please don't hesitate to contact the ARM training team: Contact Support - ARM



A good paper about Cortex M from AnandTech, you can read it by the link AnandTech | ARM's Cortex M: Even Smaller and Lower Power CPU Cores

A study recently carried out by Cambridge University found that the global cost of software debugging has risen to the princely sum of $312 billion every year, and that developers spend an average of 50% of their programming time finding and fixing bugs (read the full story here). Divide that massive sum by 7.1 billion people on the planet and it works out at $44 per person. Put another way, it’s enough to buy everyone in the world a Raspberry Pi!

Furthermore, the trend for increasing complexity in SoC design (see graph below) means that this problem will only take up more resources in terms of time and money going forward. It is an issue that has given SoC architects and system developers’ headaches for years.

ITRS 2007 SoC Consumer Portable Design Complexity Trends

With that said, a well-thought out debug and trace solution for your SoC can help manage the increased complexity by providing the right hardware visibility and hooks. Software developers can make use of this key functionality to develop optimized software in a timely manner with reduced risk of bugs. Each of the following 4 key use-cases (see picture below) can be addressed for your SoC design with a customized debug and trace solution that allows for:

  • Faster SoC bring-up
  • Easy and quick software debug
  • In-field analysis and postmortem debug
  • System optimization via profiling, event synchronization



ARM CoreSight SoC product is designed to offer a comprehensive solution that can be tailored to meet specific requirements. The CoreSight SoC-400 allows you to:

  • Design for large systems with multiple cores through use of configurable components
  • Maximize debug visibility using a combination of debug components
  • Use IPXACT descriptors for all components to automate stitching and for testbench generation
  • Support different trace bandwidth requirements for complex SoCs
  • Accelerate design and verification through example subsystems, testbenches, test cases and necessary verification IP components
  • Support multiple hardware-debug models for multiple use cases


When all of this is put together in a wider context, ARM CoreSight IP gives design teams a real advantage through its innovative debug logic that reduces design development and software debug cycles significantly. Furthermore, if we think of debug as solving a murder through the use of backward reasoning, then trace is the video surveillance that pinpoints the culprit. Trace is invaluable as it provides real-time visibility into errors, dramatically cutting down design cycles and iterations.

I recently conducted a webinar on how to build an effective and customized debug and trace solution for a multi-core SoC. Register here for free to access the webinar recording.

There is a corresponding White Paper that goes in to a lot more detail on the ARM Debug and Trace IP page.

The White Paper provides the following:

  • High-level steps on building a debug and trace solution
  • Recommended design and verification flow
  • Advantages of using SoC-400 at each stage of your development process
  • Pointers to further information and useful references

Dwight Eisenhower may not have lived until the age of semiconductors, but his quote of “No battle was ever won according to plan, but no battle was ever won without one” rings true in the context of debug subsystem design. Understanding debug and trace hardware features and capabilities is key to building a solution to meet YOUR specific requirements. The paper discussed some of the key design decisions faced by architects.

Stay tuned for more upcoming exciting news about ARM CoreSight IP or sign up for ARM TechCon 2014 to see it for yourself! TechCon will be the first time that members of the public will be able to demo the new design environment for building debug and trace subsystems. This makes it even easier to configure and integrate ARM CoreSight IP within a large system, and will help users cut down on that $312 billion global debug bill. If you have any questions or comments about ARM CoreSight IP or this blog, please write them below and I will get back to you as soon as possible.

I have followed some tutorials on the internet and found one in particular quite interesting and didactic for those just starting to program ARM Bare metal. The Blog é Freedom Embedded | Balau's technical blog on open hardware, free software and security.


Below is a summary of needed to succeed in building a Hello World commands, let noted here that in the near future I may supplement this information and synthesize into a more detailed tutorial.


Based on the link: Hello world for bare metal ARM using QEMU | Freedom Embedded


Compile the code with the following commands:

$ arm-none-eabi-as -mcpu=ARM926EJ-s -g startup.s -o startup.o
$ arm-none-eabi-gcc-c  -mcpu=ARM926EJ-S test.c -g -o test.o
$ arm-none-eabi-ld-T test.ld test.o startup.o -o test.elf
$ arm-none-eabi-objcopy -O binary test.elf test.bin


And execute with the following command:

qemu-system-arm -M versatilepb -m 128M -s -nographic S -kernel test.bin


Debug with GDB, with the following comand:



Where you get the prompt from GDB, type:

target remote localhost: 1234
file test.elf


when finished working with qemu if you have problems with the terminal, use the command:


stty sane


to fix it.

It has been a full seven months since AMD released detailed information about its Opteron A1100 server CPU, and twenty two months since announcement. Today, at the Hot Chips conference in Cupertino, CA, AMD revealed the final pieces about its ARM powered server strategy headlining the A1100.

You can find more information from AnandTech Portal | AMD’s Big Bet on ARM Powered Servers: Opteron A1100 Revealed


I am interested in the flowing questions,

1、Opteron A1100 uses eight Cortex-A57 consist of four A57 clusters, since one A57 cluster can contain up to four A57 cores, why A1100 doesn't use two A57 clusters and each cluster contains four A57 cores.

2、Cortex-A57 suppports TrustZone, but A1100 still uses Cortex-A5 to realize TrustZone. I guess the purpose of this design is to be compatible with other AMD SoCs which use x86 core as main CPU

and Cortex-A5 as TrustZone CPU.


What do you think about Opteron A1100?

Hello and I welcome you to my ARM programming tutorial series. I would like to give a big thank you to Abhishek Agrawal, a Final Year Undergraduate Student at IIT Kharagpur for his help to complete this blog.

Many students wonder where to start reading about ARM microcontrollers - although there a lot of tutorials and books available on the internet, many of them are out of focus for the beginners in ARM Assembly programming. Here we have started a blog and YouTube video tutorial series for those beginners.

Let’s start with basics, ARM is also known as Advanced RISC Machine, RISC machines have become very powerful these days. ARM processors are completely based on the RISC architecture. This approach reduces the costs of hardware and it produces less heat than traditional x86 architectures hence it is power efficient. It has highly optimized instruction sets.

RISC architecture is also known as Load-Store Architecture, it means CPU cannot directly perform memory operation. For memory operation microcontroller have to first load desired memory location content in a registers then after CPU operation it can store the result through general purpose registers.


ARM microcontrollers are the most widely used microcontroller in the world. In a study it has been found in 2005, about 98% of all mobile phones sold used at least one ARM processor.

Instructions for ARM Holdings' cores have 32-bit wide fixed-length instructions, but later versions of the architecture also support a variable-length instruction set that provides both 32 and 16-bit wide instructions for improved code density. Currently ARM microcontrollers have 32-bit architectures in most mobile phones and embedded hardware.

More recently, the ARMv8-A architecture announced in October 2011, adds support for a 64-bit address space and 64-bit arithmetic. It is more power efficient and has greater performance ranges.


ARM has three series of microcontrollers namely ARM Cortex-A, ARM CortexR and ARM Cortex-M series. Where Cortex-A microcontrollers intended to Application specific systems such as in smartphones and Cortex-R means real-time specific microcontroller, used in such as space, missile applications. The last one which is mostly used in general purposes applications such as motor control, LED or LCD interfaces etc. is ARM Cortex-M series microcontrollers.

                       ARM Architecture.png

These ARM cortex M series microcontrollers have five different sub series of microcontrollers and they are:

  1. Cortex-M0
  2. Cortex-M0+
  3. Cortex-M1
  4. Cortex-M3
  5. Cortex-M4

The interesting thing is that all microcontrollers are consistently based on 32-bit processor architecture however few of them are using 16-bit thumb instruction set and rest of them are using both thumb and ARM instruction set.

ARM Cortex-M0 is mostly preferred where our requirement is low-power and lowest cost. It has almost all general feature of microcontroller. It has Nested Vectored Interrupt controller which is also known as NVIC. The NVIC is tightly coupled to the processor core. This facilitates low latency exception processing. The main features include:

  1. A configurable number of external interrupts, from 1 to 240 but actual no of interrupt on hardware depend on chip manufacturer
  2. It has configurable number of bits of priority, from three to eight bits
  3. It also support level and pulse interrupt
  4. It also support dynamic reprioritization of interrupts
  5. It can also do priority grouping

Another Important feature is wake up interrupt controller (WIC) interface

Wakeup Interrupt Controller (WIC) can detect an interrupt and wake the processor even from deep sleep mode where processor is resting in minimum power consumption mode. Wireless sensor networking uses this feature for lowest possible power consumption

Another Important feature is Data WatchPoint and BreakPoint. It is a feature of debug unit which is present on chip. In debug mode we can monitor the state of the processor in each and every clock cycle.


ARM Cortex-M0+ is superset of Cortex-M0 processor in term of Instruction set. i.e. ARM Cortex-M0 instruction set is 100% compatible with Cortex-M0+ processors. ARM Cortex-M0+ Low Latency I/O Interface provides “Harvard- like” access to peripherals. Improves overall cycle efficiency for I/O access.

               m4F instruction.jpg

ARM Cortex-M3 added more feature in Cortex-M0+ sub series processors, the main feature of this processor is 1-cycle 32-bit hardware multiply, 2-12 cycle 32-bit hardware divide, saturated math support.

Cortex-M4 is a Cortex-M3 plus DSP Instructions, and also optional floating-point unit (FPU) on chip. And if a core contains an FPU then it is known as a Cortex-M4F core microcontroller, otherwise it is only a Cortex-M4 microcontroller.

If you are interested in the ARM Accredited MCU Engineer (AAME) qualification, I'm sure you'll be delighted to know that it now has its own Study Guide to help prepare for the test. This goes alongside the existing AAE Study Guide. You can find it, along with all other public ARM documentation on our documentation portal at infocenter.arm.com


Here is a direct link to the document: ARM Information Center


Happy studying!



Chinese Version 中文版:引发下一次移动计算革命-ARMv8 SoC处理器

I recently had the opportunity to reflect on the mobile computing revolution of the last five years. I use the term 'mobile computing' deliberately - the compute tasks we handle on mobile phones today directly rival those that were only possible on laptops and desktops several years ago. With uninterrupted direct supply from the wall, laptop and desktop PCs needed fan assisted cooling, and their architecture is designed around that capacity. Today, mobile devices run similarly demanding workloads for a full day (or more) on a single charge and serve as communications hub, entertainment center, game console, and mobile workstation. The architecture of ARM®  based mobiles devices is and has always been designed around the mobile footprint. Continuing to improve the user experience in that footprint requires commitment to deliver the most out of each milliwatt and every millimeter of Silicon.

The success of smartphones and tablets and the software app economy (worth $27 bn and growing) is largely based on SoCs (System-on-Chips) from ARM Partners. Mobile SoCs balance ever-increasing performance with form factor, battery life and price point across an incredibly diverse range of consumers.
Most of them to date have been based on the ARMv7-A architecture, accounting for 95% share of the growing smartphone market. The growing app ecosystem ( with over 40bn downloads ) has been largely designed and coded specifically for the ARM architecture resulting in a vast application base. We are now at the transition point to ARMv8-A, the next generation in efficient computing.


2014 will see the arrival of numerous devices featuring the latest ARMv8-A architecture, opening the door for developers, while retaining 100% compatibility with the vast app ecosystem based on 32-bit ARMv7.  It is great to finally be at a point where the first ARMv8 mobile SoCs are coming to the market, and it is particularly positive that some of the upcoming SoCs employ ARM big.LITTLE®  technology,  which combines the high-performance CPUs and high-efficiency CPUs in one processing sub-systems, capable of both 32-bit and 64-bit operation while dynamically moving  workloads to the right size processor and saving upwards of 50% of the energy.

Qualcomm® recently announced their Snapdragon® 810 processor which uses four Cortex®-A57 cores and four Cortex-A53 cores in a big.LITTLE configuration, and the Snapdragon 808 processor which uses two Cortex®-A57 cores and four Cortex-A53 cores, again in a big.LITTLE configuration. These processors are expected to be available in commercial devices by the first half of 2015 and will feature 64-bit ARMv8 support for Android. We have been working together with teams from Qualcomm Technologies and other ecosystem partners for several to ensure that OEMs and OS providers are able to take full advantage of the ARMv8-A Architecture, ensuring that they can rely on the same design philosophy that has made ARMv7-A based Snapdragon processors so successful in the multiple segments of the mobile market.


My colleague  James Bruce and I recently collaborated with our counterparts at Qualcomm in writing a paper that delves further into ARMv8-A and explains the journey of bringing an ARMv8 SoC to market - I recommend it for anyone seeking to better understand the SoC design process and mobile processor market space.


The white paper (which you will find below) dispels a few myths about ARMv8-A (it's more than just 64 bit, it doesn't double code size, etc.) and outlines the approach one ARM partner takes in combining ARM IP with in-house IP to build a product line ranging from premium smartphone and tablets down to low-cost smartphone tiers for emerging markets.


The first half of the paper offers some useful insights into the mobile market, how ARM competes in the market, how Android is delivered on ARM platforms, and the benefits of the latest ARM Cortex-A processors and ARMv8 instruction set architecture.  The second half of the paper dives a bit deeper into Qualcomm's approach to delivering a complete SoC, combining in-house designed components with ARM IP, then optimizing the whole platform. It discusses Qualcomm's use of Cortex-A57 and Cortex-A53 along with big.LITTLE technology in the announced Snapdragon 808 and 810 SoCs, as well as their use of custom-designed CPUs, GPUs, and other components in the Snapdragon product line.


The ready availability of ARM IP and the flexibility of the ARM business model provide the freedom to mix and match and the opportunity to rapidly innovate which have been a big factor in enabling ARM partners like Qualcomm to be so successful in the smartphone and mobile computing revolution.

Filter Blog

By date:
By tag: