Linux® has become a popular operating system in embedded products, and as a result is experiencing increased usage for system design and verification. This means that a new set of people, who are not traditional embedded software engineers, need to learn to work with Linux to accomplish daily tasks.
For system design tasks, Carbon users generally proceed from bare-metal software benchmarks such as Dhrystone and CoreMark to running benchmark applications on an operating system. If the design uses ARM® Cortex™-A series processors, the most common OS choice is Linux. Linux has come a long way with respect to ARM Architecture since Linus Torvalds made his famous complaint about ARM support back in March, 2011. The Linux Device Tree has made it very easy for those of us who work with simulated hardware, and often partial systems, to be able to run Linux with almost no changes the kernel source code.
As I mentioned in my previous blog, I recently joined Carbon. I decided the best way to learn would be to use the tools the way a customer would. I’ll talk below about the steps I took to get Linux running on a minimal system using the ARM Cortex-A15 processor. If you’re using a Cortex-A15, you’ll get even more benefit as this platform is available for download as a Cortex-A15 CPAK.
It’s easier to start with small system and iteratively add complexity – add peripherals, enable drivers, and verify at each phase that the system functions as expected. Starting off with a larger system introduces the difficulty of extracting hardware; it can be tricky to identify the software that relies on the hardware you want to remove.
It has also been my observation that many engineers designing and using leading edge SoCs are keenly interested in the performance details of key parts of chip such as the processors, interconnect, memory controller, and GPU. These processor sub-system architects don’t always want or need a lot of slower peripherals that are typically assumed to be present in systems running Linux.
The primary goal is to identify the minimum hardware needed to run Linux on an ARM Cortex-A15. It should be possible to run Linux with nothing more than the A15 CPU, memory, and a UART, so I set out to try it.
A secondary goal is to use the latest kernel from kernel.org and change as little of the Linux source code as possible. Minimizing source code changes makes it easier to update to new versions of Linux as they are released.
Simulation speed is far more important than hardware accuracy when experimenting with Linux configurations to confirm a working kernel. This is an ideal situation in which to generate an ARM Fast Model design from the Carbon SoC Designer Plus canvas. After the kernel is working with the Fast Model design, it is easy to run the Carbon cycle accurate simulator, sdsim, for benchmarking and utilize Swap & Play to confirm a fully working virtual prototype that is both fast and accurate.
To meet the goal of changing as little kernel source code as possible, I started from a currently-supported platform, the ARM Versatile™ Express, then configured the kernel to use only the minimal hardware. New kernel versions will continue to support Versatile Express, making future upgrades easy to do.
I decided to call my new hardware design the a15mini to indicate a Cortex-A15 system with minimal hardware. The Versatile Express design requires specifying the memory map and interrupt connections. The memory for the Cortex-A series Versatile Express is from 0x80000000 to 0xffffffff. The first PL011 UART is located at 0x1c090000 and uses interrupt 5 on the A15 IRQS[n:0] input request lines connected to the internal Generic Interrupt Controller (GIC). The Cortex-A15 needs to have the base address for the internal memory mapped peripherals (PERIPHBASE) set to 0x2c000000. The only other relevant information is that the UART runs from a 24 MHz reference clock.
Creating the a15mini with SoC Designer consists of instantiating the models and connecting them on the canvas using sdcanvas. Using cycle accurate models means more detail is needed to create the design. Instead of just the CPU, simple address decoder, and memory I used the ARM CCI-400 Cache Coherent Interconnect and the NIC-301 interconnect. These models, along with the Cortex-A15 are built using Carbon IP Exchange, the web-based portal that builds models directly from ARM RTL code. It’s pretty amazing to think that I answer a few questions or submit an XML file from AMBA Designer and get back a simple to use model in the form of a .so file and know that it was generated from a very complex RTL design of a processor like the Cortex-A15. It’s as if I’m using millions on lines of Verilog code and I never see any of it.
The design is shown below:
Creating a Linux image for the a15mini is a little more complex than the hardware design procedure.
There are many ways to prepare Linux, but at a minimum the following items are needed:
For this experiment I decided to make things as easy to work with as possible. The most straightforward way was to use a single executable file (ELF file) containing all of the above items; anybody who wants to run the platform needs only this one file to represent all of the artifacts. A drawback of this approach is that the file must be regenerated when any of the items changes, but it creates a generic solution that can be run on any kind of simulator.
I downloaded Linux 3.13.1 from kernel.org as the starting point. This was the latest kernel at the time I did the initial simulation, but new versions are released frequently.
The default configuration for the Versatile Express is found in the Linux source tree at: arch/arm/configs/vexpress_defconfigFirst, I use the Versatile Express configuration as the baseline by running:
$ make ARCH=arm vexpress_defconfig
To get up and running quickly, I cheated - I copied the file new-buildroot-rootfs.cpio.gz from the Carbon ARM Cortex-A9 Linux CPAK and renamed it fs.cpio.gz. (A future article may cover the various ways to make file system images.)
To create the single executable file with all of the needed artifacts, I needed to embed the file system image in the kernel, and append the Device Tree Blob at the end of the kernel image.To embed the file system image in the kernel, you can use any of the Linux configuration interfaces. I tend to use menuconfig:
$ make ARCH=arm menuconfig
I navigated to the General Setup menu (see the image below), scrolled down to “Initramfs sources file(s)” and added the name of the file system image, fs.cpio.gz. I put this file at the top of the Linux source tree so no additional path is needed.
To append the Device Tree Blob at the end of the kernel image, access the Boot options menu item “Use appended device tree blog to zImage (EXPERIMENTAL).” I enabled this to append the .dtb file at the end of the zImage file (see in the image below).
While I was in the Boot options menu I also set the Default kernel command string by adding root=/dev/ram and earlyprintk. This specifies to use a ram based root file system; the only possible choice since no other storage is included in the hardware design. There are many ways to set the default kernel command string, but this approach works for well for this application, in which we want to link everything into a single elf file.
The bad news is I wasn’t able to run the 3.13.1 kernel on the a15mini without any kernel source code changes. The good news is I came within 1 line of the goal. I needed to edit the file arch/arm/mach-vexpress/v2m.c to remove the line that configures the kernel scheduler clock to read a time value from the Versatile Express System Registers (this peripheral has a register which provides a time value). To achieve my goal of including a minimum of hardware, I wanted to forgo the System Registers.
The line I removed is in the function v2m_dt_init_early() and is line number 423, the call to versatile_sched_clock_init().
Now the source tree could be compiled. I work on Ubuntu 13.10 and use the cross-compiler with the GNU prefix arm-linux-gnueabi, so my compile command was:
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- -j 4
The most important compilation result is the kernel image file arch/arm/boot/zImage.
The Versatile Express device trees for ARM Fast Models are available from linux-arm.org. The easiest way to get the files is to use git:
$ git clone git://linux-arm.org/arm-dts.git
The file that is the closest match for the a15mini in the fast_models/ directory rtsm_ve-cortex_a15x1.dts
This file also includes a .dtsi file named rtsm_ve-motherboard.dtsi. The main work to support the a15mini hardware system was to modify these two device tree files so they match the hardware available by removing all of the hardware that doesn’t exist.
So first, I edited rtsm_ve-motherboard.dtsi t1to remove all of the peripherals that are gone from the a15mini such as flash, Ethernet, keyboard/mouse controller, three extra UARTs, watch dog timer, and more. I left the original structure, but shrunk the file all the way down to just the UART!
Linux needs a timer to run. I removed the SP804 timers that are present in the Versatile Express and the System Register block that provides the time value for the scheduler clock. Replacing these, I want to use an internal timer in the A15, sometimes called the Architected timer, to keep the hardware design as small as possible.
Support for the ARM Architected timer is already present in rtsm_ve-cortex_a15x1.dts, but I found that it didn’t work right away when I commented out the call to versatile_clock_sched_init(). The kernel crashed at the point of setting up the Architected timer, and printed a message that there was no frequency available. By looking through other device tree files, I found that some armv7-timer entries had a clock-frequency attached to them, while the Versatile Express entry did not. After adding the clock-frequency to the timer in rtsm_ve-cortex_a15x1.dts, the timer worked, and the Linux works as expected with the internal timer. No external timer is needed!
I have attached the two device tree files so you can take a look at them as needed.
After the two device tree files have been shrunk down to support only the hardware available on the a15mini, they can be compiled into the device tree blob using the device tree compiler, dtc, which is available in the scripts/dtc directory of the kernel source tree.
I did not add anything to my PATH to find dtc, I just reached over into the kernel source tree and ran:
$ ../linux-3.13/scripts/dtc/dtc -O dtb -o rtsm_ve-cortex_a15x1.dtb\ fast_models/rtsm_ve-cortex_a15x1.dts
To use the feature that appends the device tree to the Linux kernel image, I simply copied the zImage from the kernel tree.
$ cp linux-3.13.1/arch/arm/boot/zImage .
Then concatenated the device tree blob to the end of the kernel image:
$ cat arm-dts/rtsm_ve-cortex_a15x1.dtb >> zImage
There are many ways to create a boot loader, but to meet the goal of a single, all-inclusive executable file, a small assembly boot loader is the best. A great example of this is available in the ARM Fast Models ThirdParty IP package. This is an add-on to ARM Fast Models which contains all of the open source software that can be run on Fast Models examples. The majority of the package is the Linux source trees and file system images for different examples.
I selected two really useful files from the RTSM_Linux source code that comes with the ThirdParty IP package:
The addition of these two files, combined with the modified zImage (including embedded file system and concatenated device tree) means that everything is now present in the single executable file.
I also made some minor adjustments to the Makefile, inserting my paths, compiler name, and file names so it would generate the file a15-linux.axf as the final output that will be used in simulation.The updated Makefile is shown below:
Because the interconnect is more complex than a simple address decoder, there are two ways to confirm the design is correct.
I would recommend both methods to make sure the design is functioning properly and there are no errors in connecting the interrupt, setting CPU model parameters, or other common mistakes made during design construction. My most common mistake is forgetting to set the PERIPHBASE address of the A15 to 0x2c000000.
Since I’m focusing on Linux I will describe how to run a fast version of the design.
From the Tools menu in sdcanvas select FastModel System…
The Fast Model System Creator dialog will appear as shown below. Clicking the Create button will generate a Fast Model equivalent system automatically from the cycle accurate system.
This Fast Model system can be run in sdsim using the Simulation -> Simulate System … (or F5). When the simulator starts specify the a15-linux.axf file that was created in the last section and watch Linux boot in a matter of seconds. The terminal below shows 3.13.1 Linux with the machine type reported as Versatile Express.
Now we have established a working cycle accurate simulation and a Fast Model simulation for the same design, both of which can be run with the SoC Designer simulator, sdsim.
Working with Linux on a cycle accurate simulation is a great way to study all of the details of the hardware, software combination such as bus utilization, cache metrics, cache snooping, barrier transactions, and much more. It’s exciting until you realize that Linux takes about 300M instructions to boot and well over a billion clock cycles.
One alternative to waiting for a full cycle accurate simulation is to create Swap & Play checkpoints at various points of interest and then load the checkpoints into the cycle accurate simulation. To create a checkpoint run the Fast Model simulation and then stop the simulator using either a breakpoint or just hitting the Stop button at the place you want to stop, such as the Linux prompt. Use the File -> Save As menu in sdsim and select Swap & Play Checkpoint (*.mxc) as shown below:
The next dialog will ask for a name of the checkpoint and a location for the file. Enter any name and hit OK to save the checkpoint.
To load the checkpoint into the cycle accurate simulation load the design into sdsim and then use the File -> Restore checkpoint view…
Select the checkpoint saved in the Fast Model simulation and it will load into the cycle accurate simulation. There is no need to even load an image file for this case since it will be restored by the checkpoint. The disassembly window and the register window will show the same location that was saved from the fast model simulation. I saved a checkpoint at the Linux prompt and restored it into the cycle accurate simulation and I can even see I’m sitting at the WFI instruction in the Linux idle loop.
A debugger such as RealView or DS-5 can also be connected to start source level debugging.
An alternative trick I use is the addr2line utility to find out where in the code the Disassembly or Register Window is showing. This is useful to just take a peek at the current location without starting the full debugger. For the disassembly window above I do:
$ arm-linux-gnueabi-addr2line -e vmlinux 0x80014524 /home/cds/jasona/kernel.org/linux-x1/linux-3.13.1/arch/arm/mm/proc-v7.S:73
Now I can see the source code for the current location as a quick check. Sure enough, I’m sitting at the Linux idle loop as expected since nothing is happening sitting in the shell at the prompt. Here is the code:
It’s easy to imagine using swap & play to load checkpoints and do cycle accurate debugging as well as performance analysis for benchmarks. Carbon users typically run benchmarks such as Dhrystone, CoreMark, and LMbench as Linux applications.
So far, this was done using a single core A15 CPU. It’s interesting but there are no actual systems that use a single core A15. Next time I will show how to extend the a15mini for multi-core simulation by moving to the ARM Cortex-A15x2 CPU and running SMP Linux, again with as little hardware as possible. The dual-core A15 matches one of my favorite machines, the Samsung Chromebook.
As I mentioned at the beginning, all of the work that I’ve done here porting Linux and setting up a system which can be used with Swap & Play is available as a CPAK on Carbon IP Exchange. You can use this system to port your own version of Linux or customize the hardware configuration to match that of your own design.
Jason Andrews
FYI the reason Linux complains about setting up the architected timer is because it really wants something (before Linux is entered) to have configured CNTFRQ to the right value - on the basis that this register cannot be set unless in Secure world and the Linux kernel has to assume it may be running in Non-Secure. The "clock-frequency" property of the timer device tree node is a legacy from before that was decided and codified.
You can do that operation in the boot wrapper (boot.S/model.lds) which actually exists as a separate and usable source tree, and also makes the whole device tree appending thing so much easier. There's a copy of it here, for example: Linaro Git Code Hosting - arm/models/boot-wrapper.git/summary