Parallel heterogenous computing for IoT-boards and nanocomputers with Armv8 and AArch64 hardware architecture

November 20, 2020

15 minute read time.

This is a guest blog contribution from Arthur Ratz

Build and run a modern parallel code in C++17 and CL and SYCL programming model specification on the IoT-boards and innovative tiny-sized nanocomputers. These are based on the revolutionary cluster symmetric Arm Cortex-A72 CPUs with Arm AArch64 architecture.

The following blog article provides practical guidelines, tips, and the tutorial for building a modern parallel code in C++17/2x0. These are implemented using CL/SYCL programming model, and running it on the next generation of IoT-boards, based on the innovative Arm Cortex-A72, quad-core, 64-bit RISC CPUs.

Readers find out about delivering a parallel code in C++17 with the Aksel Alpay's hipSYCL library project's open-source distribution. Also, about installing and configuring the LLVM and Clang-9.x.x Arm AArch64-toolchains for building parallel code executables and for running it on the powerful Arm Cortex-A72 CPUs, with Arm AArch64 architecture. This blog article is mainly focused on building and running specific parallel code executables on the latest Raspberry Pi 4B+ boards, based on the Broadcom BCM2711 SoC-chips, especially designed for embedded systems and IoT.

Raspberry Pi 4B+ IoT-boards based on Arm Cortex-A72 CPUs

In 2016, Arm announced the release of revolutionary new symmetric Cortex-A72 CPUs with 64-bit Armv8-a hardware architecture, fully supporting parallel computations, on scale. And this is the next tremendous era of IoT-boards and tiny-sized nanocomputers, including Raspberry Pi 4B+ boards. They are designed for massively collecting and processing data, in real-time, as the most essential constituent of embedded systems and IoT-clusters.

The Arm Cortex-A72 CPUs operate at 1.8Ghz clock-frequency and the latest LPDDR4-3200Mhz RAM. They have a capacity of up to 8GB depending on the SoC-chip and IoT-board model. They meet the expectations of software developers and system engineers, engaged in designing of the high-performance embedded systems and IoT-clusters. Also, the Cortex-A72 CPUs have a revolutionary high L2 cache capacity, that varies from 512KiB to 4MiB, for a specific CPU model and revision.

An example of using the Arm Cortex-A72 is the manufacturing the innovative BCM2711 SoC-chips and Raspberry Pi 4B+ IoT-boards by Broadcom and Raspberry Pi foundation vendors.

The Raspberry Pi boards are known for the “reliable” and “fast” tiny-sized nanocomputers, designed especially for data mining and parallel computing. Principally new hardware architectural features of the Arm's cluster symmetric 64-bit RISC-CPUs, such as DSP, SIMD, VFPv4 and hardware virtualization support, brought the significant improvement to the performance, acceleration and scalability of using Raspberry Pi for massively processing data, in parallel.

Specifically, the Raspberry Pi, based on the Arm Cortex-A72 CPU and 4GiB of RAM installed, or higher, are the most suitable solution for the IoT data mining and parallel computing. Also, the BCM2711B0 SoC-chips are bundled with a various of integrated devices and peripherals, such as Broadcom VideoCore VI @ 500Mhz GPUs, PCI-Ex gigabit ethernet adapters, and so on.

All that we need for parallel computing with IoT is a Raspberry Pi 4B+. Or, any other IoT-board which SoC-chip is manufactured based on Arm Cortex-A72 CPUs and LPDDR4 system memory.

We demonstrate the setting up a Raspberry Pi 4B+ boards for the first use, out of the box.

Here is a brief checklist of the hardware and software requirements, that be met, beforehand.

Hardware:

Raspberry Pi 4 model B0, 4GB IoT board
Micro-SD card 16GB for Raspbian OS and data storage
DC power supply: 5.0V/2-3A with USB Type-C connector (minimum 3A - for data mining and parallel computing)

Software:

Raspbian Buster 10.6.0 full OS
Raspbian imager 1.4
MobaXterm 20.3 build 4396, or any other SSH-client

Setting up A Raspberry Pi 4B IoT board

Before we begin, we must download the latest release of the Raspbian Buster 10.6.0 full OS image from the official Raspberry Pi repository. We also need to download and use the Raspbian Imager 1.4 application that is available for various platforms, such as Windows, Linux, or macOS.

Also, we must also download and install MobaXterm application for establishing a connection to the Raspberry Pi board, remotely, over the SSH- or FTP-protocols:

MobaXterm 20.3

Since the Raspbian Buster OS and Imager application have been successfully downloaded and installed, we are using the Imager application to do the following:

Erase the SD-card, formatting it to the FAT32 filesystem, by default
Extract the pre-installed Raspbian Buster OS image (*.img) to the SD-card

Since the previous steps have been successfully completed, remove the SD-card from the card-reader and plug it into the Raspberry Pi board’s SD-card slot. Then, attach the micro-HDMI and ethernet cables. Finally, plug the DC power supply cable's connector in, and turn on the board. Finally, the system boots up with the Raspbian Buster OS, installed to the SD-card, prompting to perform several post-installation steps to configure it for the first use.

Since the board has been powered on, make sure that all of the following post-installation steps have been completed:

Open the bash-console and set the ‘root’ password
```
pi@raspberrypi4:~ $ sudo passwd root
```
Login to the Raspbian bash-console with 'root' privileges
```
pi@raspberrypi4:~ $ sudo -s
```

Upgrade the Raspbian's Linux base system and firmware, using the following commands

root@raspberrypi4:~# sudo apt update
root@raspberrypi4:~# sudo apt full-upgrade
root@raspberrypi4:~# sudo rpi-update

Reboot the system, for the first time

root@raspberrypi4:~# sudo shutdown -r now

Install the latest Raspbian's bootloader and reboot the system, once again

root@raspberrypi4:~# sudo rpi-eeprom-update -d -a
root@raspberrypi4:~# sudo shutdown -r now

Launch the 'raspi-config' setup tool
```
root@raspberrypi4:~# sudo raspi-config
```
Complete the following steps, using the 'raspi-config' tool

* Update the 'raspi-config' tool:

Graphic showing the raspi-config tool

* Disable the Raspbian's desktop GUI on boot:

System options >> Boot / Autologin >> Console autologin:

Graphic showing the console login

* Expand the root ‘/’ partition size on the SD-card:

Graphic showing expand the root

After performing the Raspbian post-install configuration, finally reboot the system. After rebooting, you will be prompted to login. Use the ‘root’ username and the password, previously set, for logging in to the bash-console with root privileges.

Since you have been successfully logged in, install the number of packages from APT-repositories by using the following command, in bash-console:

root@raspberrypi4:~# sudo apt install -y net-tools openssh-server

These two packages are required for configuring either the Raspberry Pi's network interface or the OpenSSH-server for connecting to the board, remotely, with SSH-protocol, by using MobaXterm.

Configure the board’s network interface ‘eth0’ by modifying the /etc/network/interfaces, for example:

auto eth0
iface eth0 inet static
address 192.168.87.100
netmask 255.255.255.0
broadcast 192.168.87.255
gateway 192.168.87.254
nameserver 192.168.87.254

Next to the network interface, perform a basic configuration of the OpenSSH-server, by uncommenting these lines in the /etc/ssh/sshd_config:

PermitRootLogin yes
StrictModes no

PasswordAuthentication yes
PermitEmptyPasswords yes

This enables the 'root' login, into the bash-console, with SSH-protocol, without entering a password.

Finally, give a try to connect the board over the network, using the MobaXterm application and opening the remote SSH-session to the host with IP-address: 192.168.87.100. You must also be able to successfully login to the Raspbian's bash-console, with the credentials, previously set.

Graphic showing the bash console

Developing A parallel code in C++17 using CL/SYCL model

In 2020, Khronos group announced the revolutionary new heterogeneous parallel compute platform (XPU). This provides an ability to offload an execution of "heavy" data processing workloads to a widespread of hardware acceleration (for example, GPGPU or FPGAs) targets, other than the host CPUs, only. Conceptually, the parallel code development, using the XPU-platform, is entirely based on the Khronos CL/SYCL programming model specification, - an abstraction layer of the OpenCL 2.0 library. Here is a tiny example, illustrating the code in C++17, implemented using the CL/SYCL model abstraction layer.

#include <CL/sycl.hpp>

using namespace cl::sycl;

constexpr std::uint32_t N = 1000;

cl::sycl::queue q{};

q.submit([&](cl::sycl::handler& cgh) {
    cgh.parallel_for<class Kernel>(cl::sycl::range<1>{N}, \
        [=](cl::sycl::id<1> idx) {
            // Do some work in parallel
        });
    });

q.wait();

The code fragment in C++17, shown previously, is delivered, entirely based on using the CL/SYCL programming model. It instantiates a cl::sycl::queue{} object with the default parameter initializers list. This is for submitting SYCL-kernels for an execution to the host CPUs acceleration target used by default. Next, it invokes the cl::sycl::submit(...) method having a single argument of the cl::sycl::handler{} object for accessing methods that provide a basic kernels functionality. This is based on a various of parallel algorithms including the cl::sycl::handler::parallel_for(...) method.

The following method is used for implementing a tight parallel loop, spawned from within a running kernel. Each iteration of this loop is executed in parallel, by its own thread. The cl::sycl::handler::parallel_for(...) accepts two main arguments of the cl::sycl::range<>{} object and a specific lamda-function, invoked, during each loop iteration. The cl::sycl::range<>{} object basically defines an number of parallel loop iterations being executed. For each specific dimension, in case when multiple nested loops are collapsed and while processing a multi-dimensional data.

In the code, from above, cl::sycl::range(N) object is used for scheduling N-iterations of the parallel loop, in a single dimension. The lambda-function of the parallel_for(...) method accepts a single argument of another cl::sycl::id<>{} object. As well as the cl::sycl::range<>{}, this object implements a vector container, each element is an index value for each dimension and each iteration of the parallel loop. Passed as an argument to a code in the lamda-function's scope, the following object is used for retrieving the specific index values. The lamda-function's body contains a code that does some of the data processing in parallel.

After a specific kernel has been submitted to the queue and spawned for an execution, the following code invokes the cl::sycl::wait() method with no arguments to set a barrier synchronization. This ensures that no code will be executed until the kernel being spawned has completed its parallel work.

The CL/SYCL heterogeneous programming model is highly efficient and can be used for a widespread of applications.

However, Intel Corp. and CodePlay Software Inc, soon, have deprecated the support of CL/SYCL for hardware architectures, other than the "native" x86_64. This made it impossible to deliver a parallel C++ code, using the specific CL/SYCL libraries, targeting Arm/Aarch64, and other architectures.

Presently, there are a number of CL/SYCL open-source library projects, developed by a vast of developers and enthusiasts. They provide support for more hardware architectures rather than the x86_64 only. In 2019, Aksel Alpay at Heidelberg university (Germany) implemented the latest CL/SYCL programming model layer specification library. This targeted hardware-architectures, including the Raspberry Pi's Arm and AArch64 architecture. It contributed the hipSYCL open-source library project distribution to GitHub (https://github.com/illuhad/hipSYCL).

Furthermore, we discuss how to install and configure the LLVM/Clang-9.x.x compilers, toolchains, and the hipSYCL library distribution. This is to deliver a modern parallel code in C++17, based on using the library.

Installing and configuring LLVM/Clang-9.x.x

Before using the Aksel Alpay's hipSYCL library project's distribution, the specific LLVM/Clang-9.x.x compilers and the Arm/AArch64 toolchains must be properly installed and configured. To do that, make sure that you have completed the following number of steps.

Update the Raspbian's APT-repositories and install the following prerequisite packages:
```
root@raspberrypi4:~# sudo apt update
root@raspberrypi4:~# sudo apt install -y bison flex python python3 snap snapd git wget
```
The previous command installs an alternative 'snap' package manager. This is required for installing the proper version of cmake >= 3.18.0 utility, and the 'python', 'python3' distributions and the 'bison', 'flex' utilities. All are needed for building the hipSYCL open-source project from a "scratch", by using the 'cmake' utility.
Install the 'cmake' >= 3.18.0 utility and LLVM/Clang daemon by using the 'snap' package manager:
```
root@raspberrypi4:~# sudo snap install cmake --classic
root@raspberrypi4:~# sudo snap install clangd --classic
```
After installing the 'cmake' utility, let us check if it works and the correct version has been installed from the 'snap'-repository, by using the following command:
```
root@raspberrypi4:~# sudo cmake --version
```
You must see the following output, after running this command:
```
cmake version 3.18.4

CMake suite maintained and supported by Kitware (kitware.com/cmake).
```

Install the latest Boost, POSIX-Threads, and C/C++ standard runtime libraries for the LLVM/Clang toolchain:

root@raspberrypi4:~# sudo apt install -y libc++-dev libc++1 libc++abi-dev libc++abi1 libpthread-stubs0-dev libpthread-workqueue-dev

root@raspberrypi4:~# sudo apt install -y clang-format clang-tidy clang-tools clang libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev llvm-runtime llvm python-clang libboost-all-dev

Download and add the LLVM/Clang's APT-repositories security key:

root@raspberrypi4:~# wget -O – https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add –

Append the LLVM/Clang’s repository URLs to the APT’s sources list:

root@raspberrypi4:~# echo «deb http://apt.llvm.org/buster/ llvm-toolchain-buster main» >> /etc/apt/sources.list.d/raspi.list

root@raspberrypi4:~# echo «deb-src http://apt.llvm.org/buster/ llvm-toolchain-buster main» >> /etc/apt/sources.list.d/raspi.list

The completion of these two previous steps 4 and 5 is necessary to have an ability of installing the LLVM/Clang-9.x.x. compilers and specific toolchains, from the specific APT-repository.

Remove the existing symlinks to the previous versions of the LLVM/Clang, installed:
```
root@raspberrypi4:~# cd /usr/bin && rm -f clang clang++
```
Update the APT-repositories, once again, and install the LLVM/Clang’s compilers, debugger, and linker:
```
root@raspberrypi4:~# sudo apt update
root@raspberrypi4:~# sudo apt install -y clang-9 lldb-9 lld-9
```

Create the corresponding symlinks to the ‘clang-9’ and ‘clang++-9’ compilers, installed:

root@raspberrypi4:~# cd /usr/bin && ln -s clang-9 clang
root@raspberrypi4:~# cd /usr/bin && ln -s clang++-9 clang++

Finally, you must have an ability of using the ‘clang’ and ‘clang++’ commands in the bash-console:
```
root@raspberrypi4:~# clang –version && clang++ --version
```
Here, let us check the version of the LLVM/Clang, that has been installed, using the previous command.

After using the commands, you must see the following output:

clang version 9.0.1-6+rpi1~bpo10+1
Target: armv6k-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin
clang version 9.0.1-6+rpi1~bpo10+1
Target: armv6k-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin

Downloading and building hipSYCL library distribution

Another essential step is downloading and building the open-source hipSYCL library staging distribution from its sources, contributed to the GitHub.

This typically done by completing the following steps:

Download the hipSYCL project's distribution, cloning it from GitHub:
```
root@raspberrypi4:~# git clone https://github.com/llvm/llvm-project llvm-project
root@raspberrypi4:~# git clone --recurse-submodules https://github.com/illuhad/hipSYCL
```
The Aksel Alpay's hipSYCL project's distribution has several dependencies from another, LLVM/Clang's open-source project. That is actually why, we normally need to clone these both distributions, for building the hipSYCL library runtimes from a "scratch".

Set the number of environment variables, required for building hipSYCL project from sources, by using the 'export' and 'env' commands, and appending the following specific lines to the.bashrc profile script:

export LLVM_INSTALL_PREFIX=/usr
export LLVM_DIR=~/llvm-project/llvm
export CLANG_EXECUTABLE_PATH=/usr/bin/clang++
export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include

echo "export LLVM_INSTALL_PREFIX=/usr" >> /root/.bashrc
echo "export LLVM_DIR=~/llvm-project/llvm" >> /root/.bashrc
echo "export CLANG_EXECUTABLE_PATH=/usr/bin/clang++" >> /root/.bashrc
echo "export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include" >> /root/.bashrc

env LLVM_INSTALL_PREFIX=/usr
env LLVM_DIR=~/llvm-project/llvm
env CLANG_EXECUTABLE_PATH=/usr/bin/clang++
env CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include

Create and change to the ~/hipSYCL/build subdirectory under the hipSYCL project's main directory:
```
root@raspberrypi4:~# mkdir ~/hipSYCL/build && cd ~/hipSYCL/build
```

Configure the hipSYCL project's sources using 'cmake' utility:

root@raspberrypi4:~# cmake -DCMAKE_INSTALL_PREFIX=/opt/hipSYCL ..

Build and install the hipSYCL runtime library using the GNUs 'make' command:
```
root@raspberrypi4:~# make -j $(nproc) && make install -j $(nproc)
```
Copy the libhipSYCL-rt.iso runtime library to the Raspbian's default libraries location:
```
root@raspberrypi4:~# cp /opt/hipSYCL/lib/libhipSYCL-rt.so /usr/lib/libhipSYCL-rt.so
```

Set the environment variables, required for using hipSYCL runtime library and LLVM/Clang compilers for building a source code:

export PATH=$PATH:/opt/hipSYCL/bin
export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib

echo "export PATH=$PATH:/opt/hipSYCL/bin" >> /root/.bashrc
echo "export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc
echo "export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib" >> /root/.bashrc

env PATH=$PATH:/opt/hipSYCL/bin
env C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include
env CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include
env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib

Running A parallel CL/SYCL code in C++17 on Raspberry Pi 4B+

We are finally all set with the installing and configuring LLVM/Clang and hipSYCL library. It is strongly recommended to build and run the 'matmul_hipsycl' sample's executable, making sure that everything is just working fine:

Here are the most common steps for building the following sample from sources:

rm -rf ~/sources
mkdir ~/sources && cd ~/sources
cp ~/matmul_hipsycl.tar.gz ~/sources/matmul_hipsycl.tar.gz
tar -xvf matmul_hipsycl.tar.gz
rm -f matmul_hipsycl.tar.gz

A set of previous commands, will create ~/source subdirectory and extract sample's sources from the matmul_hipsycl.tar.gz achieve.

To build the sample's executable, simply use the GNUs 'make' command:

root@raspberrypi4:~# make all

This invokes the 'clang++' command to build the executable:

syclcc-clang -O3 -std=c++17 -o matrix_mul_rpi4 src/matrix_mul_rpi4b.cpp -lstdc++

This command compiles the specific C++17 code with the highest level of code optimization (for example, -O3), enabled, and linking it with the C++ standard library runtime.

Note: Along with the library runtime, hipSYCL project, built, also provides the 'syclcc' and 'syclcc-clang' tools. These are used for building a parallel code in C++17, implemented using hipSYCL library. The using of these tools is a slightly different from the regular usage of 'clang' and 'clang++' commands. However, the 'syclcc' and 'syclcc-clang' can still be used, specifying the same compiler and linker options, as the original 'clang' and 'clang++' commands.

After performing the compilation using these tools, grant the execution privileges to 'matrix_mul_rpi4' file, generated by the compiler, using the following command:

root@raspberrypi4:~# chmod +rwx matrix_mul_rpi4

Run the executable, in the bash-console:

root@raspberrypi4:~# ./matrix_mul_rpi4

After running it, the execution will end up with the following output:

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Multiplication C = A x B:

Matrix C:

323 445 243 343 363 316 495 382 463 374
322 329 328 388 378 395 392 432 470 326
398 357 337 366 386 407 478 457 520 374
543 531 382 470 555 520 602 534 639 505
294 388 277 314 278 330 430 319 396 372
447 445 433 485 524 505 604 535 628 509
445 468 349 432 511 391 552 449 534 470
434 454 339 417 502 455 533 498 588 444
470 340 416 364 401 396 485 417 496 464
431 421 325 325 272 331 420 385 419 468


Execution time: 5 ms

Optionally, we can evaluate performance of the parallel code, being executed by installing and using the following utilities:

root@raspberrypi4:~# sudo apt install -y top htop

The using of 'htop' utility, installed, visualizes the CPU's and system memory utilization, while running the parallel code executable:
Graphic showing the parallel code executable

Summary

Micro-FPGAs, as well as the pocket-sized GPGPUs with compute capabilities, connected to an IoT-board, externally, with GPIO- or USB-interfaces, is the next step of parallel computing with IoT. The using of tiny-sized FPGAs and GPGPUs provides an opportunity of performing an even more complex and “heavy” computations. In parallel, drastically increasing an actual performance speed-up, while processing huge amounts of big data, in real time.

Obviously, that, another essential aspect of the parallel computing with IoT is the continuation in the development of specific libraries and frameworks, providing CL/SYCL-model layer specification and, the heterogeneous compute platform (XPU) support. Presently, the latest versions of these libraries provide a support for offloading a parallel code execution to the host CPUs acceleration targets. The other acceleration hardware, such as small-sized GPGPUs and FPGAs for nanocomputers have not yet been designed and manufactured, by its vendors, currently.

In fact, the parallel computing with Raspberry Pi and other specific IoT boards are based on the Arm Cortex-A72 cluster, 64-bit. RISC CPUs is of interest for the software developers and hardware technicians conducting a performance assessment of the existing computational processes, while running it in parallel with IoT.

In conclusion, applying IoT-based parallel computing generally benefits in an overall performance of the cloud-based solutions. These are intended for collecting and massively processing big data, in real-time. And, as the result, positively impacting the quality of machine learning (ML) and data analytics itself.

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog