Build and run a modern parallel code in C++17 and CL and SYCL programming model specification on the IoT-boards and innovative tiny-sized nanocomputers. These are based on the revolutionary cluster symmetric Arm Cortex-A72 CPUs with Arm AArch64 architecture.
The following blog article provides practical guidelines, tips, and the tutorial for building a modern parallel code in C++17/2x0. These are implemented using CL/SYCL programming model, and running it on the next generation of IoT-boards, based on the innovative Arm Cortex-A72, quad-core, 64-bit RISC CPUs.
Readers find out about delivering a parallel code in C++17 with the Aksel Alpay's hipSYCL library project's open-source distribution. Also, about installing and configuring the LLVM and Clang-9.x.x Arm AArch64-toolchains for building parallel code executables and for running it on the powerful Arm Cortex-A72 CPUs, with Arm AArch64 architecture. This blog article is mainly focused on building and running specific parallel code executables on the latest Raspberry Pi 4B+ boards, based on the Broadcom BCM2711 SoC-chips, especially designed for embedded systems and IoT.
In 2016, Arm announced the release of revolutionary new symmetric Cortex-A72 CPUs with 64-bit Armv8-a hardware architecture, fully supporting parallel computations, on scale. And this is the next tremendous era of IoT-boards and tiny-sized nanocomputers, including Raspberry Pi 4B+ boards. They are designed for massively collecting and processing data, in real-time, as the most essential constituent of embedded systems and IoT-clusters.
The Arm Cortex-A72 CPUs operate at 1.8Ghz clock-frequency and the latest LPDDR4-3200Mhz RAM. They have a capacity of up to 8GB depending on the SoC-chip and IoT-board model. They meet the expectations of software developers and system engineers, engaged in designing of the high-performance embedded systems and IoT-clusters. Also, the Cortex-A72 CPUs have a revolutionary high L2 cache capacity, that varies from 512KiB to 4MiB, for a specific CPU model and revision.
An example of using the Arm Cortex-A72 is the manufacturing the innovative BCM2711 SoC-chips and Raspberry Pi 4B+ IoT-boards by Broadcom and Raspberry Pi foundation vendors.
The Raspberry Pi boards are known for the “reliable” and “fast” tiny-sized nanocomputers, designed especially for data mining and parallel computing. Principally new hardware architectural features of the Arm's cluster symmetric 64-bit RISC-CPUs, such as DSP, SIMD, VFPv4 and hardware virtualization support, brought the significant improvement to the performance, acceleration and scalability of using Raspberry Pi for massively processing data, in parallel.
Specifically, the Raspberry Pi, based on the Arm Cortex-A72 CPU and 4GiB of RAM installed, or higher, are the most suitable solution for the IoT data mining and parallel computing. Also, the BCM2711B0 SoC-chips are bundled with a various of integrated devices and peripherals, such as Broadcom VideoCore VI @ 500Mhz GPUs, PCI-Ex gigabit ethernet adapters, and so on.
All that we need for parallel computing with IoT is a Raspberry Pi 4B+. Or, any other IoT-board which SoC-chip is manufactured based on Arm Cortex-A72 CPUs and LPDDR4 system memory.
We demonstrate the setting up a Raspberry Pi 4B+ boards for the first use, out of the box.
Here is a brief checklist of the hardware and software requirements, that be met, beforehand.
Before we begin, we must download the latest release of the Raspbian Buster 10.6.0 full OS image from the official Raspberry Pi repository. We also need to download and use the Raspbian Imager 1.4 application that is available for various platforms, such as Windows, Linux, or macOS.
Also, we must also download and install MobaXterm application for establishing a connection to the Raspberry Pi board, remotely, over the SSH- or FTP-protocols:
Since the Raspbian Buster OS and Imager application have been successfully downloaded and installed, we are using the Imager application to do the following:
Since the previous steps have been successfully completed, remove the SD-card from the card-reader and plug it into the Raspberry Pi board’s SD-card slot. Then, attach the micro-HDMI and ethernet cables. Finally, plug the DC power supply cable's connector in, and turn on the board. Finally, the system boots up with the Raspbian Buster OS, installed to the SD-card, prompting to perform several post-installation steps to configure it for the first use.
Since the board has been powered on, make sure that all of the following post-installation steps have been completed:
pi@raspberrypi4:~ $ sudo passwd root
pi@raspberrypi4:~ $ sudo -s
root@raspberrypi4:~# sudo apt update root@raspberrypi4:~# sudo apt full-upgrade root@raspberrypi4:~# sudo rpi-update
root@raspberrypi4:~# sudo shutdown -r now
root@raspberrypi4:~# sudo rpi-eeprom-update -d -a root@raspberrypi4:~# sudo shutdown -r now
root@raspberrypi4:~# sudo raspi-config
* Update the 'raspi-config' tool:
* Disable the Raspbian's desktop GUI on boot:
System options >> Boot / Autologin >> Console autologin:
* Expand the root ‘/’ partition size on the SD-card:
After performing the Raspbian post-install configuration, finally reboot the system. After rebooting, you will be prompted to login. Use the ‘root’ username and the password, previously set, for logging in to the bash-console with root privileges.
Since you have been successfully logged in, install the number of packages from APT-repositories by using the following command, in bash-console:
root@raspberrypi4:~# sudo apt install -y net-tools openssh-server
These two packages are required for configuring either the Raspberry Pi's network interface or the OpenSSH-server for connecting to the board, remotely, with SSH-protocol, by using MobaXterm.
Configure the board’s network interface ‘eth0’ by modifying the /etc/network/interfaces, for example:
auto eth0 iface eth0 inet static address 192.168.87.100 netmask 255.255.255.0 broadcast 192.168.87.255 gateway 192.168.87.254 nameserver 192.168.87.254
Next to the network interface, perform a basic configuration of the OpenSSH-server, by uncommenting these lines in the /etc/ssh/sshd_config:
PermitRootLogin yes StrictModes no PasswordAuthentication yes PermitEmptyPasswords yes
This enables the 'root' login, into the bash-console, with SSH-protocol, without entering a password.
Finally, give a try to connect the board over the network, using the MobaXterm application and opening the remote SSH-session to the host with IP-address: 192.168.87.100. You must also be able to successfully login to the Raspbian's bash-console, with the credentials, previously set.
In 2020, Khronos group announced the revolutionary new heterogeneous parallel compute platform (XPU). This provides an ability to offload an execution of "heavy" data processing workloads to a widespread of hardware acceleration (for example, GPGPU or FPGAs) targets, other than the host CPUs, only. Conceptually, the parallel code development, using the XPU-platform, is entirely based on the Khronos CL/SYCL programming model specification, - an abstraction layer of the OpenCL 2.0 library. Here is a tiny example, illustrating the code in C++17, implemented using the CL/SYCL model abstraction layer.
#include <CL/sycl.hpp> using namespace cl::sycl; constexpr std::uint32_t N = 1000; cl::sycl::queue q{}; q.submit([&](cl::sycl::handler& cgh) { cgh.parallel_for<class Kernel>(cl::sycl::range<1>{N}, \ [=](cl::sycl::id<1> idx) { // Do some work in parallel }); }); q.wait();
The code fragment in C++17, shown previously, is delivered, entirely based on using the CL/SYCL programming model. It instantiates a cl::sycl::queue{} object with the default parameter initializers list. This is for submitting SYCL-kernels for an execution to the host CPUs acceleration target used by default. Next, it invokes the cl::sycl::submit(...) method having a single argument of the cl::sycl::handler{} object for accessing methods that provide a basic kernels functionality. This is based on a various of parallel algorithms including the cl::sycl::handler::parallel_for(...) method.
cl::sycl::queue{}
cl::sycl::submit(...)
cl::sycl::handler{}
cl::sycl::handler::parallel_for(...)
The following method is used for implementing a tight parallel loop, spawned from within a running kernel. Each iteration of this loop is executed in parallel, by its own thread. The cl::sycl::handler::parallel_for(...) accepts two main arguments of the cl::sycl::range<>{} object and a specific lamda-function, invoked, during each loop iteration. The cl::sycl::range<>{} object basically defines an number of parallel loop iterations being executed. For each specific dimension, in case when multiple nested loops are collapsed and while processing a multi-dimensional data.
cl::sycl::range<>{}
In the code, from above, cl::sycl::range(N) object is used for scheduling N-iterations of the parallel loop, in a single dimension. The lambda-function of the parallel_for(...) method accepts a single argument of another cl::sycl::id<>{} object. As well as the cl::sycl::range<>{}, this object implements a vector container, each element is an index value for each dimension and each iteration of the parallel loop. Passed as an argument to a code in the lamda-function's scope, the following object is used for retrieving the specific index values. The lamda-function's body contains a code that does some of the data processing in parallel.
cl::sycl::range(N)
cl::sycl::id<>{}
After a specific kernel has been submitted to the queue and spawned for an execution, the following code invokes the cl::sycl::wait() method with no arguments to set a barrier synchronization. This ensures that no code will be executed until the kernel being spawned has completed its parallel work.
cl::sycl::wait()
The CL/SYCL heterogeneous programming model is highly efficient and can be used for a widespread of applications.
However, Intel Corp. and CodePlay Software Inc, soon, have deprecated the support of CL/SYCL for hardware architectures, other than the "native" x86_64. This made it impossible to deliver a parallel C++ code, using the specific CL/SYCL libraries, targeting Arm/Aarch64, and other architectures.
Presently, there are a number of CL/SYCL open-source library projects, developed by a vast of developers and enthusiasts. They provide support for more hardware architectures rather than the x86_64 only. In 2019, Aksel Alpay at Heidelberg university (Germany) implemented the latest CL/SYCL programming model layer specification library. This targeted hardware-architectures, including the Raspberry Pi's Arm and AArch64 architecture. It contributed the hipSYCL open-source library project distribution to GitHub (https://github.com/illuhad/hipSYCL).
Furthermore, we discuss how to install and configure the LLVM/Clang-9.x.x compilers, toolchains, and the hipSYCL library distribution. This is to deliver a modern parallel code in C++17, based on using the library.
Before using the Aksel Alpay's hipSYCL library project's distribution, the specific LLVM/Clang-9.x.x compilers and the Arm/AArch64 toolchains must be properly installed and configured. To do that, make sure that you have completed the following number of steps.
root@raspberrypi4:~# sudo apt update root@raspberrypi4:~# sudo apt install -y bison flex python python3 snap snapd git wget
root@raspberrypi4:~# sudo snap install cmake --classic root@raspberrypi4:~# sudo snap install clangd --classic
root@raspberrypi4:~# sudo cmake --version
cmake version 3.18.4 CMake suite maintained and supported by Kitware (kitware.com/cmake).
root@raspberrypi4:~# sudo apt install -y libc++-dev libc++1 libc++abi-dev libc++abi1 libpthread-stubs0-dev libpthread-workqueue-dev root@raspberrypi4:~# sudo apt install -y clang-format clang-tidy clang-tools clang libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev llvm-runtime llvm python-clang libboost-all-dev
root@raspberrypi4:~# wget -O – https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add –
root@raspberrypi4:~# echo «deb http://apt.llvm.org/buster/ llvm-toolchain-buster main» >> /etc/apt/sources.list.d/raspi.list root@raspberrypi4:~# echo «deb-src http://apt.llvm.org/buster/ llvm-toolchain-buster main» >> /etc/apt/sources.list.d/raspi.list
root@raspberrypi4:~# cd /usr/bin && rm -f clang clang++
root@raspberrypi4:~# sudo apt update root@raspberrypi4:~# sudo apt install -y clang-9 lldb-9 lld-9
root@raspberrypi4:~# cd /usr/bin && ln -s clang-9 clang root@raspberrypi4:~# cd /usr/bin && ln -s clang++-9 clang++
root@raspberrypi4:~# clang –version && clang++ --version
After using the commands, you must see the following output:
clang version 9.0.1-6+rpi1~bpo10+1 Target: armv6k-unknown-linux-gnueabihf Thread model: posix InstalledDir: /usr/bin clang version 9.0.1-6+rpi1~bpo10+1 Target: armv6k-unknown-linux-gnueabihf Thread model: posix InstalledDir: /usr/bin
Another essential step is downloading and building the open-source hipSYCL library staging distribution from its sources, contributed to the GitHub.
This typically done by completing the following steps:
root@raspberrypi4:~# git clone https://github.com/llvm/llvm-project llvm-project root@raspberrypi4:~# git clone --recurse-submodules https://github.com/illuhad/hipSYCL
export LLVM_INSTALL_PREFIX=/usr export LLVM_DIR=~/llvm-project/llvm export CLANG_EXECUTABLE_PATH=/usr/bin/clang++ export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include echo "export LLVM_INSTALL_PREFIX=/usr" >> /root/.bashrc echo "export LLVM_DIR=~/llvm-project/llvm" >> /root/.bashrc echo "export CLANG_EXECUTABLE_PATH=/usr/bin/clang++" >> /root/.bashrc echo "export CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include" >> /root/.bashrc env LLVM_INSTALL_PREFIX=/usr env LLVM_DIR=~/llvm-project/llvm env CLANG_EXECUTABLE_PATH=/usr/bin/clang++ env CLANG_INCLUDE_PATH=$LLVM_INSTALL_PREFIX/include/clang/9.0.1/include
root@raspberrypi4:~# mkdir ~/hipSYCL/build && cd ~/hipSYCL/build
root@raspberrypi4:~# cmake -DCMAKE_INSTALL_PREFIX=/opt/hipSYCL ..
root@raspberrypi4:~# make -j $(nproc) && make install -j $(nproc)
root@raspberrypi4:~# cp /opt/hipSYCL/lib/libhipSYCL-rt.so /usr/lib/libhipSYCL-rt.so
export PATH=$PATH:/opt/hipSYCL/bin export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib echo "export PATH=$PATH:/opt/hipSYCL/bin" >> /root/.bashrc echo "export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc echo "export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include" >> /root/.bashrc echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib" >> /root/.bashrc env PATH=$PATH:/opt/hipSYCL/bin env C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/hipSYCL/include env CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/hipSYCL/include env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hipSYCL/lib
We are finally all set with the installing and configuring LLVM/Clang and hipSYCL library. It is strongly recommended to build and run the 'matmul_hipsycl' sample's executable, making sure that everything is just working fine:
Here are the most common steps for building the following sample from sources:
rm -rf ~/sources mkdir ~/sources && cd ~/sources cp ~/matmul_hipsycl.tar.gz ~/sources/matmul_hipsycl.tar.gz tar -xvf matmul_hipsycl.tar.gz rm -f matmul_hipsycl.tar.gz
A set of previous commands, will create ~/source subdirectory and extract sample's sources from the matmul_hipsycl.tar.gz achieve.
To build the sample's executable, simply use the GNUs 'make' command:
root@raspberrypi4:~# make all
This invokes the 'clang++' command to build the executable:
syclcc-clang -O3 -std=c++17 -o matrix_mul_rpi4 src/matrix_mul_rpi4b.cpp -lstdc++
This command compiles the specific C++17 code with the highest level of code optimization (for example, -O3), enabled, and linking it with the C++ standard library runtime.
Note: Along with the library runtime, hipSYCL project, built, also provides the 'syclcc' and 'syclcc-clang' tools. These are used for building a parallel code in C++17, implemented using hipSYCL library. The using of these tools is a slightly different from the regular usage of 'clang' and 'clang++' commands. However, the 'syclcc' and 'syclcc-clang' can still be used, specifying the same compiler and linker options, as the original 'clang' and 'clang++' commands.
After performing the compilation using these tools, grant the execution privileges to 'matrix_mul_rpi4' file, generated by the compiler, using the following command:
root@raspberrypi4:~# chmod +rwx matrix_mul_rpi4
Run the executable, in the bash-console:
root@raspberrypi4:~# ./matrix_mul_rpi4
After running it, the execution will end up with the following output:
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Multiplication C = A x B: Matrix C: 323 445 243 343 363 316 495 382 463 374 322 329 328 388 378 395 392 432 470 326 398 357 337 366 386 407 478 457 520 374 543 531 382 470 555 520 602 534 639 505 294 388 277 314 278 330 430 319 396 372 447 445 433 485 524 505 604 535 628 509 445 468 349 432 511 391 552 449 534 470 434 454 339 417 502 455 533 498 588 444 470 340 416 364 401 396 485 417 496 464 431 421 325 325 272 331 420 385 419 468 Execution time: 5 ms
Optionally, we can evaluate performance of the parallel code, being executed by installing and using the following utilities:
root@raspberrypi4:~# sudo apt install -y top htop
The using of 'htop' utility, installed, visualizes the CPU's and system memory utilization, while running the parallel code executable:
Micro-FPGAs, as well as the pocket-sized GPGPUs with compute capabilities, connected to an IoT-board, externally, with GPIO- or USB-interfaces, is the next step of parallel computing with IoT. The using of tiny-sized FPGAs and GPGPUs provides an opportunity of performing an even more complex and “heavy” computations. In parallel, drastically increasing an actual performance speed-up, while processing huge amounts of big data, in real time.
Obviously, that, another essential aspect of the parallel computing with IoT is the continuation in the development of specific libraries and frameworks, providing CL/SYCL-model layer specification and, the heterogeneous compute platform (XPU) support. Presently, the latest versions of these libraries provide a support for offloading a parallel code execution to the host CPUs acceleration targets. The other acceleration hardware, such as small-sized GPGPUs and FPGAs for nanocomputers have not yet been designed and manufactured, by its vendors, currently.
In fact, the parallel computing with Raspberry Pi and other specific IoT boards are based on the Arm Cortex-A72 cluster, 64-bit. RISC CPUs is of interest for the software developers and hardware technicians conducting a performance assessment of the existing computational processes, while running it in parallel with IoT.
In conclusion, applying IoT-based parallel computing generally benefits in an overall performance of the cloud-based solutions. These are intended for collecting and massively processing big data, in real-time. And, as the result, positively impacting the quality of machine learning (ML) and data analytics itself.