The long awaited GTC21 by NVIDIA has finally kicked off, an event I have personally anticipated for some time now. Stacked with keynotes, labs, tech talks, panels, and much more, this conference is like a virtual playground for developers. From around the world, industry leaders, experts, educators, enthusiasts, and technical evangelists are here to learn and share in what is bound to be years of explosive growth and innovation for the tech sector.
I am sure there will be many more great announcements and amazing content coming out of GTC. Still, I wanted to first draw your attention to the NVIDIA Arm HPC Developer Kit that was referenced in Monday's opening keynote.
NVIDIA CEO, Jensen Huang announces new Ampere Altra + NVIDIA A100 HPC Developer kits (second from left).
This blog gives a quick overview of the new HPC developer kit. We talk about the already robust suite of NVIDIA and Arm developer tools (such as the NVIDIA HPC SDK and Arm Allinea Studio), and finally, we dive into a performance investigation for the Ampere Altra CPU + NVIDIA A100 setup. At the end of this blog, you will also find an easy way to express your early interest in the kit and opt in for updates from NVIDIA as it gets closer to becoming available.
Now, let us look at this kit.
During the GTC opening keynote, a graphic of this HPC developer kit caught my attention. Featuring an Ampere® Altra® CPU, with 80 Arm Neoverse cores running up to 3.3GHz; dual NVIDIA A100 GPUs, and two NVIDIA BlueField-2® DPUs, naturally, I had to take a closer look.
Server Sku
GIGABYTE G242-P32
Form Factor
2U Rack ( 87.5 x 438 x 820mm; 3.44" x 17.24" x 32.28")
CPU
1x Ampere Computing Altra Q80-30
Memory
512G DDR4
Storage
6TB SAS/ SATA 3.5”
GPU
2x A100 PCIe 40GB
Network
2x BF2 (IB single port, MBF2M345A-HESOT)
This system is clearly set up to efficiently drive a wide range of workloads, not to mention the two NVIDIA BlueField-2® DPUs, which accelerate networking, storage, and security. Pairing this kit with the NVIDIA HPC SDK and the Arm Allinea Studio tool suite, developers are able to create, migrate, and test the limits of their HPC and AI applications on GPU-accelerated Arm-based hardware. Things keep getting better for Arm developers.
It is time to explore the ecosystem of tools and some benchmarks that might help illustrate the capabilities of this new developer kit.
NVIDIA’s HPC SDK is the essential suite of tools for HPC developers on NVIDIA platforms.
The HPC SDK enables porting of important community and independent software vendor (ISV) applications to systems with multiprocessor Arm CPUs and NVIDIA data center GPUs. The NVIDIA Fortran, C, and C++ compilers are fully optimizing for both multiprocessor CPUs and NVIDIA GPUS, enabling HPC developers to write and tune parallel applications for heterogeneous CPU+GPU servers using GPU-accelerated math libraries, standard Fortran/C++ parallel language features, parallel OpenACC and OpenMP directives, and CUDA.
The NVIDIA math libraries provide drop-in, highly optimized GPU-acceleration for linear algebra and signal processing algorithms fundamental to HPC. In addition to providing an easy on-ramp to GPU acceleration, math libraries provide speed-of-light performance for supported APIs and enable users to automatically benefit from newer GPU architectures as they are released.
NVIDIA’s vision for programming heterogeneous multiprocessor CPU+GPU systems: Start by using drop-in, GPU-optimized math libraries. Achieve initial GPU-acceleration using standard C++17 parallel algorithms and Fortran parallel language features. Use pragmas and directives to fill the standard language gaps (for example, data movement), and finally, optimize performance with CUDA.
The HPC SDK is:
In addition to the NVIDIA HPC SDK, the Arm Allinea Studio is a suite of scalable, Arm-optimized developer tools targeted at HPC application developers. It enables end users to create, migrate, and innovate serial and parallel codes for Arm-based HPC systems, especially systems with GPU accelerators. HPC applications are commonly written in multiple languages and involve multiple layers of parallelism, so the Arm Allinea Studio supports C, C++, Fortran, Python, MPI, and OpenMP.
The three major components of Arm Allinea Studio are the Arm Compiler for Linux (ACfL), the Arm Performance Libraries (ArmPL), and Arm Forge. ACfL is an LLVM-based CPU compiler for server-class AArch64 platforms. It supports auto-vectorization (both SVE and Neon) in C, C++, and Fortran, and includes optimized runtime libraries that take advantage of hardware features like LSE atomics. ACfL is fully compatible with major MPI distributions like OpenMPI, HPE MPI, and MVAPICH.
The ArmPL implements BLAS, LAPACK, and FFTW interfaces to microarchitecturally optimized maths libraries for scientific computing. It supports ACfL and GCC compilers and works well with parallel math libraries like ScalaPACK. It is an excellent choice for working with dense single- and double-precision real data and complex data. It also includes a growing set of sparse matrix functions (for example, SPMV), using an inspector-executor framework, for high performing sparse solutions.
Debugging and performance profiling are provided by Arm Forge. Forge's DDT scalable debugger is popular with HPC centers worldwide for its cross-platform support of C, C++, Fortran, and Python parallel applications on CPUs, GPUs, Intel, POWER, and Arm architectures. Forge’s MAP profiler is an extremely scalable low-overhead solution for performance characterization. It helps developers to accelerate their code by revealing the causes of slow performance and is commonly used at scales ranging from multiprocessor Linux workstations through to the largest supercomputers. Runtime overhead is typically under 5% and it fully supports C, C++, and Fortran with no relinking, instrumentation, or code changes required.
In anticipation of the NVIDIA Arm HPC Developer Kit launch, Arm and NVIDIA have characterized the performance of over two-dozen key HPC applications on the Ampere Altra, NVIDIA A100, and AMD EPYC 7742. We used the NVIDIA HPC SDK, the Arm Allinea Studio, and the Gnu 10.2 toolchains, combined with optimized math libraries from NVIDIA and Arm. We selected the best times from both the x86 and Arm platforms, but did not change the application source code. Better performance may be possible, but this is representative of the “out of the box” experience for scientists and engineers.
When comparing the performance GPU-accelerated applications on x86-based and Arm-based computing platforms, we found that average application performance is the same. Memory-bound CPU-only applications also perform similarly on both platforms, but there were a few cases where the Arm-based Ampere Altra outperformed the AMD EPYC 7742.
To better understand the reason for these speedups, we used Arm Forge from the Arm Allinea Studio to characterize application performance on both the x86-based and the Arm-based platform. This apples-to-apples comparison across platforms showed that performance was due to the Altra having twice the L1 and L2 data cache per core as the EPYC. CPU-only applications with large, sparse working sets benefit from the Altra’s high cache per core. GPU-accelerated applications benefit from the low kernel launch latency and enjoy a lower CPU power consumption in all cases. For more details, see NVIDIA GTC session S32758 “HPC Applications on Arm + NVIDIA A100”.
If you are as excited about these new kits as I am and interested in learning more, cruise on over to the NVIDIA Arm HPC Developer Kit landing page to express your early interest in this hardware.