Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog New NVIDIA HPC developer kit for Arm developers
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Software Development Kit (SDK)
  • HPC Compiler
  • Partner solutions
  • Arm Allinea Studio
  • Partner Product
  • infrastructure
  • Neoverse
  • Server and HPC
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

New NVIDIA HPC developer kit for Arm developers

Robert Wolff
Robert Wolff
April 14, 2021
6 minute read time.

The long awaited GTC21 by NVIDIA has finally kicked off, an event I have personally anticipated for some time now. Stacked with keynotes, labs, tech talks, panels, and much more, this conference is like a virtual playground for developers. From around the world, industry leaders, experts, educators, enthusiasts, and technical evangelists are here to learn and share in what is bound to be years of explosive growth and innovation for the tech sector.

I am sure there will be many more great announcements and amazing content coming out of GTC. Still, I wanted to first draw your attention to the NVIDIA Arm HPC Developer Kit that was referenced in Monday's opening keynote.


NVIDIA CEO, Jensen Huang announces new Ampere Altra + NVIDIA A100 HPC Developer kits (second from left).

This blog gives a quick overview of the new HPC developer kit. We talk about the already robust suite of NVIDIA and Arm developer tools (such as the NVIDIA HPC SDK and Arm Allinea Studio), and finally, we dive into a performance investigation for the Ampere Altra CPU + NVIDIA A100 setup. At the end of this blog, you will also find an easy way to express your early interest in the kit and opt in for updates from NVIDIA as it gets closer to becoming available.

Now, let us look at this kit.

The NVIDIA Arm HPC Developer Kit

During the GTC opening keynote, a graphic of this HPC developer kit caught my attention. Featuring an Ampere® Altra® CPU, with 80 Arm Neoverse cores running up to 3.3GHz; dual NVIDIA A100 GPUs, and two NVIDIA BlueField-2® DPUs, naturally, I had to take a closer look.

Server Sku 

GIGABYTE G242-P32 

Form Factor 

2U Rack ( 87.5 x 438 x 820mm; 3.44" x 17.24" x 32.28") 

CPU  

1x Ampere Computing Altra Q80-30 

Memory  

512G  DDR4 

Storage 

6TB SAS/ SATA 3.5”  

GPU 

2x A100 PCIe 40GB 

Network 

2x BF2 (IB single port, MBF2M345A-HESOT) 

This system is clearly set up to efficiently drive a wide range of workloads, not to mention the two NVIDIA BlueField-2® DPUs, which accelerate networking, storage, and security. Pairing this kit with the NVIDIA HPC SDK and the Arm Allinea Studio tool suite, developers are able to create, migrate, and test the limits of their HPC and AI applications on GPU-accelerated Arm-based hardware. Things keep getting better for Arm developers.

It is time to explore the ecosystem of tools and some benchmarks that might help illustrate the capabilities of this new developer kit.

Tools and performance investigation

NVIDIA HPC SDK for Arm-based platforms

NVIDIA’s HPC SDK is the essential suite of tools for HPC developers on NVIDIA platforms. 

The HPC SDK enables porting of important community and independent software vendor (ISV) applications to systems with multiprocessor Arm CPUs and NVIDIA data center GPUs. The NVIDIA Fortran, C, and C++ compilers are fully optimizing for both multiprocessor CPUs and NVIDIA GPUS, enabling HPC developers to write and tune parallel applications for heterogeneous CPU+GPU servers using GPU-accelerated math libraries, standard Fortran/C++ parallel language features, parallel OpenACC and OpenMP directives, and CUDA. 

The NVIDIA math libraries provide drop-in, highly optimized GPU-acceleration for linear algebra and signal processing algorithms fundamental to HPC. In addition to providing an easy on-ramp to GPU acceleration, math libraries provide speed-of-light performance for supported APIs and enable users to automatically benefit from newer GPU architectures as they are released.

NVIDIA’s vision for programming heterogeneous multiprocessor CPU+GPU systems: Start by using drop-in, GPU-optimized math libraries. Achieve initial GPU-acceleration using standard C++17 parallel algorithms and Fortran parallel language features. Use pragmas and directives to fill the standard language gaps (for example, data movement), and finally, optimize performance with CUDA. 

The HPC SDK is:

  • Supported on 99 percent of the top 500 HPC systems as of November 2019, including Arm and Arm+GPU
  • The only product that includes comprehensive support for programming GPUs using standard language constructs in C++ and Fortran, OpenACC and OpenMP directives, and CUDA
  • Complete with optimized math and communications libraries and performance analysis and debugging tools for both CPUs and GPUs
  • Fully interoperable with third-party tools
  • Freely available (multiple releases throughout the year)

Additional enablement with Arm Allinea Studio

In addition to the NVIDIA HPC SDK, the Arm Allinea Studio is a suite of scalable, Arm-optimized developer tools targeted at HPC application developers. It enables end users to create, migrate, and innovate serial and parallel codes for Arm-based HPC systems, especially systems with GPU accelerators. HPC applications are commonly written in multiple languages and involve multiple layers of parallelism, so the Arm Allinea Studio supports C, C++, Fortran, Python, MPI, and OpenMP.

The three major components of Arm Allinea Studio are the Arm Compiler for Linux (ACfL), the Arm Performance Libraries (ArmPL), and Arm Forge. ACfL is an LLVM-based CPU compiler for server-class AArch64 platforms. It supports auto-vectorization (both SVE and Neon) in C, C++, and Fortran, and includes optimized runtime libraries that take advantage of hardware features like LSE atomics. ACfL is fully compatible with major MPI distributions like OpenMPI, HPE MPI, and MVAPICH.

The ArmPL implements BLAS, LAPACK, and FFTW interfaces to microarchitecturally optimized maths libraries for scientific computing. It supports ACfL and GCC compilers and works well with parallel math libraries like ScalaPACK. It is an excellent choice for working with dense single- and double-precision real data and complex data. It also includes a growing set of sparse matrix functions (for example, SPMV), using an inspector-executor framework, for high performing sparse solutions.

Debugging and performance profiling are provided by Arm Forge. Forge's DDT scalable debugger is popular with HPC centers worldwide for its cross-platform support of C, C++, Fortran, and Python parallel applications on CPUs, GPUs, Intel, POWER, and Arm architectures. Forge’s MAP profiler is an extremely scalable low-overhead solution for performance characterization. It helps developers to accelerate their code by revealing the causes of slow performance and is commonly used at scales ranging from multiprocessor Linux workstations through to the largest supercomputers. Runtime overhead is typically under 5% and it fully supports C, C++, and Fortran with no relinking, instrumentation, or code changes required.

Ampere Altra + NVIDIA A100 performance investigation

In anticipation of the NVIDIA Arm HPC Developer Kit launch, Arm and NVIDIA have characterized the performance of over two-dozen key HPC applications on the Ampere Altra, NVIDIA A100, and AMD EPYC 7742. We used the NVIDIA HPC SDK, the Arm Allinea Studio, and the Gnu 10.2 toolchains, combined with optimized math libraries from NVIDIA and Arm. We selected the best times from both the x86 and Arm platforms, but did not change the application source code. Better performance may be possible, but this is representative of the “out of the box” experience for scientists and engineers.

When comparing the performance GPU-accelerated applications on x86-based and Arm-based computing platforms, we found that average application performance is the same. Memory-bound CPU-only applications also perform similarly on both platforms, but there were a few cases where the Arm-based Ampere Altra outperformed the AMD EPYC 7742.

To better understand the reason for these speedups, we used Arm Forge from the Arm Allinea Studio to characterize application performance on both the x86-based and the Arm-based platform. This apples-to-apples comparison across platforms showed that performance was due to the Altra having twice the L1 and L2 data cache per core as the EPYC. CPU-only applications with large, sparse working sets benefit from the Altra’s high cache per core. GPU-accelerated applications benefit from the low kernel launch latency and enjoy a lower CPU power consumption in all cases. For more details, see NVIDIA GTC session S32758 “HPC Applications on Arm + NVIDIA A100”.

Get access to the kit

If you are as excited about these new kits as I am and interested in learning more, cruise on over to the NVIDIA Arm HPC Developer Kit landing page to express your early interest in this hardware.

Anonymous
High Performance Computing (HPC) blog
  • AWS Graviton3 improves Cadence EDA tools performance for Arm

    Tim Thornton
    Tim Thornton
    In this blog we provide an update to our use of Cadence EDA tools in the AWS cloud, with a focus on Graviton3 performance improvements.
    • November 16, 2022
  • A case study in vectorizing HACCmk using SVE

    Brian Waldecker
    Brian Waldecker
    This blog uses the HACCmk benchmark to demonstrate the vectorization capabilities and benefits of SVE over NEON (ASIMD)
    • November 3, 2022
  • Bringing WRF up to speed with Arm Neoverse

    Phil Ridley
    Phil Ridley
    In this blog we examine the WRF weather model and examine the performance improvement available using AWS Graviton3 (Neoverse V1 core) compared to AWS Graviton2 (Neoverse N1 core).
    • October 19, 2022