Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Mobile blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Arm enables more developers than ever with Allinea Studio 20.2 and a free version of Arm Performance Libraries
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • arm performance libraries
  • HPC Compiler
  • Arm Allinea Studio
  • Cloud Application
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm enables more developers than ever with Allinea Studio 20.2 and a free version of Arm Performance Libraries

Patrick Wohlschlegel
Patrick Wohlschlegel
July 22, 2020

The team is very pleased to announce that Arm Compiler for Linux Version 20.2 (bundled in Arm Allinea Studio 20.2) is now available on developer.arm.com. In addition to the commercial toolchain of compiler and libraries, there is now a free-to-use version of our performance libraries (targeted at cloud users of N1 hardware) available here.

This release focuses on incremental improvements, defect fixes and internal infrastructure improvements because of a reduced length development cycle. Highlights of this release include:

  • Introduction of a free Arm Performance Libraries edition
  • Arm Performance Libraries is now redistributable
  • Performance optimizations and A64FX tuning to our JIT-based Fast Fourier Transforms (introduced in 20.1)
  • Neoverse-N1 performance tuning
  • Improvements to the Scalable Vector Extensions (SVE) versions of libamath functions (namely exp, expf, log, logf, sin, sinf, cos, and cosf)
  • Inclusion of GCC 9.3.

C/C++ and Fortran Compilers

A64FX support and SVE integer dot-product support

With the arrival of Fujitsu's A64FX, SVE instructions have become available in real hardware for the first time. Arm supports this major milestone by improving further the reliability and code quality of SVE code generation for large codes. This includes codes that rely on the Arm C Language Extensions (ACLE). Most of these improvements are 'under the hood', but one feature worth mentioning is support for auto-vectorization of integer dot product calculations.

For example, the following code:

void dotp( short *out, short *a, short *b, int N)
{
  int acc = 0;
  for (int i=0; i<N; i++) {
    acc += a[i] * b[i];
  }
  *out = acc >> 16;
}

now generates an inner-loop that makes use of the SVE SDOT instruction, as follows:

.LBB0_2:                                // =>This Inner Loop Header: Depth=1
	ld1h	{ z1.h }, p0/z, [x1, x8, lsl #1]
	ld1h	{ z2.h }, p0/z, [x2, x8, lsl #1]
	inch	x8
	whilelo	p0.h, x8, x9
	sdot	z0.d, z2.h, z1.h
	b.mi	.LBB0_2


Deprecated ACLE features

Arm Compiler for Linux now warns against using the deprecated SVE/SVE2 ACLE features. Support for these features will be removed in the next major release.
For clarity, the SVE/SVE2 ACLE specification has deprecated two features: use of the svcdot function with unsigned arguments, and accessing individual elements of ACLE vector structs using the '.' operator. As an example of the second feature, code such as ((svint8x2_t) foo).v1 is now deprecated, and you should use svget2((svint8x2_t) foo, 0) instead.  For more information, please see the ARM C Language Extensions for SVE specification.

Mathematical Libraries

Introduction of a free edition of Arm Performance Libraries

Until recently, users of Arm Performance Libraries were primarily software developers accessing a traditional on-premise supercomputer. With the success of Amazon's Graviton 2 and the advent of "HPC in the Cloud on Arm", new requests have been flowing in. A number of you have asked us to get easier access to Arm Performance Libraries. These requests have been heard. To support the long-tail of developers who use BLAS, FFTWs or other intensive maths functions in their applications, we have created a free version of Arm Performance Libraries (available here). This library comes with a short, simplified EULA. The free Arm Performance Libraries package is compatible with the GCC compiler, all Arm v8.1+ cores and is optimized for Neoverse N1. This new edition complements nicely the commercially supported version of Arm Performance Libraries included in Allinea Studio. In addition, both the commercial and free versions of Arm Performance Libraries can now be redistributed as part of your applications: you can now link your applications to Arm Performance Libraries and ship the accelerated binary to your end users. Scientific teams working on COVID-19 research using Folding@Home were among the first to use this new capability. We were delighted to learn they enjoyed a free and impressive 25% performance boost on Arm-based servers. 

Improved performance for SVE implementations of key libamath functions

The arrival of A64FX was also the focus of the Arm Performance Libraries team. The team has been working on faster SVE implementations of some of the most commonly used trigonometric and exponential functions. These efforts translate into a performance boost across a wide range of applications. With a vector width of 512-bits, the use of SVE on A64FX gives four times the throughput compared with Neon. The following graph illustrates that our new implementations take advantage of this benefit effectively. The same accuracy as associated Neon functions is achieved (at least 3.5 ULP). We also see that for log and exp other optimizations to our implementations provide extra gains - we see five times performance improvements in these cases.

This is a graph showing the Arm PL 20.2 libamath performance

Tuned SGEMM/DGEMM performance for AWS Graviton 2

The Arm Performance Libraries team has been working on tuning matrix-matrix multiplication for Amazon Web Services' Graviton 2 M6g instances, which are based on Arm Neoverse-N1 cores. The following graphs show how close the single-precision (SGEMM) and double-precision (DGEMM) routines come to maximizing compute throughput for different numbers of threads because of this work. When using all 64 cores, we achieve over 85% of the machine's theoretical peak performance in both cases. For fewer threads, we can attain a slightly higher ratio of peak performance of the threads used (owing to reduced contention on shared resources). For example, when using a single thread, we are over 92% efficient for some double-precision problems. We continue to work on tuning the performance of these routines, making sure that DGEMM attains a higher percentage of peak for smaller problem sizes.

This is a graph showing the Arm PL SGEMMThis is a graph showing the Arm PL DGEMM

Documentation

Developer and reference guides

The Arm Fortran Compiler and Arm C/C++ Compiler Developer and reference guides have been restructured for the 20.2 release. If you have bookmarked chapters or topics in our existing guides, you might need to update these because some URLs have changed. The new structure brings a task-orientated focus to the content and will improve the findability of the content through search engines. 

The documentation hub on the Arm Developer website is being redesigned to bring many improvements on the existing platform. To learn more about these improvements, see the blog post about the upcoming changes to Arm's technical documentation hub. When the new documentation hub is live, the Arm Performance Libraries Reference guide is available online in HTML format (in addition to being available in PDF format).

Porting and tuning guides

To provide you with the latest information about porting your codes to SVE-enabled targets, we have recently released version 2.1 of our Porting and Tuning HPC Applications for Arm SVE guide. In this version, the content has been organized to focus around four key goals:

  1. Learning about Scalable Vector Extension (SVE) 
  2. Porting and optimizing your applications
  3. Developing code for SVE
  4. Emulating SVE code on non-SVE hardware

Some content from the Porting and Tuning HPC Applications for Arm guide is now also included in the Porting and Tuning HPC Applications for Arm SVE guide. It has been updated to be in the context of porting to SVE-enabled targets.

Both porting guides are available as part of the Arm Compiler for Linux package in an offline-accessible HTML format. You can find the content in <install-location>/share/doc.

Support

If you have questions or want to raise an issue, you can do so by emailing the HPC software support team or by visiting the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm-based servers.

Conclusion

Despite the very unusual times, we are all experiencing, our team has made tremendous progress in enabling the acceleration of a wide range of workloads running on Arm-based servers. I am excited to announce the availability of the free, redistributable edition of Arm Performance Libraries. In addition, Arm Allinea Studio 20.2 with major enhancements to our Linux compiler and our optimized mathematical libraries. Our next major version of the Arm Compiler for Linux is expected towards the end of December 2020 and will include major performance improvements for SVE-based microarchitectures.

The team joins me in wishing all of you and your families the very best. Stay healthy, stay safe.

Anonymous
High Performance Computing (HPC) blog
  • Key workloads demonstrate how Arm servers are changing HPC

    David Lecomber
    David Lecomber
    In the blog we look at the progress made in the Arm HPC application ecosystem and give a preview of our activities at ISC'22.
    • May 24, 2022
  • Arm Compilers and Performance Libraries for HPC developers now available for free

    Ashok Bhat
    Ashok Bhat
    Arm C/C++/Fortran Compilers and Arm Performance Libraries, aimed at HPC application developers, are now available for free. You no longer need license files to use the tools.
    • May 9, 2022
  • Stoking the Fire in Arm HPC

    David Lecomber
    David Lecomber
    In this blog we look at the growth of Arm in HPC - from humble beginnings to the number one ranked supercomputer in the world
    • May 3, 2022