Arm enables more developers than ever with Allinea Studio 20.2 and a free version of Arm Performance Libraries

July 22, 2020

6 minute read time.

The team is very pleased to announce that Arm Compiler for Linux Version 20.2 (bundled in Arm Allinea Studio 20.2) is now available on developer.arm.com. In addition to the commercial toolchain of compiler and libraries, there is now a free-to-use version of our performance libraries (targeted at cloud users of N1 hardware) available here.

This release focuses on incremental improvements, defect fixes and internal infrastructure improvements because of a reduced length development cycle. Highlights of this release include:

Introduction of a free Arm Performance Libraries edition
Arm Performance Libraries is now redistributable
Performance optimizations and A64FX tuning to our JIT-based Fast Fourier Transforms (introduced in 20.1)
Neoverse-N1 performance tuning
Improvements to the Scalable Vector Extensions (SVE) versions of libamath functions (namely exp, expf, log, logf, sin, sinf, cos, and cosf)
Inclusion of GCC 9.3.

C/C++ and Fortran Compilers

A64FX support and SVE integer dot-product support

With the arrival of Fujitsu's A64FX, SVE instructions have become available in real hardware for the first time. Arm supports this major milestone by improving further the reliability and code quality of SVE code generation for large codes. This includes codes that rely on the Arm C Language Extensions (ACLE). Most of these improvements are 'under the hood', but one feature worth mentioning is support for auto-vectorization of integer dot product calculations.

For example, the following code:

void dotp( short *out, short *a, short *b, int N)
{
  int acc = 0;
  for (int i=0; i<N; i++) {
    acc += a[i] * b[i];
  }
  *out = acc >> 16;
}

now generates an inner-loop that makes use of the SVE SDOT instruction, as follows:

.LBB0_2:                                // =>This Inner Loop Header: Depth=1
	ld1h	{ z1.h }, p0/z, [x1, x8, lsl #1]
	ld1h	{ z2.h }, p0/z, [x2, x8, lsl #1]
	inch	x8
	whilelo	p0.h, x8, x9
	sdot	z0.d, z2.h, z1.h
	b.mi	.LBB0_2

Deprecated ACLE features

Arm Compiler for Linux now warns against using the deprecated SVE/SVE2 ACLE features. Support for these features will be removed in the next major release.
For clarity, the SVE/SVE2 ACLE specification has deprecated two features: use of the svcdot function with unsigned arguments, and accessing individual elements of ACLE vector structs using the '.' operator. As an example of the second feature, code such as ((svint8x2_t) foo).v1 is now deprecated, and you should use svget2((svint8x2_t) foo, 0) instead. For more information, please see the ARM C Language Extensions for SVE specification.

Mathematical Libraries

Introduction of a free edition of Arm Performance Libraries

Until recently, users of Arm Performance Libraries were primarily software developers accessing a traditional on-premise supercomputer. With the success of Amazon's Graviton 2 and the advent of "HPC in the Cloud on Arm", new requests have been flowing in. A number of you have asked us to get easier access to Arm Performance Libraries. These requests have been heard. To support the long-tail of developers who use BLAS, FFTWs or other intensive maths functions in their applications, we have created a free version of Arm Performance Libraries (available here). This library comes with a short, simplified EULA. The free Arm Performance Libraries package is compatible with the GCC compiler, all Arm v8.1+ cores and is optimized for Neoverse N1. This new edition complements nicely the commercially supported version of Arm Performance Libraries included in Allinea Studio. In addition, both the commercial and free versions of Arm Performance Libraries can now be redistributed as part of your applications: you can now link your applications to Arm Performance Libraries and ship the accelerated binary to your end users. Scientific teams working on COVID-19 research using Folding@Home were among the first to use this new capability. We were delighted to learn they enjoyed a free and impressive 25% performance boost on Arm-based servers.

Improved performance for SVE implementations of key libamath functions

The arrival of A64FX was also the focus of the Arm Performance Libraries team. The team has been working on faster SVE implementations of some of the most commonly used trigonometric and exponential functions. These efforts translate into a performance boost across a wide range of applications. With a vector width of 512-bits, the use of SVE on A64FX gives four times the throughput compared with Neon. The following graph illustrates that our new implementations take advantage of this benefit effectively. The same accuracy as associated Neon functions is achieved (at least 3.5 ULP). We also see that for log and exp other optimizations to our implementations provide extra gains - we see five times performance improvements in these cases.

This is a graph showing the Arm PL 20.2 libamath performance

Tuned SGEMM/DGEMM performance for AWS Graviton 2

The Arm Performance Libraries team has been working on tuning matrix-matrix multiplication for Amazon Web Services' Graviton 2 M6g instances, which are based on Arm Neoverse-N1 cores. The following graphs show how close the single-precision (SGEMM) and double-precision (DGEMM) routines come to maximizing compute throughput for different numbers of threads because of this work. When using all 64 cores, we achieve over 85% of the machine's theoretical peak performance in both cases. For fewer threads, we can attain a slightly higher ratio of peak performance of the threads used (owing to reduced contention on shared resources). For example, when using a single thread, we are over 92% efficient for some double-precision problems. We continue to work on tuning the performance of these routines, making sure that DGEMM attains a higher percentage of peak for smaller problem sizes.

This is a graph showing the Arm PL SGEMM

Documentation

Developer and reference guides

The Arm Fortran Compiler and Arm C/C++ Compiler Developer and reference guides have been restructured for the 20.2 release. If you have bookmarked chapters or topics in our existing guides, you might need to update these because some URLs have changed. The new structure brings a task-orientated focus to the content and will improve the findability of the content through search engines.

The documentation hub on the Arm Developer website is being redesigned to bring many improvements on the existing platform. To learn more about these improvements, see the blog post about the upcoming changes to Arm's technical documentation hub. When the new documentation hub is live, the Arm Performance Libraries Reference guide is available online in HTML format (in addition to being available in PDF format).

Porting and tuning guides

To provide you with the latest information about porting your codes to SVE-enabled targets, we have recently released version 2.1 of our Porting and Tuning HPC Applications for Arm SVE guide. In this version, the content has been organized to focus around four key goals:

Learning about Scalable Vector Extension (SVE)
Porting and optimizing your applications
Developing code for SVE
Emulating SVE code on non-SVE hardware

Some content from the Porting and Tuning HPC Applications for Arm guide is now also included in the Porting and Tuning HPC Applications for Arm SVE guide. It has been updated to be in the context of porting to SVE-enabled targets.

Both porting guides are available as part of the Arm Compiler for Linux package in an offline-accessible HTML format. You can find the content in <install-location>/share/doc.

Support

If you have questions or want to raise an issue, you can do so by emailing the HPC software support team or by visiting the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm-based servers.

Conclusion

Despite the very unusual times, we are all experiencing, our team has made tremendous progress in enabling the acceleration of a wide range of workloads running on Arm-based servers. I am excited to announce the availability of the free, redistributable edition of Arm Performance Libraries. In addition, Arm Allinea Studio 20.2 with major enhancements to our Linux compiler and our optimized mathematical libraries. Our next major version of the Arm Compiler for Linux is expected towards the end of December 2020 and will include major performance improvements for SVE-based microarchitectures.

The team joins me in wishing all of you and your families the very best. Stay healthy, stay safe.

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog