The team is very pleased to announce that Arm Compiler for Linux Version 20.2 (bundled in Arm Allinea Studio 20.2) is now available on developer.arm.com. In addition to the commercial toolchain of compiler and libraries, there is now a free-to-use version of our performance libraries (targeted at cloud users of N1 hardware) available here.
This release focuses on incremental improvements, defect fixes and internal infrastructure improvements because of a reduced length development cycle. Highlights of this release include:
With the arrival of Fujitsu's A64FX, SVE instructions have become available in real hardware for the first time. Arm supports this major milestone by improving further the reliability and code quality of SVE code generation for large codes. This includes codes that rely on the Arm C Language Extensions (ACLE). Most of these improvements are 'under the hood', but one feature worth mentioning is support for auto-vectorization of integer dot product calculations.
For example, the following code:
void dotp( short *out, short *a, short *b, int N) { int acc = 0; for (int i=0; i<N; i++) { acc += a[i] * b[i]; } *out = acc >> 16; }
now generates an inner-loop that makes use of the SVE SDOT instruction, as follows:
.LBB0_2: // =>This Inner Loop Header: Depth=1 ld1h { z1.h }, p0/z, [x1, x8, lsl #1] ld1h { z2.h }, p0/z, [x2, x8, lsl #1] inch x8 whilelo p0.h, x8, x9 sdot z0.d, z2.h, z1.h b.mi .LBB0_2
Arm Compiler for Linux now warns against using the deprecated SVE/SVE2 ACLE features. Support for these features will be removed in the next major release.For clarity, the SVE/SVE2 ACLE specification has deprecated two features: use of the svcdot function with unsigned arguments, and accessing individual elements of ACLE vector structs using the '.' operator. As an example of the second feature, code such as ((svint8x2_t) foo).v1 is now deprecated, and you should use svget2((svint8x2_t) foo, 0) instead. For more information, please see the ARM C Language Extensions for SVE specification.
((svint8x2_t) foo).v1
svget2((svint8x2_t) foo, 0)
Until recently, users of Arm Performance Libraries were primarily software developers accessing a traditional on-premise supercomputer. With the success of Amazon's Graviton 2 and the advent of "HPC in the Cloud on Arm", new requests have been flowing in. A number of you have asked us to get easier access to Arm Performance Libraries. These requests have been heard. To support the long-tail of developers who use BLAS, FFTWs or other intensive maths functions in their applications, we have created a free version of Arm Performance Libraries (available here). This library comes with a short, simplified EULA. The free Arm Performance Libraries package is compatible with the GCC compiler, all Arm v8.1+ cores and is optimized for Neoverse N1. This new edition complements nicely the commercially supported version of Arm Performance Libraries included in Allinea Studio. In addition, both the commercial and free versions of Arm Performance Libraries can now be redistributed as part of your applications: you can now link your applications to Arm Performance Libraries and ship the accelerated binary to your end users. Scientific teams working on COVID-19 research using Folding@Home were among the first to use this new capability. We were delighted to learn they enjoyed a free and impressive 25% performance boost on Arm-based servers.
The arrival of A64FX was also the focus of the Arm Performance Libraries team. The team has been working on faster SVE implementations of some of the most commonly used trigonometric and exponential functions. These efforts translate into a performance boost across a wide range of applications. With a vector width of 512-bits, the use of SVE on A64FX gives four times the throughput compared with Neon. The following graph illustrates that our new implementations take advantage of this benefit effectively. The same accuracy as associated Neon functions is achieved (at least 3.5 ULP). We also see that for log and exp other optimizations to our implementations provide extra gains - we see five times performance improvements in these cases.
The Arm Performance Libraries team has been working on tuning matrix-matrix multiplication for Amazon Web Services' Graviton 2 M6g instances, which are based on Arm Neoverse-N1 cores. The following graphs show how close the single-precision (SGEMM) and double-precision (DGEMM) routines come to maximizing compute throughput for different numbers of threads because of this work. When using all 64 cores, we achieve over 85% of the machine's theoretical peak performance in both cases. For fewer threads, we can attain a slightly higher ratio of peak performance of the threads used (owing to reduced contention on shared resources). For example, when using a single thread, we are over 92% efficient for some double-precision problems. We continue to work on tuning the performance of these routines, making sure that DGEMM attains a higher percentage of peak for smaller problem sizes.
The Arm Fortran Compiler and Arm C/C++ Compiler Developer and reference guides have been restructured for the 20.2 release. If you have bookmarked chapters or topics in our existing guides, you might need to update these because some URLs have changed. The new structure brings a task-orientated focus to the content and will improve the findability of the content through search engines.
The documentation hub on the Arm Developer website is being redesigned to bring many improvements on the existing platform. To learn more about these improvements, see the blog post about the upcoming changes to Arm's technical documentation hub. When the new documentation hub is live, the Arm Performance Libraries Reference guide is available online in HTML format (in addition to being available in PDF format).
To provide you with the latest information about porting your codes to SVE-enabled targets, we have recently released version 2.1 of our Porting and Tuning HPC Applications for Arm SVE guide. In this version, the content has been organized to focus around four key goals:
Some content from the Porting and Tuning HPC Applications for Arm guide is now also included in the Porting and Tuning HPC Applications for Arm SVE guide. It has been updated to be in the context of porting to SVE-enabled targets.
Both porting guides are available as part of the Arm Compiler for Linux package in an offline-accessible HTML format. You can find the content in <install-location>/share/doc.
If you have questions or want to raise an issue, you can do so by emailing the HPC software support team or by visiting the support page. Most of the requests are answered within a single working day. The HPC ecosystem pages also have valuable information to get you started on Arm-based servers.
Despite the very unusual times, we are all experiencing, our team has made tremendous progress in enabling the acceleration of a wide range of workloads running on Arm-based servers. I am excited to announce the availability of the free, redistributable edition of Arm Performance Libraries. In addition, Arm Allinea Studio 20.2 with major enhancements to our Linux compiler and our optimized mathematical libraries. Our next major version of the Arm Compiler for Linux is expected towards the end of December 2020 and will include major performance improvements for SVE-based microarchitectures.
The team joins me in wishing all of you and your families the very best. Stay healthy, stay safe.