BFloat16 processing for Neural Networks on Armv8-A

August 29, 2019

7 minute read time.

Neural Networks are a key component of Machine Learning (ML) applications. Project Trillium, Arm’s heterogeneous ML platform, provides a range of technologies in this field, including instructions that accelerate such applications running on CPUs based on the Arm®v8-A architecture.

The next revision of the Armv8-A architecture will introduce Neon and SVE vector instructions designed to accelerate certain computations using the BFloat16 (BF16) floating-point number format. BF16 has recently emerged as a format tailored specifically to high-performance processing of Neural Networks (NNs). BF16 is a truncated form of the IEEE 754 [ieee754-2008] single-precision representation (IEEE-FP32), which has only 7 fraction bits, instead of 23 (see Figure 1).

Figure 1: A comparison of BFloat16 with IEEE 754

Figure 1: A comparison of BFloat16 with IEEE 754 single- and half-precision.

Several major CPU and GPU architectures, and Neural Network accelerators (NPUs), have announced an intention to support BF16. The advantages of BF16 for Neural Networks are:

The ease with which BF16 can replace IEEE-FP32, whilst retaining correct NN operation because, unlike IEEE-FP16, it has the same dynamic range.
The ability to have a single number format that can be used for both training and inference, without the need for scaling and quantization which can involve costly retraining and redesign of the network architecture.
The potential for improved performance over IEEE-FP32, thanks to a halving of the memory footprint and bandwidth, and a doubling (or more) of the number of floating-point multiplications that can be performed per instruction.
The fact that software can easily support BF16 on existing CPUs which include IEEE-FP32 instructions, by using simple masking and shifting operations to convert BF16 to IEEE-FP32, and vice versa.

Arm is introducing four new instructions to each of the SVE, AArch64 Neon and AArch32 Neon SIMD instruction sets to accelerate the multiplication of matrices of BF16 values, this being by far the most common computation performed in Neural Networks. The new BF16 multiply instructions accept BF16 inputs but do not generate BF16 results, implicitly accumulating an IEEE-FP32 intermediate result to improve the accuracy of the final BF16 output matrix. The new SVE and AArch64 Neon instructions are as follows:

BFDOT, a [1×2] × [2×1] dot product of BF16 elements, accumulating into each IEEE-FP32 element within a SIMD result.
BFMMLA, effectively comprising two BFDOT operations which performs a [2×4] × [4×2] matrix multiplication of BF16 elements, accumulating into each [2x2] matrix of IEEE-FP32 elements within a SIMD result.
BFMLAL, a simple product of the even or odd BF16 elements, accumulating into each IEEE-FP32 element within a SIMD result.
BFCVT, converts IEEE-FP32 elements or scalar values to BF16 format.

Sadly, there is no IEEE standard for BF16 numeric behaviors, and different architectures, accelerators and software libraries have adopted slightly different aspects of the IEEE 754 floating-point standard to govern the numeric behavior of arithmetic on BF16 values.

Arm’s new BFMLAL and BFCVT instructions adopt the IEEE 754 standard for rounding and subnormal processing, honoring all of the floating-point controls that apply to IEEE-FP32 arithmetic and conversion instructions. The BFMLAL instruction allows developers, where necessary, to generate bit-identical results on Arm processors compared to some non-Arm ISAs, and the BFCVT instruction permits accuracy much closer to full IEEE-FP32.

However, BFDOT and BFMMLA are specifically intended to accelerate matrix multiplications by delivering a considerably higher multiply throughput than BFMLAL. For these instructions Arm made numeric simplifications designed to minimize the hardware complexity compared to honoring the full set of IEEE-FP32 controls:

Only one rounding mode is supported, the non-IEEE Round to Odd mode [round-odd], sometimes known as sticky rounding.
Subnormal inputs and outputs are always flushed to zero.
No trapped or cumulative exceptions are reported.
Only one type of NaN is ever returned; the “default NaN”.

RTL synthesis experiments showed that these simplifications reduced the area of the BFDOT block to 65% or less of a block that honors the full set of IEEE-FP32 controls, with a similar reduction in power consumption. This reduction allows at least half as many more BFDOT blocks to be instanced within the same area and power budget as fully IEEE-compliant blocks. These additional blocks mean that a BFMMLA instruction can deliver a peak throughput of up to four times as many multiplies per cycle as the BFMLAL instruction. Conversely these simplifications may permit more area-constrained CPUs to reuse their existing IEEE-FP32 multiply-accumulate blocks to perform a BFDOT operation, delivering up to twice as many multiplies per cycle with only a small increase in area.

Figure 2 shows the relative performance speedup from using BFDOT and BFMMLA relative to BFMLAL instructions. These results derive from modelling a notional high-performance Arm Cortex-A class CPU with 256-bit SVE vectors. BFDOT provides a speedup of between 1.6x and 2.2x, while BFMMLA provides a performance improvement of between 2.3x and 3.4x. The reason why BFMMLA does not achieve the peak 4x speedup is because of the additional matrix data rearrangements required to use these instructions, which can be mitigated by pre-arrangement of the constant network weights.

Figure 2: Speedup from using BFDOT and BFMMLA

Figure 2: Speedup from using BFDOT and BFMMLA.

IEEE floating-point exceptions are typically not used by NN framework libraries, and not supported by many accelerators. The IEEE inexact, underflow, and divide-by-zero exceptions are not relevant to NN processing, and the IEEE standard results (NaN, ±Infinity) for invalid operation and overflow exceptions provide sufficient diagnostic information for debugging NN software. Support for IEEE subnormal values is similarly irrelevant to NN processing.

In order to validate our choice of “Round to Odd” (RO) mode instead of the default IEEE “Round to Nearest, ties to Even“ (RN‑E) mode for the BFDOT and BFMMLA instructions, three experiments were performed to compare the results of large BF16 matrix multiplications, as follows:

RN-E forward (baseline).
RO forward.
RN-E reverse.

The direction refers to the order in which the dot product calculations are performed. The third experiment performed the inner loop of each matrix multiplication using RN-E mode, but in the reverse order compared to the first experiment. This acts as a proxy for the reordering of calculations which can occur as a result of different matrix multiplication software algorithms or hardware implementations. The simulations showed that using RO instead of RN-E increased the probability of the final BF16 result differing from the baseline by between 0.04% and 0.14%, whereas accumulating the RN-E data in a different order increased the probability of the final BF16 result differing by between 0.02% and 0.08%, due to the non-associativity of floating-point addition. In both cases the vast majority (over 90%) of the differences were in the least-significant bit only.

It was expected that such small differences would not impact on NN accuracy and a second set of experiments using the new instructions on actual NNs was performed to confirm this.

Accuracy scores were obtained from running the DeepSpeech LSTM network for speech recognition and two image classification Convolutional Neural Networks, Inception v3 and ResNet50 v1. All three NNs were trained using standard IEEE-FP32 arithmetic, with For these experiments, both the weights and the input activation data from the ImageNet validation set, were first converted into BF16 format using RN-E. As before, RN-E results were derived twice by executing the dot product inner loops in both forward and reverse directions. The simulations showed that using RO instead of RN-E made virtually no difference to either the accuracy, recall or error rate of the three NNs studied; the largest difference measured was 0.00002, with similar differences appearing in the reversed RN-E experiment.

Summary

Arm’s new BF16 instructions will be included in the next update of the Armv8-A architecture and will be implemented in upcoming CPUs from Arm and its partners. This will enable significant performance improvements for ML training and inference workloads that exploit the increasingly popular BFloat16 format.

Experienced developers in the floating-point arithmetic field will appreciate that some results of a matrix multiplication performed using Arm’s BF16 instructions may differ marginally from other ISAs. However, such marginal differences occur with a similar frequency and magnitude to those expected due to the use of different instructions, software algorithms or hardware accelerators that change the order of calculations, and which do not materially affect the accuracy of neural networks.

Look out for a further blog announcing the next architecture update. Following that announcement, support for the new BF16 instructions will be available in tools, models, and libraries such as the Arm Compute Library and Arm NN.

References

[ieee754-2008] IEEE Standard for Floating-Point Arithmetic, in IEEE Std 754-2008 , vol., no., pp.1-70, 29 Aug. 2008 URL: https://ieeexplore.ieee.org/document/4610935

[round-odd] Sylvie Boldo, Guillaume Melquiond . “Emulation of a FMA and correctly-rounded sums: proved algorithms using rounding to odd.” IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2008, 57 (4), pp.462-471. URL: https://hal.inria.fr/inria-00080427

0 comments
0 members are here

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

BFloat16 processing for Neural Networks on Armv8-A

Coaching AI coding agents: A guide for senior engineers

Optimize Llama.cpp with Arm I8MM instruction

Build AI responsibly with the Yellow Teaming methodology and LLM assistant