Today, we’re excited about the announcement of the M‑profile vector extensions (MVE) for the Armv8‑M, which started in Arm’s research group several years ago. When we were asked to increase the DSP performance of Arm Cortex‑M processors, naturally our first thought was to just add the existing Neon technology. However, the need to support a range of performance points within the area constraints of typical Cortex‑M applications meant we had to start from scratch. As a lighter noble gas, Helium seemed an apt name for the research project, made perfect by the fact that the nominal goals (for a mid-range processor) where a 4x performance increase for a 2x increase in data path width, which coincides with Helium’s atomic weight and number. As it turns out, we managed to beat the 4x target on many digital signal processing (DSP) and machine learning (ML) kernels. Needless to say, the name Helium stuck, and was adopted as the branding for the MVE for the Arm Cortex-M processor series.
Half the battle in creating a processor with good DSP performance is feeding it enough data. On Cortex‑A processors the 128‑bit Neon loads can easily be pulled straight from the data cache. But it’s common for Cortex‑M processors to be cache‑less, and instead, have a low latency SRAM used as the main memory. Since widening the path to the SRAM (which is often only 32‑bits) to 128‑bits would be unacceptable for many systems, we were faced with the possibility of memory operations stalling for up to four cycles. Similarly, the multipliers used in the multiply and accumulate (MAC) instructions take a lot of area, and having 4 x 32-bit multipliers on a small Cortex‑M processor wasn’t going to fly. To put the area constraints into perspective, there can be orders of magnitude difference in size between the smallest Cortex-M processor and a powerful out‑of‑order Cortex‑A processor. So, when creating M‑profile architecture, we really have to think about every last gate. To make the most out of the available hardware, we need to keep expensive resources like the path to memory and multipliers simultaneously busy every cycle. On a high‑performance processor like Cortex‑M7, this could be accomplished by dual issuing a vector load with a vector MAC. But an important goal was to increase DSP performance over a range of different performance points, not just at the high end. Adapting some technology from the decades‑old idea of vector chaining helps address these problems.
The diagram above shows an alternating sequence of vector load (VLDR) and vector MAC (VMLA) instructions executing over four clock cycles. This would require a 128‑bit wide path to memory, and four MAC blocks, both of which would be idle half the time. You can see each 128‑bit wide instruction is split up into four equally sized chunks, which the MVE architecture calls “beats” (labelled A to D). These beats are always 32‑bits worth of compute regardless of the element size, so a beat could contain 1 x 32‑bit MAC, or 4 x 8‑bit MACs. Since the load and MAC hardware are separate, the execution of these beats can be overlapped as shown below.
Even if the value loaded by the VLDR is used by the subsequent VMLA the instructions can still be overlapped. This is because beat A of the VMLA only depends on beat A of the VLDR, which occurred on the previous cycle, so overlapping beats A and B with beats C and D doesn’t require time travel. In this example, we get the same performance as processor with a 128‑bit data path, but with half the hardware. The concept of “beatwise” execution enables efficient implementation of multiple performance points. For example, the diagram below shows how a processor with only a 32‑bit data path could handle the same instructions. This is quite attractive as it enables double the performance of a single‑issue scalar processor (loading and performing MACs on 8 x 32‑bit values in eight cycles), but without the area and power penalty of dual issuing scalar instructions.
MVE supports scaling up to a quad beat per cycle implementation, at which point the beatwise execution collapses to a more conventional SIMD approach. This helps to keep the implementation complexity manageable on high-performance processors.
Beatwise execution sounds great, but it does raise some interesting challenges for the rest of the architecture.
Through a lot of hard work (and dusting off the architecture history books) MVE manages to turn some very demanding power, area, and interrupt latency constraints to its advantage. I hope you’ll join me for the second part of this series, where we go down the rabbit hole of some mind bending (or should I say twisting) interleaving load/store instructions.
Learn More About MVE
This post is the first in a four part series. Read the other parts of the series using the links below: Part 2 - Making Helium: Sudoku, registers and rabbits Part 3 - Making Helium: Going around in circles Part 4 - Making Helium: Bringing Amdahl's law to heel
This post is the first in a four part series. Read the other parts of the series using the links below:
Part 2 - Making Helium: Sudoku, registers and rabbits
Part 3 - Making Helium: Going around in circles
Part 4 - Making Helium: Bringing Amdahl's law to heel
This new book is the ideal gateway into Arm’s Helium technology, including both theoretical and practical sections that introduce the technology at an accessible level.
Download the free eBook
Hi, I have some questions about ECI and handling of faulted overlapped-pair for MVE.
From Armv8-M Architecture Reference Manual, ECI valid value only supports the following combination:
Completed beats: A0
Completed beats: A0 A1
Completed beats: A0 A1 A2
Completed beats: A0 A1 A2 B0
There is no valid ECI value for completed beats: A0 A1 B0 or A0 A1 A2 B0 B1. If there is a valid ECI value reserved for these combinations of completed beats, or these combinations will bring some troubles if the architecture allows them?
If A2 and B0 are overlapped, there is still a strong order between A2 and B0? If B0 can raise a sync fault before A2 completes?
By changing the amount of work done per architecture tick, it’s possible to scale the performance to match different sized CPUs. However, changing the phase of overlap doesn’t affect performance much. For example, executing B0 in parallel with A2 gives you the same performance as starting it in parallel with A1 or A3. Only supporting a 50% overlap (i.e. B0 executing in parallel with A2) makes the hardware a lot easier to build because it’s easier to divide up the available hardware. For example, because the B instruction is accessing the bottom half of the register file while the A instruction is accessing the top half. One important thing to keep in mind here is that while a CPU can choose to never produce a particular ECI value, all CPUs must be able to consume all ECI values. This is required so that a thread that was started on one CPU can be migrated to a different CPU that supports a different amount of beat parallelism. This is why we don’t permit some ECI encodings (like A0 A1 B0 and A0 A1 A2 B0 B1), as they would increase the complexity of all Helium enabled CPUs but wouldn’t improve performance.
From a strict architecture point of view, an Architecture tick is atomic. So, if B0 overlaps with A2, then either both B0 and A2 complete, or if there’s a fault / exception, neither of them complete. However, as usual, CPUs are permitted to break these rules if all the observable side effects are still architecturally valid. For example, if B0 faults then a CPU could choose to complete A2 and A3 and report the exception return address as the B instruction. This is valid because you’d get the same architecture state if the CPU had chosen not to overlap instructions A and B in the first place.
Thanks for sharing these details which are really helpful!
I have another question about the scenario that B0 overlaps with A2 and B0 hits a sync bus fault:
If it is valid that a CPU chooses to complete A2 only rather than complete both A2 and A3, and raises the fault of B0, reports the exception return address as the A instruction, and records the ECI as A0 A1 A2 completed? If it breaks the atomicity of one architecture tick?
In the case of a sync fault on B0, I’d say you wouldn’t be allowed to complete A2 unless you also completed A3 (and adjusted the returns address to point to the B instruction). As you mentioned, completing just A2 would break the atomicity rules around that tick.