Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Graphics and Gaming
    • High Performance Computing
    • Innovation
    • Multimedia
    • Open Source Software and Platforms
    • Physical
    • Processors
    • Security
    • System
    • Software Tools
    • TrustZone for Armv8-M
    • 中文社区
  • Blog
    • Announcements
    • Artificial Intelligence
    • Automotive
    • Healthcare
    • HPC
    • Infrastructure
    • Innovation
    • Internet of Things
    • Machine Learning
    • Mobile
    • Smart Homes
    • Wearables
  • Forums
    • All developer forums
    • IP Product forums
    • Tool & Software forums
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Research
  • Developer Community
  • Jump...
Arm Research
Research Articles Making Helium: Why not just add Neon? (1/4)
  • Research Articles
  • Leaderboard
  • Resources
  • Arm Research Events
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
Arm Research requires membership for participation - click to join
More blogs in Arm Research
  • Research Articles

Tags
  • Arm Research
  • Helium
  • M-Profile Vector Extension (MVE)
  • Cortex-M
  • Computer Architecture
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Making Helium: Why not just add Neon? (1/4)

Thomas Grocutt
Thomas Grocutt
February 14, 2019

Today, we’re excited about the announcement of the M‑profile vector extensions (MVE) for the Armv8‑M, which started in Arm’s research group several years ago. When we were asked to increase the DSP performance of
Arm Cortex‑M processors, naturally our first thought was to just add the existing Neon technology. However, the need to support a range of performance points within the area constraints of typical Cortex‑M applications meant we had to start from scratch. As a lighter noble gas, Helium seemed an apt name for the research project, made perfect by the fact that the nominal goals (for a mid-range processor) where a 4x performance increase for a 2x increase in data path width, which coincides with Helium’s atomic weight and number. As it turns out, we managed to beat the 4x target on many digital signal processing (DSP) and machine learning (ML) kernels. Needless to say, the name Helium stuck, and was adopted as the branding for the MVE for the Arm Cortex-M processor series.

Half the battle in creating a processor with good DSP performance is feeding it enough data. On Cortex‑A processors the 128‑bit Neon loads can easily be pulled straight from the data cache. But it’s common for Cortex‑M processors to be cache‑less, and instead, have a low latency SRAM used as the main memory. Since widening the path to the SRAM (which is often only 32‑bits) to 128‑bits would be unacceptable for many systems, we were faced with the possibility of memory operations stalling for up to four cycles. Similarly, the multipliers used in the multiply and accumulate (MAC) instructions take a lot of area, and having 4 x 32-bit multipliers on a small Cortex‑M processor wasn’t going to fly. To put the area constraints into perspective, there can be orders of magnitude difference in size between the smallest Cortex-M processor and a powerful out‑of‑order Cortex‑A processor. So, when creating M‑profile architecture, we really have to think about every last gate. To make the most out of the available hardware, we need to keep expensive resources like the path to memory and multipliers simultaneously busy every cycle. On a high‑performance processor like Cortex‑M7, this could be accomplished by dual issuing a vector load with a vector MAC. But an important goal was to increase DSP performance over a range of different performance points, not just at the high end. Adapting some technology from the decades‑old idea of vector chaining helps address these problems.

VLDR chart

The diagram above shows an alternating sequence of vector load (VLDR) and vector MAC (VMLA) instructions executing over four clock cycles. This would require a 128‑bit wide path to memory, and four MAC blocks, both of which would be idle half the time. You can see each 128‑bit wide instruction is split up into four equally sized chunks, which the MVE architecture calls “beats” (labelled A to D). These beats are always 32‑bits worth of compute regardless of the element size, so a beat could contain 1 x 32‑bit MAC, or 4 x 8‑bit MACs. Since the load and MAC hardware are separate, the execution of these beats can be overlapped as shown below.

VLDR chart 2

Even if the value loaded by the VLDR is used by the subsequent VMLA the instructions can still be overlapped. This is because beat A of the VMLA only depends on beat A of the VLDR, which occurred on the previous cycle, so overlapping beats A and B with beats C and D doesn’t require time travel. In this example, we get the same performance as processor with a 128‑bit data path, but with half the hardware. The concept of “beatwise” execution enables efficient implementation of multiple performance points. For example, the diagram below shows how a processor with only a 32‑bit data path could handle the same instructions. This is quite attractive as it enables double the performance of a single‑issue scalar processor (loading and performing MACs on 8 x 32‑bit values in eight cycles), but without the area and power penalty of dual issuing scalar instructions.

VLDR chart 3

MVE supports scaling up to a quad beat per cycle implementation, at which point the beatwise execution collapses to a more conventional SIMD approach. This helps to keep the implementation complexity manageable on high-performance processors.

Beatwise execution sounds great, but it does raise some interesting challenges for the rest of the architecture.

  • Because multiple partially executed instructions can be in flight at the same time, interrupt and fault handling could become quite complex. For example, if beat D of the VLDR in the diagram above encountered a fault, implementations would normally have to roll back the write to the register file caused by beat A of the VMLA on the previous cycle. Buffering old values in case of rollback would not be in line with our philosophy of making every last gate work for its lunch. To avoid the need for this the processor stores a special ECI value on exception entry which indicates which beats of the subsequent instructions have already been executed. On exception return, the processor uses this to identify which beats to skip. Being able to quickly jump out of an instruction without having to rollback, or wait until it completes, also helps preserve the fast and deterministic interrupt handling that Cortex-M is known for.
  • If an instruction involves crossing beat boundaries, we again have a time travel problem. This crossing behavior commonly shows up in widening/narrowing operations. A good example of this is the VMLAL instruction in the Neon architecture, which can multiply and accumulate a vector of 32‑bit values into 64‑bit accumulators. Unfortunately, these sorts of widening operations are typically required to preserve the full range of the multiplier output. MVE addresses this problem by using the general‑purpose “R” register file for accumulators. As a bonus, this reduces the register pressure on the vector registers and enables MVE to get good performance with half the vector registers present in the Neon architecture. Making extensive use of the general‑purpose register file (as MVE does) wouldn’t normally be done in a vector architecture, as the register file tends to be physically a long way from the vector unit. This is especially true on high performance out‑of‑order processors where the long physical distances would limit performance. However, this is somewhere where we can turn the smaller scale nature of typical Cortex‑M processors to our advantage.
  • To make sure that overlapped execution is well balanced and stall free, every instruction should describe 128‑bits of work, no more and no less. This can raise some interesting challenges, but I’ll save that for part two of this blog series.

Through a lot of hard work (and dusting off the architecture history books) MVE manages to turn some very demanding power, area, and interrupt latency constraints to its advantage. I hope you’ll join me for the second part of this series, where we go down the rabbit hole of some mind bending (or should I say twisting) interleaving load/store instructions.

Learn More About MVE 

This post is the first in a four part series. Read the other parts of the series using the links below:

Part 2 - Making Helium: Sudoku, registers and rabbits

Part 3 - Making Helium: Going around in circles 

Part 4 - Making Helium: Bringing Amdahl's law to heel

Arm Helium Technology MVE for Arm Cortex-M Processors Reference Book Now Available!

This new book is the ideal gateway into Arm’s Helium technology, including both theoretical and practical sections that introduce the technology at an accessible level.

Download the free eBook 

Anonymous
  • Thomas Grocutt
    Offline Thomas Grocutt over 2 years ago in reply to 2henwei

    The idea behind MVE is that you can scale performance by changing the number of beats per cycle that the processor executes.

    On a really small processor you may only have a 32-bit path to memory, and only execute a single beat per cycle (as shown in the last diagram). Whereas on a bigger processor you may have 64-bit, or even a 128-bit data path and therefore be able to execute two or four beats per cycle respectively (the middle diagram shows a two beat per cycles design). The key is MVE is able to keep both the memory path and the multiplier(s) busy at the same time, regardless of whether it’s a small single beat per cycle processor, or something more powerful. This enables you to get the most out of all hardware you have.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • 2henwei
    Offline 2henwei over 2 years ago

    If you only have 32-bit data path, why you can load beats A and B at the same cycle?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • beru
    Offline beru over 2 years ago in reply to Thomas Grocutt

    Thanks to the newly added VLDR*(Vector Gather Load) and VMULH instructions. I guess vectorizing division by invariant integers using multiplication and shift would be finally possible. Just hoping it would be a lot faster than sequential SDIV/UDIV instructions.

    It still wonder about not having reciprocal instructions even for fp types. But maybe I should forget about cycles and be delighted with the highest accuracies of VDIV and VSQRT instructions. So I will forget about this design choice and just keep dreaming that ARM-software will release vrcp28, vrsqrt28, vexp23, vrcp14, vrcp14, etc... library functions written with Armv8.1-M instructions.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Thomas Grocutt
    Offline Thomas Grocutt over 2 years ago in reply to beru

    You’re right that the floating point instructions are optional. However you have the integer divide instructions which are still available (SDIV / UDIV). A common approach when vectorising code with a divide is to either arrange things so that the number being divided by is a power of two (so that a shift right can be used instead of a true divide), or to multiply the vector by the reciprocal of the number you want to divide by. Since both of these options are now very popular that wasn’t enough justification for a dedicated vector divide, which would have been a slower operation. It’s also possible to build square root and reciprocal library functions out of the available Armv8.1-M instructions that offer good performance.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • beru
    Offline beru over 2 years ago

    Armv8-M Floating-point Extension is optional. so if it is omitted, my understanding is that some fp math instructions like VDIV, VSQRT are also forbidden to use. In such case, how can programmers achieve integer division, square root and so on with efficient manners using Helium?

    And NEON instruction set includes reciprocal and square root recirprocal (estimate and step) but those aren't included in Armv8-M. Does that mean they are kind of obsolete with recent circuit design? I hope VDIV and VSQRT will be acceptably fast then.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Research Articles
  • Simplifying persistent programming with microarchitectural support

    William Wang
    William Wang
    Hardware-based coherence has simplified concurrent programming, and our work shows the improvements in performance and programmability.
    • April 15, 2021
  • Ensuring your AI is sure: Any place, anywhere, anytime

    Tiago Azevedo
    Tiago Azevedo
    It is important in industry to define what we see and how well we see it. This simple yet powerful idea has driven recent developments in the Arm Research ML Lab.
    • April 9, 2021
  • Barcelona Supercomputing Center: An Arm Research Centre of Excellence

    Rhiannon Burleigh
    Rhiannon Burleigh
    The Barcelona Supercomputing Center – Arm Research Centre of Excellence was formed to recognize the leadership and hard work in pioneering Arm in HPC.
    • March 10, 2021