When we designed the M‑Profile Vector Extensions (MVE), known as Arm Helium technology for the Arm Cortex-M processor series, we wanted to make it a good fit for a wide variety of digital signal processing (DSP) applications. The ability to perform computation on data efficiently is only half the story. Equally important is the ability to access and store this data in memory.
As we described in an earlier post, Helium is a four-beat vector architecture. The most straightforward way to load data into vectors is a contiguous load operation (see Figure 1). For each beat, memory is accessed in sequence starting from a base address specified in a scalar register. Regardless of the data type targeted (8, 16, or 32‑bit), this can be performed efficiently through accesses that make full use of the bus width, because the data elements are adjacent in both the memory and the vector. The same is true for store operations.
Figure 1: Contiguous load operation
As memory is a scarce resource, it is normal to pack data as tightly as possible, using the smallest data type that will accommodate it. However, when processing the data more headroom might be required in order to avoid overflows in intermediate stages of computation. This could be performed as a standalone widening instruction, but as explained in the first post in this series, this would require time travel. (For an 8 to 32‑bit expansion, expanding data into the last beat requires data from the first beat, which is no longer available.) As a result, the expansion instruction could not be overlapped with anything else, which would have a detrimental impact on performance. Instead, Helium introduces size-changing memory operations. Here, the data can be loaded efficiently as a single 8, 16, or 32‑bit access for each beat, and either zero or sign-extended to match the desired datatype. In the example below (see Figure 2), we want to perform an 8‑bit load expanding to 16‑bits for each vector lane. Two 8‑bit data samples are loaded as a single 16‑bit load operation, and each one is extended to 16‑bits before being written to the vector lane. Similarly, for stores the data can be truncated to the desired size and stored efficiently.
Figure 2: Expanding load
The Helium load and store instructions have the same rich set of addressing modes as the rest of the M‑Profile architecture, with features like pre or post-incrementing and pointer writeback support. This eliminates the need for separate pointer manipulation operations in most cases.
DSP applications often operate on data structures instead of individual elements. For example, stereo audio data is usually stored as an interleaved stream of left and right values. Similarly, image data is often stored as interleaved Red-Green-Blue-Alpha values. This was the subject of the previous post, which described the structured load/store instructions that can efficiently do this.
Sometimes, data stored in memory cannot be structured in a convenient way that makes contiguous accesses possible. In some architectures, this can actually prevent code from being vectorized. Helium solves this problem with scatter‑gather operations. These point a vector of offsets into memory so that multiple non‑contiguous addresses can be accessed with a single instruction (see Figure 3). They are also able to extend or truncate the data being accessed.
Figure 3: Gather load
Scatter-gather operations are powerful instructions that will provide a lot of flexibility to applications. For example, they are almost indispensable for implementing an FFT (Fast Fourier Transform); in this algorithm, the memory accesses for the first or last butterfly phases need to be performed with bit-reversed addressing. A dedicated bit-reversal instruction (VBRSR) produces the bit reversed addressing pattern that is then used by these scatter-gather instructions. These instructions overlap for performance and together are much more efficient at loading in the data than an equivalent sequence of scalar instructions. They are also much easier to use! While contiguous vector accesses offer better performance for regular patterns, because scatter-gather instructions cannot make assumptions about the layout of the data, a gather load instruction must perform as many separate accesses as there are distinct elements. That can be up to 16 accesses for 8‑bit data – it’s the load that just keeps on loading.
DSP applications often use a memory arrangement known as circular addressing. Elements can be accessed in sequence up to the configured size of the buffer, after which accesses wrap around to the first element (see Figure 4). For example, a four-element read starting at element N‑1 would access elements N‑1, 0, 1, 2.
Figure 4: Circular buffer example
This has many uses in DSP applications including avoiding pointer manipulation when only the previous N data samples of a data stream are needed after being processed. In FIR filters, the last N data samples need to be multiplied by a set of coefficients to produce the desired filter response. When a new data sample arrives, it is the previous N‑1 samples and the new sample that need to be processed, the oldest no longer being used. The data could be rearranged so that the buffer to be processed always contains the elements in the right order, but this would require considerable effort to copy each sample to different locations before processing can start. If a circular buffer is used, data can be accessed in-place, wrapping around when necessary, and only one write is required to replace the oldest sample with the newest one.
Some DSPs implement circular buffers through dedicated access instructions and dedicated registers for the start and end addresses. Each time the pointer is incremented, the hardware compares it to the end address and wraps around accordingly. This means that the number of simultaneous circular buffers supported is limited by the available hardware. It also implies a large amount of extra state that needs to be preserved on every interrupt, which affects latency. The hardware required to support this is not negligible; in a typical implementation, more complex address generation units would be required. To avoid this, some DSPs require that circular buffers are a power of two in size and that the address of the buffer is aligned to a multiple of the size. This can be achieved by ANDing the pointer with a bitmask, making the hardware much simpler. However, this imposes restrictions on the placement and use of these buffers, notably making them almost impossible to use directly from a high-level language. Since the M‑Profile ethos is to make everything easy to use from C, we needed to come up with a better approach.
Our solution is to split the circular buffer into two distinct operations, in a similar manner to the bit-reversed addressing discussed above. We combine an instruction that generates wrapping offsets with a scatter‑gather instruction to access data at these offsets. This achieves flexibility of buffer size and placement and does not require dedicated hardware on the critical path. The circular buffer generating instruction (VIWDUP) creates a vector containing a sequence of incrementing offsets that wrap when an end position is reached (see Figure 5). The instruction populates vector register Q0 with a sequence starting at the value of R0 and wraps when the value of R1 is reached. It then writes back an updated start offset of 2 to R0. A clever design feature of this instruction is that the vector of offsets written to Q0 is regenerated from just the scalar values each time. As the scatter‑gather instruction that uses the offsets is normally the next instruction, Q0 can be reused for other purposes straight afterwards. The immediate value specifies the amount by which the offsets increment, which is useful for handling different element sizes. For example, if 32‑bit values are being loaded, an increment by four bytes would be used. Since arbitrary increment or decrement amounts can be specified, this instruction can be used in other cases where a generic pattern of numbers is required. In this way, Helium can offer any number of circular buffers, with flexible size, direction, and alignment in memory, all the while using only existing hardware and providing a sequence-generating instruction in the process.
VIWDUP
Q0
R0
R1
Figure 5: Example operation of the sequence generating instruction
But what about performance? While an additional offset-generating instruction (the VIWDUP) is required, we found that in many cases the overhead could be hidden by overlapping it with the memory access itself. In all cases, this was less than the computational effort of managing the wrapping without hardware support. We also said earlier that it was desirable to use contiguous accesses where possible for performance reasons. Circular buffers are interesting in the sense that most of the accesses will be contiguous, with only occasional discontinuities when wrapping occurs. One option would be for the scatter-gather to compare the offset values and merge accesses that are contiguous. Unfortunately, this would involve a fair amount of extra hardware and would add a lot of extra complexity into a critical part of the design. Paying the scatter‑gather performance penalty when the loads are contiguous did not sit well with our mantra of squeezing every last drop of performance out of each gate.
When trying to find a solution to this, we noticed that the offset generation instruction (VIWDUP) already knows where the wraparound points are. If this information could be passed to the scatter-gather, it could promote accesses to contiguous accesses without the need for expensive and time-consuming offset comparators. Could we specify an extra scalar register to transport this information? Unfortunately, this would increase the number of read ports required, and the scalar dependency from VIWDUP to the scatter‑gather would prevent the instructions overlapping. Could Helium implementations instead store this information in hidden microarchitecture metadata, which is cleared if the vector is modified? Normally this would not be an option as the metadata would need to be preserved on interrupts and that would affect latency. But we realized that in this case we do not need to preserve it. In the rare cases that exceptions occur, the fall-back is to execute the scatter-gather normally rather than optimizing for contiguous accesses. By using volatile hidden metadata to indicate contiguous accesses, we can optimize performance for the common non-wrapping case, while avoiding extra architectural state and interrupt latency.
Working within a constrained environment can be challenging, and Helium constantly required us to find innovative solutions to squeeze ever more performance from the hardware. We worked hard to co-design the architecture and microarchitecture, to find a range of memory access instructions that are up to the task of supporting the needs of DSP applications while minimizing the amount of hardware required to implement them. With our approach to circular buffers in particular, we continue the M-Profile tradition of making each gate work for its lunch to achieve performance at low area cost, while providing a frustration-free end-user experience.
Please join us again for the final part of this blog series, where my colleague Thomas will explain how we kept Amdahl’s law at bay with some interesting loop handling instructions.
To learn more about the new features of Arm Helium technology, a vector extension of the Armv8.1-M architecture, download the white paper below.
Download white paper
This post is the third in a four part series. Read the other parts of the series using the links below: Part 1 - Making Helium: Why not just add Neon? Part 2 - Making Helium: Sudoku, registers and rabbits Part 4 - Making Helium: Bringing Amdahl's law to heel
This post is the third in a four part series. Read the other parts of the series using the links below:
Part 1 - Making Helium: Why not just add Neon?
Part 2 - Making Helium: Sudoku, registers and rabbits
Part 4 - Making Helium: Bringing Amdahl's law to heel
This new book is the ideal gateway into Arm’s Helium technology, including both theoretical and practical sections that introduce the technology at an accessible level.
Download the free eBook
“As the scatter‑gather instruction that uses the offsets is normally the next instruction, Q0 can be reused for other purposes straight afterwards. "
Here Q0 or R0?
Hi, thanks for the question!Q0 is the vector of offsets for the scatter-gather, so once the scatter-gather has consumed it Q0 can be used for other purposes. This means it does not restrict the overall number of available vector registers.R0 would still need to be maintained, as it will be used in subsequent loop iterations to generate the next offsets.