Making Helium: Sudoku, registers and rabbits (2/4)

February 21, 2019

8 minute read time.

In the first part of this blog series, I explained the “beatwise” execution that’s at the heart of Helium technology (also known as the M-profile Vector Extensions / MVE). There’s some useful background information in that post, so it’s worth looking at if you haven’t already read it.

An important part of DSP processing is efficiently handling different data formats, which often need to be transformed into a different arrangement for computation. A good example of this is image data, which is typically stored as an interleaved stream of red, green, blue, and alpha pixel values. However, to vectorise the computation you may need all the red pixels in one vector, all the green pixels in another vector and so on. In the Neon architecture the VLD4 / VST4 instructions can perform this transformation, as illustrated below.

Data in registers

VST4 interleaves 4 x 128‑bit registers, storing a total of 512‑bits of data. The Neon architecture has a variety of interleaving / deinterleaving operations to support different formats. For example, a VST2 is provided that can be used to interleave the left and right channels of stereo audio. These instructions also support different element sizes from 8 to 32‑bit.

One of the main benefits of MVE’s “beatwise” execution is that it allows memory and ALU operations to be overlapped, even on single issue processors. As shown below, to achieve the doubling in performance that this technique enables, all instructions must perform the same amount of work.

Execution time ALU store

As can be seen, the performance increase from the overlapping would be seriously degraded by a wide store instruction like VST4. The solution that MVE provides is to split the store into chunks that are well balanced with the ALU operations, with each chunk storing 128‑bits of data. MVE enables this four‑way interleaving to be made up of four instructions called VST40, VST41, VST42 and VST43. Unfortunately, this isn’t the end of the story, as there are a whole host of related issues:

The obvious split would be for each of the four instructions to store a different stream (e.g. VST40
storing all the red pixels, VST41 storing the green pixels, etc). For 8‑bit pixel data this would mean each instruction would store 16 non‑contiguous bytes. This sort of access pattern would be tortuous for the memory subsystem and would result in a significant amount of stalling. Instead the instructions need to generate large contiguous block requests.
To work correctly with other vector instructions, the register file ports must be set up to access the rows of the register file (i.e. a whole vector register), rather than the columns (i.e. the first byte of four
registers) as would be needed if interleaving data to a contiguous block of memory.
To avoid the sort of time travel issue described in my previous blog post, we need to be able to split the instruction into “beats” where bits [63:0] of the register are read first, and then bits [127:64] are read on the subsequent cycle.
The solution must work for both interleaving by two and by four, and when operating on 8, 16 and 32‑bit data.

With all these conflicting constraints we’re so far down the rabbit hole that I’m reminded of Alice in Wonderland:

Alice: This is impossible.
The Mad Hatter: Only if you believe it is.

So, let’s suspend disbelief and take a closer look at a read port and see where it takes us.

Read port 1

MVE reuses the floating‑point register file, so the vector registers (Q0 to Q7) are made up of groups of four “S” registers. The same row is selected for each column mux and the data combined to access an entire Q register (see figure above). But what if instead of being able to select from any register in a column, the ports were twisted and we could select from registers in alternating columns, as shown below:
Read port 2

If the control inputs on 8:1 muxes are set to the same value a row can be read (e.g. S0 and S1). However, if different values are used, a pair of values in a column (e.g. S0 and S4) can be read. Now this is starting to look promising as it gives us the ability to read from columns as well as rows. If we zoom in a bit to the bottom corner of the diagram and replace the register numbers with the numbers of the muxes they’re connected to we get this:

Read port 3

This resembles a simple sudoku puzzle with each number appearing exactly once in every row and column within the repeating matrix, but on a 2 x 2 matrix, instead of the usual 9 x 9. This pattern only gives us a solution for interleaving by two, as it can only read two values from a column and only handles 32‑bit values (the width of the muxes). Since we need one pattern that works with all combinations of interleaving patterns and data widths, we can imagine stacking all the combinations vertically to get a multi resolution, 3D sudoku puzzle. While solving each layer is trivial, solving the full 3D puzzle as a whole is mind blowing. In addition, we still need to consider the other constraints I mentioned above, like contiguous memory accesses, and being able to split the accesses to the upper and lower 64‑bits of a register across different cycles.

After a bit of head scratching I realised I could split the problem in two: identifying a way to represent the full set of constraints in a single unified problem space, and the monotonous task of solving those constraints. Since the pattern resembles a really complex sudoku problem, and many sudoku programs are based on SAT solvers, I had the idea of using a SAT solver to perform the monotonous constraints solving task. With some caffeine‑boosted creativity I came up with a way of representing all the constraints and after a bit of debugging I had the first working solution. It was incredibly messy and would have made the control logic for the muxes a nightmare to implement, but at least I knew I was onto a winner. Again I didn’t want to have to try and manually clean up the solution, so with a bit of ingenuity I added some extra constraints to introduce some symmetry, and I got the final solution. The answer turns out to be a pair of double nested quadruple helixes:

Double nested

So that you can see the nested helixes, I’ve highlighted the path of a single mux in the diagrams below. As you can see, the path alternates between 32‑bit “S” registers every row (shown on the left), and swaps between the upper and lower 16‑bit halves of the “S” registers every two rows (shown on the right).

Nested 2 I had a hunch that the twisting approach wouldn’t work for an interleaving by three, but keeping the Mad Hatter in mind I didn’t want to just give up on this without a fight. As it turns out I was right and the SAT solver formally proved there wasn’t a solution.

This twisting approach means the data in both the rows and the columns of the register file can be accessed. However the snag is that the bytes returned by the read port can be in the wrong order, and that order depends on which registers are being accessed. To correct this, a crossbar mux is needed to swap everything back to the correct positions. As it happens we get this crossbar for free as it’s already needed for other instructions like VREV, and the instructions that natively operate on complex numbers. This again plays to our mantra of “if you have to have a piece of hardware, make it work for its lunch”.

The diagram below shows some of the instruction access patterns derived from the read port twisting pattern. The first case (VST2n.S32) shows 32‑bits (S32) being read from vector registers Q0 and Q1 and being interleaved by two (e.g. for left and right audio channels). The colours represent the parts of the registers that are read by each of the two instructions (i.e. VST20 reads the orange sections), and the text in the elements indicates which byte offsets from memory are stored.

VST2n

You’ll notice that both the S8 and S16 patterns above place the same data within the same coloured sections of the registers; the only difference is the arrangement of the bytes within a section. This means that the 16‑bit pattern can also support 8‑bit by just using a different configuration in the crossbar mux. These patterns also work for the equivalent write port used by load instructions. In addition to making it possible to build the register file ports, the patterns also mean the memory accesses are always a pair of 64‑bit contiguous blocks, which makes the memory accesses efficient. As an added bonus, bit[3] of the addresses of these blocks is always different, so they can be dispatched in parallel on systems that have two banks of interleaved 64‑bit memories.

There were two important lessons that the research team took from these instructions. Firstly, when pushing the boundaries of gate count and efficiency this hard, you really do have to design the fine details of a microarchitecture in parallel with the architecture. Secondly, keep the faith and believe in the impossible (until formally proven otherwise…).

In the next part of this blog series, my colleague, François, will continue the memory access theme and describe some techniques for implementing circular buffers.

Learn More About MVE

This post is the second in a four part series. Read the other parts of the series using the links below:

Part 1 - Making Helium: Why not just add Neon?

Part 3 - Making Helium: Going around in circles

Part 4 - Making Helium: Bringing Amdahl's law to heel

Arm Helium Technology MVE for Arm Cortex-M Processors Reference Book Now Available!

This new book is the ideal gateway into Arm’s Helium technology, including both theoretical and practical sections that introduce the technology at an accessible level.

Download the free eBook

Top Comments

Thomas Grocutt over 6 years ago in reply to zhengwang721
The different colours represent different instructions. E.G. VST40.S32 is shown in orange, and VST41.S32 in green, etc. So the order of the memory accesses depends on the order in the program. If we assume the program order of the instructions is: VST40.S32, VST41.S32, VST42.S32, VST43.S32, then the memory order would be the one you suggested.

Microarchitectures can use all sorts of interesting tricks to increase performance, and it’s certainly possible to build a processor that performs M[47:40] in parallel with M[15:8]. However there’s a number of reasons why you may not want to:

Because a VST40.S32 and a VST41.S32 are likely to use the same hardware blocks, the compiler will try and schedule a different type of instruction in between them. EG VST40.S32, VMLALV, VST41.S32, VMLALV. This way the stores can overlap with the MACs, keeping both blocks busy at the same time. As a result it may not be worth the extra hardware to support storing M[47:40] in parallel with M[15:8], as it’s unlikely the VST40.S32 and VST41.S32 instructions would be next to each other in the program.

Performing M[47:40] in parallel with M[15:8] would require a 128bit data path to memory. If you had such a wide microarchitecture it would actually be simpler to perform M[7:0] in parallel with M[47:40], and effectively do the whole of the VST40.S32 in one go.

The access patterns are designed to make the microarchitecture easy to implement (eg by making sure the memory accesses are large contiguous blocks). Microarchitecture is all about providing the illusion the architecture is being strictly followed, when actually all manner of prediction and reordering is being performed under the hood. It’s certainly possible for microarchitectures to deviate from these patterns, but I suspect they won’t as there wouldn’t be much benefit in doing so.
- Cancel
- Up +1 Down
- Reply
- More
- Cancel
zhengwang721 over 6 years ago

In the last picture, take the VST4n.S32 as example. What's the timing sequence of the data flow?

Does it follow the color like M[7:0]->M[47:40]->M[15:8]->M[55:48]->M[23:16]->M[63:56]->M[31:24]->M[39:32]?

Can the M[47:40] and M[15:8] been overlapped?

It looks like a microarchitecture here. Do you think if there could be an implementation doesn't follow this?

Thanks!
- Cancel
- Up +1 Down
- Reply
- More
- Cancel

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

Making Helium: Sudoku, registers and rabbits (2/4)

Arm Helium Technology MVE for Arm Cortex-M Processors Reference Book Now Available!

Top Comments

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond