In the first part of this blog series, I explained the “beatwise” execution that’s at the heart of Helium technology (also known as the M-profile Vector Extensions / MVE). There’s some useful background information in that post, so it’s worth looking at if you haven’t already read it.
An important part of DSP processing is efficiently handling different data formats, which often need to be transformed into a different arrangement for computation. A good example of this is image data, which is typically stored as an interleaved stream of red, green, blue, and alpha pixel values. However, to vectorise the computation you may need all the red pixels in one vector, all the green pixels in another vector and so on. In the Neon architecture the VLD4 / VST4 instructions can perform this transformation, as illustrated below.
VST4 interleaves 4 x 128‑bit registers, storing a total of 512‑bits of data. The Neon architecture has a variety of interleaving / deinterleaving operations to support different formats. For example, a VST2 is provided that can be used to interleave the left and right channels of stereo audio. These instructions also support different element sizes from 8 to 32‑bit.
One of the main benefits of MVE’s “beatwise” execution is that it allows memory and ALU operations to be overlapped, even on single issue processors. As shown below, to achieve the doubling in performance that this technique enables, all instructions must perform the same amount of work.
As can be seen, the performance increase from the overlapping would be seriously degraded by a wide store instruction like VST4. The solution that MVE provides is to split the store into chunks that are well balanced with the ALU operations, with each chunk storing 128‑bits of data. MVE enables this four‑way interleaving to be made up of four instructions called VST40, VST41, VST42 and VST43. Unfortunately, this isn’t the end of the story, as there are a whole host of related issues:
With all these conflicting constraints we’re so far down the rabbit hole that I’m reminded of Alice in Wonderland:
Alice: This is impossible. The Mad Hatter: Only if you believe it is.
So, let’s suspend disbelief and take a closer look at a read port and see where it takes us.
MVE reuses the floating‑point register file, so the vector registers (Q0 to Q7) are made up of groups of four “S” registers. The same row is selected for each column mux and the data combined to access an entire Q register (see figure above). But what if instead of being able to select from any register in a column, the ports were twisted and we could select from registers in alternating columns, as shown below:
If the control inputs on 8:1 muxes are set to the same value a row can be read (e.g. S0 and S1). However, if different values are used, a pair of values in a column (e.g. S0 and S4) can be read. Now this is starting to look promising as it gives us the ability to read from columns as well as rows. If we zoom in a bit to the bottom corner of the diagram and replace the register numbers with the numbers of the muxes they’re connected to we get this:
This resembles a simple sudoku puzzle with each number appearing exactly once in every row and column within the repeating matrix, but on a 2 x 2 matrix, instead of the usual 9 x 9. This pattern only gives us a solution for interleaving by two, as it can only read two values from a column and only handles 32‑bit values (the width of the muxes). Since we need one pattern that works with all combinations of interleaving patterns and data widths, we can imagine stacking all the combinations vertically to get a multi resolution, 3D sudoku puzzle. While solving each layer is trivial, solving the full 3D puzzle as a whole is mind blowing. In addition, we still need to consider the other constraints I mentioned above, like contiguous memory accesses, and being able to split the accesses to the upper and lower 64‑bits of a register across different cycles.
After a bit of head scratching I realised I could split the problem in two: identifying a way to represent the full set of constraints in a single unified problem space, and the monotonous task of solving those constraints. Since the pattern resembles a really complex sudoku problem, and many sudoku programs are based on SAT solvers, I had the idea of using a SAT solver to perform the monotonous constraints solving task. With some caffeine‑boosted creativity I came up with a way of representing all the constraints and after a bit of debugging I had the first working solution. It was incredibly messy and would have made the control logic for the muxes a nightmare to implement, but at least I knew I was onto a winner. Again I didn’t want to have to try and manually clean up the solution, so with a bit of ingenuity I added some extra constraints to introduce some symmetry, and I got the final solution. The answer turns out to be a pair of double nested quadruple helixes:
So that you can see the nested helixes, I’ve highlighted the path of a single mux in the diagrams below. As you can see, the path alternates between 32‑bit “S” registers every row (shown on the left), and swaps between the upper and lower 16‑bit halves of the “S” registers every two rows (shown on the right).
I had a hunch that the twisting approach wouldn’t work for an interleaving by three, but keeping the Mad Hatter in mind I didn’t want to just give up on this without a fight. As it turns out I was right and the SAT solver formally proved there wasn’t a solution.
This twisting approach means the data in both the rows and the columns of the register file can be accessed. However the snag is that the bytes returned by the read port can be in the wrong order, and that order depends on which registers are being accessed. To correct this, a crossbar mux is needed to swap everything back to the correct positions. As it happens we get this crossbar for free as it’s already needed for other instructions like VREV, and the instructions that natively operate on complex numbers. This again plays to our mantra of “if you have to have a piece of hardware, make it work for its lunch”.
The diagram below shows some of the instruction access patterns derived from the read port twisting pattern. The first case (VST2n.S32) shows 32‑bits (S32) being read from vector registers Q0 and Q1 and being interleaved by two (e.g. for left and right audio channels). The colours represent the parts of the registers that are read by each of the two instructions (i.e. VST20 reads the orange sections), and the text in the elements indicates which byte offsets from memory are stored.
You’ll notice that both the S8 and S16 patterns above place the same data within the same coloured sections of the registers; the only difference is the arrangement of the bytes within a section. This means that the 16‑bit pattern can also support 8‑bit by just using a different configuration in the crossbar mux. These patterns also work for the equivalent write port used by load instructions. In addition to making it possible to build the register file ports, the patterns also mean the memory accesses are always a pair of 64‑bit contiguous blocks, which makes the memory accesses efficient. As an added bonus, bit[3] of the addresses of these blocks is always different, so they can be dispatched in parallel on systems that have two banks of interleaved 64‑bit memories.
There were two important lessons that the research team took from these instructions. Firstly, when pushing the boundaries of gate count and efficiency this hard, you really do have to design the fine details of a microarchitecture in parallel with the architecture. Secondly, keep the faith and believe in the impossible (until formally proven otherwise…).
In the next part of this blog series, my colleague, François, will continue the memory access theme and describe some techniques for implementing circular buffers.
Learn More About MVE
This post is the second in a four part series. Read the other parts of the series using the links below: Part 1 - Making Helium: Why not just add Neon? Part 3 - Making Helium: Going around in circles Part 4 - Making Helium: Bringing Amdahl's law to heel
This post is the second in a four part series. Read the other parts of the series using the links below:
Part 1 - Making Helium: Why not just add Neon?
Part 3 - Making Helium: Going around in circles
Part 4 - Making Helium: Bringing Amdahl's law to heel
This new book is the ideal gateway into Arm’s Helium technology, including both theoretical and practical sections that introduce the technology at an accessible level.
Download the free eBook
The different colours represent different instructions. E.G. VST40.S32 is shown in orange, and VST41.S32 in green, etc. So the order of the memory accesses depends on the order in the program. If we assume the program order of the instructions is: VST40.S32, VST41.S32, VST42.S32, VST43.S32, then the memory order would be the one you suggested.
Microarchitectures can use all sorts of interesting tricks to increase performance, and it’s certainly possible to build a processor that performs M[47:40] in parallel with M[15:8]. However there’s a number of reasons why you may not want to:
The access patterns are designed to make the microarchitecture easy to implement (eg by making sure the memory accesses are large contiguous blocks). Microarchitecture is all about providing the illusion the architecture is being strictly followed, when actually all manner of prediction and reordering is being performed under the hood. It’s certainly possible for microarchitectures to deviate from these patterns, but I suspect they won’t as there wouldn’t be much benefit in doing so.
In the last picture, take the VST4n.S32 as example. What's the timing sequence of the data flow?
Does it follow the color like M[7:0]->M[47:40]->M[15:8]->M[55:48]->M[23:16]->M[63:56]->M[31:24]->M[39:32]?
Can the M[47:40] and M[15:8] been overlapped?
It looks like a microarchitecture here. Do you think if there could be an implementation doesn't follow this?
Thanks!