Making Helium: Bringing Amdahl's law to heel (4/4)

March 8, 2019

11 minute read time.

In the previous posts in this blog series we looked at how the Armv8.1-M architecture with Arm Helium technology (also known as MVE) handles vector instructions. The problem is that whenever code is vectorized it’s not long before Amdahl's law sneaks up on you and bites your ankles. For those of you who haven’t heard of it before, Amdahl's law states that the parts of an algorithm that can’t be parallelized quickly become a performance bottleneck (a full description can be found here). For example, if 50% of a workload can be parallelized, then even if that part was infinitely parallel the maximum speed up you could get is 2x. I don’t know about you, but if I’d managed to infinitely parallelize something, I’d be more than a little irritated by a meagre 2x speedup! When we designed Helium we had to consider everything that’s typically used around the vector instructions, as well as the vector instructions themselves, so that we could maximize performance.

Serial code is common in loop handling, where the overheads caused by the serial code can be considerable, especially for small loops. The memory copy code below is a good example of this:

loopStart: LDRB R3, [R1], #1
STRB R3, [R0], #1
SUBS R2, R2, #1
BNE loopStart

The decrement of the loop iteration count and the conditional branch back to the top of the loop account for 50% of the instructions in the loop. Many small Cortex-M processors don’t have branch predictors (the extreme area efficiency of small Cortex-M processors means that many branch predictors would be several times larger than the whole processor). As a result, the runtime overhead is actually higher than 50% due to the branch penalty. Loop unrolling can help reduce the overhead by amortizing it over several iterations, but it increases code size and can further complicate the process of vectorizing the code. Given that many DSP kernels have small loops, it was important to address these problems during the Helium research project. Many dedicated DSP processors support zero overhead loops. One way of accomplishing this is by having a REPEAT instruction which tells the processor to repeat the following instructions N times:

REPEAT R2, #2 // Repeat the next 2 instructions R2 times
LDRB R3, [R1], #1
STRB R3, [R0], #1

The processor must keep track of several pieces of data:

Address of the start of the loop
How many instructions are left before it needs to branch back to the start of the loop
The number of loop iterations remaining

Keeping track of all of this when handling interrupts can be problematic, so some DSPs simply delay interrupts until the loop is complete. If there’s a large number of iterations to perform this could take a considerable amount of time, and certainly isn’t compatible with the fast and deterministic interrupt latency expected of Cortex-M processors. This sort of approach also wouldn’t work for handling precise faults, like a permissions violation leading to a MemManage fault. Another approach would be to add extra registers to handle the loop state. But these new registers would have to be saved and restored on exception entry and return, which again would increase interrupt latencies. To get around this problem, Armv8.1-M has a pair of loop instructions:

WLS LR, R2, loopEnd loopStart:
LDRB R3, [R1], #1
STRB R3, [R0], #1
LE LR, loopStart loopEnd:

The loop is started by a While Loop Start (WLS) instruction, which copies the iteration count to LR and branches to the end of the loop if the iteration count is zero. There’s also a Do Loop Start (DLS) instruction that can be used to set up a loop where at least one iteration is always performed. The Loop End (LE) instruction checks LR to see if another iteration is required and, if so, it branches back to the start. The interesting thing is that the processor can cache the information provided by the LE instruction (i.e. the position of the start and end of the loop) so on the next iteration it can branch back to the start of the loop before it even fetches the LE. So, the sequence of instructions executed by the processor would look like this:

WLS LR, R2, loopEnd
LDRB R3, [R1], #1
STRB R3, [R0], #1
LE LR, loopStart // Loop cache populated
LDRB R3, [R1], #1
STRB R3, [R0], #1
LDRB R3, [R1], #1 // Cached info allows LE to be skipped
STRB R3, [R0], #1
LDRB R3, [R1], #1
STRB R3, [R0], #1 ...

A nice side effect of having a loop instruction at the end of the loop is that it will be re-executed if the cached loop information is flushed. Re-executing the LE instruction would then re-populate the cache. As shown below, this enables the existing fast interrupt handling to be preserved as the loop start and end addresses don’t need to be saved.

WLS LR, R2, loopEnd
LDRB R3, [R1], #1
STRB R3, [R0], #1
LE   LR, loopStart // Loop cache populated
LDRB R3, [R1], #1
STRB R3, [R0], #1
LDRB R3, [R1], #1
<IRQ taken, loop cache flushed>
<IRQ returns> STRB R3, [R0], #1
LE   LR, loopStart // LE re-executed, cache repopulated
LDRB R3, [R1], #1
STRB R3, [R0], #1
LDRB R3, [R1], #1
STRB R3, [R0], #1
...

Apart from some setup on the first iteration and when recovering from interrupts, all the time is actually spent doing the memory copy rather than the loop handling. In addition, because the processor knows the sequence of instructions in advance, it can always fill the pipeline with the correct instructions. This eliminates the pipeline flush and subsequent branch penalty. So, we can vectorize this loop with the comforting knowledge that Amdahl's law has been muzzled and our ankles are safe (for the moment).

When vectorizing code it’s common for a loop to start and end with different types of instructions, for example a vector load (VLDR) and a vector multiply accumulate (VMLA). When such a loop is executed it produces a really long unbroken chain of alternating VLDR / VMLA operations (as shown below). This unbroken chain enables the processor to get the maximum benefit from overlapping the instructions, as it can even overlap from the end of one iteration of the loop to the start of the next, boosting performance even further. You can read more about instruction overlapping in the first part of this blog series.

Code sequence

Vectorizing code can be problematic when the amount of data to be processed isn’t a multiple of the vector length. A typical solution is to process the full vectors first, and then have a serial / non-vectorized tail clean-up loop that processes the remaining elements. Before I know it, Amdahl's law is back and I’ve got a sharp pain in my ankle! The vectors in Helium can hold 16 x 8-bit values, so in the case that we had a vectorized memcpy of 31 bytes, just under half of the copy would be executed serially by the tail loop, instead of in parallel by vector instructions. To solve this, we added tail predication variants of the loop instructions (e.g. WLSTP, LETP). For these tail predicated loops LR holds the number of vector elements to be processed, instead of the number of loop iterations to perform. The loop start instruction (WLSTP) has a size field (the “.8” in the memcpy example below) that specifies the width of the elements to be processed.

memcpy:
PUSH    {R0, LR}
WLSTP.8 LR, R2, vectLoopEnd // R2 = element / byte count
vectLoopStart:
VLDRB.8 Q0, [R1], #16       // R1 = srcPtr
VSTRB.8 Q0, [R0], #16       // R0 = destPtr
LETP    LR, vectLoopStart
vectLoopEnd:
POP     {R0, PC}

Anyone that’s looked at other optimized memcpy routines will probably be surprised by the simplicity of this example, but with Helium this really is all you need for an optimal, fully vectorized solution. Here’s how it works: The processor uses the size field together with the number of elements remaining to calculate how many iterations are left. If the number of elements to be processed on the last iteration is less than the vector length, the appropriate number of elements at the end of the vector are disabled. So in our previous example of copying 31 bytes, Helium would copy 16 bytes in parallel on the first iteration, and then 15 bytes in parallel on the next iteration. Not only does this put the muzzle back on Amdahl's law and give us back our performance, but it also eliminates the serial tail code completely, reducing code size and making development easier.

As we’ve said a few times in this blog series, every last gate must work for its lunch. So now that we’ve got hardware that can cache branch information and eliminate the branch penalty, the question becomes “what else can we do to squeeze performance out of it?” One performance-boosting piece of architecture that’s fallen out of fashion is the branch delay slot (more details can be found here). In theory they’re a nice idea as they enable the processor to get useful work out of the instructions that have already entered the pipeline when a branch is executed. Reducing the branch penalty can be quite important, especially on small Cortex-M processors that don’t have branch predictors. However, in practice, delay slots have quite a few downsides:

A branch delay slot bakes an assumption about the pipeline length into the architecture. If the actual pipeline length matches the architected length then you’re fine. If it doesn’t then the microarchitecture has to fake the correct behavior.
- Microarchitecture techniques like instruction fusion can make faking the correct behavior even more difficult.
If the compiler can’t schedule an instruction into the delay slot, then it must be padded out with a NOP.
For processors with longer pipelines, the full benefit of branch delay can’t be realized because it may be artificially restricted by a smaller number of architectural branch delay slots.

The perfect solution would be a way for the compiler to specify as many (or as few) branch delay instructions as it can identify in the code, and for the processor to be able to make use of as many branch delay slots as its pipeline naturally provides. This sounds like it would be expensive to implement, but there’s a trick. The LE instruction means we already have hardware to eliminate the branch penalty for cached branches. So we can get our perfect solution by adding a Branch Future instruction that populates this cache with the details of an up-and-coming branch. Armv8.1-M has multiple variants of this instruction for different use cases, including Branch Future and Exchange (BFX) for function returns, Branch Future and Link (BFL) for function calls and so on.

add4:
BFX branchPoint, LR // Set up branch to LR just before
ADD R0, R0, R1      // branchPoint is reached, branch
ADD R0, R0, R2      // penalty avoided
ADD R0, R0, R3
branchPoint:
BX LR              // Only executed if branch cache flushed
<branch penalty>

The example function above just adds four numbers together and returns the result. Since the branch to the return address (stored in LR) isn’t dependent on any calculations the BFX instruction can be placed right at the start of the function. This gives the processor a lot of advance warning that a branch is coming, helping it to eliminate as much of the branch penalty as possible. The processor will branch to the return address just before it gets to the branchPoint label, so it doesn’t even need to fetch the “BX LR” instruction that would normally cause the function return. In fact, the “BX LR” is only present as a fallback path in case the cache gets flushed between the BFX and the branchPoint label, such as if an interrupt is taken. Since the position of the branchPoint label is a parameter, we’ve essentially got a microarchitecture-agnostic, variable length branch delay slot. So, we get all the benefits of branch delay slots without the downsides, and it is almost free because we can reuse the LE caching hardware. This is too much of a bargain to turn down.

Due to the high-performance goals and tight area / interrupt latency constraints we faced, designing Helium was like trying to design a multi-dimensional jigsaw puzzle, where the shape of half of the pieces is already fixed. Seemingly unrelated parts of the architecture can interact to produce either unexpected benefits or interesting problems to be solved. I hope that through this blog series you’ve got a sense of how we addressed some of the issues that we faced, and how the Helium puzzle fits together as a result. If you haven’t already read them, the previous blogs can be found below.

Part 1 - Making Helium: Why not just add Neon?

Part 2 - Making Helium: Sudoku, registers and rabbits

Part 3 - Making Helium: Going around in circles

I think I can speak for the rest of the Helium Research team when I say we look forward to seeing the new classes of applications that will emerge and be enabled by Helium.

To learn more about the new features of Arm Helium technology in the Armv8.1-M architecture, download the white paper below.

Download white paper

Arm Helium Technology MVE for Arm Cortex-M Processors Reference Book Now Available!

This new book is the ideal gateway into Arm’s Helium technology, including both theoretical and practical sections that introduce the technology at an accessible level.

Download the free eBook

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

Making Helium: Bringing Amdahl's law to heel (4/4)

Arm Helium Technology MVE for Arm Cortex-M Processors Reference Book Now Available!

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond