Porting SSE to Neon: Are libraries the way forward?

July 1, 2021

12 minute read time.

Currently, games and applications are increasingly looking at moving to devices that support Arm Neon. Such as games not wanting to miss mobile (since it is the biggest market segment), or those who have found that the Windows on Arm ecosystem is now mature enough to make the move to native. There are many that have made the move already. For example, PC and console games that have shifted to mobile like Grand Theft Auto, Fortnite and Brawlhalla, and PC and laptop apps that support Arm Neon like Photoshop, Zoom and Visual Studio Code.

When you are looking to move to Arm Neon, a lot of code in the app or game will just need recompiling with the new target. If you need it, there is help on framework support, but what about that hand-written Intel intrinsics code?

Figure 1: Intel Memory to intrinsics Register Alignment

Figure 2: Example intrinsics instruction - Multiply Accumulate (MLA)

As a reminder, intrinsics code is an easy way to write specialized SIMD (Single Instruction, Multiple Data) instructions. The same functions as assembly instructions, but directly in C/C++, and without having to worry about which particular registers are used, and the ABI, and so on.

Arm has its own intrinsics – Neon – but intrinsics code takes time, effort, and plenty of thought, especially if it is a new suite of commands you are unfamiliar with. Is there an easier way?

The good news is: there is an easier way – with either of two great open-source libraries for converting Intel’s SSE (Streaming SIMD Extensions) intrinsics to Neon. The first, and older, library is called precisely that – SSE2Neon – because that is what it does, converts SSE to Neon. The other library – SIMDe, or “SIMD Everywhere” – wants to make as much SIMD code translatable to as many different architectures as possible. Blender, OGRE, and FoundationDB are among the many products who have used one of these libraries to port to Arm Neon.

The bad news? Well, you wrote intrinsics because those bits of the code needed great performance. So you need to decide how much it is worth writing code by hand to improve efficiency versus the easy porting and maintenance of the libraries. This blog is here to help you work out that balance.

The practicals

We are porting to 64-bit Armv8-A architecture, which current processors are using, and we presume a straight 128-bit SSE to 128-bit Neon port for simplicity. Armv9-A should be very similar for this Neon use case. Although not looked at specifically, the advice is largely still sound if you are converting 256-bit SSE. Even though it is having to split everything in two to go to 128-bit Neon, that would need doing if you are converting by hand too.

I will start with the conclusion: unless your project’s use of intrinsics is very small, it makes sense to start your port with SSE2Neon or SIMDe. This makes sense because it simplifies the task of porting, regardless of whether you end up replacing library code with hand-written Neon.

The libraries allow you to replace as much or as little as you like with a true native port. So using a library means you can still try and do better than the library does. Indeed, you could decide to replace the library entirely in the end, and it would still make sense to use it in the porting process. The libraries allow you to quickly get your project compiling and working on Neon, even if you still want to improve performance after that. You do not need to rewrite all your SSE intrinsics code before anything works, and you have a functionally correct port working before you begin your optimization.

The libraries provide as good a transliteration as possible, with a sizeable community contributing to make sure that it is as good as possible mapping without knowing the reasons for you calling those intrinsics. However, when you are doing a port you know the algorithm that calls the intrinsics, and can adjust the surrounding code to suit Neon commands and get much bigger improvements. Also, knowing your code and what you want to achieve, there may be Neon intrinsics available that are not in SSE that you could use. For example, in later Neon implementations there is complex number support, so there are functions that can be done with one intrinsic call instead of many in an SSE function and its direct Neon translation. There are also areas like 8-bit support where many functions will be able to be done directly with Neon calls, rather than in more complicated ways with SSE (if it can be done at all). In later Neon implementations and in later SSE implementations there are different dot product intrinsics implementations, but the libraries fail to map them to each other, instead implementing many multiply accumulates – so more specialized recent calls may be improved.

Generally, the best first step for your SSE to Neon port is to run Arm’s Porting Advisor tool. This will give you a check of any portability issues, including how many SSE intrinsics there are and where, along with giving you a summary of the scope of the porting effort required. SSE intrinsics being the bulk of the effort, you will be able to assess the best route forward. Hopefully, there is not much else to clear up and you will quickly cover the rest

If the intrinsics use of your project is very small, it may make sense to do it by hand. If it is only a couple of small functions, then it is no hassle to make SSE and Neon variants, and you are able to consider the intrinsics available and form your algorithm around them for efficiency.

But if your project is not that special case and you decide to use a library – which one should you use?

SIMDe vs SSE2Neon

Both libraries work in the same way, with the intel #include for intrinsics being replaced by the library #include. Then in the implementation the libraries detect what intrinsics are available from the compiler. There are three outcomes of this detection:

If it is Intel, then it falls straight through to the original implementation.
If it is Arm, it converts to Neon.
If it is neither, then it uses a non-intrinsics implementation.
For SSE2Neon, the include changes are all that is needed

For SIMDe, there is a define to avoid code changes beyond the different includes, but it is recommended to instead make code changes by adding SIMDe prefixes to the SSE functions for clarity

SSE2Neon supports MMX and SSE, but if you have AVX code you need to use SIMDe. SIMDe supports AVX/AVX2, but only has partial support for AVX512.

For making a choice beyond AVX support, SSE2Neon’s selling point is its simplicity. If you are only ever considering this one-way port, it is probably the way to go, rather than having all the options that SIMDe gives you.

SIMDe though offers you a way to cover your bases for further ports, and potentially for future technology changes. If you wish to support WebAssembly, or implement some feature in Neon and have it automatically work back on SSE, these options are covered. SIMDe intends to expand over time, so as new SIMD technologies are released from both Arm and Intel, you will be able to cover those too. For instance, SVE2 is an improvement on Neon in Arm’s v9 architecture that was announced recently.

They are both open-source, so have benefited from numerous contributions, as people have covered all the functions they needed, and improved them if they have found more efficient ways. SSE2Neon is older, and its code was initially used for SIMDe to cover the SSE to Neon use case, so the implementations are near identical. Presently both are maintained, so there does not seem to be a performance reason to choose one over the other.

After the initial port

So you have ported with the library and have a functionally correct version of our code running on Arm, are you done?

If you are happy with the performance, then yes, you are done, but you may not be as well. As you implemented intrinsics because performance was critical in those pieces of code, if there are not perfect mappings from SSE to Neon, that code may need improvement.

There are some intrinsics that do enjoy a perfect, or near perfect, mapping. If you have passages of code with these, there may be zero performance improvement from any time spent on them. So, where best to spend our effort?

The obvious answer is to profile – the Arm device may have different bottlenecks to an Intel one anyway, so there may be different pieces of code that need improvement. But for the intrinsics, we can give some pointers as to which pieces of code are most likely to warrant a closer look.

Most basic arithmetic and logic functions are either a perfect mapping, or near to perfect. Along with the re-interpreted intrinsics being no-cost compiler directives, the load, set, and store intrinsics also map pretty well with their library translations. Which means that simple math will often not need any additional work. So a 2D distance calculation can probably be left to the library translation:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void distances(float* xDists, float* yDists, float* results, int size)
{
    __m128 Xs, Ys, m1, m2, m3, res;
    for (size_t index = 0; index < size; index += 4)
    {
        Xs = _mm_load_ps(xDists + index); 
        Ys = _mm_load_ps(yDists + index); 
        m1 = _mm_mul_ps(Xs, Xs);        
        m2 = _mm_mul_ps(Ys, Ys);
        m3 = _mm_add_ps(m1, m2);  
        res = _mm_sqrt_ps(m3);   
        _mm_store_ps(results + index, res);
    }
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

void distances(float* xDists, float* yDists, float* results, int size)
{
    __m128 Xs, Ys, m1, m2, m3, res;

    for (size_t index = 0; index < size; index += 4)
    {
        Xs = _mm_load_ps(xDists + index); 
        Ys = _mm_load_ps(yDists + index); 
        m1 = _mm_mul_ps(Xs, Xs);        
        m2 = _mm_mul_ps(Ys, Ys);
        m3 = _mm_add_ps(m1, m2);  
        res = _mm_sqrt_ps(m3);   
        _mm_store_ps(results + index, res);
    }
}

Figure 3: SSE for vectorized 2D distance calculations

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
distances(float*, float*, float*, int):
        sxtw    x4, w3
        cbz     w3, .L1
        sub     x3, x4, #1
        add     x4, x0, 16
        lsr     x3, x3, 2
        add     x3, x4, x3, lsl 4
.L3:
        ldr     q0, [x1], 16
        ldr     q1, [x0], 16
        fmul    v0.4s, v0.4s, v0.4s
        fmla    v0.4s, v1.4s, v1.4s
        fsqrt   v0.4s, v0.4s
        str     q0, [x2], 16
        cmp     x0, x3
        bne     .L3
.L1:
        ret
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

distances(float*, float*, float*, int):
        sxtw    x4, w3
        cbz     w3, .L1
        sub     x3, x4, #1
        add     x4, x0, 16
        lsr     x3, x3, 2
        add     x3, x4, x3, lsl 4
.L3:
        ldr     q0, [x1], 16
        ldr     q1, [x0], 16
        fmul    v0.4s, v0.4s, v0.4s
        fmla    v0.4s, v1.4s, v1.4s
        fsqrt   v0.4s, v0.4s
        str     q0, [x2], 16
        cmp     x0, x3
        bne     .L3
.L1:
        ret

Figure 4: GCC compilation of SIMDe Neon translation on Godbolt

The loads and stores are not quite perfect partly because the SSE ordering of the vector within the SIMD type is the reverse of Neon, but this fundamental difference causes fewer issues than one might expect.

To go through some categories of functions – and there are individual exceptions within categories – we can create this list:

As previous, basic arithmetic and logic functions are either a perfect mapping or near to it. Loads, sets, and stores mostly map pretty well.
A lot of more complicated maths functions (sqrt, avg, min, max) map well from SSE to Neon, although absolute value is usually not so great (but occasionally perfect).
Bit shifts are an area where you may be more likely to get a gain by writing some native Neon, along with negations and aligns.
Converts, inserts, and extracts generally map pretty well, so should be low priority for any specialized Neon code.
Compares mostly map well, although gains may be possible on specific compares to NaN, 0 or 1 where intel has specialized functions.
Moves mostly map well, but movemasks mostly do not.
Higher precision division and sqrt make the mapping less efficient – do you need the precision?
“Pairwise” functions (horizontal operations where the numbers within one intrinsics vector are used together) can be less efficient, especially subtractions, so have potential for improvement by specialized Neon code.
Very specialized SSE functions like _mm_dp_ps and _mm_minpos_epu16 are worst cases where a different algorithm for Neon is likely to have significant improvement.
Crypto functions and bit tests similarly may well be worth looking at whether specialized Neon code should be implemented.
Blends, shuffles, and rounding have more potential for gain than arithmetic-type functions.
Narrowing and Widening similarly are not as bad as some, but still have some potential for gain with hand-written code.
Recent function additions to Neon or SSE are less likely to be well mapped yet – but may be an opportunity to contribute to the open-source libraries.

How much you replace will depend not just on efficiency, but on time available to port and also the cost of maintenance. Having the SSE functions just “run” on Neon requires virtually no overhead, whereas having specialized code involves any changes needing to be implemented and tested twice. Those decisions, and how much specialized code to write, will need to be made on a project-by-project basis.

Conclusion

I gave away the answer early on, but for any projects using more than a handful of intrinsics, the use of one of these two open-source libraries – SSE2Neon and SIMDe – is the way to go when porting. After the initial port, how much further work to do – from nothing, to the complete replacement of the library, and anything in-between – is a decision to be made per project. There is a balance to be struck between how efficient and fast the code has to be, versus how much time is available for the port and how much maintenance overhead is acceptable.

Once a decision on how far to go with specializing the port is made, hopefully this blog then helps you to know where the bigger gains are likely to be made. The most common maths functions and inserting and removing data from intrinsics work well with the libraries, so other areas are more useful targets for implementing separate specialized code. Although the transliteration between intrinsics is as good as possible in the libraries, where code is written with an algorithm that suits the available intrinsics in Neon there is potential for performance gains. Both where Neon has intrinsics that SSE does not, will be the biggest potential improvements.

To be able to write that specialized Neon code you will need to know the Neon intrinsics, so please look at Arm’s other Neon resources. Optimizing with C/C++ intrinsics is a good starting point, and the complete searchable list of intrinsics are very useful. If you wish to go back to basics, there an introduction to the concepts of Neon in general, and for the more advanced the principles are the same in the Neon assembly documentation. And lastly, you can also find an article on how best to let the compiler auto-vectorize without intrinsics. Happy porting!

0 comments
0 members are here

AI blog

Get ready for Arm SME, coming soon to Android

Eric Sondhi

Build next-gen mobile AI apps with SME2—no code changes needed. Accelerate performance across devices using top AI frameworks and runtimes.
- July 10, 2025
One year of Arm KleidiAI in XNNPack: Seamless and transparent AI performance

Gian Marco Iodice

A year of Arm KleidiAI in XNNPack brings major ML performance boosts—no code changes needed. Transparent, seamless acceleration on Arm CPUs.
- July 10, 2025
Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Porting SSE to Neon: Are libraries the way forward?

The practicals

SIMDe vs SSE2Neon

After the initial port

Conclusion

Get ready for Arm SME, coming soon to Android

One year of Arm KleidiAI in XNNPack: Seamless and transparent AI performance

Coaching AI coding agents: A guide for senior engineers