Currently, games and applications are increasingly looking at moving to devices that support Arm Neon. Such as games not wanting to miss mobile (since it is the biggest market segment), or those who have found that the Windows on Arm ecosystem is now mature enough to make the move to native. There are many that have made the move already. For example, PC and console games that have shifted to mobile like Grand Theft Auto, Fortnite and Brawlhalla, and PC and laptop apps that support Arm Neon like Photoshop, Zoom and Visual Studio Code.
When you are looking to move to Arm Neon, a lot of code in the app or game will just need recompiling with the new target. If you need it, there is help on framework support, but what about that hand-written Intel intrinsics code?
Figure 1: Intel Memory to intrinsics Register Alignment
Figure 2: Example intrinsics instruction - Multiply Accumulate (MLA)
As a reminder, intrinsics code is an easy way to write specialized SIMD (Single Instruction, Multiple Data) instructions. The same functions as assembly instructions, but directly in C/C++, and without having to worry about which particular registers are used, and the ABI, and so on.
Arm has its own intrinsics – Neon – but intrinsics code takes time, effort, and plenty of thought, especially if it is a new suite of commands you are unfamiliar with. Is there an easier way?
The good news is: there is an easier way – with either of two great open-source libraries for converting Intel’s SSE (Streaming SIMD Extensions) intrinsics to Neon. The first, and older, library is called precisely that – SSE2Neon – because that is what it does, converts SSE to Neon. The other library – SIMDe, or “SIMD Everywhere” – wants to make as much SIMD code translatable to as many different architectures as possible. Blender, OGRE, and FoundationDB are among the many products who have used one of these libraries to port to Arm Neon.
The bad news? Well, you wrote intrinsics because those bits of the code needed great performance. So you need to decide how much it is worth writing code by hand to improve efficiency versus the easy porting and maintenance of the libraries. This blog is here to help you work out that balance.
We are porting to 64-bit Armv8-A architecture, which current processors are using, and we presume a straight 128-bit SSE to 128-bit Neon port for simplicity. Armv9-A should be very similar for this Neon use case. Although not looked at specifically, the advice is largely still sound if you are converting 256-bit SSE. Even though it is having to split everything in two to go to 128-bit Neon, that would need doing if you are converting by hand too.
I start with the conclusion: unless your project’s use of intrinsics is very small, it makes sense to start your port with. This makes sense because it simplifies the task of porting, regardless of whether you end up replacing library code with hand-written Neon.
The libraries allow you to replace as much or as little as you like with a true native port. So using a library means you can still try and do better than the library does. Indeed, you could decide to replace the library entirely in the end, and it would still make sense to use it in the porting process. The libraries allow you to quickly get your project compiling and working on Neon, even if you still want to improve performance after that. You do not need to rewrite all your SSE intrinsics code before anything works, and you have a functionally correct port working before you begin your optimization.
The libraries provide as good a transliteration as possible, with a sizeable community contributing to make sure that it is as good as possible mapping without knowing the reasons for you calling those intrinsics. However, when you are doing a port you know the algorithm that calls the intrinsics, and can adjust the surrounding code to suit Neon commands and get much bigger improvements. Also, knowing your code and what you want to achieve, there may be Neon intrinsics available that are not in SSE that you could use. For example, in later Neon implementations there is complex number support, so there are functions that can be done with one intrinsic call instead of many in an SSE function and its direct Neon translation. There are also areas like 8-bit support where many functions will be able to be done directly with Neon calls, rather than in more complicated ways with SSE (if it can be done at all). In later Neon implementations and in later SSE implementations there are different dot product intrinsics implementations, but the libraries fail to map them to each other, instead implementing many multiply accumulates – so more specialized recent calls may be improved.
Generally, the best first step for your SSE to Neon port is to run.
If the intrinsics use of your project is very small, it may make sense to do it by hand. If it is only a couple of small functions, then it is no hassle to make SSE and Neon variants, and you are able to consider the intrinsics available and form your algorithm around them for efficiency.
But if your project is not that special case and you decide to use a library – which one should you use?
Both libraries work in the same way, with the intel #include for intrinsics being replaced by the library #include. Then in the implementation the libraries detect what intrinsics are available from the compiler. There are three outcomes of this detection:
For SIMDe, there is a define to avoid code changes beyond the different includes, but it is recommended to instead make code changes by adding SIMDe prefixes to the SSE functions for clarity
SSE2Neon supports MMX and SSE, but if you have AVX code you need to use SIMDe. SIMDe supports AVX/AVX2, but only has partial support for AVX512.
For making a choice beyond AVX support, SSE2Neon’s selling point is its simplicity. If you are only ever considering this one-way port, it is probably the way to go, rather than having all the options that SIMDe gives you.
SIMDe though offers you a way to cover your bases for further ports, and potentially for future technology changes. If you wish to support WebAssembly, or implement some feature in Neon and have it automatically work back on SSE, these options are covered. SIMDe intends to expand over time, so as new SIMD technologies are released from both Arm and Intel, you will be able to cover those too. For instance, SVE2 is an improvement on Neon in Arm’s v9 architecture that was announced recently.
They are both open-source, so have benefited from numerous contributions, as people have covered all the functions they needed, and improved them if they have found more efficient ways. SSE2Neon is older, and its code was initially used for SIMDe to cover the SSE to Neon use case, so the implementations are near identical. Presently both are maintained, so there does not seem to be a performance reason to choose one over the other.
So you have ported with the library and have a functionally correct version of our code running on Arm, are you done?
If you are happy with the performance, then yes, you are done, but you may not be as well. As you implemented intrinsics because performance was critical in those pieces of code, if there are not perfect mappings from SSE to Neon, that code may need improvement.
There are some intrinsics that do enjoy a perfect, or near perfect, mapping. If you have passages of code with these, there may be zero performance improvement from any time spent on them. So, where best to spend our effort?
The obvious answer is to profile – the Arm device may have different bottlenecks to an Intel one anyway, so there may be different pieces of code that need improvement. But for the intrinsics, we can give some pointers as to which pieces of code are most likely to warrant a closer look.
Most basic arithmetic and logic functions are either a perfect mapping, or near to perfect. Along with the re-interpreted intrinsics being no-cost compiler directives, the load, set, and store intrinsics also map pretty well with their library translations. Which means that simple math will often not need any additional work. So a 2D distance calculation can probably be left to the library translation:
void distances(float* xDists, float* yDists, float* results, int size)
__m128 Xs, Ys, m1, m2, m3, res;
for (size_t index = 0; index < size; index += 4)
Xs = _mm_load_ps(xDists + index);
Ys = _mm_load_ps(yDists + index);
m1 = _mm_mul_ps(Xs, Xs);
m2 = _mm_mul_ps(Ys, Ys);
m3 = _mm_add_ps(m1, m2);
res = _mm_sqrt_ps(m3);
_mm_store_ps(results + index, res);
Figure 3: SSE for vectorized 2D distance calculations
distances(float*, float*, float*, int):
sxtw x4, w3
cbz w3, .L1
sub x3, x4, #1
add x4, x0, 16
lsr x3, x3, 2
add x3, x4, x3, lsl 4
ldr q0, [x1], 16
ldr q1, [x0], 16
fmul v0.4s, v0.4s, v0.4s
fmla v0.4s, v1.4s, v1.4s
fsqrt v0.4s, v0.4s
str q0, [x2], 16
cmp x0, x3
Figure 4: GCC compilation of SIMDe Neon translation on Godbolt
The loads and stores are not quite perfect partly because the SSE ordering of the vector within the SIMD type is the reverse of Neon, but this fundamental difference causes fewer issues than one might expect.
To go through some categories of functions – and there are individual exceptions within categories – we can create this list:
How much you replace will depend not just on efficiency, but on time available to port and also the cost of maintenance. Having the SSE functions just “run” on Neon requires virtually no overhead, whereas having specialized code involves any changes needing to be implemented and tested twice. Those decisions, and how much specialized code to write, will need to be made on a project-by-project basis.
I gave away the answer early on, but for any projects using more than a handful of intrinsics, the use of one of these two open-source libraries – SSE2Neon and SIMDe – is the way to go when porting. After the initial port, how much further work to do – from nothing, to the complete replacement of the library, and anything in-between – is a decision to be made per project. There is a balance to be struck between how efficient and fast the code has to be, versus how much time is available for the port and how much maintenance overhead is acceptable.
Once a decision on how far to go with specializing the port is made, hopefully this blog then helps you to know where the bigger gains are likely to be made. The most common maths functions and inserting and removing data from intrinsics work well with the libraries, so other areas are more useful targets for implementing separate specialized code. Although the transliteration between intrinsics is as good as possible in the libraries, where code is written with an algorithm that suits the available intrinsics in Neon there is potential for performance gains. Both where Neon has intrinsics that SSE does not, will be the biggest potential improvements.
To be able to write that specialized Neon code you will need to know the Neon intrinsics, so please look at Arm’s other Neon resources. Optimizing with C/C++ intrinsics is a good starting point, and the complete searchable list of intrinsics are very useful. If you wish to go back to basics, there an introduction to the concepts of Neon in general, and for the more advanced the principles are the same in the Neon assembly documentation. And lastly, you can also find an article on how best to let the compiler auto-vectorize without intrinsics. Happy porting!