With the arrival of Windows RT and the opening of Microsoft App Store, you can now develop Windows Store Apps on ARM platforms and make them available to 200+ markets via the Microsoft App Store. If you are an Android or ARM Linux developer, you probably are already using ARM® NEON™ to optimize your applications or benefiting from NEON-optimized libraries. In fact, you can use NEON to speed up your Windows RT applications as well.
NEON is a wide SIMD data processing architecture extension introduced in ARMv7 architecture. It performs "Packed SIMD" processing and can be used to optimize multimedia codec algorithms, 2D/3D graphic libraries or other data processing applications. The use of NEON has proven to be very popular in many open-source projects or proprietary applications. The WebM Multimedia project and Android's Skia library are good examples of software libraries utilizing NEON instructions.
Windows RT also utilizes NEON for optimization. The Microsoft Visual C++ compiler supports NEON intrinsics with implementation close to ARM RCVT compiler 4.1. You have access to NEON intrinsics by including the arm_neon.h header file. This is the same as what you would do for Linux/Android development. Refer to MSDN for more details. The SIMD C++ Math library (DirectXMath.h) is implemented using NEON intrinsics and can be used as a good reference.
As an example, I decided to port the HelloNEON program from Android NDK to see how easy it is to use NEON intrinsics on Windows RT. The HelloNEON program offers several benefits. It is small and nicely written; so it is easy to understand and modify if needed. It also offers both C and NEON implementations; so I can easily show the benefit of NEON optimization.
As it turns out, there is really not much work needed. All I have to do is to create a WinRT component project that contains the bulk of function implementations, replace the Linux system call for getting the timestamp with the Windows version, wrap the main routines as WinRT component and finally implement a simple JavaScript-based Windows Store app as front-end for initiating the tests.
Rewriting the timestamp function:
Android Version:
Windows Version:
Wrap the main routines as WinRT component:
Implement a simple UI with JavaScript/HTML5:
Once the coding is done, you have to specify the platform to be 'ARM' and the build configuration to be 'Release'. You also have to set up remote debugging for running the program on your Windows RT device. In my case, I tested it on my Surface RT tablet.
The result is great -- Normalizing the result, the NEON version is about twice as fast as the C version.
So, without any hardware change, I am able to get 100% improvement with NEON optimization over the original C implementation. Obviously, the result will vary depending upon your functions or algorithms, but the benefit is obvious. It is also worth noting that this is a single-thread implementation. The Surface RT device uses NVIDIA® Tegra® T30 chip, which utilizes a quad ARM Cortex™-A9 MPCore CPU. If your function or algorithm can be fairly paralleled into independent processing blocks, a multi-thread implementation will give you even further optimization.
With NEON intrinsics support in Microsoft Visual C++ compiler, using NEON to speed up your Windows RT application is as easy as including the relevant header file and compiler options. With so many applications benefiting from NEON optimization, your application should too. For more information on NEON, check out the ARM online infocenter. You can also find the online NEON programming reference guide as well.
Is it true that the ARM instruction set is limited when coding for Windows RT? I've heard that you can only use Thumb-2 instructions, is that the case? Where is this information documented please? Thanks!