Co-Authors: Ravi Malhotra and Jici Gao from Arm, with support from Konstantinos Margaritis from VectorCamp
This blog describes the work done by Arm and partners to accelerate regular expression parsing using vector engines in Arm Neoverse platforms, and its applicability to open-source Deep-packet-inspection applications like Snort.
With the shift of Enterprise computing from on premises to the cloud, it has become critical to ensure security of data transfers and to detect possible attacks before they happen. To address this concern, traditional network security appliances and VPN gateways have transformed into Unified Threat Management (UTM) systems. These systems analyze streams of data and its usage by devices to detect patterns and anomalies. Inspecting every byte of packet payload (‘deep-packet’) at typical network speeds can be very compute intensive, and this is where CPU SIMD vector engines can help by analyzing large sets of data in parallel.
Intel developed an open-source regular expression (regex) parsing library called Hyperscan that leveraged its SSE and AVX vector-engines and integrated it with a popular Deep-packet-inspection application, Snort. More info here: https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf.
To create a regex parsing library that was optimized for Arm platforms, Arm collaborated with VectorCamp, who specialize in Software Optimizations & SIMD vectorizations across a range of popular CPU architectures. Together, we created an architecture-inclusive fork of Hyperscan called Vectorscan, that preserves the support for x86 and modifies the framework to allow for other architectures and vector engine implementations.
The goal for Vectorscan was to preserve the API compatibility with Hyperscan, to allow usage as a drop-in replacement in applications like Snort. Changes made to split the SIMD code into a separate library allowed not only portability across multiple architectures, but also reduced code-size by 1/3rd in relevant SIMD routines.
Vectorscan currently supports the Neon vector engine and Arm's focus is to continue further optimizations on Arm Neoverse platforms, including newer vector-engine implementations like Scalable Vector Engine (SVE/SVE2). Compatibility with the Hyperscan project is preserved - algorithms and Intel architecture optimizations were cherry-picked and integrated into Vectorscan to allow Linux distros to bundle and maintain one package instead of two.
At the same time, generic contributions like cache-prefetching, grouping load/stores and algorithm optimizations provided performance gains across all architectures. Cleanup and restructuring also makes it easier to debug and profile.
On the Arm architecture, Vectorscan provides a performance uplift of 20-40% over the default regex implementations within SNORT. The below chart shows a single-core comparison of Vectorscan vs. default regex implementations in Snort on a Neoverse N1-based Ampere® Altra® CPU. This uses the Arm Neon vector-engines within N1. Future vector-engines implementations like Arm SVE and SVE2 in future will provide even further uplift.
Figure 1: Performance of Vectorscan versus default Regex Implementations in Snort (in Mbps)
We also compared Vectorscan performance on Arm and alternative systems with:
The Ampere Altra compares well in both single and multi-core performance.
Vectorscan performance scales linearly - all the way up to 80 cores on Ampere Altra, providing an overall socket performance well above 20 Gbps. This allows users to flexibly allocate cores between deep-packet-inspection and other security packet-processing tasks like network-proxy, VPN, IDS/IPS etc. By comparison, with only 24 cores available, the x86 system throughput maxes out at about 10 Gbps.
Figure 2: Ampere Altra vs. x86 Throughput Scaling Comparison
However, a typical UTM appliance must perform other packet-processing tasks that include Firewall, NAPT, tunneling, encryption, which take up a significant amount of CPU bandwidth. The Ampere Altra utilizes only 30% (24 out of 80) of its to achieve 10 Gbps of Vectorscan performance, as compared to close to 100% an Intel Xeon 8268 (24 out of 24).
Figure 3: System CPU Utilization for 10Gbps Vectorscan
The work done by Arm and partners with Vectorscan provides network security application developers with a regex library that is portable across multiple architectures, optimized to leverage SIMD acceleration and preserves legacy compatibility. This enables Arm Neoverse based platforms to provide deep-packet-inspection performance that is not only comparable with alternatives at a per-core level, but at significantly lower system-level CPU utilization, and leaves headroom for other packet-processing tasks in the same system.
Learn more about Neoverse N1