Accelerating Deep-packet-inspection (DPI) with Neon on Arm Neoverse platforms

October 6, 2021

3 minute read time.

Co-Authors: Ravi Malhotra and Jici Gao from Arm, with support from Konstantinos Margaritis from VectorCamp

This blog describes the work done by Arm and partners to accelerate regular expression parsing using vector engines in Arm Neoverse platforms, and its applicability to open-source Deep-packet-inspection applications like Snort.

Background

With the shift of Enterprise computing from on premises to the cloud, it has become critical to ensure security of data transfers and to detect possible attacks before they happen. To address this concern, traditional network security appliances and VPN gateways have transformed into Unified Threat Management (UTM) systems. These systems analyze streams of data and its usage by devices to detect patterns and anomalies. Inspecting every byte of packet payload (‘deep-packet’) at typical network speeds can be very compute intensive, and this is where CPU SIMD vector engines can help by analyzing large sets of data in parallel.

Intel developed an open-source regular expression (regex) parsing library called Hyperscan that leveraged its SSE and AVX vector-engines and integrated it with a popular Deep-packet-inspection application, Snort. More info here: https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf.

Vectorscan

https://github.com/VectorCamp/vectorscan

To create a regex parsing library that was optimized for Arm platforms, Arm collaborated with VectorCamp, who specialize in Software Optimizations & SIMD vectorizations across a range of popular CPU architectures. Together, we created an architecture-inclusive fork of Hyperscan called Vectorscan, that preserves the support for x86 and modifies the framework to allow for other architectures and vector engine implementations.

The goal for Vectorscan was to preserve the API compatibility with Hyperscan, to allow usage as a drop-in replacement in applications like Snort. Changes made to split the SIMD code into a separate library allowed not only portability across multiple architectures, but also reduced code-size by 1/3^rd in relevant SIMD routines.

Vectorscan currently supports the Neon vector engine and Arm's focus is to continue further optimizations on Arm Neoverse platforms, including newer vector-engine implementations like Scalable Vector Engine (SVE/SVE2). Compatibility with the Hyperscan project is preserved - algorithms and Intel architecture optimizations were cherry-picked and integrated into Vectorscan to allow Linux distros to bundle and maintain one package instead of two.

At the same time, generic contributions like cache-prefetching, grouping load/stores and algorithm optimizations provided performance gains across all architectures. Cleanup and restructuring also makes it easier to debug and profile.

Performance results

On the Arm architecture, Vectorscan provides a performance uplift of 20-40% over the default regex implementations within SNORT. The below chart shows a single-core comparison of Vectorscan vs. default regex implementations in Snort on a Neoverse N1-based Ampere® Altra® CPU. This uses the Arm Neon vector-engines within N1. Future vector-engines implementations like Arm SVE and SVE2 in future will provide even further uplift.

Vectorscan vs. default regex implementation performance on Snort

Figure 1: Performance of Vectorscan versus default Regex Implementations in Snort (in Mbps)

We also compared Vectorscan performance on Arm and alternative systems with:

Ampere Altra with 80 Arm Neoverse N1 cores @ 3.0 Ghz
Intel Cascade-Lake Xeon 8268 Platinum with 24 cores @ 2.9 Ghz

The Ampere Altra compares well in both single and multi-core performance.

Vectorscan performance scales linearly - all the way up to 80 cores on Ampere Altra, providing an overall socket performance well above 20 Gbps. This allows users to flexibly allocate cores between deep-packet-inspection and other security packet-processing tasks like network-proxy, VPN, IDS/IPS etc. By comparison, with only 24 cores available, the x86 system throughput maxes out at about 10 Gbps.

Ampere Altra vs. x86 Scaling Comparison Figure 2: Ampere Altra vs. x86 Throughput Scaling Comparison

However, a typical UTM appliance must perform other packet-processing tasks that include Firewall, NAPT, tunneling, encryption, which take up a significant amount of CPU bandwidth. The Ampere Altra utilizes only 30% (24 out of 80) of its to achieve 10 Gbps of Vectorscan performance, as compared to close to 100% an Intel Xeon 8268 (24 out of 24).

System CPU Utilization for 10Gpbs Vectorscan

Figure 3: System CPU Utilization for 10Gbps Vectorscan

Conclusion

The work done by Arm and partners with Vectorscan provides network security application developers with a regex library that is portable across multiple architectures, optimized to leverage SIMD acceleration and preserves legacy compatibility. This enables Arm Neoverse based platforms to provide deep-packet-inspection performance that is not only comparable with alternatives at a per-core level, but at significantly lower system-level CPU utilization, and leaves headroom for other packet-processing tasks in the same system.

[CTAToken URL = "https://www.arm.com/products/silicon-ip-cpu/neoverse/neoverse-n1" target="_blank" text="Learn more about Neoverse N1" class ="green"]

Appendix

System configuration

Arm - Ampere Altra: https://amperecomputing.com/wp-content/uploads/2021/06/Altra_Rev_A1_DS_v1.10_20210612-1.pdf
1. 80 Neoverse N1 cores @ 3.0 Ghz
X86 - Intel Xeon 8268 Platinum: https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html
1. 24 cores / 48 threads @ 2.9 Ghz, HT = off
Software/test configuration
1. Snort Version = 3.1.1.0
2. Vectorscan 5.3.0
3. Linux Version = 5.10.9
4. PCAP =maccdc2012_00001.pcap from https://www.netresec.com/?page=MACCDC

Infrastructure Solutions blog

Virtual Networking Solution and Performance on Arm Neoverse

Yanqin Wei

An introduction to the Virtual Networking Solution and Performance on Arm Neoverse white paper.
- November 14, 2024
Use Case: How to Enable Real-Time Sentiment Analysis on Arm Neoverse-Based Kubernetes Clusters

Na Li

Learn how to build a distributed kubernetes cluster on Arm Neoverse-based instances.
- November 11, 2024
Google’s Axion Powered by Arm Neoverse: Faster Inference and Higher Performance for AI Workloads

Ashok Bhat

Google Axion is an excellent choice for AI inference, capable of handling a wide range of workloads from traditional machine learning tasks like XGBoost to generative AI applications such as LLaMa.
- October 30, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog