Accelerating Deep-packet-inspection (DPI) with Neon on Arm Neoverse platforms

October 6, 2021

3 minute read time.

Co-Authors: Ravi Malhotra and Jici Gao from Arm, with support from Konstantinos Margaritis from VectorCamp

This blog describes the work done by Arm and partners to accelerate regular expression parsing using vector engines in Arm Neoverse platforms, and its applicability to open-source Deep-packet-inspection applications like Snort.

Background

With the shift of Enterprise computing from on premises to the cloud, it has become critical to ensure security of data transfers and to detect possible attacks before they happen. To address this concern, traditional network security appliances and VPN gateways have transformed into Unified Threat Management (UTM) systems. These systems analyze streams of data and its usage by devices to detect patterns and anomalies. Inspecting every byte of packet payload (‘deep-packet’) at typical network speeds can be very compute intensive, and this is where CPU SIMD vector engines can help by analyzing large sets of data in parallel.

Intel developed an open-source regular expression (regex) parsing library called Hyperscan that leveraged its SSE and AVX vector-engines and integrated it with a popular Deep-packet-inspection application, Snort. More info here: https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf.

Vectorscan

https://github.com/VectorCamp/vectorscan

To create a regex parsing library that was optimized for Arm platforms, Arm collaborated with VectorCamp, who specialize in Software Optimizations & SIMD vectorizations across a range of popular CPU architectures. Together, we created an architecture-inclusive fork of Hyperscan called Vectorscan, that preserves the support for x86 and modifies the framework to allow for other architectures and vector engine implementations.

The goal for Vectorscan was to preserve the API compatibility with Hyperscan, to allow usage as a drop-in replacement in applications like Snort. Changes made to split the SIMD code into a separate library allowed not only portability across multiple architectures, but also reduced code-size by 1/3^rd in relevant SIMD routines.

Vectorscan currently supports the Neon vector engine and Arm's focus is to continue further optimizations on Arm Neoverse platforms, including newer vector-engine implementations like Scalable Vector Engine (SVE/SVE2). Compatibility with the Hyperscan project is preserved - algorithms and Intel architecture optimizations were cherry-picked and integrated into Vectorscan to allow Linux distros to bundle and maintain one package instead of two.

At the same time, generic contributions like cache-prefetching, grouping load/stores and algorithm optimizations provided performance gains across all architectures. Cleanup and restructuring also makes it easier to debug and profile.

Performance results

On the Arm architecture, Vectorscan provides a performance uplift of 20-40% over the default regex implementations within SNORT. The below chart shows a single-core comparison of Vectorscan vs. default regex implementations in Snort on a Neoverse N1-based Ampere® Altra® CPU. This uses the Arm Neon vector-engines within N1. Future vector-engines implementations like Arm SVE and SVE2 in future will provide even further uplift.

Vectorscan vs. default regex implementation performance on Snort

Figure 1: Performance of Vectorscan versus default Regex Implementations in Snort (in Mbps)

We also compared Vectorscan performance on Arm and alternative systems with:

Ampere Altra with 80 Arm Neoverse N1 cores @ 3.0 Ghz
Intel Cascade-Lake Xeon 8268 Platinum with 24 cores @ 2.9 Ghz

The Ampere Altra compares well in both single and multi-core performance.

Vectorscan performance scales linearly - all the way up to 80 cores on Ampere Altra, providing an overall socket performance well above 20 Gbps. This allows users to flexibly allocate cores between deep-packet-inspection and other security packet-processing tasks like network-proxy, VPN, IDS/IPS etc. By comparison, with only 24 cores available, the x86 system throughput maxes out at about 10 Gbps.

Ampere Altra vs. x86 Scaling Comparison Figure 2: Ampere Altra vs. x86 Throughput Scaling Comparison

However, a typical UTM appliance must perform other packet-processing tasks that include Firewall, NAPT, tunneling, encryption, which take up a significant amount of CPU bandwidth. The Ampere Altra utilizes only 30% (24 out of 80) of its to achieve 10 Gbps of Vectorscan performance, as compared to close to 100% an Intel Xeon 8268 (24 out of 24).

System CPU Utilization for 10Gpbs Vectorscan

Figure 3: System CPU Utilization for 10Gbps Vectorscan

Conclusion

The work done by Arm and partners with Vectorscan provides network security application developers with a regex library that is portable across multiple architectures, optimized to leverage SIMD acceleration and preserves legacy compatibility. This enables Arm Neoverse based platforms to provide deep-packet-inspection performance that is not only comparable with alternatives at a per-core level, but at significantly lower system-level CPU utilization, and leaves headroom for other packet-processing tasks in the same system.

Learn more about Neoverse N1

Appendix

System configuration

Arm - Ampere Altra: https://amperecomputing.com/wp-content/uploads/2021/06/Altra_Rev_A1_DS_v1.10_20210612-1.pdf
1. 80 Neoverse N1 cores @ 3.0 Ghz
X86 - Intel Xeon 8268 Platinum: https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html
1. 24 cores / 48 threads @ 2.9 Ghz, HT = off
Software/test configuration
1. Snort Version = 3.1.1.0
2. Vectorscan 5.3.0
3. Linux Version = 5.10.9
4. PCAP =maccdc2012_00001.pcap from https://www.netresec.com/?page=MACCDC

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog