Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Mobile blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog Accelerating Deep-packet-inspection (DPI) with Neon on Arm Neoverse platforms
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Software-defined network (SDN)
  • Cloud Computing
  • Neoverse N1
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Accelerating Deep-packet-inspection (DPI) with Neon on Arm Neoverse platforms

Ravi Malhotra
Ravi Malhotra
October 6, 2021

Co-Authors: Ravi Malhotra and Jici Gao from Arm, with support from Konstantinos Margaritis from VectorCamp

This blog describes the work done by Arm and partners to accelerate regular expression parsing using vector engines in Arm Neoverse platforms, and its applicability to open-source Deep-packet-inspection applications like Snort.

Background

With the shift of Enterprise computing from on premises to the cloud, it has become critical to ensure security of data transfers and to detect possible attacks before they happen. To address this concern, traditional network security appliances and VPN gateways have transformed into Unified Threat Management (UTM) systems.  These systems analyze streams of data and its usage by devices to detect patterns and anomalies. Inspecting every byte of packet payload (‘deep-packet’) at typical network speeds can be very compute intensive, and this is where CPU SIMD vector engines can help by analyzing large sets of data in parallel.

Intel developed an open-source regular expression (regex) parsing library called Hyperscan that leveraged its SSE and AVX vector-engines and integrated it with a popular Deep-packet-inspection application, Snort. More info here: https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf.

Vectorscan

https://github.com/VectorCamp/vectorscan

To create a regex parsing library that was optimized for Arm platforms, Arm collaborated with VectorCamp, who specialize in Software Optimizations & SIMD vectorizations across a range of popular CPU architectures. Together, we created an architecture-inclusive fork of Hyperscan called Vectorscan, that preserves the support for x86 and modifies the framework to allow for other architectures and vector engine implementations.

The goal for Vectorscan was to preserve the API compatibility with Hyperscan, to allow usage as a drop-in replacement in applications like Snort. Changes made to split the SIMD code into a separate library allowed not only portability across multiple architectures, but also reduced code-size by 1/3rd in relevant SIMD routines.

Vectorscan currently supports the Neon vector engine and Arm's focus is to continue further optimizations on Arm Neoverse platforms, including newer vector-engine implementations like Scalable Vector Engine (SVE/SVE2). Compatibility with the Hyperscan project is preserved - algorithms and Intel architecture optimizations were cherry-picked and integrated into Vectorscan to allow Linux distros to bundle and maintain one package instead of two.

At the same time, generic contributions like cache-prefetching, grouping load/stores and algorithm optimizations provided performance gains across all architectures. Cleanup and restructuring also makes it easier to debug and profile.

Performance results

On the Arm architecture, Vectorscan provides a performance uplift of 20-40% over the default regex implementations within SNORT. The below chart shows a single-core comparison of Vectorscan vs. default regex implementations in Snort on a Neoverse N1-based Ampere® Altra® CPU. This uses the Arm Neon vector-engines within N1. Future vector-engines implementations like Arm SVE and SVE2 in future will provide even further uplift.

Vectorscan vs. default regex implementation performance on Snort

Figure 1: Performance of Vectorscan versus default Regex Implementations in Snort (in Mbps)

We also compared Vectorscan performance on Arm and alternative systems with:

  • Ampere Altra with 80 Arm Neoverse N1 cores @ 3.0 Ghz
  • Intel Cascade-Lake Xeon 8268 Platinum with 24 cores @ 2.9 Ghz

The Ampere Altra compares well in both single and multi-core performance.

Vectorscan performance scales linearly - all the way up to 80 cores on Ampere Altra, providing an overall socket performance well above 20 Gbps. This allows users to flexibly allocate cores between deep-packet-inspection and other security packet-processing tasks like network-proxy, VPN, IDS/IPS etc.  By comparison, with only 24 cores available, the x86 system throughput maxes out at about 10 Gbps.

Ampere Altra vs. x86 Scaling ComparisonFigure 2: Ampere Altra vs. x86 Throughput Scaling Comparison

However, a typical UTM appliance must perform other packet-processing tasks that include Firewall, NAPT, tunneling, encryption, which take up a significant amount of CPU bandwidth. The Ampere Altra utilizes only 30% (24 out of 80) of its to achieve 10 Gbps of Vectorscan performance, as compared to close to 100% an Intel Xeon 8268 (24 out of 24).

System CPU Utilization for 10Gpbs Vectorscan

Figure 3: System CPU Utilization for 10Gbps Vectorscan

Conclusion

The work done by Arm and partners with Vectorscan provides network security application developers with a regex library that is portable across multiple architectures, optimized to leverage SIMD acceleration and preserves legacy compatibility. This enables Arm Neoverse based platforms to provide deep-packet-inspection performance that is not only comparable with alternatives at a per-core level, but at significantly lower system-level CPU utilization, and leaves headroom for other packet-processing tasks in the same system.

Learn more about Neoverse N1

Appendix

System configuration

  1. Arm - Ampere Altra: https://amperecomputing.com/wp-content/uploads/2021/06/Altra_Rev_A1_DS_v1.10_20210612-1.pdf
    1. 80 Neoverse N1 cores @ 3.0 Ghz
  2. X86 - Intel Xeon 8268 Platinum: https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html
    1. 24 cores / 48 threads @ 2.9 Ghz, HT = off
  3. Software/test configuration
    1. Snort Version = 3.1.1.0
    2. Vectorscan 5.3.0
    3. Linux Version = 5.10.9
    4. PCAP =maccdc2012_00001.pcap from  https://www.netresec.com/?page=MACCDC
Anonymous
Tools, Software and IDEs blog
  • Arm Compiler for Linux: what is new in the 22.0 release?

    Ashok Bhat
    Ashok Bhat
    Arm Compiler for Linux 22.0 is now available with performance improvements and support for new hardware like AWS Graviton 3.
    • May 27, 2022
  • Cloud infrastructure for continuous integration tests

    Christopher Seidl
    Christopher Seidl
    This blog introduces a cloud-based continuous integration (CI) workflow for embedded projects that uses model-based simulation.
    • May 24, 2022
  • New performance features and improvements in GCC 12

    Tamar Christina
    Tamar Christina
    Read about the new architecture and performance feature in GCC 12 for the Arm CPUs. From vectorization to instructions to optimize memory operations.
    • May 10, 2022