Accelerate multi-token search in strings with SVE2 SVMATCH instruction

September 25, 2024

2 minute read time.

SVMATCH is an instruction that is introduced by Arm SVE2. This instruction aims to accelerate a very common task in software engineering: to check existence of multiple tokens in a string.

As an example, consider parsing CSV files. To extract records and fields, we must cut input into rows and columns. Rows can be separated by '\r', '\n', and columns are often separated by ','. These 3 tokens must be handled differently than normal input. A trivial CSV parser processes 1 character in each iteration.

To improve performance, we want to parse by batch with Neon vectorization. As explain in the following steps:

Load multiple bytes into a vector
Check existence of special tokens in the vector
Big win if there's no special token.
Fallback to the trivial character by character approach if special token exists.

Step 2 is critical for performance. That is, how to efficiently check existence of the 3 special tokens ('\r', '\n', ',') in a vector.

Traditionally, we implement it by comparing the vector with 3 tokens separately, then "or” the 3 results. This approach is tedious. More importantly, it does not scale. Image supporting additional tokens, like escape '\' and quote '"', the time complexity increases linearly.

SVMATCH simplifies this task significantly. We can match multiple tokens in constant time.

For example, svbool_t svmatch_u8(svbool_t pg, svuint8_t op1, svuint8_t op2)

op1 (haystack): a vector contains the input string.
op2 (needle): a vector contains the tokens to be matched in input string.
return: a predicate with each bit represents occurrence of any token in the input string.

The following table illustrates the behavior of SVMATCH. The result bits are set for positions where the op1 character exists in op2.

Note as there are only 3 tokens to be matched, op2 vector is populated with duplicated tokens (the comma).

op1 (input csv string vector)	a	,	0	1	2	3	4	5	6	\r	\n	+	-	*	/	?
op2 (special tokens vector)	\r	\n	,	,	,	,	,	,	,	,	,	,	,	,	,	,
result (predicate)	0	1	0	0	0	0	0	0	0	1	1	0	0	0	0	0

On Neoverse-N2, SVMATCH costs 2 cycles. We can match at most 16 tokens in a 16 bytes string with 1 instruction. While the traditional approach requires 16 compare and 15 "or" instructions.

Optimize Sonic JSON decoder

This is a real world case to optimize Sonic JSON decoder with SVMATCH.

Sonic is a fast JSON serializing & deserializing library from Bytedance. We profiled Sonic on Neoverse-N2 and found below hot function.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// original code: 4 NEON compare and 3 "or" operations
sonic_force_inline uint64_t GetNonSpaceBits(const uint8_t *data) {
  uint8x16_t v = vld1q_u8(data);
  uint8x16_t m1 = vceqq_u8(v, vdupq_n_u8(' '));
  uint8x16_t m2 = vceqq_u8(v, vdupq_n_u8('\t'));
  uint8x16_t m3 = vceqq_u8(v, vdupq_n_u8('\n'));
  uint8x16_t m4 = vceqq_u8(v, vdupq_n_u8('\r'));
  uint8x16_t m5 = vorrq_u8(m1, m2);
  uint8x16_t m6 = vorrq_u8(m3, m4);
  uint8x16_t m7 = vorrq_u8(m5, m6);
  uint8x16_t m8 = vmvnq_u8(m7);
  return to_bitmask(m8);
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

// original code: 4 NEON compare and 3 "or" operations
sonic_force_inline uint64_t GetNonSpaceBits(const uint8_t *data) {
  uint8x16_t v = vld1q_u8(data);
  uint8x16_t m1 = vceqq_u8(v, vdupq_n_u8(' '));
  uint8x16_t m2 = vceqq_u8(v, vdupq_n_u8('\t'));
  uint8x16_t m3 = vceqq_u8(v, vdupq_n_u8('\n'));
  uint8x16_t m4 = vceqq_u8(v, vdupq_n_u8('\r'));

  uint8x16_t m5 = vorrq_u8(m1, m2);
  uint8x16_t m6 = vorrq_u8(m3, m4);
  uint8x16_t m7 = vorrq_u8(m5, m6);
  uint8x16_t m8 = vmvnq_u8(m7);

  return to_bitmask(m8);
}

This code loads 16 bytes into a NEON register and compares it with 4 tokens (' ', '\t', '\n', '\r'). It uses 4 vectors, compare (vceqq) and 3 vector "or" (vorrq) instructions. This is the typical case we can improve with SVMATCH.

The SVMATCH optimized code is much more concise.

Fullscreen

1
2
3
4
5
6
7
8
// optimize with SVMATCH
sonic_force_inline svbool_t GetNonSpaceBits(const uint8_t *data) {
  const svuint8x16_t v = svld1_u8(svptrue_b8(), data);
  // load four tokens: tab(09), LF(0a), CR(0d), space(20)
  svuint8x16_t tokens = svreinterpret_u8_u32(svdup_n_u32(0x090a0d20U));
  return svnmatch_u8(svptrue_b8(), v, tokens);
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

// optimize with SVMATCH
sonic_force_inline svbool_t GetNonSpaceBits(const uint8_t *data) {
  const svuint8x16_t v = svld1_u8(svptrue_b8(), data);

  // load four tokens: tab(09), LF(0a), CR(0d), space(20)
  svuint8x16_t tokens = svreinterpret_u8_u32(svdup_n_u32(0x090a0d20U));
  return svnmatch_u8(svptrue_b8(), v, tokens);
}

We see significant performance improvement from SVMATCH. In the table:

Benchmark: Sonic decoder microbenchmarks
Original: performance of original code
SVMATCH: performance of SVMATCH optimized code
Improvement: performance uplift from SVMATCH

Benchmark	Original (Gi/s)	SVMATCH (Gi/s)	Improvement
gsoc-2018/Decode_SonicDyn	2.38736	2.76677	15.89%
citm_catalog/Decode_SonicDyn	1.41729	1.76191	24.32%
otfcc/Decode_SonicDyn	0.39992	0.41342	3.38%
fgo/Decode_SonicDyn	0.69160	0.71630	3.57%
twitter/Decode_SonicDyn	1.33604	1.58737	18.81%
twitterescaped/Decode_SonicDyn	1.24759	1.30216	4.37%
github_events/Decode_SonicDyn	1.38961	1.65635	19.20%
canada/Decode_SonicDyn	0.52615	0.52452	-0.31%
poet/Decode_SonicDyn	2.06297	2.40383	16.52%
lottie/Decode_SonicDyn	0.41990	0.43882	4.51%
book/Decode_SonicDyn	0.45662	0.48720	6.70%

Sonic JSON Decoder Benchmarks

^{Comparison of Sonic JSON Decoder Performance}

Upstream patch for reference: https://github.com/bytedance/sonic-cpp/pull/92

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Accelerate multi-token search in strings with SVE2 SVMATCH instruction

Optimize Sonic JSON decoder

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC