This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Accelerating PCRE regex matching with data prefetch on ARM Cortex-A7

(This could be a blog post but I cannot create such content)

Pattern matching today is a bit different from what it was in the past. The syntax of modern regular expressions (or regexes in short) is based on controlled backtracking which provides great freedom for regex users but it is also a considerable challenge from implementation viewpoint. Now regex engines have a similar executor as common script languages and they are accelerated in the same way: with just-in-time (JIT) compilers.

PCRE (Perl Compatible Regular Expressions) (www.pcre.org) is a well known regular expression engine. It has a JIT compiler which compiles patterns into a platform independent representation and this low-level representation (LIR) is transformed into machine code by the SLJIT compiler.

Unlike PCRE, SLJIT (https://sljit.sourceforge.net) is a less known JIT compiler. It translates an assembly like LIR into machine code and it supports three ARM instruction sets: the traditional 32 bit, the Thumb-2 and the 64 bit instruction sets. The key challenge of this project is defining a common LIR which is feature rich and still can be efficiently translated to many CPU architectures.

Finding the possible start of a regex (fast skipping) is an important optimization for high performance regex engines. The JIT compiler of PCRE has about half-dozen algorithms for this purpose and we have been curious whether data prefetching could be used to improve the performance of fast skipping
further.

We picked a Raspberry Pi 2 (https://www.raspberrypi.org/products/raspberry-pi-2-model-b/) computer for taking our measurements. This popular device has a 900MHz ARM Cortex-A7 CPU and 1Gbyte of RAM.

We extended one of the fast skipping algorithms with data prefetching and chose a few patterns whose are accelerated by the selected algorithm. The input data was a 20 MByte text file (an ebook). To measure both 8-bit and 16-bit matching speed the input whose original character width is 8-bit is converted to 16-bit characters first. This conversion is doubled the input size to 40 MByte.

Original results:

Pattern: /how to/ (number of matches: 395)
Runtime with 8-bit characters: 57 ms
Runtime with 16-bit characters: 64 ms
Pattern: /^how to/ (number of matches: 33)
Runtime with 8-bit characters: 57 ms
Runtime with 16-bit characters: 64 ms
Pattern: /how( to|ever)/ (number of matches: 842)
Runtime with 8-bit characters: 69 ms
Runtime with 16-bit characters: 77 ms

The 16-bit matching performance is a bit slower probably due to the doubled input size. After prefetching the data from 64 byte ahead of the current string pointer we got the following results:

Pattern: /how to/ (number of matches: 395)
Runtime with 8-bit characters: 50 ms (14% speedup)
Runtime with 16-bit characters: 57 ms (12% speedup)
Pattern: /^how to/ (number of matches: 33)
Runtime with 8-bit characters: 50 ms (14% speedup)
Runtime with 16-bit characters: 57 ms (12% speedup)
Pattern: /how( to|ever)/ (number of matches: 842)
Runtime with 8-bit characters: 62 ms (11% speedup)
Runtime with 16-bit characters: 68 ms (13% speedup)

The prefetch instruction visibly improved the performance of the engine! The speedup is around 12% for both character widths. We also tried prefetching from the 128 byte offset:

Pattern: /how to/ (number of matches: 395)
Runtime with 8-bit characters: 50 ms (14% speedup)
Runtime with 16-bit characters: 52 ms (23% speedup)
Pattern: /^how to/ (number of matches: 33)
Runtime with 8-bit characters: 50 ms (14% speedup)
Runtime with 16-bit characters: 52 ms (23% speedup)
Pattern: /how( to|ever)/ (number of matches: 842)
Runtime with 8-bit characters: 63 ms (10% speedup)
Runtime with 16-bit characters: 64 ms (20% speedup)

Although the 8 bit results are mostly unchanged the gain on the 16 bit results are nearly doubled! More interestingly the 16-bit matching speed is close to the 8-bit matching speed even though it scans twice as much memory.

The conclusion is that prefetch instructions do improve pattern matching speed. These results convinced me to add prefetch support to the main PCRE repository later. The next step could be porting the SIMD accelerated fast skipping algorithms to ARM.