Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Refining MurmurHash64A for greater efficiency in Libstdc++
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • arm performance libraries
  • performance analysis
  • neoverse V2
  • Memory Access Instructions
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang
Zongyao Zhang
October 16, 2025
5 minute read time.

Introduction

Murmur hash is a non-cryptographic hash function. It is designed for general-purpose hashing, such as hash tables and hash maps. Murmur hash is widely used because it is fast, efficient, and has good distribution properties. These qualities make it suitable for hash-based data structures and algorithms. Murmur hash includes a series of hash algorithms, however this blog post focuses on MurmurHash64A, which is the most popular algorithm on 64-bit platforms.

Bench_hash_functions is a benchmarking tool that evaluates and compares the performance of non-cryptographic hash functions. It measures the speed and efficiency of different hashing algorithms. This helps developers in selecting the most suitable hash function for their specific use cases.

I will improve the implementation of hash algorithm in stdlibc++. Then I will use bench_hash_functions to test the effectiveness of the improvements. Finally, I will use the perf tool to verify memory load reductions and the changes at the assembly instruction level before and after the improvements.
.

Input key generation 

To better reflect real-world usage scenarios ,where hash functions are typically applied to relatively short keys, we created a dataset of 10 million keys. Key lengths ranged from 1 to 1024 bytes. In practice, keys used in hash functions tend to be small. We designed the key length distribution to closely mimic realistic workloads.

This dataset serves as the foundation for our performance benchmarks. It helps us evaluate how different hashing strategies behave under practical conditions. You can see the distribution of key lengths in the figure below. If you are curious about the key generation algorithm, feel free to reach out. I am happy to share the details.

Input key distribution

Code change for benchmarks

While working with the bench_hash_functions repository, I found that the benchmark function MurmurHash64A was not calling the standard C++ library’s implementation of MurmurHash. To ensure we are benchmarking the correct version, specifically stdlibc++'s hash function, I made a few code changes. These are shown below.

// 64-bit hash for 64-bit platforms

#include <string>
#include <string_view>
#include <functional>
#include <cstdint>
#include <cstring>

uint64_t MurmurHash64A(const void* key, int len, uint64_t seed)
{
    // use std::string_view to wrap raw memory buffers,
    // allowing us to treat them as lightweight,
    // non-owning string-like objects without copying the data.
    std::string_view sv(reinterpret_cast<const char*>(key), len);

    std::hash<std::string_view> hasher;
    size_t h = hasher(sv);

    return static_cast<uint64_t>(h);
}

Code change for GCC

In the libstdc++ implementation of MurmurHash64A, a do-while loop loads the remaining bytes of the input string. These are the bytes left after processing full 8-byte chunks. The leftover bytes are loaded one by one, which is inefficient.

To optimize this, we can restructure the loading logic:

  • If the remaining length is greater than 4 bytes, load 4 bytes at once.
  • If more than 2 bytes remain, load 2 bytes.
  • If 1 byte remains, load that final byte.

This tiered approach reduces memory accesses and improves performance. On Arm architectures, we can use the ldr, ldrh, and ldrb instructions to load 4-byte, 2-byte, and 1-byte segments. This reduces instruction count and can improve speed.

Get code from https://github.com/gcc-mirror/gcc.git and checkout branch releases/gcc-14.2.0 and make changes as follows:

$ git diff libstdc++-v3/libsupc++/hash_bytes.cc
diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
index 3665375096a..be8863b0e79 100644
--- a/libstdc++-v3/libsupc++/hash_bytes.cc
+++ b/libstdc++-v3/libsupc++/hash_bytes.cc
@@ -33,6 +33,8 @@
 // FNV is provided primarily for backward compatibility.
 
 #include <bits/hash_bytes.h>
+#include <cstdint>
+
 
 namespace
 {
@@ -50,13 +52,29 @@ namespace
   load_bytes(const char* p, int n)
   {
     std::size_t result = 0;
-    --n;
-    do
-      result = (result << 8) + static_cast<unsigned char>(p[n]);
-    while (--n >= 0);
+    const uint8_t* cp = reinterpret_cast<const uint8_t*>(p);
+    int avail = n;
+
+    if (n & 4) {
+       avail -= 4;
+       uint32_t v4 = *reinterpret_cast<const uint32_t*>(cp + avail);
+       result = static_cast<std::size_t>(v4) << (avail * 8);
+    }
+
+    if (n & 2) {
+       avail -= 2;
+       uint16_t v2 = *reinterpret_cast<const uint16_t*>(cp + avail);
+       result |= static_cast<std::size_t>(v2) << (avail * 8);
+    }
+
+    if (n & 1) {
+      result |= cp[0];
+    }
+
     return result;
   }

Build code by below commands:

mkdir build
cd build/
../configure --prefix=/opt/install_libstdcpp --enable-languages=c,c++ --disable-multilib
make -j$(nproc)
make install

Set environment variable like this export LD_LIBRARY_PATH=/opt/install_libstdcpp/lib64/libstdc++.so.6:$LD_LIBRARY_PATH before running benchmarks.

Benchmark result

Here are the benchmark results on Neoverse-V2 Grace server with Ubuntu 24.04 and os linux kernel 6.8.0-79-generic.

Before applying any code modifications to GCC:

./perf_stdlib 10000000 --input ~/blog/key.urls --hasher murmurhash2_64
read 1000000 strings
read 2000000 strings
read 3000000 strings
read 4000000 strings
read 5000000 strings
read 6000000 strings
read 7000000 strings
read 8000000 strings
read 9000000 strings
read 10000000 strings
num_strings 10000000
max_string_length 1024
total_length 448644715
avg_string_length 44.86

=== murmurhash2_64 ===
#ignore: 6272232597482875923
Hash -- nanosec_per_key = 14.52
#ignore: 2498494660000
Hash+Rand -- nanosec_per_key = 19.14
min value = 270
max value = 19490
left = 6313588 (63.14%)
right = 3686412 (36.86%)

After modifying the GCC code, the benchmark results are as follows:

./perf_stdlib 10000000 --input ~/blog/key.urls --hasher murmurhash2_64
read 1000000 strings
read 2000000 strings
read 3000000 strings
read 4000000 strings
read 5000000 strings
read 6000000 strings
read 7000000 strings
read 8000000 strings
read 9000000 strings
read 10000000 strings
num_strings 10000000
max_string_length 1024
total_length 448644715
avg_string_length 44.86

=== murmurhash2_64 ===
#ignore: 6272232597482875923
Hash -- nanosec_per_key = 13.28
#ignore: 2498494660000
Hash+Rand -- nanosec_per_key = 19.33
min value = 270
max value = 19490
left = 6313588 (63.14%)
right = 3686412 (36.86%)

The time to compute a single key hash, measured in nanoseconds per key, dropped from 14.52 to 13.28 after the optimization. These represents about a 9% performance improvement. Graph showing the time to compute the hash of single key, from 14,52 nanoseconds per key to 13.28, representing a 9% performance improvement.

Verify result

Perf stat

We use the following perf command to inspect memory access counts via performance monitoring unit:

perf stat -e armv8_pmuv3_0/mem_access/ ./perf_bench 10000000 --input ~/blog/key.urls --hasher murmurhash2_64

Before applying any optimizations, the memory access count was: "5027148198 armv8_pmuv3_0/mem_access/".

After the code changes, the memory access count dropped to: "4923663819 armv8_pmuv3_0/mem_access/".

This reduction shows fewer memory load operations, suggesting improved efficiency in the optimized version.

Perf assembly

We can also check what is changed in assembly code by using perf record/report command.

perf record -g  ./perf_stdlib 10000000 --input ~/blog/key.urls --hasher murmurhash2_64

perf report -i perf.data

The left side shows the assembly code before optimization, while the right side shows the assembly code after optimization.

 Diagram showing the comparison between left side shows the assembly code before optimization, while the right side shows the assembly code after optimization.

The left-hand side shows a loop that loads one byte from memory per iteration. Whereas the right-hand side uses ldr, ldrh, and ldrb instructions to load 4, 2, and 1 bytes of data respectively.

Summary

This optimization reduces the number of memory accesses by adjusting the way remaining bytes are loaded. It improves the overall performance of MurmurHash64A in libstdc++, with significant effects on Arm architectures. For scenarios requiring efficient hashing, this optimization is practically meaningful.

This optimization can also be directly applied to code bases that have their own implementation of the MurmurHash64A algorithm, such as Meta’s folly library.

Anonymous
Servers and Cloud Computing blog
  • Refining MurmurHash64A for greater efficiency in Libstdc++

    Zongyao Zhang
    Zongyao Zhang
    Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
    • October 16, 2025
  • How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

    Marc Meunier
    Marc Meunier
    Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
    • October 13, 2025
  • Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

    odinlmshen
    odinlmshen
    In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
    • October 13, 2025