BOLT is a post-link optimization technology which brings performance improvement for various workloads. Previously, BOLT was enabled through CoreSight and perf, which improved performance for some typical workloads. Find out more about BOLT optimization technology in the following blog. However, CoreSight is required to capture branch perf datas, which is not convenient to deploy in the production environment.
BOLT instrumentation is an alternative method which optimizes the executable binary based on the profile data, which is collected by instrumenting and running the binary. Only llvm-bolt utility is required as there is no dependency on CoreSight and perf.
This blog illustrates the steps to enable BOLT instrumentation and benchmark results on MongoDB.
Two Alibaba ECS instances are reserved for the benchmark. Client runs the ycsb while the server runs MongoDB. 200G AutoPL ESSD, which has a higher bandwidth, is attached to the server to ensure there is no bottleneck on the drive.
MongoDB BOLT instrumentation test environment
There are two steps when running ycsb: load and run. This sends 40000000 records and 5000000 operations. Run the following command:
REC_CNT=40000000 OP_CNT=5000000 ./bin/ycsb.sh load mongodb -s -P workloads/workloada -p recordcount=$REC_CNT -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali" ./bin/ycsb.sh run mongodb -s -P workloads/workloada -p recordcount=$REC_CNT -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali"
python3 buildscripts/scons.py DESTDIR=$WORKSPACE/install/mongo install-mongod \ CCFLAGS="-fno-reorder-blocks-and-partition -mcpu=native -O3 -w" \ LINKFLAGS="-Wl,--emit-relocs" --disable-warnings-as-errors
llvm-bolt mongod.orig -instrument -o mongod.inst --instrumentation-file=`pwd`/prof.fdata --instrumentation-sleep-time=60
3. Start mongod.inst and run ycsb to collect profile data. Run the following command:
OP_CNT=5000000 ./bin/ycsb.sh load mongodb -s -P workloads/workloada -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali" ./bin/ycsb.sh run mongodb -s -P workloads/workloada -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali"
4. Stop mongod.inst
1. Convert mongod.orig to optimized executable (name it mongod.bolt):
llvm-bolt mongod.orig -o mongod.bolt -data=prof.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats
2. Run mongod.orig and mongod.bolt, and compare the results of them.
The benchmark shows that MongoDB improved 58% for INSERT and 52% for READ and UPDATE. Latencies also dropped significantly with BOLT enabled.
INSERT:
READ and UPDATE (with ratio 1:1):
The throughput improvement after using BOLT increased by 58% for INSERT and 52% for READ and UPDATE:
Throughput improvement report for BOLT
Latency improvement after using BOLT increased by 37% for INSERT, 35% for READ and 34% for UPDATE average latency:
Latency improvement report for BOLT
The perf data concludes that L1-icache-misses, branch-misses and iTLB-load-misses dropped significantly. Use the following command to capture perf data:
perf stat -e instructions,L1-icache-misses,branches,branch-misses,iTLB-load,iTLB-load-misses -p `pgrep mongo` -- sleep 60
Perf data report for BOLT
BOLT instrumentation results in a 52% performance uplift for MongoDB READ and UPDATE tests, whilst latencies have dropped significantly. Moreover, the instrumentation method is easy to deploy as it has no dependency on hardware counters and perf.