Co-authors: Martin Ma and Zaiping Bie
ClickHouse is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). It supports best in the industry query performance, while significantly reducing storage requirements through the innovative use of columnar storage and compression. It has been very popular in the OLAP field for the past several years and has been widely used by many enterprises.
In this blog, we compare the query latency (processing time) and throughput of ClickHouse on two Amazon EC2 instance families over a range of instance sizes. These instance families are the Amazon EC2 C7g (based on Arm Neoverse-powered AWS Graviton3 processors) and C6i (based on 3rd Generation Intel Xeon Scalable processors). Our findings demonstrate that ClickHouse deployments on C7g instances can achieve up to 26% performance advantage over C6i instances. The following sections cover the details of our testing methodology and results.
For the benchmark setup, the ClickHouse server and client are deployed in different instances. We connect the ClickHouse client to the ClickHouse server and repeatedly send preset queries. We then collect query processing time and throughput to compare performance between C7g and C6i instances.
To achieve the best performance, besides using the latest Clang to build ClickHouse per the official procedure, we also apply CMake NATIVE and AVX-related flags as following.
architecture
ClickHouse CMake flags
AArch64
-DARCH_NATIVE=ON
x86
-DENABLE_AVX2=ON
-DENABLE_AVX2_FOR_SPEC_OP=ON
-DENABLE_AVX512=ON
-DENABLE_AVX512_FOR_SPEC_OP=ON
To align jemalloc behavior on C7g and C6i, the following jemalloc parameters are configured in jemalloc_internal_defs.h.in.
jemalloc parameter
value
LG_PAGE
12 (One page is 2^LG_PAGE bytes)
LG_HUGEPAGE
21 (One huge page is 2^LG_HUGEPAGE bytes)
The ClickHouse server runs on C7g/C6i instance families across a range of instance sizes.
The benchmark client runs on a single C7g.4xlarge instance.
The following table summarizes the tested instance types.
Instance Type
Instance Size (vCPU)
Memory (GiB)
Storage
C7g / C6i
2xlarge (8)
16
50GB (EBS gp3)
4xlarge (16)
32
8xlarge (32)
64
16xlarge (64)
128
The software versions and test parameters are as following:
Software
Version
ClickHouse
v22.5.1.2079-stable
Operation System
Amazon Linux 2
Kernel
5.10.112-108.499.amzn2.aarch64 5.10.112-108.499.amzn2.x86_64
ClickHouse server parameter
max_threads
vCPU number
Note: the 'max threads' parameter specifies the number of worker threads for parallel query processing on ClickHouse server; the default value is the number of physical CPU cores. When using this default 'max threads' setting, C7g instances outperform C6i instances by 40%. But up to half of the entire CPU resource are idle in C6i instances while C7g instances are fully utilized. To fully utilize the CPU resource on C6i, we set the 'max threads' value to the vCPU number on C7g and C6i instances in this comparison.
We use the web analytics dataset (“hits” table containing 100 million rows) and 43 typical queries to collect query processing time, which is provided by official benchmark method.
For each of these 43 typical queries, the average query time is the arithmetic mean of 10 consecutive queries after one warmup query. The total query time, as shown in the following tables, is the sum of the average time of these 43 queries. We observed 25.8% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.
The following table shows total query processing time (lower is better) comparison between C7g and C6i.
Instance Size
C7g (Sec)
C6i (Sec)
Performance gain
2xlarge
34.95
42.77
18.3%
4xlarge
18.91
24.57
23.0%
8xlarge
11.72
15.57
24.8%
16xlarge
9.02
12.16
25.8%
Table 1. ClickHouse query processing time benchmark results on C7g vs C6i
Figure 1. Query time Performance gains for C7g vs. C6i
We also selected the 3 most significant queries (Query 19, Query 33, Query 34) that consume more processing time, to observe the performance uplift on C7g instances compared to C6i instances.
Query 19
SELECT UserID, toMinute(EventTime) AS m, SearchPhrase, count() FROM hits_100m_obfuscated GROUP BY UserID, m, SearchPhrase ORDER BY count() DESC LIMIT 10;
Query 33
SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM hits_100m_obfuscated GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
Query 34
SELECT URL, count() AS c FROM hits_100m_obfuscated GROUP BY URL ORDER BY c DESC LIMIT 10;
The following tables show the result of the top 3 complex queries, comparing between C7g and C6i instances. (Lower is better)
C7g (sec)
C6i (sec)
3.995
4.918
18.8%
2.002
2.736
26.8%
1.101
1.558
29.3%
0.690
1.010
31.7%
Table 2. Query 19 results on C7g vs C6i
Figure 2. Query 19 Performance gains for C7g vs. C6i instances
4.562
4.947
7.8%
2.351
2.816
16.5%
1.578
2.107
25.1%
1.137
1.608
Table 3. Query 33 results on C7g vs C6i
Figure 3. Query 33 Performance gains for C7g vs. C6i instances
3.225
3.766
14.4%
1.793
2.171
17.4%
1.066
1.325
19.6%
0.774
1.036
25.4%
Table 4. Query 34 results on C7g vs C6i
Figure 4. Query 34 Performance gains for C7g vs. C6i instances
We used the official ClickHouse benchmark tool to collect throughput data based on the same dataset and queries. After a warmup phase, each test will use the benchmark tool to continuously send all 43 typical queries to the server, reporting queries per second (QPS) by the end of test. We observed a 31.6% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.
The following table shows the QPS (higher is better) comparison for the default single connection scenario (clickhouse-benchmark --concurrency=1) on C7g and C6i.
C7g (Queries/Sec)
C6i (Queries/Sec)
0.684
0.581
17.7%
2.249
1.738
29.4%
3.529
2.709
30.3%
4.536
3.446
31.6%
Table 5. ClickHouse throughput performance results (single connection) on C7g vs C6i
Figure 5. ClickHouse throughput performance gain (single connection) for C7g vs. C6i instances
The following table shows the QPS comparison for a multi-connection scenario (clickhouse-benchmark --concurrency=N) on C7g and C6i. (note: xlarge/2xlarge/4xlarge instances cannot support multi-connection due to a memory capacity limit)
Concurrency
performance gain
2
4.125
2.968
39.0%
4
4.138
2.931
41.2%
6
4.182
2.947
41.9%
8
4.108
2.914
41.0%
5.847
4.003
46.1%
6.195
4.071
52.2%
6.329
4.093
54.6%
6.290
4.112
53.0%
Table 6. ClickHouse throughput performance results (multi connection) on C7g vs C6i
Figure 6. ClickHouse throughput performance gain (multi connection) for C7g vs. C6i instances
In addition to a 20% instance price savings, by deploying on AWS Graviton3-based C7g instances ClickHouse has seen query latency (processing time) reduced by 26% and throughput performance increased by 32%. This comparison is over equally configured 3rd generation Xeon Scalable processor-based instances.
Visit the AWS Graviton3 page for customer stories on adoption of Arm-based processors. For details on how to migrate existing applications to AWS Graviton, please check this GitHub page. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at sw-ecosystem@arm.com.
Thanks for the info