Improve ClickHouse Performance up to 26% by using AWS Graviton3

July 12, 2022

6 minute read time.

Co-authors: Martin Ma and Zaiping Bie

Introduction

ClickHouse is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). It supports best in the industry query performance, while significantly reducing storage requirements through the innovative use of columnar storage and compression. It has been very popular in the OLAP field for the past several years and has been widely used by many enterprises.

In this blog, we compare the query latency (processing time) and throughput of ClickHouse on two Amazon EC2 instance families over a range of instance sizes. These instance families are the Amazon EC2 C7g (based on Arm Neoverse-powered AWS Graviton3 processors) and C6i (based on 3rd Generation Intel Xeon Scalable processors). Our findings demonstrate that ClickHouse deployments on C7g instances can achieve up to 26% performance advantage over C6i instances. The following sections cover the details of our testing methodology and results.

Performance benchmark setup and result

For the benchmark setup, the ClickHouse server and client are deployed in different instances. We connect the ClickHouse client to the ClickHouse server and repeatedly send preset queries. We then collect query processing time and throughput to compare performance between C7g and C6i instances.

Build Config

To achieve the best performance, besides using the latest Clang to build ClickHouse per the official procedure, we also apply CMake NATIVE and AVX-related flags as following.

architecture

ClickHouse CMake flags

AArch64

-DARCH_NATIVE=ON

x86

-DARCH_NATIVE=ON

-DENABLE_AVX2=ON

-DENABLE_AVX2_FOR_SPEC_OP=ON

-DENABLE_AVX512=ON

-DENABLE_AVX512_FOR_SPEC_OP=ON

To align jemalloc behavior on C7g and C6i, the following jemalloc parameters are configured in jemalloc_internal_defs.h.in.

jemalloc parameter	value
LG_PAGE	12 (One page is 2^LG_PAGE bytes)
LG_HUGEPAGE	21 (One huge page is 2^LG_HUGEPAGE bytes)

Server Config

The ClickHouse server runs on C7g/C6i instance families across a range of instance sizes.

The benchmark client runs on a single C7g.4xlarge instance.

The following table summarizes the tested instance types.

Instance Type	Instance Size (vCPU)	Memory (GiB)	Storage
C7g / C6i	2xlarge (8)	16	50GB (EBS gp3)
	4xlarge (16)	32
	8xlarge (32)	64
	16xlarge (64)	128

The software versions and test parameters are as following:

Software	Version
ClickHouse	v22.5.1.2079-stable
Operation System	Amazon Linux 2
Kernel	5.10.112-108.499.amzn2.aarch64 5.10.112-108.499.amzn2.x86_64

ClickHouse server parameter	value
max_threads	vCPU number

Note: the 'max threads' parameter specifies the number of worker threads for parallel query processing on ClickHouse server; the default value is the number of physical CPU cores. When using this default 'max threads' setting, C7g instances outperform C6i instances by 40%. But up to half of the entire CPU resource are idle in C6i instances while C7g instances are fully utilized. To fully utilize the CPU resource on C6i, we set the 'max threads' value to the vCPU number on C7g and C6i instances in this comparison.

Query Time Test

We use the web analytics dataset (“hits” table containing 100 million rows) and 43 typical queries to collect query processing time, which is provided by official benchmark method.

For each of these 43 typical queries, the average query time is the arithmetic mean of 10 consecutive queries after one warmup query. The total query time, as shown in the following tables, is the sum of the average time of these 43 queries. We observed 25.8% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.

The following table shows total query processing time (lower is better) comparison between C7g and C6i.

Instance Size	C7g (Sec)	C6i (Sec)	Performance gain
2xlarge	34.95	42.77	18.3%
4xlarge	18.91	24.57	23.0%
8xlarge	11.72	15.57	24.8%
16xlarge	9.02	12.16	25.8%

Table 1. ClickHouse query processing time benchmark results on C7g vs C6i

Figure 1. Query time performance gain for C7g vs. C6i

Figure 1. Query time Performance gains for C7g vs. C6i

We also selected the 3 most significant queries (Query 19, Query 33, Query 34) that consume more processing time, to observe the performance uplift on C7g instances compared to C6i instances.

Query 19	SELECT UserID, toMinute(EventTime) AS m, SearchPhrase, count() FROM hits_100m_obfuscated GROUP BY UserID, m, SearchPhrase ORDER BY count() DESC LIMIT 10;
Query 33	SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM hits_100m_obfuscated GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
Query 34	SELECT URL, count() AS c FROM hits_100m_obfuscated GROUP BY URL ORDER BY c DESC LIMIT 10;

The following tables show the result of the top 3 complex queries, comparing between C7g and C6i instances. (Lower is better)

Instance Size	C7g (sec)	C6i (sec)	Performance gain
2xlarge	3.995	4.918	18.8%
4xlarge	2.002	2.736	26.8%
8xlarge	1.101	1.558	29.3%
16xlarge	0.690	1.010	31.7%

Table 2. Query 19 results on C7g vs C6i

Figure 2. Query 19 Performance gains for C7g vs. C6i instances

Instance Size	C7g (Sec)	C6i (Sec)	Performance gain
2xlarge	4.562	4.947	7.8%
4xlarge	2.351	2.816	16.5%
8xlarge	1.578	2.107	25.1%
16xlarge	1.137	1.608	29.3%

Table 3. Query 33 results on C7g vs C6i

Figure 3. Query 33 Performance gains for C7g vs. C6i instances

Instance Size	C7g (Sec)	C6i (Sec)	Performance gain
2xlarge	3.225	3.766	14.4%
4xlarge	1.793	2.171	17.4%
8xlarge	1.066	1.325	19.6%
16xlarge	0.774	1.036	25.4%

Table 4. Query 34 results on C7g vs C6i

Figure 4. Query 34 Performance gains for C7g vs. C6i instances

Throughput Test

We used the official ClickHouse benchmark tool to collect throughput data based on the same dataset and queries. After a warmup phase, each test will use the benchmark tool to continuously send all 43 typical queries to the server, reporting queries per second (QPS) by the end of test. We observed a 31.6% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.

The following table shows the QPS (higher is better) comparison for the default single connection scenario (clickhouse-benchmark --concurrency=1) on C7g and C6i.

Instance Size	C7g (Queries/Sec)	C6i (Queries/Sec)	Performance gain
2xlarge	0.684	0.581	17.7%
4xlarge	2.249	1.738	29.4%
8xlarge	3.529	2.709	30.3%
16xlarge	4.536	3.446	31.6%

Table 5. ClickHouse throughput performance results (single connection) on C7g vs C6i

Figure 5. ClickHouse throughput performance gain (single connection) for C7g vs. C6i instances

The following table shows the QPS comparison for a multi-connection scenario (clickhouse-benchmark --concurrency=N) on C7g and C6i. (note: xlarge/2xlarge/4xlarge instances cannot support multi-connection due to a memory capacity limit)

Instance Size	Concurrency	C7g (Queries/Sec)	C6i (Queries/Sec)	performance gain
8xlarge	2	4.125	2.968	39.0%
	4	4.138	2.931	41.2%
	6	4.182	2.947	41.9%
	8	4.108	2.914	41.0%
16xlarge	2	5.847	4.003	46.1%
	4	6.195	4.071	52.2%
	6	6.329	4.093	54.6%
	8	6.290	4.112	53.0%

Table 6. ClickHouse throughput performance results (multi connection) on C7g vs C6i

Figure 6. ClickHouse throughput performance gain (multi connection) for C7g vs. C6i instances

Conclusion

In addition to a 20% instance price savings, by deploying on AWS Graviton3-based C7g instances ClickHouse has seen query latency (processing time) reduced by 26% and throughput performance increased by 32%. This comparison is over equally configured 3rd generation Xeon Scalable processor-based instances.

Visit the AWS Graviton3 page for customer stories on adoption of Arm-based processors. For details on how to migrate existing applications to AWS Graviton, please check this GitHub page. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at sw-ecosystem@arm.com.

1 comment
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog