Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Improve ClickHouse Performance up to 26% by using AWS Graviton3
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • aws
  • Server and Infrastructure
  • Cloud Application
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Improve ClickHouse Performance up to 26% by using AWS Graviton3

Martin Ma
Martin Ma
July 12, 2022
6 minute read time.

Co-authors: Martin Ma and Zaiping Bie


Introduction

ClickHouse is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). It supports best in the industry query performance, while significantly reducing storage requirements through the innovative use of columnar storage and compression. It has been very popular in the OLAP field for the past several years and has been widely used by many enterprises.

In this blog, we compare the query latency (processing time) and throughput of ClickHouse on two Amazon EC2 instance families over a range of instance sizes. These instance families are the Amazon EC2 C7g (based on Arm Neoverse-powered AWS Graviton3 processors) and C6i (based on 3rd Generation Intel Xeon Scalable processors). Our findings demonstrate that ClickHouse deployments on C7g instances can achieve up to 26% performance advantage over C6i instances. The following sections cover the details of our testing methodology and results.

Performance benchmark setup and result

For the benchmark setup, the ClickHouse server and client are deployed in different instances. We connect the ClickHouse client to the ClickHouse server and repeatedly send preset queries. We then collect query processing time and throughput to compare performance between C7g and C6i instances.

Build Config

To achieve the best performance, besides using the latest Clang to build ClickHouse per the official procedure, we also apply CMake NATIVE and AVX-related flags as following.

architecture

ClickHouse CMake flags

AArch64

-DARCH_NATIVE=ON

x86

-DARCH_NATIVE=ON

-DENABLE_AVX2=ON

-DENABLE_AVX2_FOR_SPEC_OP=ON

-DENABLE_AVX512=ON

-DENABLE_AVX512_FOR_SPEC_OP=ON

To align jemalloc behavior on C7g and C6i, the following jemalloc parameters are configured in jemalloc_internal_defs.h.in.

jemalloc parameter

value

LG_PAGE

12 (One page is 2^LG_PAGE bytes)

LG_HUGEPAGE

21 (One huge page is 2^LG_HUGEPAGE bytes)

Server Config

The ClickHouse server runs on C7g/C6i instance families across a range of instance sizes.

The benchmark client runs on a single C7g.4xlarge instance.

The following table summarizes the tested instance types.

Instance Type

Instance Size (vCPU)

Memory (GiB)

Storage

C7g / C6i

2xlarge (8)

16

50GB (EBS gp3)

4xlarge (16)

32

8xlarge (32)

64

16xlarge (64)

128

The software versions and test parameters are as following:

Software

Version

ClickHouse

v22.5.1.2079-stable

Operation System

Amazon Linux 2

Kernel

5.10.112-108.499.amzn2.aarch64
5.10.112-108.499.amzn2.x86_64

 

ClickHouse server parameter

value

max_threads

vCPU number

Note: the 'max threads' parameter specifies the number of worker threads for parallel query processing on ClickHouse server; the default value is the number of physical CPU cores. When using this default 'max threads' setting, C7g instances outperform C6i instances by 40%. But up to half of the entire CPU resource are idle in C6i instances while C7g instances are fully utilized. To fully utilize the CPU resource on C6i, we set the 'max threads' value to the vCPU number on C7g and C6i instances in this comparison.

Query Time Test

We use the web analytics dataset (“hits” table containing 100 million rows) and 43 typical queries to collect query processing time, which is provided by official benchmark method.

For each of these 43 typical queries, the average query time is the arithmetic mean of 10 consecutive queries after one warmup query. The total query time, as shown in the following tables, is the sum of the average time of these 43 queries. We observed 25.8% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.

The following table shows total query processing time (lower is better) comparison between C7g and C6i.

Instance Size

C7g (Sec)

C6i (Sec)

Performance gain

2xlarge

34.95

42.77

18.3%

4xlarge

18.91

24.57

23.0%

8xlarge

11.72

15.57

24.8%

16xlarge

9.02

12.16

25.8%

Table 1. ClickHouse query processing time benchmark results on C7g vs C6i

Figure 1. Query time performance gain for C7g vs. C6i

Figure 1. Query time Performance gains for C7g vs. C6i

We also selected the 3 most significant queries (Query 19, Query 33, Query 34) that consume more processing time, to observe the performance uplift on C7g instances compared to C6i instances.

Query 19

SELECT UserID, toMinute(EventTime) AS m, SearchPhrase, count() FROM hits_100m_obfuscated GROUP BY UserID, m, SearchPhrase ORDER BY count() DESC LIMIT 10;

Query 33

SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM hits_100m_obfuscated GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;

Query 34

SELECT URL, count() AS c FROM hits_100m_obfuscated GROUP BY URL ORDER BY c DESC LIMIT 10;

The following tables show the result of the top 3 complex queries, comparing between C7g and C6i instances. (Lower is better)

Instance Size

C7g (sec)

C6i (sec)

Performance gain

2xlarge

3.995

4.918

18.8%

4xlarge

2.002

2.736

26.8%

8xlarge

1.101

1.558

29.3%

16xlarge

0.690

1.010

31.7%

Table 2. Query 19 results on C7g vs C6i

Figure 2. Query 19 Performance gains for C7g vs. C6i instances

Instance Size

C7g (Sec)

C6i (Sec)

Performance gain

2xlarge

4.562

4.947

7.8%

4xlarge

2.351

2.816

16.5%

8xlarge

1.578

2.107

25.1%

16xlarge

1.137

1.608

29.3%

 Table 3. Query 33 results on C7g vs C6i

Figure 3. Query 33 Performance gains for C7g vs. C6i instances

Instance Size

C7g (Sec)

C6i (Sec)

Performance gain

2xlarge

3.225

3.766

14.4%

4xlarge

1.793

2.171

17.4%

8xlarge

1.066

1.325

19.6%

16xlarge

0.774

1.036

25.4%

Table 4. Query 34 results on C7g vs C6i

Figure 4. Query 34 Performance gains for C7g vs. C6i instances

Throughput Test

We used the official ClickHouse benchmark tool to collect throughput data based on the same dataset and queries. After a warmup phase, each test will use the benchmark tool to continuously send all 43 typical queries to the server, reporting queries per second (QPS) by the end of test. We observed a 31.6% performance uplift by running ClickHouse on C7g instances compared to running on C6i instances.

The following table shows the QPS (higher is better) comparison for the default single connection scenario (clickhouse-benchmark --concurrency=1) on C7g and C6i.

Instance Size

C7g (Queries/Sec)

C6i (Queries/Sec)

Performance gain

2xlarge

0.684

0.581

17.7%

4xlarge

2.249

1.738

29.4%

8xlarge

3.529

2.709

30.3%

16xlarge

4.536

3.446

31.6%

Table 5. ClickHouse throughput performance results (single connection) on C7g vs C6i

Figure 5. ClickHouse throughput performance gain (single connection) for C7g vs. C6i instances

The following table shows the QPS comparison for a multi-connection scenario (clickhouse-benchmark --concurrency=N) on C7g and C6i. (note: xlarge/2xlarge/4xlarge instances cannot support multi-connection due to a memory capacity limit)

Instance Size

Concurrency

C7g (Queries/Sec)

C6i (Queries/Sec)

performance gain

8xlarge

2

4.125

2.968

39.0%

4

4.138

2.931

41.2%

6

4.182

2.947

41.9%

8

4.108

2.914

41.0%

16xlarge

2

5.847

4.003

46.1%

4

6.195

4.071

52.2%

6

6.329

4.093

54.6%

8

6.290

4.112

53.0%

Table 6. ClickHouse throughput performance results (multi connection) on C7g vs C6i

Figure 6. ClickHouse throughput performance gain (multi connection) for C7g vs. C6i instances

Conclusion

In addition to a 20% instance price savings, by deploying on AWS Graviton3-based C7g instances ClickHouse has seen query latency (processing time) reduced by 26% and throughput performance increased by 32%. This comparison is over equally configured 3rd generation Xeon Scalable processor-based instances.

Visit the AWS Graviton3 page for customer stories on adoption of Arm-based processors. For details on how to migrate existing applications to AWS Graviton, please check this GitHub page. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at sw-ecosystem@arm.com.

Anonymous
Parents
  • hallison
    hallison over 2 years ago

    Thanks for the info

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Comment
  • hallison
    hallison over 2 years ago

    Thanks for the info

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Children
No Data
Servers and Cloud Computing blog
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025
  • Arm CMN S3: Driving CXL storage innovation

    John Xavier Lionel
    John Xavier Lionel
    CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
    • February 24, 2025
  • Streamline Arm adoption with GitHub Copilot and Arm64 Runners

    Michael Gamble
    Michael Gamble
    The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
    • February 19, 2025