Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Gain up to 58% price-performance benefits when deploying Apache Spark on AWS Graviton2
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Data Engines Software/Tools
  • aws
  • infrastructure
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Gain up to 58% price-performance benefits when deploying Apache Spark on AWS Graviton2

Masoud Koleini
Masoud Koleini
March 15, 2022
7 minute read time.

Introduction

Apache Spark is a distributed big data computation engine that runs over a cluster of machines. On Spark, parallel computations can be executed using a dataset abstraction called RDD (Resilient Distributed Datasets), or can be executed as SQL queries using the Spark SQL API. Spark Streaming is a Spark module that allows users to express streaming data as RDD/dataframe abstractions. These abstractions run through functions and the results are stored in a storage backend. The results can also be sent to an output stream. Apache Spark is mainly suited for big data processing, where data cannot reside in the memory of a single machine.

In this blog, we compare the performance of Apache Spark running on a single node Yarn cluster on M6gd (AWS Graviton2) and M5d (Intel Skylake-SP or Cascade Lake) AWS instances. HDFS (Apache Hadoop Distributed File System) data is stored on NVMe-based SSD storage attached to the instance.

AWS Graviton2: an Arm Neoverse-based processor

AWS has built the Graviton2 processor using Arm Neoverse cores which are optimized for running on cloud workloads. Graviton2-based instances provide better price-performance compared to x86 ones. The price-performance advantage of Graviton2 instances depends on the type of the applications that are deployed. Some applications can reach up to 40% increase in price-performance when compared to x86 counterparts.

AWS has reported improved performance on Graviton2-based Amazon EMR when running Spark workloads, resulting in even lower prices for running the cluster on M6g, C6g, and R6g instance types.

Benchmark environment

Benchmark tool

HiBench is a big data benchmarking tool for Hadoop, Spark, Flink, and Kafka and is used in this blog to compare Spark performance on AWS instances with different architectures. It has six workload categories: Micro Benchmarks, Machine Learning, SQL, Graph, Streaming, and Web Search. Each workload defines a data scale profile from the set tiny, small, large, huge, gigantic, and bigdata. The size of each data profile is defined in the workload configuration file.

To compare the throughput of Spark computations on different AWS instances, two workloads from three different categories were selected:

  • Spark SQL: Aggregation from SQL workloads
    • Implements Spark SQL aggregation function.
  • Spark ML: K-means from ML workloads
    • Implements K-means clustering algorithm involving large amount of data processing.

The storage required by some of the bigdata benchmarks is larger than the NVMe partition on M5d (x86) instances. Therefore, we omit the bigdata profile tests in this blog. In addition, since Spark is specifically used for large dataset computations, we also ignore the results for tiny and small datasets.

Spark cluster

For the benchmarks, the Yarn Hadoop cluster is pseudo-distributed, where Hadoop runs on a single node with its daemons running on separate JVMs. Benchmark data is stored inside an HDFS cluster running on the same node using NVMe SSD as its storage backend.

The Spark version is 3.1.2 and the Yarn Hadoop version is 3.3.1. The Spark/Yarn configuration uses best practices and recommendations factoring in available resources on the instances tested. The following tables show the number of executors, executor cores, and executor memory used for different instance sizes:

 

Yarn executor cores

Num of executors

Executor memory (GB)

m6gd/m5d.8xlarge

5

5

16

m6gd/m5d.12xlarge

5

9

16

m6gd/m5d.16xlarge

5

12

16

Table 1. Yarn executor parameters for different instances

Cluster parameters can be tuned based on the type of the computation running on the cluster. For the sake of simplicity, Spark benchmarks use similar parameters for all workloads. The number of executor cores are set to 5, and the number of executors is computed so the total number of cores remains less than available vCPUs on the instance. The Yarn executor memory is computed following the same principle, so the total memory used by executors remains less than the total available memory. Some vCPUs and memory are left for operating system processes and daemons (following the recommendations in HiBench introduction for recommended settings).

AWS instance setup

Benchmarks run on general-purpose compute instances with NVMe support for three different sizes: 8xlarge, 12xlarge, and 16xlarge.

The specifications of the instances are as follows:

 

vCPU

Memory (GB)

Storage

m6gd/m5d.8x large

32

128

1 x 1900/2 x 600 NVMe SSD

m6gd/m5d.12xlarge

48

192

2 x 1425/600 NVMe SSD

m6gd/m5d.16xlarge

64

256

2 x 1900/900 NVMe SSD

Table 2. Instance configurations under test

Price-performance comparison                     

The benchmarks run on a single node, resembling a Spark cluster with very low network latency. The metric of interest is bytes/s (throughput). The charts in Figure 1 demonstrate price-performance (bytes/$) (bar graph, left axis) of Spark running on m6gd.16xlarge and m5d.16xlarge, which are two instances with similar resources.

The instances were deployed on us-east-1 and the price-performance is calculated with on-demand prices. For the workloads selected, the price-performance benefit of the M6gd is up to 37% for aggregation and 58% for K-means gigantic datasets (line graph, right axis).

Figure 1. SQL/Aggregation price-performance comparison, 16xlarge instances

Figure 1. SQL/Aggregation price-performance comparison, 16xlarge instances

Figure 2. ML/K-means price-performance comparison, 16xlarge instances

Figure 2. ML/K-means price-performance comparison, 16xlarge instances

Tables 3 and 4 are the price-performance numbers and percent improvements of the M6gd instances over the M5d ones for aggregation and K-means algorithms (data from figures 1 & 2).

sql/aggregation m6gd.16xlarge m5d.16xlarge m6gd Improvement %
large 1060460592 918990597.3 15.39
huge 9932358684 8081299115 22.91
gigantic 65600271986 47821068916 37.18

Table 3. Sql/aggregation bytes/$ on large to gigantic datasets for 16xlarge instances

ml/kmeans m6gd.16xlarge m5d.16xlarge m6gd Improvement %
large 105738662887.17 75821708628.32 39.46
huge 351306313606.20 241592742477.88 45.41
gigantic 507086965154.87 319719116150.44 58.60

Table 4. ml/kmeans bytes/$ on large to gigantic datasets for 16xlarge instances

Spark clusters usually process terabytes of data, spawning many worker nodes on which to run executors. The price-performance calculations (tables 3 & 4) show how users can significantly reduce costs by running their cluster on M6gd instances. The throughput benchmarks in the next section show users the price-performance benefit and compute performance they can gain by running jobs on M6gd.

Benchmark performance results

The charts in this section show benchmark results on 8xlarge, 12xlarge and 16xlarge instances. The left axis on each chart is the throughput in bytes/s (bar graph). The right axis is M6gd performance improvements over M5d instances (line graph).

Spark SQL benchmarks

The following charts compare aggregation performance throughput on M6gd vs M5d instances in 8xlarge, 12xlarge and 16xlarge instance sizes. HiBench SQL aggregation benchmarks show up to 29% (8xlarge), 18% (12xlarge), and 9.74% (16xlarge) higher throughput on the gigantic dataset for M6gd over M5d.

Figure 3. SQL/Aggregation throughput comparison, 8xlarge instances

Figure 3: SQL/Aggregation throughput comparison, 8xlarge instances

Figure 4. SQL/Aggregation throughput comparison, 12xlarge instances

Figure 4. SQL/Aggregation throughput comparison, 12xlarge instances

Figure 5. SQL/Aggregation throughput comparison, 16xlarge instances

Figure 5. SQL/Aggregation throughput comparison, 16xlarge instances

Tables 5, 6 and 7 are throughput numbers and M6gd improvements over M5d instances for SQL aggregation benchmark.

m6gd.8xlarge m5d.8xlarge m6gd Improvement %
large 942415 1102030 -14.48372549
huge 8700149 9124867 -4.654511677
gigantic 48871786 37851898 29.11317155

Table 5: SQL/Aggregation throughput comparison data, 8xlarge instances

sql/aggregation m6gd.12xlarge m5d.12xlarge m6gd Improvement
large 928698 1013234 -8.34
huge 8460299 9049525 -6.51
gigantic 55229252 46792922 18.03

Table 6. SQL/Aggregation throughput comparison data, 12xlarge instances

sql/aggregation m6gd.16xlarge m5d.16xlarge m6gd Improvement %
large 852139 923075 -7.68
huge 7981202 8117216 -1.68
gigantic 52713463 48033607 9.74

Table 7. SQL/Aggregation throughput comparison data, 16xlarge instances

Spark ML benchmarks

The charts in Figures 6, 7 and 8 represent K-means benchmarks on M6gd vs M5d instances. K-means benchmarks show up to 21.6% (8xlarge instances) higher throughput on the huge dataset. And 23.6% (12xlarge instances) and 26.88% (16xlarge instances) higher throughput on the gigantic dataset.

Figure 6: ML/K-means throughput comparison data, 8xlarge instances

Figure 6. ML/K-means throughput comparison, 8xlarge instances

Figure 7. ML/K-means throughput comparison, 12xlarge instances

Figure 7. ML/K-means throughput comparison, 12xlarge instances

Figure 8. ML/K-means throughput comparison, 16xlarge instances

Figure 8. ML/K-means throughput comparison, 16xlarge instances

Tables 8, 9 and 10 are throughput numbers and M6gd improvements over M5d instances for K-means benchmarks.

m6gd.8xlarge m5d.8xlarge m6gd Improvement %
large 80238342 79914573 0.405143878
huge 212746148 174920830 21.62425024

Table 8: ML/K-means throughput comparison data, 8xlarge instances

ml/kmeans m6gd.12xlarge m5d.12xlarge m6gd Improvement
large 89087828 87693452 1.59
huge 268752373 228047753 17.85
gigantic 304764667 246581359 23.60

Table 9. ML/K-means throughput comparison data, 12xlarge instances

ml/kmeans m6gd.16xlarge m5d.16xlarge m6gd Improvement %
large 84966890 76158694 11.57
huge 282294140 242666488 16.33
gigantic 407472548 321140090 26.88

Table 10. ML/K-means throughput comparison data, 16xlarge instances

Selecting the right EBS volume

We also run the benchmarks on instances with general-purpose SSD volumes (gp2). When the job has small I/O size and infrequent disk access, our experiments show a negligible difference between running Spark jobs on gp2 compared to using NVMe-based volumes. This is the case for the Aggregation and K-means benchmarks. However, users will experience poor performance on general-purpose SSD volumes if the Spark job requires lots of disk access (for example, WordCount in micro benchmarks). This poor performance is due to an IOPS/throughout limit imposed on the storage. Therefore, selecting the right volume type for the job can benefit users by lowering their costs without affecting performance.

Conclusion

Running Spark SQL Aggregation and Spark ML K-means workloads shows that M6gd (Graviton2) instances outperform equivalent M5d (x86) instances by up to 26%. M6gd also offers up to a 58% price-performance advantage over M5d on gigantic datasets.

Visit the AWS Graviton page for customer stories on adoption of Arm-based processors. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at sw-ecosystem@arm.com.

More Neoverse Workload Blogs

Anonymous
Servers and Cloud Computing blog
  • Refining MurmurHash64A for greater efficiency in Libstdc++

    Zongyao Zhang
    Zongyao Zhang
    Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
    • October 16, 2025
  • How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

    Marc Meunier
    Marc Meunier
    Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
    • October 13, 2025
  • Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

    odinlmshen
    odinlmshen
    In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
    • October 13, 2025