Gain up to 58% price-performance benefits when deploying Apache Spark on AWS Graviton2

March 15, 2022

7 minute read time.

Introduction

Apache Spark is a distributed big data computation engine that runs over a cluster of machines. On Spark, parallel computations can be executed using a dataset abstraction called RDD (Resilient Distributed Datasets), or can be executed as SQL queries using the Spark SQL API. Spark Streaming is a Spark module that allows users to express streaming data as RDD/dataframe abstractions. These abstractions run through functions and the results are stored in a storage backend. The results can also be sent to an output stream. Apache Spark is mainly suited for big data processing, where data cannot reside in the memory of a single machine.

In this blog, we compare the performance of Apache Spark running on a single node Yarn cluster on M6gd (AWS Graviton2) and M5d (Intel Skylake-SP or Cascade Lake) AWS instances. HDFS (Apache Hadoop Distributed File System) data is stored on NVMe-based SSD storage attached to the instance.

AWS Graviton2: an Arm Neoverse-based processor

AWS has built the Graviton2 processor using Arm Neoverse cores which are optimized for running on cloud workloads. Graviton2-based instances provide better price-performance compared to x86 ones. The price-performance advantage of Graviton2 instances depends on the type of the applications that are deployed. Some applications can reach up to 40% increase in price-performance when compared to x86 counterparts.

AWS has reported improved performance on Graviton2-based Amazon EMR when running Spark workloads, resulting in even lower prices for running the cluster on M6g, C6g, and R6g instance types.

Benchmark environment

Benchmark tool

HiBench is a big data benchmarking tool for Hadoop, Spark, Flink, and Kafka and is used in this blog to compare Spark performance on AWS instances with different architectures. It has six workload categories: Micro Benchmarks, Machine Learning, SQL, Graph, Streaming, and Web Search. Each workload defines a data scale profile from the set tiny, small, large, huge, gigantic, and bigdata. The size of each data profile is defined in the workload configuration file.

To compare the throughput of Spark computations on different AWS instances, two workloads from three different categories were selected:

Spark SQL: Aggregation from SQL workloads
- Implements Spark SQL aggregation function.
Spark ML: K-means from ML workloads
- Implements K-means clustering algorithm involving large amount of data processing.

The storage required by some of the bigdata benchmarks is larger than the NVMe partition on M5d (x86) instances. Therefore, we omit the bigdata profile tests in this blog. In addition, since Spark is specifically used for large dataset computations, we also ignore the results for tiny and small datasets.

Spark cluster

For the benchmarks, the Yarn Hadoop cluster is pseudo-distributed, where Hadoop runs on a single node with its daemons running on separate JVMs. Benchmark data is stored inside an HDFS cluster running on the same node using NVMe SSD as its storage backend.

The Spark version is 3.1.2 and the Yarn Hadoop version is 3.3.1. The Spark/Yarn configuration uses best practices and recommendations factoring in available resources on the instances tested. The following tables show the number of executors, executor cores, and executor memory used for different instance sizes:

	Yarn executor cores	Num of executors	Executor memory (GB)
m6gd/m5d.8xlarge	5	5	16
m6gd/m5d.12xlarge	5	9	16
m6gd/m5d.16xlarge	5	12	16

Table 1. Yarn executor parameters for different instances

Cluster parameters can be tuned based on the type of the computation running on the cluster. For the sake of simplicity, Spark benchmarks use similar parameters for all workloads. The number of executor cores are set to 5, and the number of executors is computed so the total number of cores remains less than available vCPUs on the instance. The Yarn executor memory is computed following the same principle, so the total memory used by executors remains less than the total available memory. Some vCPUs and memory are left for operating system processes and daemons (following the recommendations in HiBench introduction for recommended settings).

AWS instance setup

Benchmarks run on general-purpose compute instances with NVMe support for three different sizes: 8xlarge, 12xlarge, and 16xlarge.

The specifications of the instances are as follows:

	vCPU	Memory (GB)	Storage
m6gd/m5d.8x large	32	128	1 x 1900/2 x 600 NVMe SSD
m6gd/m5d.12xlarge	48	192	2 x 1425/600 NVMe SSD
m6gd/m5d.16xlarge	64	256	2 x 1900/900 NVMe SSD

Table 2. Instance configurations under test

Price-performance comparison

The benchmarks run on a single node, resembling a Spark cluster with very low network latency. The metric of interest is bytes/s (throughput). The charts in Figure 1 demonstrate price-performance (bytes/$) (bar graph, left axis) of Spark running on m6gd.16xlarge and m5d.16xlarge, which are two instances with similar resources.

The instances were deployed on us-east-1 and the price-performance is calculated with on-demand prices. For the workloads selected, the price-performance benefit of the M6gd is up to 37% for aggregation and 58% for K-means gigantic datasets (line graph, right axis).

Figure 1. SQL/Aggregation price-performance comparison, 16xlarge instances

Figure 2. ML/K-means price-performance comparison, 16xlarge instances

Tables 3 and 4 are the price-performance numbers and percent improvements of the M6gd instances over the M5d ones for aggregation and K-means algorithms (data from figures 1 & 2).

sql/aggregation	m6gd.16xlarge	m5d.16xlarge	m6gd Improvement %
large	1060460592	918990597.3	15.39
huge	9932358684	8081299115	22.91
gigantic	65600271986	47821068916	37.18

Table 3. Sql/aggregation bytes/$ on large to gigantic datasets for 16xlarge instances

ml/kmeans	m6gd.16xlarge	m5d.16xlarge	m6gd Improvement %
large	105738662887.17	75821708628.32	39.46
huge	351306313606.20	241592742477.88	45.41
gigantic	507086965154.87	319719116150.44	58.60

Table 4. ml/kmeans bytes/$ on large to gigantic datasets for 16xlarge instances

Spark clusters usually process terabytes of data, spawning many worker nodes on which to run executors. The price-performance calculations (tables 3 & 4) show how users can significantly reduce costs by running their cluster on M6gd instances. The throughput benchmarks in the next section show users the price-performance benefit and compute performance they can gain by running jobs on M6gd.

Benchmark performance results

The charts in this section show benchmark results on 8xlarge, 12xlarge and 16xlarge instances. The left axis on each chart is the throughput in bytes/s (bar graph). The right axis is M6gd performance improvements over M5d instances (line graph).

Spark SQL benchmarks

The following charts compare aggregation performance throughput on M6gd vs M5d instances in 8xlarge, 12xlarge and 16xlarge instance sizes. HiBench SQL aggregation benchmarks show up to 29% (8xlarge), 18% (12xlarge), and 9.74% (16xlarge) higher throughput on the gigantic dataset for M6gd over M5d.

Figure 3. SQL/Aggregation throughput comparison, 8xlarge instances

Figure 3: SQL/Aggregation throughput comparison, 8xlarge instances

Figure 4. SQL/Aggregation throughput comparison, 12xlarge instances

Figure 5. SQL/Aggregation throughput comparison, 16xlarge instances

Tables 5, 6 and 7 are throughput numbers and M6gd improvements over M5d instances for SQL aggregation benchmark.

	m6gd.8xlarge	m5d.8xlarge	m6gd Improvement %
large	942415	1102030	-14.48372549
huge	8700149	9124867	-4.654511677
gigantic	48871786	37851898	29.11317155

Table 5: SQL/Aggregation throughput comparison data, 8xlarge instances

sql/aggregation	m6gd.12xlarge	m5d.12xlarge	m6gd Improvement
large	928698	1013234	-8.34
huge	8460299	9049525	-6.51
gigantic	55229252	46792922	18.03

Table 6. SQL/Aggregation throughput comparison data, 12xlarge instances

sql/aggregation	m6gd.16xlarge	m5d.16xlarge	m6gd Improvement %
large	852139	923075	-7.68
huge	7981202	8117216	-1.68
gigantic	52713463	48033607	9.74

Table 7. SQL/Aggregation throughput comparison data, 16xlarge instances

Spark ML benchmarks

The charts in Figures 6, 7 and 8 represent K-means benchmarks on M6gd vs M5d instances. K-means benchmarks show up to 21.6% (8xlarge instances) higher throughput on the huge dataset. And 23.6% (12xlarge instances) and 26.88% (16xlarge instances) higher throughput on the gigantic dataset.

Figure 6: ML/K-means throughput comparison data, 8xlarge instances

Figure 6. ML/K-means throughput comparison, 8xlarge instances

Figure 7. ML/K-means throughput comparison, 12xlarge instances

Figure 8. ML/K-means throughput comparison, 16xlarge instances

Tables 8, 9 and 10 are throughput numbers and M6gd improvements over M5d instances for K-means benchmarks.

	m6gd.8xlarge	m5d.8xlarge	m6gd Improvement %
large	80238342	79914573	0.405143878
huge	212746148	174920830	21.62425024

Table 8: ML/K-means throughput comparison data, 8xlarge instances

ml/kmeans	m6gd.12xlarge	m5d.12xlarge	m6gd Improvement
large	89087828	87693452	1.59
huge	268752373	228047753	17.85
gigantic	304764667	246581359	23.60

Table 9. ML/K-means throughput comparison data, 12xlarge instances

ml/kmeans	m6gd.16xlarge	m5d.16xlarge	m6gd Improvement %
large	84966890	76158694	11.57
huge	282294140	242666488	16.33
gigantic	407472548	321140090	26.88

Table 10. ML/K-means throughput comparison data, 16xlarge instances

Selecting the right EBS volume

We also run the benchmarks on instances with general-purpose SSD volumes (gp2). When the job has small I/O size and infrequent disk access, our experiments show a negligible difference between running Spark jobs on gp2 compared to using NVMe-based volumes. This is the case for the Aggregation and K-means benchmarks. However, users will experience poor performance on general-purpose SSD volumes if the Spark job requires lots of disk access (for example, WordCount in micro benchmarks). This poor performance is due to an IOPS/throughout limit imposed on the storage. Therefore, selecting the right volume type for the job can benefit users by lowering their costs without affecting performance.

Conclusion

Running Spark SQL Aggregation and Spark ML K-means workloads shows that M6gd (Graviton2) instances outperform equivalent M5d (x86) instances by up to 26%. M6gd also offers up to a 58% price-performance advantage over M5d on gigantic datasets.

Visit the AWS Graviton page for customer stories on adoption of Arm-based processors. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at sw-ecosystem@arm.com.

More Neoverse Workload Blogs

0 comments
0 members are here

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog