Apache Spark is a distributed big data computation engine that runs over a cluster of machines. On Spark, parallel computations can be executed using a dataset abstraction called RDD (Resilient Distributed Datasets), or can be executed as SQL queries using the Spark SQL API. Spark Streaming is a Spark module that allows users to express streaming data as RDD/dataframe abstractions. These abstractions run through functions and the results are stored in a storage backend. The results can also be sent to an output stream. Apache Spark is mainly suited for big data processing, where data cannot reside in the memory of a single machine.
In this blog, we compare the performance of Apache Spark running on a single node Yarn cluster on M6gd (AWS Graviton2) and M5d (Intel Skylake-SP or Cascade Lake) AWS instances. HDFS (Apache Hadoop Distributed File System) data is stored on NVMe-based SSD storage attached to the instance.
AWS has built the Graviton2 processor using Arm Neoverse cores which are optimized for running on cloud workloads. Graviton2-based instances provide better price-performance compared to x86 ones. The price-performance advantage of Graviton2 instances depends on the type of the applications that are deployed. Some applications can reach up to 40% increase in price-performance when compared to x86 counterparts.
AWS has reported improved performance on Graviton2-based Amazon EMR when running Spark workloads, resulting in even lower prices for running the cluster on M6g, C6g, and R6g instance types.
HiBench is a big data benchmarking tool for Hadoop, Spark, Flink, and Kafka and is used in this blog to compare Spark performance on AWS instances with different architectures. It has six workload categories: Micro Benchmarks, Machine Learning, SQL, Graph, Streaming, and Web Search. Each workload defines a data scale profile from the set tiny, small, large, huge, gigantic, and bigdata. The size of each data profile is defined in the workload configuration file.
To compare the throughput of Spark computations on different AWS instances, two workloads from three different categories were selected:
The storage required by some of the bigdata benchmarks is larger than the NVMe partition on M5d (x86) instances. Therefore, we omit the bigdata profile tests in this blog. In addition, since Spark is specifically used for large dataset computations, we also ignore the results for tiny and small datasets.
For the benchmarks, the Yarn Hadoop cluster is pseudo-distributed, where Hadoop runs on a single node with its daemons running on separate JVMs. Benchmark data is stored inside an HDFS cluster running on the same node using NVMe SSD as its storage backend.
The Spark version is 3.1.2 and the Yarn Hadoop version is 3.3.1. The Spark/Yarn configuration uses best practices and recommendations factoring in available resources on the instances tested. The following tables show the number of executors, executor cores, and executor memory used for different instance sizes:
Yarn executor cores
Num of executors
Executor memory (GB)
Table 1. Yarn executor parameters for different instances
Cluster parameters can be tuned based on the type of the computation running on the cluster. For the sake of simplicity, Spark benchmarks use similar parameters for all workloads. The number of executor cores are set to 5, and the number of executors is computed so the total number of cores remains less than available vCPUs on the instance. The Yarn executor memory is computed following the same principle, so the total memory used by executors remains less than the total available memory. Some vCPUs and memory are left for operating system processes and daemons (following the recommendations in HiBench introduction for recommended settings).
AWS instance setup
Benchmarks run on general-purpose compute instances with NVMe support for three different sizes: 8xlarge, 12xlarge, and 16xlarge.
The specifications of the instances are as follows:
1 x 1900/2 x 600 NVMe SSD
2 x 1425/600 NVMe SSD
2 x 1900/900 NVMe SSD
Table 2. Instance configurations under test
The benchmarks run on a single node, resembling a Spark cluster with very low network latency. The metric of interest is bytes/s (throughput). The charts in Figure 1 demonstrate price-performance (bytes/$) (bar graph, left axis) of Spark running on m6gd.16xlarge and m5d.16xlarge, which are two instances with similar resources.
The instances were deployed on us-east-1 and the price-performance is calculated with on-demand prices. For the workloads selected, the price-performance benefit of the M6gd is up to 37% for aggregation and 58% for K-means gigantic datasets (line graph, right axis).
Figure 1. SQL/Aggregation price-performance comparison, 16xlarge instances
Figure 2. ML/K-means price-performance comparison, 16xlarge instances
Tables 3 and 4 are the price-performance numbers and percent improvements of the M6gd instances over the M5d ones for aggregation and K-means algorithms (data from figures 1 & 2).
Table 3. Sql/aggregation bytes/$ on large to gigantic datasets for 16xlarge instances
Table 4. ml/kmeans bytes/$ on large to gigantic datasets for 16xlarge instances
Spark clusters usually process terabytes of data, spawning many worker nodes on which to run executors. The price-performance calculations (tables 3 & 4) show how users can significantly reduce costs by running their cluster on M6gd instances. The throughput benchmarks in the next section show users the price-performance benefit and compute performance they can gain by running jobs on M6gd.
The charts in this section show benchmark results on 8xlarge, 12xlarge and 16xlarge instances. The left axis on each chart is the throughput in bytes/s (bar graph). The right axis is M6gd performance improvements over M5d instances (line graph).
The following charts compare aggregation performance throughput on M6gd vs M5d instances in 8xlarge, 12xlarge and 16xlarge instance sizes. HiBench SQL aggregation benchmarks show up to 29% (8xlarge), 18% (12xlarge), and 9.74% (16xlarge) higher throughput on the gigantic dataset for M6gd over M5d.
Figure 3: SQL/Aggregation throughput comparison, 8xlarge instances
Figure 4. SQL/Aggregation throughput comparison, 12xlarge instances
Figure 5. SQL/Aggregation throughput comparison, 16xlarge instances
Tables 5, 6 and 7 are throughput numbers and M6gd improvements over M5d instances for SQL aggregation benchmark.
Table 5: SQL/Aggregation throughput comparison data, 8xlarge instances
Table 6. SQL/Aggregation throughput comparison data, 12xlarge instances
Table 7. SQL/Aggregation throughput comparison data, 16xlarge instances
The charts in Figures 6, 7 and 8 represent K-means benchmarks on M6gd vs M5d instances. K-means benchmarks show up to 21.6% (8xlarge instances) higher throughput on the huge dataset. And 23.6% (12xlarge instances) and 26.88% (16xlarge instances) higher throughput on the gigantic dataset.
Figure 6. ML/K-means throughput comparison, 8xlarge instances
Figure 7. ML/K-means throughput comparison, 12xlarge instances
Figure 8. ML/K-means throughput comparison, 16xlarge instances
Tables 8, 9 and 10 are throughput numbers and M6gd improvements over M5d instances for K-means benchmarks.
Table 8: ML/K-means throughput comparison data, 8xlarge instances
Table 9. ML/K-means throughput comparison data, 12xlarge instances
Table 10. ML/K-means throughput comparison data, 16xlarge instances
We also run the benchmarks on instances with general-purpose SSD volumes (gp2). When the job has small I/O size and infrequent disk access, our experiments show a negligible difference between running Spark jobs on gp2 compared to using NVMe-based volumes. This is the case for the Aggregation and K-means benchmarks. However, users will experience poor performance on general-purpose SSD volumes if the Spark job requires lots of disk access (for example, WordCount in micro benchmarks). This poor performance is due to an IOPS/throughout limit imposed on the storage. Therefore, selecting the right volume type for the job can benefit users by lowering their costs without affecting performance.
Running Spark SQL Aggregation and Spark ML K-means workloads shows that M6gd (Graviton2) instances outperform equivalent M5d (x86) instances by up to 26%. M6gd also offers up to a 58% price-performance advantage over M5d on gigantic datasets.
Visit the AWS Graviton page for customer stories on adoption of Arm-based processors. For any queries related to your software workloads running on Arm Neoverse platforms, feel free to reach out to us at email@example.com.
More Neoverse Workload Blogs