Accelerate Spark SQL on Arm64 with Gluten and Velox

August 7, 2024

6 minute read time.

The rapid growth of data and the potential for uncovering insights are unprecedented. However, managing these extensive calculations can pose significant performance challenges.

Apache Spark, a leading open-source data processing engine, is used for batch processing, machine learning, stream processing, and large-scale SQL. It is designed to make big data processing quicker and easier.

Apache Spark SQL offers a reliable solution for users to efficiently process large datasets. As Apache Spark continues to evolve, Spark SQL and the newer abstractions of Datasets and DataFrames provide a more expressive and powerful coding approach compared to Spark's lower-level RDDs. Additionally, Apache Spark's Catalyst engine automatically optimizes Spark SQL code to enhance execution speed.

Though Spark is a well-established project with many years of development and is one of the top frameworks for handling petabyte-scale datasets, some companies saw an opportunity to evolve Spark into a vectorized SQL engine to address its row-based data processing and JVM constraints. Today, Project Gluten was unveiled as an Apache-Arrow-based native SQL engine designed to enhance Spark SQL. Meanwhile, several vectorized SQL engines with more active open-source communities have emerged. Among these, the Meta-led Velox project is a rising star, offering a vectorized database acceleration library.

Gluten and Velox Plugin

Gluten, derived from Latin for glue, serves as a bridge between Apache Spark and vectorized SQL engines or libraries (figure 1). By integrating a vectorized SQL engine or shared library with Spark, numerous optimization opportunities arise. These include offloading functions and operators to a vectorized library, introducing just-in-time compilation engines, and enabling the use of hardware accelerators such as GPUs and FPGAs. Project Gluten enables users to leverage these software and hardware advancements within Spark.

A flow of the Gluten Architecture

^{Figure 1: Gluten Architecture}

Project Gluten leverages many aspects of the Spark framework, such as resource scheduling and the Catalyst logical plan optimizer. Furthermore, Gluten introduces new components to manage APIs between JVM and native libraries.

Gluten Fallback

Gluten utilizes the Spark JVM engine to determine if an operator is supported by the native library.

If it is not, Gluten defaults to the existing Spark-JVM-based operator.

This fallback process incurs the cost of converting data between columnar and row formats.

Gluten Query plan conversion

Gluten utilizes Substrait.io to construct a query plan tree.

It translates Spark’s physical plan into a Substrait plan tailored for each backend and then communicates this plan with JNI to initiate the execution pipeline within the native library.

Substrait is a format for describing compute operations on structured data, designed to ensure interoperability across various languages and systems.

The goal of Substrait is to establish a clear, cross-language specification for data compute operations. For more information, visit https://substrait.io.

Gluten Unified memory management

Gluten leverages Spark’s existing memory management system, which provides a robust framework for handling memory operations efficiently. Every time there is a need for native memory allocation or deallocation, Gluten calls the Spark JVM memory registration API, ensuring seamless integration with the underlying system. Spark's memory management system is designed to manage the memory for each task thread meticulously. If a thread requires more memory than the currently allocated amount, it has the capability to invoke the spill interface, provided the operators in use support this functionality. Moreover, Spark’s comprehensive memory management system includes mechanisms to protect against memory leaks and out-of-memory issues, ensuring stable and reliable performance even under high workload conditions.

Memory Pool holds a block of memory
Acquire from task memory manager if not enough
Spill happens if task doesn’t have enough memory, see figure 2

A diagram of Gluten Memory Management

^{Figure 2: Gluten Memory Management}

Gluten Columnar Shuffle

Gluten utilizes its predecessor Gazelle’s Apache Arrow-based Columnar Shuffle Manager as the default shuffle manager. A third-party library manages the data transformation from native to Arrow. Alternatively, developers can create their own shuffle manager.

Gluten Shim Layer

To seamlessly integrate with Spark, Gluten includes a Shim Layer that supports multiple versions of Spark. Gluten is compatible with Spark versions 3.x and newer.

Gluten Metrics

Gluten supports Spark’s Metrics functionality. While Spark's default metrics focus on Java row-based data processing, Project Gluten extends this with a column-based API and additional metrics. This enhancement aids in the use of Gluten and provides developers with tools to debug these native libraries.

Velox and Native Library

Velox is a unified execution engine developed and open-sourced by Meta to enhance data management systems and simplify their development. One of Velox's primary advantages is its ability to consolidate and unify data management systems, eliminating the need for repeated engine rewrites. It is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML.

Velox is being integrated into Spark serving as a native backend for Gluten. Gluten enables C++ execution engines (Velox) to operate within the Spark environment for running Spark SQL queries, figure 3. It separates the Spark JVM from the execution engine by using a JNI API based on the Apache Arrow data format and Substrait query plans. This integration allows Velox to function within Spark through Gluten’s JNI API.

A flowchart of the Gluten-Velox backend

^{Figure 3: Gluten-Velox backend}

Performance Metrics on Arm64

For Gluten, we supported the third-party dependency libhdfs3 on Arm64 (Gluten-Velox Github PR: 1,2). We also added Arm64 CPU detection (Neoverse N1/N2/V1) and adjusted the appropriate compiling flag. Additionally, we fixed the illegal instruction, core dumped, issue on earlier Arm platforms in Velox (Gluten-Velox Github PR: 3, 4).

After adding TPC-DS benchmarking scripts (Gluten-Velox Github PR: 5), we evaluated the performance of the Spark Java engine and the Gluten + Velox backend using the TPC-DS benchmark suite on Arm64 Neoverse N2.

The following chart, figure 4, shows the performance ratio comparing the Gluten + Velox backend with the Spark Java engine on Neoverse N2, using over 60 queries of a TPC-DS-like workload. The Spark configuration of the Gluten plugin was also attached.

Fullscreen

1
2
3
4
5
6
7
8
9
#gluten plugin
spark.gluten.enabled true
spark.plugins io.glutenproject.GlutenPlugin
spark.gluten.sql.columnar.backend.lib velox
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.executorEnv.VELOX_HDFS hdfs://master:9000
spark.gluten.loadLibFromJar true
spark.executor.extraClassPath /usr/local/spark/gluten-current/*
spark.driver.extraClassPath /usr/local/spark/gluten-current/*
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#gluten plugin
spark.gluten.enabled true
spark.plugins io.glutenproject.GlutenPlugin
spark.gluten.sql.columnar.backend.lib velox
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.executorEnv.VELOX_HDFS hdfs://master:9000
spark.gluten.loadLibFromJar true
spark.executor.extraClassPath /usr/local/spark/gluten-current/*
spark.driver.extraClassPath /usr/local/spark/gluten-current/*

A graph of Gluten and Velox backend vs Spark Java engine on Arm Neoverse N2

^{Figure 4: Gluten and Velox backend vs Spark Java engine on Arm Neoverse N2}

Takeaways

The Gluten: Velox integration has implemented the functions and operators necessary to execute decision support queries
Compared the Gluten-Velox with Spark Java engine, Gluten with Velox backend performance can achieve up to 83% uplift in a single query
Gluten integrates Meta-led Velox vectorized execution engine. Apache Spark users can expect improved performance and increased resource efficiency on Arm64

Summary

Gluten-velox marks a major advancement in offering an effective solution for the Spark SQL accelerator on the Arm platform. The techniques implemented by Gluten-velox allow Spark-SQL to surpass its row-based data processing and JVM constraints on Arm64. This is achieved by utilizing the vertical composability of Arm64 SIMD power and vectorized execution capabilities. Experimental results highlight the potential of this approach for Spark SQL on Arm Neoverse N2.

Reference

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog