The rapid growth of data and the potential for uncovering insights are unprecedented. However, managing these extensive calculations can pose significant performance challenges.
Apache Spark, a leading open-source data processing engine, is used for batch processing, machine learning, stream processing, and large-scale SQL. It is designed to make big data processing quicker and easier.
Apache Spark SQL offers a reliable solution for users to efficiently process large datasets. As Apache Spark continues to evolve, Spark SQL and the newer abstractions of Datasets and DataFrames provide a more expressive and powerful coding approach compared to Spark's lower-level RDDs. Additionally, Apache Spark's Catalyst engine automatically optimizes Spark SQL code to enhance execution speed.
Though Spark is a well-established project with many years of development and is one of the top frameworks for handling petabyte-scale datasets, some companies saw an opportunity to evolve Spark into a vectorized SQL engine to address its row-based data processing and JVM constraints. Today, Project Gluten was unveiled as an Apache-Arrow-based native SQL engine designed to enhance Spark SQL. Meanwhile, several vectorized SQL engines with more active open-source communities have emerged. Among these, the Meta-led Velox project is a rising star, offering a vectorized database acceleration library.
Gluten, derived from Latin for glue, serves as a bridge between Apache Spark and vectorized SQL engines or libraries (figure 1). By integrating a vectorized SQL engine or shared library with Spark, numerous optimization opportunities arise. These include offloading functions and operators to a vectorized library, introducing just-in-time compilation engines, and enabling the use of hardware accelerators such as GPUs and FPGAs. Project Gluten enables users to leverage these software and hardware advancements within Spark.
Figure 1: Gluten Architecture
Project Gluten leverages many aspects of the Spark framework, such as resource scheduling and the Catalyst logical plan optimizer. Furthermore, Gluten introduces new components to manage APIs between JVM and native libraries.
Gluten utilizes the Spark JVM engine to determine if an operator is supported by the native library.
If it is not, Gluten defaults to the existing Spark-JVM-based operator.
This fallback process incurs the cost of converting data between columnar and row formats.
Gluten utilizes Substrait.io to construct a query plan tree.
It translates Spark’s physical plan into a Substrait plan tailored for each backend and then communicates this plan with JNI to initiate the execution pipeline within the native library.
Substrait is a format for describing compute operations on structured data, designed to ensure interoperability across various languages and systems.
The goal of Substrait is to establish a clear, cross-language specification for data compute operations. For more information, visit https://substrait.io.
Gluten leverages Spark’s existing memory management system, which provides a robust framework for handling memory operations efficiently. Every time there is a need for native memory allocation or deallocation, Gluten calls the Spark JVM memory registration API, ensuring seamless integration with the underlying system. Spark's memory management system is designed to manage the memory for each task thread meticulously. If a thread requires more memory than the currently allocated amount, it has the capability to invoke the spill interface, provided the operators in use support this functionality. Moreover, Spark’s comprehensive memory management system includes mechanisms to protect against memory leaks and out-of-memory issues, ensuring stable and reliable performance even under high workload conditions.
Figure 2: Gluten Memory Management
Gluten utilizes its predecessor Gazelle’s Apache Arrow-based Columnar Shuffle Manager as the default shuffle manager. A third-party library manages the data transformation from native to Arrow. Alternatively, developers can create their own shuffle manager.
To seamlessly integrate with Spark, Gluten includes a Shim Layer that supports multiple versions of Spark. Gluten is compatible with Spark versions 3.x and newer.
Gluten supports Spark’s Metrics functionality. While Spark's default metrics focus on Java row-based data processing, Project Gluten extends this with a column-based API and additional metrics. This enhancement aids in the use of Gluten and provides developers with tools to debug these native libraries.
Velox is a unified execution engine developed and open-sourced by Meta to enhance data management systems and simplify their development. One of Velox's primary advantages is its ability to consolidate and unify data management systems, eliminating the need for repeated engine rewrites. It is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML.
Velox is being integrated into Spark serving as a native backend for Gluten. Gluten enables C++ execution engines (Velox) to operate within the Spark environment for running Spark SQL queries, figure 3. It separates the Spark JVM from the execution engine by using a JNI API based on the Apache Arrow data format and Substrait query plans. This integration allows Velox to function within Spark through Gluten’s JNI API.
Figure 3: Gluten-Velox backend
For Gluten, we supported the third-party dependency libhdfs3 on Arm64 (Gluten-Velox Github PR: 1,2). We also added Arm64 CPU detection (Neoverse N1/N2/V1) and adjusted the appropriate compiling flag. Additionally, we fixed the illegal instruction, core dumped, issue on earlier Arm platforms in Velox (Gluten-Velox Github PR: 3, 4).
After adding TPC-DS benchmarking scripts (Gluten-Velox Github PR: 5), we evaluated the performance of the Spark Java engine and the Gluten + Velox backend using the TPC-DS benchmark suite on Arm64 Neoverse N2.
The following chart, figure 4, shows the performance ratio comparing the Gluten + Velox backend with the Spark Java engine on Neoverse N2, using over 60 queries of a TPC-DS-like workload. The Spark configuration of the Gluten plugin was also attached.
#gluten plugin spark.gluten.enabled true spark.plugins io.glutenproject.GlutenPlugin spark.gluten.sql.columnar.backend.lib velox spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager spark.executorEnv.VELOX_HDFS hdfs://master:9000 spark.gluten.loadLibFromJar true spark.executor.extraClassPath /usr/local/spark/gluten-current/* spark.driver.extraClassPath /usr/local/spark/gluten-current/*
Figure 4: Gluten and Velox backend vs Spark Java engine on Arm Neoverse N2
Gluten-velox marks a major advancement in offering an effective solution for the Spark SQL accelerator on the Arm platform. The techniques implemented by Gluten-velox allow Spark-SQL to surpass its row-based data processing and JVM constraints on Arm64. This is achieved by utilizing the vertical composability of Arm64 SIMD power and vectorized execution capabilities. Experimental results highlight the potential of this approach for Spark SQL on Arm Neoverse N2.