Scalable Hyperparameter Tuning for AutoML

July 7, 2020

15 minute read time.

AutoML

Automated machine learning (AutoML) is the process of automating machine learning application to real-world problems. A typical Machine Learning (ML) pipeline involves multiple steps including data pre-processing, feature engineering, model selection and hyperparameter tuning. Each of these steps create their own challenges, introducing significant hurdles to the application of machine learning in production environments. AutoML methods aim to automate parts of the ML pipeline in order to enable widespread adoption of ML, especially for non-expert users. For example, applications like fault detection and quality monitoring in manufacturing systems, fraud detection in financial transactions require models to be trained and tuned periodically. This is because new equipment are commissioned or consumer behavior evolves.

A week in the life of data scientist cartoon, from PhDComics.

Figure 1: A week in the life of data scientist: Without AutoML every change in data or features would require lot of manual steps for training and retraining ML models. "Piled Higher and Deeper" by Jorge Cham, PhD Comics.

In this post, we introduce Mango, an open-source Python library for hyperparameter optimization that is built for AutoML systems. Mango is developed by Arm Research in collaboration with Sandeep Singh Sandha, currently pursuing a PhD in the Computer Science Department at UCLA. Mango is deployed in production at Arm Research as part of an in-house AutoML system. The primary contribution of Mango is the ability to parallelize hyperparameter optimization on a distributed cluster, while maintaining the flexibility to use any scheduling framework and using intelligent parallel Bayesian search algorithm. It has many other useful features:

Simple abstractions to define complex stochastic search spaces.
Compatible with scikit-learn's (Sklearn) parameter space.
The ability to handle mixed (categorical and continuous) parameters and fault tolerance.

We go through background on hyperparameter tuning and Bayesian optimization to motivate the technical problem, followed by details on Mango and how it can be used to parallelize hyperparameter tuning with Celery. But first, here is a simple example to show how Mango works:

Install Mango (requires Python 3):

$ pip3 install git+https://github.com/ARM-software/mango.git

Following snippet shows how to tune the parameters of KNN (K-Nearest Neighbor) algorithm using scikit-learn’s cross validation score on the breast cancer dataset:

from sklearn import datasets 

from sklearn.neighbors import KNeighborsClassifier 

from sklearn.model_selection import cross_val_score 

 

from mango import Tuner, scheduler 

 

# search space for KNN classifier's hyperparameters 

# n_neighbors can vary between 1 and 50, with different choices of algorithm 

param_space = dict(n_neighbors=range(1, 50), 

                   algorithm=['auto', 'ball_tree', 'kd_tree', 'brute']) 

 

# Define the objective function to be maximized 

# here we use cross validation score and maximize accuracy 

@scheduler.serial 

def objective(**params): 

    X, y = datasets.load_breast_cancer(return_X_y=True) 

    clf = KNeighborsClassifier(**params) 

    score = cross_val_score(clf, X, y, scoring='accuracy').mean() 

    return score 

 

# Initialize and run Mango Tuner (by default it uses 20 iterations 

tuner = Tuner(param_space, objective) 

results = tuner.maximize() 

print('best parameters:', results['best_params']) 

print('best accuracy:', results['best_objective']) 

# => best parameters: {'algorithm': 'auto', 'n_neighbors': 11} 

# => best accuracy: 0.931486122714193

Later in the blog post we show how the previous example can be run in parallel on a cluster using Celery task scheduler. This, and many more examples, can be found on Mango's GitHub home.

Hyperparameter Tuning

Machine learning models are characterized by the complex and large space of hyperparameters. For example, a random forest classifier is parameterized by the number of trees and their depth, the number of features per tree, and the minimum samples per leaf node - just to name a few. The performance of machine learning classifiers is very sensitive to the choice of hyperparameters. Tuning hyperparameters is an outer optimization loop on top of ML model training.

A diagram representing the relationship between hyperparameter tuning and model training

Figure 2: Hyperparameter tuning is the outer optimization loop in ML training.

A simple approach to find the optimal hyperparameters is to train the model for all possible combinations of hyperparameter values and pick the best one. Training ML models is computationally expensive, especially for large datasets, hence traditional approaches like grid search are not scalable. This is the case especially for Auto ML pipelines, which involve tuning and training ML models at large scale without a developer in the loop.

A diagram demonstrating a Support Vector Machine (SVM) classifier as a function of hyperparameters C and Gamma.

Figure 3: Accuracy of Support Vector Machine (SVM) classifier as a function of hyperparameters C and Gamma. The function is highly non-linear, and it takes thousands of evaluations in grid search to find the optima.

There are several existing libraries designed for hyperparameter tuning, including Hyperopt, Auto-sklearn, and Auto-WEKA. These libraries provide limited support for parallel search using distributed scheduling frameworks. For example, the parallel execution in Hyperopt is limited to Mongo processes or Apache Spark. The parallel execution in Auto-sklearn requires users to manually start multiple instances on a shared file system. Auto-WEKA runs multiple sequential algorithms with different random seeds for parallel execution on a single compute node. The engineering effort required to integrate these libraries into existing ML training pipelines would be prohibitive due to their lack of abstraction and inability to support arbitrary distributed task scheduling frameworks. As a result, Mango was developed in-house with a modular design for ease of integration. Before describing the abstractions in Mango, it is helpful to go over some background on Bayesian optimization.

Bayesian Optimization

Bayesian optimization has emerged as an efficient method for hyperparameter optimization due to its ability to find optimal parameters with fewer evaluations compared to grid or random search [1]. Bayesian optimization is an efficient method to solve black box optimization problems, especially when number of parameters is below 100 and single evaluation of the optimization objective is a costly operation. Hyperparameter tuning falls squarely within the kind of problems that Bayesian optimization is most suited for. The effectiveness of Bayesian optimization is illustrated in Figure 4, using the same SVM example as before.

Figure 4: Accuracy of SVM classifier as a function of hyperparameters C and Gamma. Bayesian optimizer implemented in Mango is able to closely approximate the complex objective function surface in 50 evaluations.

Bayesian methods use a surrogate model to approximate the objective function and an acquisition function to guide the search. The general algorithm can be summarized as follows:

A graphic showing the surrogate model.

Surrogate Model

A popular choice for the surrogate model, also used in Mango, is Gaussian Process (GP). GP can be thought of as an infinite dimensional multivariate Gaussian distribution. It can be specified using a mean and covariance:

We can incorporate prior beliefs about the objective function by specifying an appropriate mean and covariance function. This helps in speeding up the search and provides a systematic way of using historical knowledge in future searches.

An intuitive way to think about GP is that it specifies how objective function values are correlated given how close the input parameters are. For smooth functions, we expect high correlation, so knowing the objective function value at a point gives use high confidence or lower variance in the objective function value around that point. Mathematically the following equations describe this [2].

Where Σ* is the updated variance at x* given the values at x. As we are able to see if the covariance between x and x* is high, we know the updated variance at x* would be lower. As more points are evaluated from the actual function, we incrementally get more confident in our approximation and can use the acquisition function to guide search.

Acquisition function

Acquisition functions are used to determine the next point to sample given the updated surrogate function based on the evaluations so far. A simple acquisition function is to sample the point where mean of the surrogate function is maximized:

However, this approach is likely to get stuck in a local optimum early in the search. Acquisition functions need to balance exploration vs exploitation to explore unknown regions, while trying to maximize the objective function. The Upper Confidence Bound (UCB) acquisition function proposed by Srinivas, Niranjan, and others. [3], provides a way to control this balance through an exploration factor β:

Initially the exploration factor β is set at a high value so that we can explore and build a good approximation of the objective function. As the number of iterations increase, we can reduce the exploration factor and start favoring points with high values.

Figure 5 gives a visual representation of steps in Bayesian optimization for a simple function using GP as the surrogate model and UCB as the acquisition function.

A diagram showing bayesian optimization with Gaussian process and upper confidence bound function

Figure 5: Simple example showing how Bayesian optimization with Gaussian process (GP) and upper confidence bound (UCB) as acquisition function explores and exploits the search space. Initially (iteration 1) GP has high variance and the acquisition function explores by sampling points with high variance. As more points are evaluated the variance in GP reduces, it gets closer to the actual objective function and acquisition function samples the point with high objective values.

Intelligent Parallel Search

The traditional Bayesian optimization algorithm is sequential that is, it provides a single next point to evaluate for search. This presents a challenge if we want to utilize distributed computing resources and horizontally scale the optimization process. To address this challenge, we need an acquisition function that can suggest multiple points. This enables us to evaluate in parallel such that the information gain is maximized, and we are not wasting resources to acquire redundant information.

In Mango, we have implemented two such parallel search acquisition functions. One of them, called batch Gaussian process bandits [4], is based on combination of hallucination and UCB. The other acquisition function was developed in-house, it utilizes clustering to find multiple peaks in the UCB acquisition function as shown in figure 6. In our experiments, we have observed that clustering approach performs better however it is computationally more expensive.

A diagram showing clustering approach for parallel search in Bayesian optimization.

Figure 6: Visualizing the clustering approach for parallel search in Bayesian optimization for the same example shown in Figure 5. As seen, the parallel search evaluates the optimal region in the second iteration. Upper confidence bound (UCB) is used the acquisition function that explores three regions in parallel. The parallel areas are based on the clustering approach, where we construct three clusters in promising regions of the search space. In every iteration, three parallel evaluations are done.

The clustering parallel search approach internally uses the K-Means algorithm. Our goal is to create clusters that have high acquisition function values and are far from each other in the parameter search space. The clustering parallel search has the following two steps:

First, we cluster the search domain based on acquisition function values to create partitions based on their expected optima's. This step ensures that we are able to filter the regions where the acquisition function has high values.
Next, we cluster the high values in the acquisition function based on their distance in the search domain space. This step ensures that we create different clusters that are far from each other in the parameter space.

Figure 6 gives a visual representation of the clustering approach, where three clusters are created in parallel. As seen, Mango creates clusters in parameter space regions that have high acquisition function values and also maintains distance in the parameter space. It is also noted that the distance in parameter space decreases, as the optimal regions are discovered and concentrated near the true optimal function value.

Tuning KNN classifier through Celery scheduler

We go through an example of tuning of k-nearest neighbor (KNN) classifier using Celery as the task scheduler. Install Mango library (requires Python 3):

$ pip3 install git+https://github.com/ARM-software/mango.git

Define a Celery app and an objective function that trains the KNN classifier for a given value of hyperparameters. Here we use scikit-learn’s *** cancer dataset and cross validations score as the target score.

from sklearn import datasets 

from sklearn.neighbors import KNeighborsClassifier 

from sklearn.model_selection import cross_val_score 

import celery 


# celery app and task 

app = celery.Celery('knn_celery', backend='rpc://') 

@app.task 

def remote_objective(**params): 

    X, y = datasets.load_breast_cancer(return_X_y=True) 

    clf = KNeighborsClassifier(**params) 

    score = cross_val_score(clf, X, y, scoring='accuracy').mean() 

    return score

Mango is designed to allow the use of any distributed computing framework (like Celery or Kubernetes). The objective function can be scheduled to run locally or on a cluster with parallel evaluations. Users can define their own distribution strategies using custom scheduler. To do so, users need to define an objective function that takes a list of parameters and returns the list of results. To use Celery scheduler, we can define the objective as follows:

from mango import Tuner, scheduler

@scheduler.custom(n_jobs=2) 

def objective(params_batch): 

    jobs = celery.group(remote_objective.s(**params) for params in params_batch)() 

    return jobs.get()

params_batch is a list of hyperparameters selected by the search algorithm to evaluate in parallel.

Parameter search space is defined as a dictionary with parameter names as keys and search space as the value. Mango support all the scipy.stats distributions and supports categorical, discreet, and continuous variables. For KNN the search space consists of an integer and categorical parameter:

# search space for KNN classifier's hyperparameters 

# n_neighbors can vary between 1 and 50, with different choices of algorithm 

param_space = dict(n_neighbors=range(1, 50), 

                   algorithm=['auto', 'ball_tree', 'kd_tree', 'brute'])

The Tuner module takes the param_space, objective, and number of maximum iterations to maximize the objective:

Tuner = Tuner(param_space, objective, {'num_iteration': 30}) 

results = tuner.maximize() 

print('best parameters:', results['best_params']) 

print('best accuracy:', results['best_objective'])

assert results['best_objective'] > 0.93

The complete working example is also given here.

Conclusion

In summary, Mango provides a simple to use interface for hyperparameter tuning of machine learning models using state-of-the-art parallel Bayesian optimization algorithm. It is designed to be used with any distributed scheduling framework like Celery or Kubernetes jobs. These features make Mango ideal for use in production systems where there is need to scale horizontally and efficiently utilized distributed computing resources. In addition to hyperparameter tuning of ML models, Mango can also be used for black box optimization of computationally expensive functions. We continue to make algorithmic and design improvements to Mango. An interesting future direction is combined hyperparameter optimization with choice of ML model as one of the parameters and generating an optimal ensemble.

For more examples and detailed documentation please visit Mango's GitHub page. More technical details are available in the ICASSP 2020 conference paper, and get in touch should you have any questions.

Contact Mohit Aggarwal Read the full paper

Research Articles

TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024
Intermittently-powered systems for battery-less energy

Becky Ellis

Maintaining batteries can be costly. The solution, say researchers at Newcastle University, are systems that are resilient to power intermittency.
- October 8, 2024