The proliferation of AI at the edge offers several advantages including decreased latency, enhanced privacy, and cost-efficiency. Arm has been at the forefront of this development, with a focus on delivering advanced AI capabilities at the edge across its Cortex-A and Cortex-M CPUs and Ethos-U NPUs. However, this space continues to expand rapidly, presenting challenges for developers looking to enable easy deployment on billions of edge devices.
One such challenge is to develop deep learning models for edge devices, since developers need to work with limited resources such as storage, memory and computing power, and still balance good model accuracy and run-time metrics such as latency or frame rate. An off-the-shelf model designed for a more powerful platform may be slow or not running at all when deployed on a more resource-constraint platform.
The TAO Toolkit is a low-code open-source tool developed by NVIDIA on top of Tensorflow and PyTorch to abstract away the intricacies of training deep learning models. It has an extensive pre-trained model repository for computer vision applications to facilitate transfer learning and also provides turnkey model optimizations in the form of channel pruning and quantization-aware training, thus helping develop much lighter models.
Figure adapted from: https://developer.nvidia.com/tao-toolkit
In this blog, we:
If you want to read more about the advantages of using other types of model optimization techniques such as random pruning and clustering on the Arm Ethos-U NPU, please read this blog.
We assume that:
The complete code, which is executable as an interactive Jupyter Notebook, will be available on GitHub.
The setup required is very straightforward and includes the following steps:
1. Install Docker follow post installation steps
2. Install the NVIDIA Container Toolkit
3. Setup an NGC Account and get the NGC API Key
4. On your terminal, log in to the NGC Docker Registry using docker login nvcr.io. This will enable TAO to pull the necessary docker required based on the task at hand.
5. Setup a conda environment for further isolating the Python package dependencies.
6. Download the latest TAO Package from the NGC Registry and you can get started with the Jupyter Notebooks.
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/5.0.0/zip -O getting_started_v5.0.0.zip unzip -u getting_started_v5.0.0.zip -d ./getting_started_v5.0.0 && rm -rf getting_started_v5.0.0.zip && cd ./getting_started_v5.0.0
For more detailed instructions, please refer to the official NVIDIA setup instructions page here.
Once the setup is done, you can download our Jupyter notebook from the Arm ML-Examples repo to the following path within the recently downloaded TAO folder: tao-getting-started_v5.0.0/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/
tao-getting-started_v5.0.0/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/
The TAO Toolkit has a wide range of models available as a part of their Model Zoo which can be easily downloaded and used for a vast number of applications. You can use NGC CLI in the following way to get a table of available pre-trained models.
!ngc registry model list nvidia/tao/pretrained_classification:*
We will download the MobilenetV2 model pre-trained on ImageNet and use it to train on our own downstream task.
# Pull pretrained model from NGC !ngc registry model download-version nvidia/tao/pretrained_classification:mobilenet_v2 --dest $LOCAL_EXPERIMENT_DIR/pretrained_mobilenet_v2
Once you have downloaded the pre-trained model, you can fine tune it on any dataset as long as it is in the following format:
Using these guidelines, we can transfer the MobileNetV2 model on the Visual Wake Words dataset. The Visual Wake Words dataset is derived from COCO dataset to train models to detect if a person is present in an image frame, this is a task particularly relevant to IoT devices. It is an image classification problem with two classes:
Figure adapted from: https://arxiv.org/pdf/1906.05721.pdf
We train the model with the following command line in TAO:
print("To run this training in data parallelism using multiple GPU's, please uncomment the line below and " " update the --gpus parameter to the number of GPU's you wish to use.") !tao model classification_tf2 train -e $SPECS_DIR/spec.yaml --gpus 2
After training has completed, the base model has evaluation accuracy of 90.33 percent.
To deploy the model on Arm Ethos-U NPU, we need to quantize the model to INT8. All the models provided by NVIDIA are encoded using EFF. NVIDIA Exchange File Format (EFF) was created with the aim of facilitating exchange and interoperability between different NVIDIA Deep Learning frameworks and tools. We will use the decode_eff() function shown below to first decode the models back into TensorFlow format and then we will use the following code for post-training quantization (PTQ) and obtain an INT8 tflite model.
def representative_dataset_gen(): root_path = '/home/amodab01/v5.0/mobilenetV2/visualwake/data/train' categories = os.listdir(root_path) x = [] img_path = os.path.join(root_path, categories[0]) images = os.listdir(img_path) for j in range(100): img = cv2.imread(os.path.join(img_path, images[j])) img = cv2.resize(img, (224,224)) img = img/255.0 img = img.astype(np.float32) x.append(img) x = np.array(x) train_data = tf.data.Dataset.from_tensor_slices(x) for i in train_data.batch(1).take(5): yield [i]
def decode_eff(eff_model_path, enc_key=None): """Decode EFF to saved_model directory. Args: eff_model_path (str): Path to eff model enc_key (str, optional): Encryption key. Defaults to None. Returns: str: Path to the saved_model """ # Decrypt EFF eff_filename = os.path.basename(eff_model_path) eff_art = Archive.restore_artifact( restore_path=eff_model_path, artifact_name=eff_filename, passphrase=enc_key) zip_path = eff_art.get_handle() # Unzip saved_model_path = os.path.dirname(zip_path) with zipfile.ZipFile(zip_path, "r") as zip_file: zip_file.extractall(saved_model_path) return saved_model_path
input_model_file = '/home/amodab01/v5.0/mobilenetV2/visualwake/classification_tf2/output/train/mobilenet_v2_bn_070.tlt' output_model_file = '/home/amodab01/v5.0/mobilenetV2/visualwake/classification_tf2/output/int81/model.tflite' key = 'tlt' if os.path.isdir(input_model_file): print("Model provided is a saved model directory at {}".format(input_model_file)) saved_model = input_model_file else: saved_model = decode_eff( input_model_file, enc_key=key ) print("Converting the saved model to tflite model.") converter = tf.lite.TFLiteConverter.from_saved_model( saved_model, signature_keys=["serving_default"], ) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.target_spec.supported_ops = [ tf.lite.OpsSet.TFLITE_BUILTINS_INT8, # enable TensorFlow Lite ops. tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops. ] converter.inference_input_type = tf.int8 converter.inference_output_type = tf.int8 converter.representative_dataset = representative_dataset_gen tflite_model = converter.convert() model_root = os.path.dirname(output_model_file) if not os.path.exists(model_root): os.makedirs(model_root) print("Writing out the tflite model.") with open(output_model_file, "wb") as tflite_file: model_size = tflite_file.write(tflite_model) print(f"TFLite model of size {model_size//MB} MB was written to {output_model_file}")
We use the Vela compiler as well as Corstone-300 Fixed Virtual Platform to get the performance numbers of tflite models running on Arm Ethos-U NPU.
Vela is developed by Arm to compile a tflite model into an optimized version that can run on an embedded system containing an Arm Ethos-U NPU. It’s a python package and can be installed using:
pip install ethos-u-vela
More details can be found here.
Corstone-300 is a cycle approximate emulator for Ethos-U NPU with Cortex-M microcontrollers. More information about Corstone-300 can be found here. It is available to developers through the ML Inference Adviser project. It can be installed with commands:
pip install mlia
mlia-backend install corstone-300
Vela gives an estimate based on simplified assumptions and that’s why it’s not exactly the same as measured on Corstone-300.
SRAM usage is also based on Corstone-300 estimates. We use the following setting in Corstone-300 FPV:
mlia check --output-dir tao-mnv2 --performance -t ethos-u55-256 tao/MobilenetV2.tflite
This configuration corresponds to Arm Ethos-U55 NPU with 256 MAC engines.
For TF2, TAO Toolkit offers the option of channel pruning with the following parameters:
Channel pruning aims at removing unimportant channels in each layer so that a model can be shrunk with minimal impact to its accuracy. To get started, we will first try using a pruning threshold of 0.5 which removes around 50% of the channels for each layer and use the default values for other parameters. Note that channel pruning results in reduction of both input and output channel numbers of a layer to match the size and thus the resultant model will be more than 50% smaller in size, as governed by other factors like granularity and min_num_filters.
To prune the model, we use the following command:
!tao model classification_tf2 prune -e $SPECS_DIR/spec.yaml
With a threshold of 0.5, the pruned model is approximately 4x smaller in size than the original and may have slightly reduced accuracy. This occurs because some weights that were previously helpful may have been eliminated. It is advised that you retrain this pruned model using the same dataset in order to regain accuracy. On re-training it, we get an evaluation accuracy of 90.35%. Re-training has recovered all the lost accuracy compared to the baseline dense model.
To get an even faster model to fit a smaller latency budget, we can use a higher pruning threshold of 0.68 which removes approximately 68% of the channels for each layer. The pruned model is approximately 10x smaller and on re-training the model, it has evaluation accuracy of 90.17% and has regained almost all of the previously lost accuracy with a much smaller model.
In the figure below, we visualize the models in Netron and compare their graph structures side-by-side before and after pruning. Notice how the numbers of channels (last dimension in each red box) have been reduced. The full model is shown on the left and the pruned model with a 0.68 pruning threshold is shown on the right.
To deploy the model on the Ethos-U, we need to quantize the models to INT8 using post-training quantization. Similar to the dense model, we use the block of code provided in the previous section to obtain INT8 tflite models, which can be compiled with Vela and we get the following performance estimates.
In addition, the TAO Toolkit offers 2 AutoML algorithms – Hyperband and Bayesian as a part of the API service which can be used to tune hyperparameters automatically for a particular model and dataset pair. We will be making a future blog post about the usage of AutoML feature to further increase the accuracy of the models with comparison and tradeoffs for each algorithm.
This blog describes how you can take a pre-trained model available in the NVIDIA TAO Toolkit, adapt it to your custom dataset and use-case, and then use the channel pruning functionality in TAO to obtain models which fit your latency requirements and achieve better overall performance on Arm Ethos-U NPUs. Using off-the-shelf pre-trained models enables users to rapidly fine-tune for downstream tasks using a much smaller dataset, while still achieving high accuracies. The TAO Toolkit streamlines this process and offers good optimization options to enable users to get 3x to 4x higher performance and throughput without sacrificing much on model accuracy. It also offers deployment routes to high-performance Arm Ethos-U NPUs and this opens the door for endless opportunities of deploying machine learning models at the edge on Arm. We would encourage developers to try out NVIDIA TAO Toolkit and use it to optimize models for deployment on Arm hardware.
This blog is a co-authored piece from Amogh Dabholkar, Machine Learning Engineer at Arm, and Chu Zhou, Staff Engineer at Arm.
According to your description,"you can download our Jupyter notebook from the Arm ML-Examples repo to the following path within the recently downloaded TAO folder:", but I can't find the example project on your ML-Examples repo. Would you please help to point out?
Hello,
Sorry for the inconvenience caused.
I'll check it once, thanks for pointing out!