***All content written in this blog by Logan Weber & Andrew Reusch from OctoML***
Executive summary: Optimizing and deploying machine learning workloads to bare-metal devices today is difficult, and Apache TVM is laying the open-source foundation to making this easy and fast for anyone. Here we show how the µTVM component of TVM brings broad framework support, powerful compiler middleware, and flexible autotuning and compilation capabilities to embedded platforms. Check out our detailed post here for all the technical details, or read on for an overview.
The proliferation of low-cost, AI-powered consumer devices has led to widespread interest in "bare-metal" (low power, often without an operating system) devices among ML researchers and practitioners. While it is already possible for experts to run some models on some bare-metal devices, optimizing models for diverse sets of devices is challenging, often requiring manually optimized device-specific libraries. Because of this, to target new devices, developers must implement one-off custom software stacks for managing system resources and scheduling model execution.
The manual optimization of machine learning software is not unique to the domain of bare-metal devices. In fact, this has been a common theme for developers working with other hardware backends (for example, GPUs and FPGAs). Apache TVM has helped many ML engineers and companies handle the breadth of hardware targets available, but until now, it had little to offer for the unique profile of microcontrollers. To solve this gap, we have extended TVM to feature a microcontroller backend, called µTVM (pronounced "MicroTVM"). µTVM facilitates host-driven execution of tensor programs on bare-metal devices and enables automatic optimization of these programs VIA AutoTVM, TVM's built-in tensor program optimizer. In the figure below, a bird's eye view of the µTVM + AutoTVM infrastructure is shown:
A standard µTVM setup, where the host communicates with the device VIA JTAG.
In the previous image, we have an STM32F746ZG board, housing an Arm Cortex-M7 processor, an ideal part for AI on the edge given it is strong performance in a low-power envelope. We use its USB-JTAG port to connect it to our desktop machine. On the desktop, we run OpenOCD to open a JTAG connection with the device; in turn, OpenOCD allows µTVM to control the M7 processor using a device-agnostic TCP socket. With this setup in place, we can run a CIFAR-10 classifier using TVM code that looks like this (full script here):
OPENOCD_SERVER_ADDR = '127.0.0.1' OPENOCD_SERVER_PORT = 6666 TARGET = tvm.target.create('c -device=micro_dev') DEV_CONFIG = stm32f746xx.default_config(OPENOCD_SERVER_ADDR, OPENOCD_SERVER_PORT) module, params = get_cifar10_cnn() with micro.Session(device_config) as sess: graph, c_module, params = relay.build(module['main'], target=TARGET, params=params) micro_mod = micro.create_micro_mod(c_module, DEV_CONFIG) graph_mod = graph_runtime.create(graph, micro_mod, ctx=tvm.micro_dev(0)) graph_mod.run(data=data_np) prediction = CIFAR10_CLASSES[np.argmax(graph_mod.get_output(0).asnumpy())] print(f'prediction was {prediction}')
The following are the performance results of MicroTVM, compared with CMSIS-NN version 5.6.0 (commit b5ef1c9), a hand-optimized library of ML kernels.
As we can see, the out-of-the-box performance is not great, but this is where AutoTVM comes to the rescue. We can write a schedule template for our device, do a round of autotuning, then achieve significantly better results. To plug in our autotuned results, we only need to replace this line:
graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)
with these lines:
with TARGET, autotvm.apply_history_best(TUNING_RESULTS_FILE): graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)
And our results now look like this:
We have improved our performance by 2x, and we are now much quite close to CMSIS-NN (especially if you want TFLite compatible quantization support), which is code written by some of the best Arm engineers in the world. TVM with µTVM enables you to play with the best of them. To see how it works, see the detailed deep-dive here written by OctoML engineer Logan Weber.
The envisioned µTVM optimization and deployment pipeline
While end-to-end benchmark results are already obtainable with the current runtime as we demonstrated previously, deployment of these models in a standalone capacity is still on our roadmap. The gap being that the AutoTVM-oriented runtime currently relies on the host to allocate tensors and to schedule function execution. However, to be useful at the edge, we need a pipeline through µTVM that generates a single binary to be run on a bare-metal device. Users will then be able to easily integrate fast ML into their applications by including this binary in their edge application. Each stage of this pipeline is already in place, and now it is just a matter of gluing it all together, so expect updates from us soon on this front.
MicroTVM for single-kernel optimization is ready today and is the choice for that use case. As we now build out self-hosted deployment support, we hope you are just as excited as we are to make µTVM the choice for model deployment as well. However, this is not just a spectator sport - remember: this is all open source. µTVM is still in its early days, so every individual can have a great deal of impact on its trajectory.
Check out the TVM contributor's guide if you are interested in building with us or jump straight into the TVM forums to discuss ideas first.
None of this work would have been possible, if not for the following people:
[CTAToken URL = "https://octoml.ai" target="_blank" text="Learn more about OctoML" class ="green"]