TinyML: TVM Taming the Final (ML) Frontier

June 8, 2020

5 minute read time.

***All content written in this blog by Logan Weber & Andrew Reusch from OctoML***

Executive summary: Optimizing and deploying machine learning workloads to bare-metal devices today is difficult, and Apache TVM is laying the open-source foundation to making this easy and fast for anyone. Here we show how the µTVM component of TVM brings broad framework support, powerful compiler middleware, and flexible autotuning and compilation capabilities to embedded platforms. Check out our detailed post here for all the technical details, or read on for an overview.

The proliferation of low-cost, AI-powered consumer devices has led to widespread interest in "bare-metal" (low power, often without an operating system) devices among ML researchers and practitioners. While it is already possible for experts to run some models on some bare-metal devices, optimizing models for diverse sets of devices is challenging, often requiring manually optimized device-specific libraries. Because of this, to target new devices, developers must implement one-off custom software stacks for managing system resources and scheduling model execution.

The manual optimization of machine learning software is not unique to the domain of bare-metal devices. In fact, this has been a common theme for developers working with other hardware backends (for example, GPUs and FPGAs). Apache TVM has helped many ML engineers and companies handle the breadth of hardware targets available, but until now, it had little to offer for the unique profile of microcontrollers. To solve this gap, we have extended TVM to feature a microcontroller backend, called µTVM (pronounced "MicroTVM"). µTVM facilitates host-driven execution of tensor programs on bare-metal devices and enables automatic optimization of these programs VIA AutoTVM, TVM's built-in tensor program optimizer. In the figure below, a bird's eye view of the µTVM + AutoTVM infrastructure is shown:

This is a graphic displaying the TVM runtime.

Let us see it in action

This is a graphic displaying the target board.

A standard µTVM setup, where the host communicates with the device VIA JTAG.

In the previous image, we have an STM32F746ZG board, housing an Arm Cortex-M7 processor, an ideal part for AI on the edge given it is strong performance in a low-power envelope. We use its USB-JTAG port to connect it to our desktop machine. On the desktop, we run OpenOCD to open a JTAG connection with the device; in turn, OpenOCD allows µTVM to control the M7 processor using a device-agnostic TCP socket. With this setup in place, we can run a CIFAR-10 classifier using TVM code that looks like this (full script here):

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
OPENOCD_SERVER_ADDR = '127.0.0.1'
OPENOCD_SERVER_PORT = 6666
TARGET = tvm.target.create('c -device=micro_dev')
DEV_CONFIG = stm32f746xx.default_config(OPENOCD_SERVER_ADDR, OPENOCD_SERVER_PORT)
module, params = get_cifar10_cnn()
with micro.Session(device_config) as sess:
    graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)
  micro_mod = micro.create_micro_mod(c_module, DEV_CONFIG)
  graph_mod = graph_runtime.create(graph, micro_mod, ctx=tvm.micro_dev(0))
  graph_mod.run(data=data_np)
  prediction = CIFAR10_CLASSES[np.argmax(graph_mod.get_output(0).asnumpy())]
  print(f'prediction was {prediction}')
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

OPENOCD_SERVER_ADDR = '127.0.0.1'
OPENOCD_SERVER_PORT = 6666
TARGET = tvm.target.create('c -device=micro_dev')
DEV_CONFIG = stm32f746xx.default_config(OPENOCD_SERVER_ADDR, OPENOCD_SERVER_PORT)

module, params = get_cifar10_cnn()
with micro.Session(device_config) as sess:
	graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)
  micro_mod = micro.create_micro_mod(c_module, DEV_CONFIG)
  graph_mod = graph_runtime.create(graph, micro_mod, ctx=tvm.micro_dev(0))
  graph_mod.run(data=data_np)
  prediction = CIFAR10_CLASSES[np.argmax(graph_mod.get_output(0).asnumpy())]
  print(f'prediction was {prediction}')

The following are the performance results of MicroTVM, compared with CMSIS-NN version 5.6.0 (commit b5ef1c9), a hand-optimized library of ML kernels.

This is a graph displaying CIFAR-10 Int-8 CNN.

As we can see, the out-of-the-box performance is not great, but this is where AutoTVM comes to the rescue. We can write a schedule template for our device, do a round of autotuning, then achieve significantly better results. To plug in our autotuned results, we only need to replace this line:

Fullscreen

1
graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)

with these lines:

Fullscreen

1
2
with TARGET, autotvm.apply_history_best(TUNING_RESULTS_FILE):
  graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

with TARGET, autotvm.apply_history_best(TUNING_RESULTS_FILE):
  graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)

And our results now look like this:

This is a graph displaying CIFAR-10 Int-8 CNN results.

We have improved our performance by 2x, and we are now much quite close to CMSIS-NN (especially if you want TFLite compatible quantization support), which is code written by some of the best Arm engineers in the world. TVM with µTVM enables you to play with the best of them. To see how it works, see the detailed deep-dive here written by OctoML engineer Logan Weber.

Self-Hosted Runtime: The Final Frontier

This is a graphic displaying self-hosting runtime.

The envisioned µTVM optimization and deployment pipeline

While end-to-end benchmark results are already obtainable with the current runtime as we demonstrated previously, deployment of these models in a standalone capacity is still on our roadmap. The gap being that the AutoTVM-oriented runtime currently relies on the host to allocate tensors and to schedule function execution. However, to be useful at the edge, we need a pipeline through µTVM that generates a single binary to be run on a bare-metal device. Users will then be able to easily integrate fast ML into their applications by including this binary in their edge application. Each stage of this pipeline is already in place, and now it is just a matter of gluing it all together, so expect updates from us soon on this front.

Conclusion

MicroTVM for single-kernel optimization is ready today and is the choice for that use case. As we now build out self-hosted deployment support, we hope you are just as excited as we are to make µTVM the choice for model deployment as well. However, this is not just a spectator sport - remember: this is all open source. µTVM is still in its early days, so every individual can have a great deal of impact on its trajectory.

Check out the TVM contributor's guide if you are interested in building with us or jump straight into the TVM forums to discuss ideas first.

Acknowledgements

None of this work would have been possible, if not for the following people:

Tianqi Chen, for guiding the design and for being a fantastic mentor.
Pratyush Patel, for collaborating on early prototypes of MicroTVM.
OctoML, for facilitating the internships where I have been able to go full steam on this project.
Thierry Moreau, for mentoring me during my time at OctoML.
Luis Vega, for teaching me the fundamentals of interacting with microcontrollers.
Ramana Radhakrishnan, for supplying the Arm hardware used in our experiments and for providing guidance on its usage.

Learn more about OctoML

0 comments
0 members are here

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

TinyML: TVM Taming the Final (ML) Frontier

Let us see it in action

Self-Hosted Runtime: The Final Frontier

Conclusion

Acknowledgements

Coaching AI coding agents: A guide for senior engineers

Optimize Llama.cpp with Arm I8MM instruction

Build AI responsibly with the Yellow Teaming methodology and LLM assistant