Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Graphics and Gaming
    • High Performance Computing
    • Innovation
    • Multimedia
    • Open Source Software and Platforms
    • Physical
    • Processors
    • Security
    • System
    • Software Tools
    • TrustZone for Armv8-M
    • 中文社区
  • Blog
    • Announcements
    • Artificial Intelligence
    • Automotive
    • Healthcare
    • HPC
    • Infrastructure
    • Innovation
    • Internet of Things
    • Machine Learning
    • Mobile
    • Smart Homes
    • Wearables
  • Forums
    • All developer forums
    • IP Product forums
    • Tool & Software forums
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Processors
  • Developer Community
  • IP Products
  • Processors
  • Jump...
  • Cancel
Processors
Machine Learning IP blog TinyML: TVM Taming the Final (ML) Frontier
  • Blogs
  • Leaderboard
  • Forums
  • Videos & Files
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
  • New
More blogs in Processors
  • DesignStart blog

  • Machine Learning IP blog

  • Processors blog

  • TrustZone for Armv8-M blog

Tags
  • TinyML
  • Microcontroller (MCU)
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

TinyML: TVM Taming the Final (ML) Frontier

Mary Bennion
Mary Bennion
June 8, 2020

***All content written in this blog by Logan Weber & Andrew Reusch from OctoML***

Executive summary: Optimizing and deploying machine learning workloads to bare-metal devices today is difficult, and Apache TVM is laying the open-source foundation to making this easy and fast for anyone. Here we show how the µTVM component of TVM brings broad framework support, powerful compiler middleware, and flexible autotuning and compilation capabilities to embedded platforms. Check out our detailed post here for all the technical details, or read on for an overview.

The proliferation of low-cost, AI-powered consumer devices has led to widespread interest in "bare-metal" (low power, often without an operating system) devices among ML researchers and practitioners. While it is already possible for experts to run some models on some bare-metal devices, optimizing models for diverse sets of devices is challenging, often requiring manually optimized device-specific libraries. Because of this, to target new devices, developers must implement one-off custom software stacks for managing system resources and scheduling model execution.

The manual optimization of machine learning software is not unique to the domain of bare-metal devices. In fact, this has been a common theme for developers working with other hardware backends (for example, GPUs and FPGAs). Apache TVM has helped many ML engineers and companies handle the breadth of hardware targets available, but until now, it had little to offer for the unique profile of microcontrollers. To solve this gap, we have extended TVM to feature a microcontroller backend, called µTVM (pronounced "MicroTVM"). µTVM facilitates host-driven execution of tensor programs on bare-metal devices and enables automatic optimization of these programs VIA AutoTVM, TVM's built-in tensor program optimizer. In the figure below, a bird's eye view of the µTVM + AutoTVM infrastructure is shown:

 This is a graphic displaying the TVM runtime.

Let us see it in action

 This is a graphic displaying the target board.

A standard µTVM setup, where the host communicates with the device VIA JTAG.

In the previous image, we have an STM32F746ZG board, housing an Arm Cortex-M7 processor, an ideal part for AI on the edge given it is strong performance in a low-power envelope. We use its USB-JTAG port to connect it to our desktop machine. On the desktop, we run OpenOCD to open a JTAG connection with the device; in turn, OpenOCD allows µTVM to control the M7 processor using a device-agnostic TCP socket. With this setup in place, we can run a CIFAR-10 classifier using TVM code that looks like this (full script here):

OPENOCD_SERVER_ADDR = '127.0.0.1'
OPENOCD_SERVER_PORT = 6666
TARGET = tvm.target.create('c -device=micro_dev')
DEV_CONFIG = stm32f746xx.default_config(OPENOCD_SERVER_ADDR, OPENOCD_SERVER_PORT)

module, params = get_cifar10_cnn()
with micro.Session(device_config) as sess:
	graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)
  micro_mod = micro.create_micro_mod(c_module, DEV_CONFIG)
  graph_mod = graph_runtime.create(graph, micro_mod, ctx=tvm.micro_dev(0))
  graph_mod.run(data=data_np)
  prediction = CIFAR10_CLASSES[np.argmax(graph_mod.get_output(0).asnumpy())]
  print(f'prediction was {prediction}')

The following are the performance results of MicroTVM, compared with CMSIS-NN version 5.6.0 (commit b5ef1c9), a hand-optimized library of ML kernels.

 This is a graph displaying CIFAR-10 Int-8 CNN.

As we can see, the out-of-the-box performance is not great, but this is where AutoTVM comes to the rescue. We can write a schedule template for our device, do a round of autotuning, then achieve significantly better results. To plug in our autotuned results, we only need to replace this line:

graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)

with these lines:

with TARGET, autotvm.apply_history_best(TUNING_RESULTS_FILE):
  graph, c_module, params = relay.build(module['main'], target=TARGET, params=params)

And our results now look like this:

 This is a graph displaying CIFAR-10 Int-8 CNN results.

We have improved our performance by 2x, and we are now much quite close to CMSIS-NN (especially if you want TFLite compatible quantization support), which is code written by some of the best Arm engineers in the world. TVM with µTVM enables you to play with the best of them. To see how it works, see the detailed deep-dive here written by OctoML engineer Logan Weber.

Self-Hosted Runtime: The Final Frontier

 This is a graphic displaying self-hosting runtime.

The envisioned µTVM optimization and deployment pipeline

While end-to-end benchmark results are already obtainable with the current runtime as we demonstrated previously, deployment of these models in a standalone capacity is still on our roadmap. The gap being that the AutoTVM-oriented runtime currently relies on the host to allocate tensors and to schedule function execution. However, to be useful at the edge, we need a pipeline through µTVM that generates a single binary to be run on a bare-metal device. Users will then be able to easily integrate fast ML into their applications by including this binary in their edge application. Each stage of this pipeline is already in place, and now it is just a matter of gluing it all together, so expect updates from us soon on this front.

Conclusion

MicroTVM for single-kernel optimization is ready today and is the choice for that use case. As we now build out self-hosted deployment support, we hope you are just as excited as we are to make µTVM the choice for model deployment as well. However, this is not just a spectator sport - remember: this is all open source. µTVM is still in its early days, so every individual can have a great deal of impact on its trajectory.

Check out the TVM contributor's guide if you are interested in building with us or jump straight into the TVM forums to discuss ideas first.

Acknowledgements

None of this work would have been possible, if not for the following people:

  • Tianqi Chen, for guiding the design and for being a fantastic mentor.
  • Pratyush Patel, for collaborating on early prototypes of MicroTVM.
  • OctoML, for facilitating the internships where I have been able to go full steam on this project.
  • Thierry Moreau, for mentoring me during my time at OctoML.
  • Luis Vega, for teaching me the fundamentals of interacting with microcontrollers.
  • Ramana Radhakrishnan, for supplying the Arm hardware used in our experiments and for providing guidance on its usage.

Learn more about OctoML

Anonymous
Machine Learning IP blog
  • Profiling Arm NN Machine Learning applications running on Linux with Streamline

    Florent Lebeau
    Florent Lebeau
    This blog article introduces how to profile and optimize machine learning applications running with the Arm NN inference engine.
    • January 20, 2021
  • Accelerating ML inference on X-Ray detection at edge using Raspberry Pi with PyArmNN

    sandeepsingh
    sandeepsingh
    This blog is trying to showcase an X-RAY classification model for detecting a COVID-19 vs Healthy patients using the CovidX database.
    • December 9, 2020
  • Why standard benchmarks matter to AI innovation?

    Dylan Zika
    Dylan Zika
    Arm and MLCommons, a global engineering consortium are working together to push industry benchmarks and best practices for AI.
    • December 4, 2020