Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
AI and ML blog SensiML: Good data in, good model out, good result
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Machine Learning (ML)
  • Partner solutions
  • Internet of Things (IoT)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

SensiML: Good data in, good model out, good result

Mary Bennion
Mary Bennion
August 13, 2020
6 minute read time.

***ALL content in this blog written by Chris Rogers, CEO SensiML 

In the computing world, the old aphorism "garbage in, garbage out" has always rung true.  The equivalent in the Artificial Intelligence (AI) and Machine Learning (ML) world is "garbage data in, garbage model out", or the optimistic spin "good data in, good model out".  Whether your AI classification application is a big data model run in the cloud, or a real-time sensor model run at the IoT edge, the adage holds true.  

At SensiML, we focus on building tools to simplify and auto generate optimized AI code for the edge. Working with Arm and the Arm ecosystem of hardware platform vendors, SensiML can provide extremely compact machine learning inference models that leverage hardware capabilities for maximum power and performance in the IoT device realm. SensiML fully supports the Arm Cortex Microcontroller Software Interface Standard (CMSIS) DSP library to take advantage of signal processing optimizations in Cortex-M and Cortex-A processors. Together, SensiML IoT edge AI models running on Arm processors can provide powerful smart sensing insight in real time where events are occurring. But the performance of such models depends equally on the quality of datasets used to train the algorithms. Let us look more specifically at what that means for AI and AutoML applications. 

ML tools such as SensiML's Analytics Studio can create extremely effective results for applications when supplied with good datasets. The ability of AutoML software to successfully detect patterns in real-world data is almost uncanny, but only if the software has high quality labeled data for training.  Intuitively this makes sense as it is analogous to the impact on an otherwise capable child’s ability to learn when presented with poor materials or instruction. 

In our experience, whether AI tools work well for a specific use case is often more a reflection of the data being used than anything else. Adverse dataset factors that can impact model quality for the worse can be categorized into the following: 

  • Poor sensor data quality 
  • Data insufficiency 
  • Mislabeled data 
  • Omission of negative cases 
  • Unexplained variance  

Properly labeled training data is key to learning.

Ideally, proper planning should be designed in from the outset to address these problems or to mitigate their impact in the use of existing datasets.  We will review each one in turn briefly: 

The first potential issue, and perhaps most obvious, is poor quality data.  Just a few of the questions to ask oneself are: 

Am I getting sufficient signal from my sensor to measure the physical property of interest? 

Physical sensor location can play a role in either boosting or attenuating the desired measurement. Look for easy opportunities to improve signal at the physical interface before turning to amplification, filtering, and other signal processing solutions. 

Is full-scale range and gain of the sensor appropriate for the nature of the measurement?  

Whether clipping from signals exceeding measurable range, or insufficient signal quantized to just a few bits of your ADC; both are bad and should be addressed early in pilot data collection phase. 

Is my sampling rate appropriate to the dynamic nature of the signal?   

It is generally better to overshoot and collect data at high rates and downsample in preprocessing rather than face aliasing in the data from insufficient sample rate. SensiML supports data rates of 100ksps and in some devices up to 1Msps. 

Am I capturing all the data?

Particularly an issue for wireless, ensure your collected data streams are not suffering from dropped packets or data loss. Data collection tools like SensiML that support local storage to the device with subsequent download after collection can be a savior in cases where robust streaming is impractical. 

One of the most common initial questions we hear from customers is, “How much data is enough to build a good model?”  The answer to that question is always challenging since the answer depends on aspects specific to the nature of a given application. Very predictable and repeatable systems require much less data to characterize than those with many different uncorrelated sources of variance. The best way to handle this challenge is to plan for an iterative phased approach to data collection and modeling. We advise our customers wherever possible to start with smaller datasets (say 30-50 labeled examples) and then analyze the results within the tool. SensiML provides standard statistical methods to visualize and assess variance and data sufficiency. Often, this initial ‘pilot’ collection usually informs improvements to the data collection protocol that can reduce the subsequent number of samples required. One of the primary ways of doing so is to tightly control uninteresting sources of variance that add noise to the dataset but not value to the desired model.

Next is incomplete or insufficient data. In general, more data is better than less data (as long as it is clean per the previous paragraph) though it is recognized that data collection does not come for free and sometimes can be quite costly or impractical to capture from real-world physical systems. 

Mislabeled data is another major pitfall, and the degree to which it is a risk for your application is a function of how labels are applied in the first place. Ideally, ground truth for the classification model desired can be derived from objective, repeatable sources such as other sensors that can be collected in parallel to the classifier input sensor data but are simply impractical in real-world usage.

SensiML Data Capture Lab application enables developers to build high quality labeled datasets.

Often though, such ground truth measures can be subjective or require human judgment, which isn’t entirely infallible.  Like a radiologist looking at an X-ray to assess if a bone is broken or not, even such expert insight can come with errors.  In cases where large datasets must be manually labeled, the problem is compounded by the tedium of the exercise. 

Fortunately, high quality labeling software as included in the SensiML Data Capture Lab can reduce the chances for error with features to assist the process.  Examples include audio/video synchronization to raw sensor data to allow visual review for cases where this can provide clarity.  Predictive labeling that learns by the initial examples and requires only confirmation of successive labels is another feature that helps.  Multi-user support for data collection can divide up the effort and allow many users and different roles to work on a common project.  As you seek AI automation tools for your project, make sure to understand what capabilities are available in the software to add quality to the labeling process as mislabeled data can severely impact performance of the final model and worse, and can be difficult or impossible to extricate after-the-fact. 

Omission of negative cases is an area often overlooked as developers focus on collecting representative data for conditions of interest.  Just keep in mind that not only is it important to know what examples correspond to a given classification, but also what examples do not.  To the extent that real-world data could include states that might misclassify as a false positive, be sure to capture these as well. 

Finally, unexplained variance can be both a source for model performance degradation as well as a missed opportunity for enhanced insight and value.  Frequently this can take the form of additional contextual data (or metadata) that provides explanation for outcomes, and while not necessarily a sensor in a physical sense of a device measuring a physical attribute, can be thought of as another input combined with actual sensor data.  An example would be a predictive maintenance application for a specific electric motor. 

The vibration characteristics of that same motor would look different based on environmental factors like the type of load being driven or the type of mounts used to secure the motor.  To the extent you can enumerate such sources of variance upfront, it is possible to capture those inputs and either include them in the overall model algorithm or use them as conditional states to select different models. 

From the examples above, it should hopefully be clear that model accuracy completely depends on the availability and proper labeling of high-quality training datasets.  Thus, properly planning for and collecting that data lays the foundation for future AI/AutoML success.  Good data in = good results out.  Spending time addressing these potential problems on the front end of the process generally pays significant dividends in quickly and efficiently getting to a highly functional AI-based result. 

Learn More About Data Labeling

 

Anonymous
AI and ML blog
  • Analyzing Machine Learning models on a layer-by-layer basis

    George Gekov
    George Gekov
    In this blog, we demonstrate how to analyze a Machine Learning model on a layer-by-layer basis.
    • October 31, 2022
  • How audio development platforms can take advantage of accelerated ML processing

    Mary Bennion
    Mary Bennion
    Join DSP Concepts and Alif Semiconductor at Arm DevSummit 2022 to discuss ML techniques commonly used for audio. Discover the features and benefits of the Audio Weaver platform.
    • October 24, 2022
  • How to Deploy PaddlePaddle on Arm Cortex-M with Arm Virtual Hardware

    Liliya Wu
    Liliya Wu
    This blog introduces how to deploy a PP-OCRv3 English text recognition model on Arm Cortex-M55 processor with Arm Virtual Hardware.
    • August 31, 2022