***ALL content in this blog written by Chris Rogers, CEO SensiML
In the computing world, the old aphorism "garbage in, garbage out" has always rung true. The equivalent in the Artificial Intelligence (AI) and Machine Learning (ML) world is "garbage data in, garbage model out", or the optimistic spin "good data in, good model out". Whether your AI classification application is a big data model run in the cloud, or a real-time sensor model run at the IoT edge, the adage holds true.
At SensiML, we focus on building tools to simplify and auto generate optimized AI code for the edge. Working with Arm and the Arm ecosystem of hardware platform vendors, SensiML can provide extremely compact machine learning inference models that leverage hardware capabilities for maximum power and performance in the IoT device realm. SensiML fully supports the Arm Cortex Microcontroller Software Interface Standard (CMSIS) DSP library to take advantage of signal processing optimizations in Cortex-M and Cortex-A processors. Together, SensiML IoT edge AI models running on Arm processors can provide powerful smart sensing insight in real time where events are occurring. But the performance of such models depends equally on the quality of datasets used to train the algorithms. Let us look more specifically at what that means for AI and AutoML applications.
ML tools such as SensiML's Analytics Studio can create extremely effective results for applications when supplied with good datasets. The ability of AutoML software to successfully detect patterns in real-world data is almost uncanny, but only if the software has high quality labeled data for training. Intuitively this makes sense as it is analogous to the impact on an otherwise capable child’s ability to learn when presented with poor materials or instruction.
In our experience, whether AI tools work well for a specific use case is often more a reflection of the data being used than anything else. Adverse dataset factors that can impact model quality for the worse can be categorized into the following:
Properly labeled training data is key to learning.
Ideally, proper planning should be designed in from the outset to address these problems or to mitigate their impact in the use of existing datasets. We will review each one in turn briefly: The first potential issue, and perhaps most obvious, is poor quality data. Just a few of the questions to ask oneself are:
Physical sensor location can play a role in either boosting or attenuating the desired measurement. Look for easy opportunities to improve signal at the physical interface before turning to amplification, filtering, and other signal processing solutions.
Whether clipping from signals exceeding measurable range, or insufficient signal quantized to just a few bits of your ADC; both are bad and should be addressed early in pilot data collection phase.
It is generally better to overshoot and collect data at high rates and downsample in preprocessing rather than face aliasing in the data from insufficient sample rate. SensiML supports data rates of 100ksps and in some devices up to 1Msps.
Particularly an issue for wireless, ensure your collected data streams are not suffering from dropped packets or data loss. Data collection tools like SensiML that support local storage to the device with subsequent download after collection can be a savior in cases where robust streaming is impractical.
One of the most common initial questions we hear from customers is, “How much data is enough to build a good model?” The answer to that question is always challenging since the answer depends on aspects specific to the nature of a given application. Very predictable and repeatable systems require much less data to characterize than those with many different uncorrelated sources of variance. The best way to handle this challenge is to plan for an iterative phased approach to data collection and modeling. We advise our customers wherever possible to start with smaller datasets (say 30-50 labeled examples) and then analyze the results within the tool. SensiML provides standard statistical methods to visualize and assess variance and data sufficiency. Often, this initial ‘pilot’ collection usually informs improvements to the data collection protocol that can reduce the subsequent number of samples required. One of the primary ways of doing so is to tightly control uninteresting sources of variance that add noise to the dataset but not value to the desired model.Next is incomplete or insufficient data. In general, more data is better than less data (as long as it is clean per the previous paragraph) though it is recognized that data collection does not come for free and sometimes can be quite costly or impractical to capture from real-world physical systems. Mislabeled data is another major pitfall, and the degree to which it is a risk for your application is a function of how labels are applied in the first place. Ideally, ground truth for the classification model desired can be derived from objective, repeatable sources such as other sensors that can be collected in parallel to the classifier input sensor data but are simply impractical in real-world usage.
SensiML Data Capture Lab application enables developers to build high quality labeled datasets.
Often though, such ground truth measures can be subjective or require human judgment, which isn’t entirely infallible. Like a radiologist looking at an X-ray to assess if a bone is broken or not, even such expert insight can come with errors. In cases where large datasets must be manually labeled, the problem is compounded by the tedium of the exercise.
Fortunately, high quality labeling software as included in the SensiML Data Capture Lab can reduce the chances for error with features to assist the process. Examples include audio/video synchronization to raw sensor data to allow visual review for cases where this can provide clarity. Predictive labeling that learns by the initial examples and requires only confirmation of successive labels is another feature that helps. Multi-user support for data collection can divide up the effort and allow many users and different roles to work on a common project. As you seek AI automation tools for your project, make sure to understand what capabilities are available in the software to add quality to the labeling process as mislabeled data can severely impact performance of the final model and worse, and can be difficult or impossible to extricate after-the-fact.
Omission of negative cases is an area often overlooked as developers focus on collecting representative data for conditions of interest. Just keep in mind that not only is it important to know what examples correspond to a given classification, but also what examples do not. To the extent that real-world data could include states that might misclassify as a false positive, be sure to capture these as well.
Finally, unexplained variance can be both a source for model performance degradation as well as a missed opportunity for enhanced insight and value. Frequently this can take the form of additional contextual data (or metadata) that provides explanation for outcomes, and while not necessarily a sensor in a physical sense of a device measuring a physical attribute, can be thought of as another input combined with actual sensor data. An example would be a predictive maintenance application for a specific electric motor.
The vibration characteristics of that same motor would look different based on environmental factors like the type of load being driven or the type of mounts used to secure the motor. To the extent you can enumerate such sources of variance upfront, it is possible to capture those inputs and either include them in the overall model algorithm or use them as conditional states to select different models.
From the examples above, it should hopefully be clear that model accuracy completely depends on the availability and proper labeling of high-quality training datasets. Thus, properly planning for and collecting that data lays the foundation for future AI/AutoML success. Good data in = good results out. Spending time addressing these potential problems on the front end of the process generally pays significant dividends in quickly and efficiently getting to a highly functional AI-based result.
[CTAToken URL = "https://www.sensiml.com" target="_blank" text="Learn More About Data Labeling" class ="green"]