In many industrial applications, it is important to not only define what we see, but also how well we see it. It was this simple yet powerful idea that drove recent research developments in the Arm Research Machine Learning (ML) Lab, and recently presented at the Workshop on Machine Learning for Autonomous Driving (ML4AD) at NeurIPS 2020.

To measure how well we see something, we need to measure our uncertainty about it. Consequently, measuring uncertainty is important in critical applications, as is the case with object detection (OD). OD concerns the detection of semantic objects in images or videos by identifying a box around those objects. This could be applied in different fields, like autonomous driving (for example, detecting cars or traffic signs), healthcare (for example, detecting types of cancer cells), and many others.

## What is wrong with typical OD metrics?

Current metrics targeted for evaluating OD models - like mean average precision (mAP) - are based on how much a model's predicted bounding box intersects with the ground-truth bounding box. However, besides well-known limitations with mAP [1], this metric cannot capture the idea of accurately evaluating the spatial information of a bounding box. Figure 1 illustrates one of such concerns:

Figure 1: Example of a detected bounding box (in orange) evaluated against the ground-truth box (blue line). The orange box covers 50% of the ground-truth box, although it covers only 16% of the plane’s pixels (blue-colored region). Reproduced from Hall and others, 2020.

As demonstrated in figure 1, the bounding box predicted by a certain model intersected with the ground-truth box by 50%, while only capturing 16% of the real object's pixels. In practice, it occupies more of the background rather than the actual plane. A metric called Probability-based Detection Quality (PDQ) tries to solve this issue by evaluating both the label and spatial qualities. The details are available in the original PDQ paper, but, in summary, label quality is measured from the softmaxed probabilities of each bounding box, and spatial quality comprises information about foreground and background losses. The notion of foreground and background is important to negatively evaluate cases like the ones in figure 1.

## But can we measure uncertainty?

Of course, we can. There are many ways to measure uncertainty with ML models, for example, using Bayesian approaches. Monte Carlo dropout (MC-Drop) is a famous approach which can be viewed as an approximation to Bayesian neural networks. It works by activating dropout at inference time, which will produce different outputs every time there is a forward pass in the neural network. This variability (variance) in the outputs represents our uncertainty: the more variable the output of the neural network is after all the forward passes, the less certain the model is about the correct output.

Uncertainty estimates are extremely important to achieve robust models, therefore we need to evaluate whether an ML model can produce good estimates. This problem is well studied in image classification tasks, where metrics like the Brier score and expected calibration error (ECE) are prevalent. However, they are hard to adapt for OD tasks because we have a notion of *spatial* information from the bounding box that does not exist in image classification tasks. These limitations led us to use the PDQ metric that I introduced before.

## Stochastic-YOLO to the rescue

We introduce Stochastic-YOLO, a novel OD architecture based on YOLOv3 with efficiency in mind. We added dropout layers for Monte Carlo dropout (MC-Drop) at the end of each one of the three YOLO heads. We suggest that MC-Drop brings a good trade-off in terms of complexity and efficiency for introducing stochasticity at inference time.

The introduction of dropout layers towards the end of a deep OD model, instead of after every layer, allows for an efficient sampling procedure. When sampling *N* times from a deep OD model, we can cache the intermediate resulting feature tensor of one forward pass right until the first dropout layer. This cached tensor is deterministic (assuming that numerical errors are not significant), allowing only the last few layers of the model to be sampled from, instead of making *N* full forward passes through all the layers. The entire process is summarized at a higher level for non-stochastic and stochastic cases in Figure 2, from which YOLOv3 and Stochastic-YOLO are special cases.

Figure 2: Working blocks of a deterministic baseline model and a stochastic model with Monte Carlo dropout, from which YOLOv3 and Stochastic-YOLO are special cases. A stochastic model outputs better probabilistic scores (that is, PDQ) when compared to a deterministic baseline that outputs inflated mAP metrics.

For the non-stochastic baseline model at the top, its output is a set . is a bounding box predicting *C* possible labels, and is the number of bounding boxes originally produced by the specific model (for example YOLOv3).

Each bounding box contains *5 + C* real values:

- Four representing the bounding box (such as
*x*/*y*coordinates, width, and height) - One representing the objectness score (the softmaxed score that the bounding box contains an object)
*C*softmaxed scores for each possible label.

This set enters the *Filtering *block which contains suppression techniques described in detail in our paper, now producing a smaller set of bounding boxes, where is the final number of bounding boxes to be evaluated. For the stochastic model, the number of bounding boxes being originally produced is instead where *N* is the number of MC-Drop samples. This distinction in the stochastic model's output, when compared to the non-stochastic one, will make the *Filtering* block to have an extra output: for each averaged bounding box not being filtered, we need the corresponding *N* samples of that bounding box, represented as .

A further *Format Conversion* block is needed to transform these sets into a format which can be evaluated from a probabilistic perspective (such as using the PDQ metric) in the *Evaluation* block. Figure 3 illustrates this conversion:

Figure 3: Conversion steps from outputted bounding boxes to a format in which probabilistic metrics can be calculated

In practice, a bounding box vector

is transformed into

where now we have two coordinates for the top-left corner () and bottom-right corner () instead of a single coordinate with width/height. In this work, for a deterministic model these values correspond to those originally in whereas for the stochastic model, these are the average coordinates across the *N* samples. Finally, is a covariance matrix for one coordinate calculated from the distribution of the *N* sampled points in each coordinate.

## Our experiments and main results

We have used and adapted Ultralytics' open-source implementation of YOLOv3 for Pytorch. Training and evaluation was performed using the MS COCO dataset, with more than *100,000* images in its 2017 release. For comparison, we train an ensemble of five YOLOv3 models, where each one was trained in the same way with different random seeds when initializing the network's weights. Table 1 summarizes the metrics we obtained:

Table 1: Overall results across different models and metrics, where Lbl and SP mean label and spatial uncertainty quality, respectively. In parentheses confidence threshold (0.1 or 0.5). S-YOLO means Stochastic-YOLO in which the corresponding number is the dropout percentage applied, and -X means fine-tuned model. The best results for each metric are shown in **bold**.

For every model, we show two confidence thresholds - 0.1 and 0.5. For all the models, a confidence threshold of 0.1 corresponds to higher values of mAP. However, for the label and spatial quality metrics, all the models perform significantly worse when a confidence threshold of 0.1 is applied. This illustrates well-known concerns in the field, where many developers try to decrease the confidence threshold to inflate mAP values, giving a false sense of good performance.

Fine-tuning Stochastic-YOLO models usually results in better metrics when compared to Stochastic-YOLO used directly from a pre-trained YOLOv3 model with inserted dropout layers and no fine-tuning. PDQ score and spatial quality more than doubled for the Stochastic-YOLO model, with a 25% dropout rate when compared to YOLOv3. At the same time, label quality only reduced around 2% when compared to YOLOv3, with the same 0.5 confidence threshold.

Further fine-tuning of Stochastic-YOLO can positively impact overall results. Nevertheless, Stochastic-YOLO with a dropout rate of 25% with no fine-tuning seems to yield the best trade-off in terms of performance and complexity. It is able to achieve comparable results without the need to spend further time fine-tuning it.

## Over to you

The use of MC-Drop usually includes dropout layers activated at both training and test times, but in many OD architectures dropout is not a typical choice of regularizer for training. Consequently, we developed Stochastic-YOLO in a way that can fit most OD researchers' current pipelines - we show that direct application of MC-Drop in pre-trained models without any fine-tuning results in significant improvements. To further help the community decide how to use Stochastic-YOLO, we have provided sensitivity analysis on dropout rate and on the decision of further fine-tuning the model. To the best of our knowledge, this has not yet been fully explored in the field. The implementation of this architecture was achieved using a caching mechanism that enabled minimal impact on inference time when deploying the model in the real world. All these results are explored in our paper - which you are encouraged to take a look at!

We hope that this work encourages other researchers to extend these ideas on their own models with the help of our publicly available code. Share your results in the comments.

Contact Tiago Azevedo Read the full paper

## Contributing to the Global Goals

In line with Arm's vision to realize the UN Global Goals, this work helps to address three of them. Our advances in probabilistic OD show that it is possible to achieve an extra level of complexity in an efficient way, promoting sustained growth for a green economy (Goal 8). We were able to bring effective uncertainty estimates that are important for critical applications in industry, and the work also opens several avenues of further innovative research (Goal 9). Finally, there is an application in autonomous driving, which will likely drive the future of sustainable smart cities (Goal 11). This could ultimately help to establish artificial intelligence as a necessary tool to achieve the SDG.