Imagine you have a dog. Let’s call her Bingo. You’d like Bingo to fetch her leash whenever you say, “Walkies!” So what do you do? You train her. Maybe you wave the leash and say, “Walkies!” Next time, you might repeat the command word, while pointing to the leash hung on the wall. If Bingo makes the connection and goes to get the leash herself, you’ll likely make a huge fuss to let her know she got it right.
It may take a while, but sooner or later, every time you say, “Walkies!” Bingo will infer that it’s time for a walk, and will go and get the leash. Simple, right? It’s not too dissimilar to what happens with AI, only this time Bingo can take a back seat, because it’s a neural network (NN) that’s doing the learning.
The process is almost exactly the same: if you want the NN to do a job, you need to teach it how to get it right.
Although neural networks – an essential component of machine learning – are loosely modelled on the human brain, they don’t yet have the capacity for artificial general intelligence. That is, they can’t understand or learn like a human. They don’t have the flexibility or versatility that allows our neurons to connect with any old neuron in the vicinity to create millions of overlapping and interlinking neural circuits.
What they do have is layers of neurons that give them the ability to ‘learn’ from examples, without being programmed with any task-specific rules. The number of layers can be relatively few, or can run into the thousands, but they still can’t rival homo sapiens for complexity.
Neural network; artificial neural network
Let’s take the classic example of image recognition which, in academic examples, always seems to involve cats. But let’s indulge Bingo a little and go with dogs. If you want your NN to recognize dogs, you need to give it some training data, so you might start by showing it – guess what? – lots of images of dogs.
Now for humans – even tiny ones – it’s not a difficult process to identify a dog. For an NN, it’s an altogether more laborious process. After all, there are dogs, and there are dogs. (And there’s also the awkward dog/muffin crossover – check out this popular meme to see what I mean.)
So how does it learn?
An NN can have anything from a few dozen to millions of neurons – known as units – arranged, as we’ve said, into layers. On the one side, we have input units, which are designed to receive information for the NN to process. On the other side are output units, which kick out the results. Sandwiched in between are hidden units, which are the ones that do all the hard work.
These units are connected, and each connection is represented by a number called a weight, which is either positive (if it has information that’s relevant to the task) or negative (if it’s not). The higher the weight, the more influence one unit has on another.
Traditional (i.e. old-school) NNs were fully connected – that is, every hidden and output unit was connected to every unit in the adjacent layer. By contrast, convolutional neural networks (CNNs) – a type of NN that has at least one convolutional layer – use a cunning technique to restrict the number of connections and make the whole thing less labor intensive.
Let’s go back to our doggy pics...
As we mentioned, to recognize dogs, we need to train our network on lots of images. But images are, after all, just a collection of pixels – and only some of those pixels are key to our goal of distinguishing dogs from cats or racoons. (Or muffins.)
Fully connected networks are inefficient for image classification, because they don’t take into account spatial structure – that is, the correlation between the pixels. CNNs, by contrast, take advantage of the fact that nearby pixels are more strongly related than distant ones.
The influence of nearby pixels is analysed using what’s known as a filter. In the case of our canine friends, a filter might be tasked with identifying noses – where and how many times they occur, for example. Since these filters concentrate on sections, the number of operations the network has to wrangle is reduced, and the process is more streamlined. (It also means that a change in the location of the canine nose doesn’t throw the network into chaos and confusion.)
Traditional neural network; convolutional neural network
Information flows through an NN in two directions. As each new training image is introduced, each unit receives input from the unit to its left, and this input is multiplied by the weights of the connections as it travels through the network. If the input reaches a certain threshold of ‘influence’ the unit passes the information to its colleague on the right.
Of course, it’s not using its skill and judgement as you or I would; it’s using the new input to either switch on or off each input unit, according to the binary yes/no (1,0) answers for each typical characteristic. Does it have four legs? A tail? A broad muzzle? Almond eyes?
A typical dog would give a response of 1110 (yes, yes, yes, no) whereas a cat might produce 1101 (yes, yes, no, yes). So, during this learning phase, the NN is simply looking at these numbers and learning that some mean dog and some don’t.
But, just like Bingo and her leash, NNs need feedback to allow them to compare the actual outcome with desired outcome. This lets the NN modify its behavior if it’s falling short and, hopefully, do better next time. Backpropagation, as it’s known, allows it to do just that, providing the data necessary to modify the weights – or influence – of the connections between the units and reduce the difference between the actual and the intended output, ideally to zero.
In the case of the dog pictures, our NN might examine an image and spit out a ‘probability vector’ – otherwise known as an educated guess – that’s typically expressed as a percentage. It may, for example, be 89% confident that an image is a dog. (And 11% confident that it’s a muffin.) The difference between the NN’s output and the correct answer is the error value. These error values are pushed back through the network, to help the NN get closer to the correct answer next time.
And so, to inference…
Inference is the relatively easy part. It’s essentially when you let your trained NN do its thing in the wild, applying its new-found skills to new data. So, in this case, you might give it some photos of dogs that it’s never seen before and see what it can ‘infer’ from what it’s already learnt.
Nowadays, more and more inference is being done on-device – not least on our mobile phones. When you use your camera to take a photo of yourself with bunny ears (and you have, right?), that’s your phone using an NN to recognize that there’s a face there, and applying the aforementioned ears to the right part of your anatomy.
By keeping this processing on the phone – rather than sending it to the cloud, you not only dodge latency to get a near-instant result, you also prevent your data from being exposed to nasties on its way to the cloud and back.
As more and more of this kind of processing moves on-device, our devices are set to become smarter, and the user experience continually smoother. If you’d like to know more about it, check out this blog post, Living on the Edge: Why On-Device ML is Here to Stay.
[CTAToken URL = "https://community.arm.com/developer/ip-products/processors/b/ml-ip-blog/posts/why-on-device-ml-is-here-to-stay" target="_blank" text="Read the blog" class ="green"]
Really cool info