MLP evaluation
What we have done so far is build a multilayer perceptron with activation functions so that we can begin to construct complex non-linear decision boundaries (or even tackle regression problems). We can build increasingly large neural networks with multiple hidden layers that can have huge numbers of connected neurons.
So far we have been eyeballing our outputs and attempting to manually tweak our weights and biases to solve a given problem. This aspect of “training” is something we will learn how to do automatically in the later theory section. Before we get to that process, we should set ourselves up for success so that we know how to prepare our data for training and testing and have a suitable way to evaluate how well our neural network is guessing.
Loss/cost functions
There are any different ways of evaluating loss/cost, i.e. quantifying how wrong our model is from correctly guessing the answer. In this course we will focus on the more conventional Mean Square Error or MSE for short. In other material you might encounter other forms of loss such as “Cross Entropy Loss”.
The basic idea is fairly intuitive. If your guess is the right answer, loss should be low, if your guess is the wrong answer, your loss should be high. A straightforward way to do this is to take the difference between your guess and your final answer. But this could run into issues.
If we have a neural network set up to guess if something is in class 0 or class 1 (this could be guessing if a picture is a cat, class 0, or a dog, class 1, for example) then if our neural network has a picture of a cat 0
that it guesses is a dog 1
and a picture of a dog 1
it guesses is a dog 0
then if we add together the differences between those two error we get:
\((0-1) + (1-0) = 0\)
which implies that there the network is perfect! So we want to make sure minus signs don’t cancel out. It transpires that a sensible way to do this is to square each of the errors before adding them together:
\((0-1)^2 + (1-0)^2 = 2\)
What we also want, is for our loss to not depend on the number of data points we are testing on at a given time. To do this, we just divide by the number of data points \(n\). Generally for \(n\) data points of \(Observation_{i}\) \(Prediction_{i}\) for \(i=1, \ldots, n\) we define the Mean Squared Error as:
\(MSE = \frac{1}{n} \Sigma_{i=1}^{n} (Observation_{i}- Prediction_{i})^2\)
\(= \frac{1}{n} \left((Observation_{1}- Prediction_{1})^2 + (Observation_{2}- Prediction_{2})^2 + \ldots + (Observation_{n}- Prediction_{n})^2\right)\)
Out of sample predictive power
For a full recap on testing your model from the applied data analysis in Python course, see here. The key thing to remember is that the true power of building predictive models is not predicting on data that we already have - you already know the right prediction then!
What we aspire to do is to learn a pattern from some data with some labels and then predict new labels on previously unseen data. We might want to be able to guess the price of a stock next week, guess whether a picture is a dog or a cat or guess the next word in a sentence. To make our predictive model good at this, we need to reduce “overfitting” which is when our model learns charateristics specific to our dataset that doesn’t generalise.
An example of overfitting might be if you trained an image classifier to tell the difference between a dog and a cat, but all your examples of dogs were taken outside so lots of the background is green. A network could “learn” that green is associated with dog and then any picture with a green background might automatically get associated with a dog, even if the foreground image was a cat.
In order to mitigate this and maximise our ability to predict out of sample, we always seperate our data before training (this applies to all supervised learning, not just neural networks) into a test set and a train set. We train the model on the train set and then evaluate that models performance on the disjoint test set.