**Validation** Recall our general approach to building a deployable neural network using supervised learning: we design an architecture, gather labeled training data, and then run that data through the network over and over, helping it adjust itself to make the labels it produces (or *predicts*) ever-closer to the actual labels. We can periodically test the network's quality by giving it some new examples it hasn't seen before, and measuring how well the predicted labels match the real labels. This last step where we test the network with new data might be called *evaluation*, but it's come to be called **validation**. You can think of it as determining a number that quantifies the "validity" (or correctness) of our network. Here we'll talk about how to actually perform this validation (or evaluation) step in the most productive way. The key insight is that we will break up our labeled input data into one or more different pieces, each of which has its own role. To make this discussion concrete, I'll use the example of an image classifier. That also lets us show example data by simply presenting pictures. But these ideas are very general, and used for all kinds of neural networks. Why We Shouldn't Train With All the Data === Knowing that every example we give to the network will usually improve its abilities just a little bit more, we might be tempted to use all of our data for learning the proper weights. This wouldn't work out well for us; let's see why. The problem is that even if the system does a perfect job classifying the training data, that doesn't mean it will do well at all with new data it hasn't seen before. For example, suppose we want to classify images of hand-written block capital letters. Figure [fig-VOWEL] shows a possible training set of 5 examples: V, O, W, E, and L. After running through these a bunch of times, our system learns to classify each example perfectly. ![Figure [fig-VOWEL]: Five pieces of test data](../Images/600x200.png) Now we ask the system to classify the images in Figure [fig-non-VOWEL], and it fails completely. Every prediction is wrong. The problem, of course, is that it hasn't seen anything like these letters before. It might look at the T, for instance, and decide that it's composed of two straight lines, and therefore it's an L. ![Figure [fig-non-VOWEL]: Some new data we want to classify](../Images/600x200.png) Obviously the problem here is that we didn't give it examples of all 26 letters to train on. But except for small toy examples like this, we'll almost never be able to give the system all of the possible pieces of data it might see. Suppose our system is trying to recognize dogs in photographs: there are an infinite number of possible photographs with dogs, so we'll never be able to give the computer a clear example of every kind of input we might ever give it. If we're trying to classify the words that someone speaks into a phone, there's no way we can train the system on every voice and accent and speaking pattern. This problem, where we can't find and train on an example of every kind of thing we might encounter in real use, is a key part of the [Curse of Dimensionality](curse_of_dimensionality.md.html). So we're faced with two problems here. First, how much data do we need to train our network well? Second, how can we have confidence that the network will give good results on new data? The first problem is usually addressed by the motto, "More data is better." In fact, gathering more input data is usually the very best thing you can do to improve your network's training, and will give you better results than adopting some other architecture or using some fancy algorithmdo to improve your network's training, and will give you better results than adopting some other architecture or using some fancy algorithm To overcome this problem, and make a classifier that can be used in a practical setting, we'll set aside some of our precious starting data. Training and Validation Sets === Before we can deply our network for real applications, we need some confidence that it will handle new data correctly. The best way to get that confidence is to give the network some new data it's never seen before, and see how well it does. To achieve this, we *segment*, or split, our original collection of labeled examples into two pieces, one big and one small, as shown in Figure [fig-segmentation]. The big set we'll call the [training set](training_set.md.html). The smaller one is the [validation set](validation_set.md.html). ![Figure [fig-segmentation]: Splitting our input examples into a training set and a validation set](../Images/600x200.png) When you're ready to train, set the validation set aside, and feed the training set to the network over and over until you decide to stop (usually, you evaulate its accuracy after each [epoch](epoch.md.html), or complete run through the training set, and stop when the accuracy stops improving). Evaluating with the Validation Set --- When you're done learning with the training set, break out the validation set, and have the network predict the value for each entry. But here is the critical part: **do not update the weights when predicting the validation set**. > "Never update the network from the validation > set. Use it only to determine the network's > performance on those examples." This is extremely important, so it's worth repeating: **do not learn from the validation set**. This is strictly to gather the predictions, right or wrong. Remember that, generally speaking, the more data the algorithm learns from, the better it will perform. So shouldn't we learn from our validation data? If we were to learn from the validation set, then the network would learn the labels on those examples, just like it did for the examples in the training set. So we would expect it to do better and better on the validation set the more times it saw that data and learned from it. And then we'd have lost the whole point of the validation set, which was to tell us how the network would perform on *brand-new* and *never-before-seen* data, just like the data it will be getting once it's deployed. By using the validation data, but never learning from it, we ensure that it's fresh each time. That is, the network has never tried to learn from or adapt to the validation data, so even though we use it over and over, since the network doesn't remember it in any way, it seems new each time. k-Fold Validation === The discussion above assumed that when we split up our original data, the training set was still large enough to do a good job of training the network. But what if it's not? Sometimes you only have a small amount of labeled data, and getting more is impractical or impossible. Because there are so few examples, you really want to train using every one. We can use an algorithm that kind of splits the difference between using all the data to train with, and using only part of the data and sequestering the rest in the validation set. Rather than split the data into two pieces, one big and one small, we split it into several chunks of equal size, as shown in Figure [fig-folds]. Each chunk is called a [fold](fold.md.html). ![Figure [fig-folds]: Splitting our input examples into 5 folds](../Images/600x200.png) Let's say we have 5 folds, as in the figure. When we start training, we treat the examples in folds 1-4 as our training set, and the examples in fold 5 as our validation set. Now we train by running through one epoch; that is, we train using each example in folds 1-4 one time. Then see how well we did by predicting the examples in the validation set and comparing them to the real labels. So far, that's just what we were doing in the last section. But now, when we're ready to start training epoch number 2, we treat the examples in folds 1, 2, 3, and 5 as the training set. When they've all been fed to the network and it's adjusted itself to match their labels, we validate using the examples in fold 4. As always, the network doesn't learn during this step; it's just to see how well it's doing. Figure [fig-k-fold] shows this idea. ![Figure [fig-k-fold]: In each epoch, we choose one fold for validation, and train with the others](../Images/600x200.png) For the next epoch, we train with the examples in folds 1, 2, 4, and 5, and we validate with the examples in fold 3. The process continues until we're using folds 2, 3, 4, and 5 to train with, and fold 1 to validate with. The next epoch starts the cycle over again, training with folds 1 through 4 and validating with fold 5. We'd call this **5-fold validation**, but of course you can chop up your data into any number of folds. It's conventional to refer to using $k$ folds, so the technique is named [k-fold validation](k_fold_validation.md.html) Using Validation Sets to Choose Hyperparameters === A network's [hyperparameters](hyperparameters.md.html) have a lot of influence on how well it performs for a given type of data. Common hyperparameters include the [learning rate](learning_rate.md.html), [momentum](momentum_coefficient.md.html), and [regularization](regularization.md.html) strength. To find values for these parameters, we typical try a whole bunch of different values using some kind of [hyperparameter search algorithm](hyperparameter_search.md.html). The basic idea is that we choose a set of hyperparameters, train our network using those values, and then test it using the validation data. Whichever set of hyperparameters gives us the best performance on the validation data tells us which of the trained models we should use when we deploy this network.