public:courses:machine_learning:machine_learning:applying_machine_learning [NervTech's Wiki]

To try an improve a learning algorithm, we can try to:
1. Get more training examples (but sometimes it doesn't really help).
2. Use a smaller set of features.
3. Get additional features.
4. Add polynomial features.
5. increasing/decreasing \(\lambda\).

There is a simple technique to select what to do, which is called machine learning diagnostic: this can take time to implement but could be a very good use of our time.

To evaluate an hypothesis, we split our global dataset into a training set (70%) and a test set (30%).
We use \(m_\text{test}\) to denote the number of examples in the test set. (and we keep using “m” for the number of examples in the training set).
It's usally better to send a random 70% of the data as the training set, and what's left as the test set (assuming there is some kind of order in the data set).

So the procedure is as follow:
1. Learn parameters \(\theta\) as usual (minimizing training error \(J(\theta)\)).
2. Compute the test set error: \(J_\text{test}(\theta) = \frac{1}{2m_\text{test}} \sum\limits_{i=1}^{m_\text{test}} (h_\theta(x_\text{test}^{(i)}) - y_\text{test}^{(i)})^2 \)

The same logic can be applied for logistic regression (and neural networks).

For logic regression, we can also use the misclassification error (eg. 0/1 misclassification error) testing procedure:

\[err(h_\theta(x),y) = \begin{cases} 1 & \text{if }(h_\theta(x) \ge 0.5 \text{ and } y=0)\text{ or }(h_\theta(x) \lt 0.5 \text{ and } y=1) \\ 0 & \text{otherwise (eg. no error detected)} \end{cases}\]

Then we can compute of misclassification error as: \(error = \frac{1}{m_\text{test}} \sum\limits_{i=1}^{m_\text{test}} err(h_\theta(x_\text{test}^{(i)}) - y_\text{test}^{(i)}) \)

⇒ This is the proportion of errors we have in the test set.

To perform model selection, one thing we can do is to compute the parameters \(\theta\) for each model, then we compute the cost function \(J_\text{test}(\theta)\) for each of them. So, let's say we choose the degree of the polynomial hypothesis this way, then we select the best hypothesis to use depending on the computed cost function.

⇒ The problem here is that we are using the test set to select the value of the additional parameter d = polynomial degree so this is going to be an optimistic estimate on how well the hypothesis will perform on a completely new dataset.

The solution is to cut the dataset into 3 parts:
1. The training set (60% ?)
2. The cross validation set (CV) (20% ?)
3. The test set (20% ?)

We will use \(m_{cv}\) to indicate the total number of examples in the cross validation set.

The we define the cost function \(J_{cv}(\theta)\) on the cross validation set as usual.

So the new process, is to train variable hypothesis with the training set, then we compute the cost on the cross validation set, then we select the hypothesis with the lowest cost. And to get an estimate generalization error we compute the cost function on the test set using this selected hypothesis.

high bias ⇔ underfitting problem
high variance ⇔ overfitting problem

If we try to draw the mesured error depending on the degree of the polynomial used for an hypothesis, we find that:
1. On the training set, the error will go don't when we increase the degree of the polynomial.
2. On the validation set, the error will be high, then low, then high again.

When the error on the validation set is high on the “left” part, it means we are in an high bias problem. On the right part, it is an high variance problem.

In other words:
1. for high bias: \(J_{train}(\theta)\) and \(J_{cv}(\theta)\) are high.
2. for high variance: \(J_{train}(\theta)\) is low and \(J_{cv}(\theta)\) » \(J_{train}(\theta)\).

When using regularization, we just define \(J_{train}(\theta)\) as the unregularized cost function:

\[J_{train}(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2\]

Similarly we define \(J_{cv}(\theta)\) and \(J_{test}(\theta)\) as the unregularized cost functions.

To select the value of the parameter \(\lambda\) for the regularization we could build hypothesis with different values of \(\lambda\) (0, 0.01, 0.02, 0.04, 0.08, …, 10) from the train set. then we evaluate \(J_{cv}(\theta)\) for each of those hypothesis, and we select the one with the smallest error. Then to get an estimate of the actual error, we compute \(J_{test}(\theta)\).

To draw the learning curves we will draw the values of \(J_{train}(\theta)\) and \(J_{cv}(\theta)\) (using the unregularized versions) as a function of m (eg. the training set size).
We have to artificially reduce the training set size.
If the training set size is small then the value of \(J_{train}(\theta)\) is going to be small too, then this value will increase when the training set size increases.
For the cross validation error, \(J_{cv}(\theta)\), its value will tend to decrease when the size of the training set increases.

When we have an high bias issue, we will find that the curves for \(J_{train}(\theta)\) and \(J_{cv}(\theta)\) are very close for high values of the training set size and the error value is very high.
We can notice that when we have high bias, getting more training data will not help as the value of \(J_{cv}(\theta)\) will still plateau at an high error value.
When we have an high variance issue, we will find that the curves for \(J_{train}(\theta)\) and \(J_{cv}(\theta)\) are far apart (eg. there is a large gap between them).
In an high variance settings getting more training data may actually be a good idea.

Get more training examples -> Helps to fix high variance issue.
Try smaller sets of features -> Helps to fix high variance issue.
Try getting additional features -> Helps fix high bias issue.
Try adding polynomial features -> Helps fix high bias issue.
Try decreasing \(\lambda\) -> Helps fix high bias issue.
Try increasing \(\lambda\) -> Helps fix high variance issue.

To select the number of hidden layers in a neural network, one option is to training different NNs and then select the one that perform the best on the cross validation set.