XVII - Large Scale Machine Learning

17.1 - Learning With Large Datasets

We have a big dataset for instance with m=100000000.
Computation of gradient descent can become very expensive in that case with the summation in: \(\theta_j :) \theta_j - \alpha \frac 1m \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \).
One initial good step is to perform a sanity check to verify that we actually get improvements when using a large value of m (otherwise, why not just use m=1000 for instance)
To check this we can draw the learning curves (Jtrain and Jcv as a function of m) and check that we still have a high variance situation (eg. Jtrain much lower than Jcv).

The current version of gradient descent we used so far is called Batch Gradient Descent.
If we write the cost function for a single example as: \(cost(\theta,(x^{(i)},y^{(i)})) = \frac 12 (h_\theta(x^{(i)}) - y^{(i)})^2 \)
Then the cost function can be written: \(J_{train}(\theta) = \frac1m \sum\limits_ {i=1}^m cost(\theta,(x^{(i)},y^{(i)})) \)
The partial derivative for this per example cost is: \( \frac {\partial}{\partial \theta_j} cost(\theta,(x^{(i)},y^{(i)})) = ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \)
Stochastic gradient descent algorithm:
1. Randomly shuffle the dataset.
2. Repeat (between 1 to 10 times maybe, depending on how large the dataset is):
  - for i=1,…,m : \(\theta_j := \theta_j - \alpha ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \) (for every j=0,…,n)

Mini-batch gradient descent will use b examples in each iteration. b is called the mini-batch size. Typical value is b=10, and it could be between 2-100 for instance.
For each iteration we use the b next examples.
Thanks to vectorization mini-batch can be a bit faster than stochastic gradient descent.

Here, during learning, we compute \(cost(\theta,(x^{(i)},y^{(i)})) \) before updating \(\theta\) using (x^{(i)},y^{(i)}).
Every 1000 iterations for instance, we plot this previous cost averaged over the last 1000 example processed by the algorithm.
The plot will be noisy, but this is okay. if we increase the previous value from 1000 to say 5000 we get smoother curve, but we need to wait 5 times longer to get a data point.
To improve convergence, we could slowly decrease the learning rate: \(\alpha = \frac{const1}{iterationNumber + const2}\).

Online learning would do:
- Repeat forever : {
  - Get (x,y) corresponding to user.
  - Update \(\theta\) using only this example (x,y). (eg \(\theta_j\) for j=0,…,n).
- }

The idea is to compute the gradient descent summation separated on multiple computers. (each machine would only use a fraction of the available training examples).