XVII - Large Scale Machine Learning
17.1 - Learning With Large Datasets
- We have a big dataset for instance with m=100000000.
- Computation of gradient descent can become very expensive in that case with the summation in: \(\theta_j :) \theta_j - \alpha \frac 1m \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \).
- One initial good step is to perform a sanity check to verify that we actually get improvements when using a large value of m (otherwise, why not just use m=1000 for instance)
- To check this we can draw the learning curves (Jtrain and Jcv as a function of m) and check that we still have a high variance situation (eg. Jtrain much lower than Jcv).
17.2 - Stochastic Gradient Descent
- The current version of gradient descent we used so far is called Batch Gradient Descent.
- If we write the cost function for a single example as: \(cost(\theta,(x^{(i)},y^{(i)})) = \frac 12 (h_\theta(x^{(i)}) - y^{(i)})^2 \)
- Then the cost function can be written: \(J_{train}(\theta) = \frac1m \sum\limits_ {i=1}^m cost(\theta,(x^{(i)},y^{(i)})) \)
- The partial derivative for this per example cost is: \( \frac {\partial}{\partial \theta_j} cost(\theta,(x^{(i)},y^{(i)})) = ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \)
- Stochastic gradient descent algorithm:
- Randomly shuffle the dataset.
- Repeat (between 1 to 10 times maybe, depending on how large the dataset is):
- for i=1,…,m : \(\theta_j := \theta_j - \alpha ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \) (for every j=0,…,n)
17.3 - Mini-Batch Gradient Descent
- Mini-batch gradient descent will use b examples in each iteration. b is called the mini-batch size. Typical value is b=10, and it could be between 2-100 for instance.
- For each iteration we use the b next examples.
- Thanks to vectorization mini-batch can be a bit faster than stochastic gradient descent.
17.4 - Stochastic Gradient Descent Convergence
- Here, during learning, we compute \(cost(\theta,(x^{(i)},y^{(i)})) \) before updating \(\theta\) using (x^{(i)},y^{(i)}).
- Every 1000 iterations for instance, we plot this previous cost averaged over the last 1000 example processed by the algorithm.
- The plot will be noisy, but this is okay. if we increase the previous value from 1000 to say 5000 we get smoother curve, but we need to wait 5 times longer to get a data point.
- To improve convergence, we could slowly decrease the learning rate: \(\alpha = \frac{const1}{iterationNumber + const2}\).
17.5 - Online learning
- Online learning would do:
- Repeat forever : {
- Get (x,y) corresponding to user.
- Update \(\theta\) using only this example (x,y). (eg \(\theta_j\) for j=0,…,n).
- }
17.6 - Map Reduce and Data Parallelism
- The idea is to compute the gradient descent summation separated on multiple computers. (each machine would only use a fraction of the available training examples).