public:courses:machine_learning:machine_learning:large_scale_machine_learning

XVII - Large Scale Machine Learning

  • We have a big dataset for instance with m=100000000.
  • Computation of gradient descent can become very expensive in that case with the summation in: \(\theta_j :) \theta_j - \alpha \frac 1m \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \).
  • One initial good step is to perform a sanity check to verify that we actually get improvements when using a large value of m (otherwise, why not just use m=1000 for instance)
  • To check this we can draw the learning curves (Jtrain and Jcv as a function of m) and check that we still have a high variance situation (eg. Jtrain much lower than Jcv).
  • The current version of gradient descent we used so far is called Batch Gradient Descent.
  • If we write the cost function for a single example as: \(cost(\theta,(x^{(i)},y^{(i)})) = \frac 12 (h_\theta(x^{(i)}) - y^{(i)})^2 \)
  • Then the cost function can be written: \(J_{train}(\theta) = \frac1m \sum\limits_ {i=1}^m cost(\theta,(x^{(i)},y^{(i)})) \)
  • The partial derivative for this per example cost is: \( \frac {\partial}{\partial \theta_j} cost(\theta,(x^{(i)},y^{(i)})) = ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \)
  • Stochastic gradient descent algorithm:
    1. Randomly shuffle the dataset.
    2. Repeat (between 1 to 10 times maybe, depending on how large the dataset is):
      • for i=1,…,m : \(\theta_j := \theta_j - \alpha ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \) (for every j=0,…,n)
  • Mini-batch gradient descent will use b examples in each iteration. b is called the mini-batch size. Typical value is b=10, and it could be between 2-100 for instance.
  • For each iteration we use the b next examples.
  • Thanks to vectorization mini-batch can be a bit faster than stochastic gradient descent.
  • Here, during learning, we compute \(cost(\theta,(x^{(i)},y^{(i)})) \) before updating \(\theta\) using (x^{(i)},y^{(i)}).
  • Every 1000 iterations for instance, we plot this previous cost averaged over the last 1000 example processed by the algorithm.
  • The plot will be noisy, but this is okay. if we increase the previous value from 1000 to say 5000 we get smoother curve, but we need to wait 5 times longer to get a data point.
  • To improve convergence, we could slowly decrease the learning rate: \(\alpha = \frac{const1}{iterationNumber + const2}\).
  • Online learning would do:
    • Repeat forever : {
      • Get (x,y) corresponding to user.
      • Update \(\theta\) using only this example (x,y). (eg \(\theta_j\) for j=0,…,n).
    • }
  • The idea is to compute the gradient descent summation separated on multiple computers. (each machine would only use a fraction of the available training examples).
  • public/courses/machine_learning/machine_learning/large_scale_machine_learning.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1