====== XVII - Large Scale Machine Learning ====== ===== 17.1 - Learning With Large Datasets ===== * We have a big dataset for instance with m=100000000. * Computation of gradient descent can become very expensive in that case with the summation in: \(\theta_j :) \theta_j - \alpha \frac 1m \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \). * One initial good step is to perform a sanity check to verify that we actually get improvements when using a large value of m (otherwise, why not just use m=1000 for instance) * To check this we can draw the learning curves (Jtrain and Jcv as a function of m) and check that we still have a high variance situation (eg. Jtrain much lower than Jcv). ===== 17.2 - Stochastic Gradient Descent ===== * The current version of gradient descent we used so far is called **Batch Gradient Descent**. * If we write the cost function for a single example as: \(cost(\theta,(x^{(i)},y^{(i)})) = \frac 12 (h_\theta(x^{(i)}) - y^{(i)})^2 \) * Then the cost function can be written: \(J_{train}(\theta) = \frac1m \sum\limits_ {i=1}^m cost(\theta,(x^{(i)},y^{(i)})) \) * The partial derivative for this per example cost is: \( \frac {\partial}{\partial \theta_j} cost(\theta,(x^{(i)},y^{(i)})) = ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \) * Stochastic gradient descent algorithm: - Randomly shuffle the dataset. - Repeat (between 1 to 10 times maybe, depending on how large the dataset is): * for i=1,...,m : \(\theta_j := \theta_j - \alpha ( h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \) (for every j=0,...,n) ===== 17.3 - Mini-Batch Gradient Descent ===== * Mini-batch gradient descent will use b examples in each iteration. b is called the **mini-batch size**. Typical value is b=10, and it could be between 2-100 for instance. * For each iteration we use the b next examples. * Thanks to **vectorization** mini-batch can be a bit faster than stochastic gradient descent. ===== 17.4 - Stochastic Gradient Descent Convergence ===== * Here, during learning, we compute \(cost(\theta,(x^{(i)},y^{(i)})) \) before updating \(\theta\) using (x^{(i)},y^{(i)}). * Every 1000 iterations for instance, we plot this previous cost averaged over the last 1000 example processed by the algorithm. * The plot will be noisy, but this is okay. if we increase the previous value from 1000 to say 5000 we get smoother curve, but we need to wait 5 times longer to get a data point. * To improve convergence, we could slowly decrease the learning rate: \(\alpha = \frac{const1}{iterationNumber + const2}\). ===== 17.5 - Online learning ===== * Online learning would do: * Repeat forever : { * Get (x,y) corresponding to user. * Update \(\theta\) using only this example (x,y). (eg \(\theta_j\) for j=0,...,n). * } ===== 17.6 - Map Reduce and Data Parallelism ===== * The idea is to compute the gradient descent summation separated on multiple computers. (each machine would only use a fraction of the available training examples).