Here, during learning, we compute \(cost(\theta,(x^{(i)},y^{(i)})) \) before updating \(\theta\) using (x^{(i)},y^{(i)}).
Every 1000 iterations for instance, we plot this previous cost averaged over the last 1000 example processed by the algorithm.
The plot will be noisy, but this is okay. if we increase the previous value from 1000 to say 5000 we get smoother curve, but we need to wait 5 times longer to get a data point.
To improve convergence, we could slowly decrease the learning rate: \(\alpha = \frac{const1}{iterationNumber + const2}\).