public:courses:machine_learning:machine_learning:system_design

  • For a spam classifier, we could select 100 words to build a feature vector. Usually, to build this we rather choose 10000 to 50000 words.
  • Choosing the features with a gut feeling is usually not a good idea.
  • Recommended approach:
    1. Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
    2. Plot learning curves to decide if more data, more features, etc are likely to help.
    3. Error analysis: manually examine the examples (in cross validation set) that your algorithm made errors on. To see if we can spot any systematic trend in what type of examples it is making errors on.
  • ⇒ look for the Porter stemmer (eg. stemming software) to find words starting with the same characters.
  • Skewed classes: when one of the classification classes is occuring much less often (say under 1%) than the other one. (eg. lot more examples for one class than for the other one).
  • Precision/Recall
    • Precision: Number of true positive divided by the number predicted as positive = true positive divided by (true positive + false positive).
    • Recall: number of true positive divided by the number of actual positives = true positive divided by (true positive + false negative).
  • Those two numbers provide a more useful evaluation metric in case of skewed classes.
  • The convention here is to use y=1 in the presence of the rare class that we want to detect.
  • To be more confident when we predict y=1, we could change the threshold used to make the decision on the hypothesis result, and for instance:
    • Predict y=1 if \(h_\theta(x) \ge 0.7 \)
    • Predict y=0 if \(h_\theta(x) \lt 0.7 \)
  • Note that in that case, we get an higher precision, but we also get a lower recall value.
  • If we do the opposite, with for instance:
    • Predict y=1 if \(h_\theta(x) \ge 0.3 \)
    • Predict y=0 if \(h_\theta(x) \lt 0.3 \)
  • In that case,we get an higher recall, but a lower precision.
  • How do we compare algorithms with different precision and recall values ?
    • using the average of precision and recall is not a good solution.
    • Instead, we could use the \(F_1\) score: \(score = 2 \frac{PR}{P+R}\) which is working good.
  • “It's not who has the best algorithm that wins. It's who has the most data.”
  • large data rationale : Assume feature \(x \in \mathbb{R}^{n+1}\) has sufficient information to predict y accurately.
  • Useful test: given the input x, can a human expert confidently predict y ?
  • In that case, if we have a lot of parameters for our hypothesis then a lot of training data may help.
  • public/courses/machine_learning/machine_learning/system_design.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1