XV - Anomaly Detection
15.1 - Problem motivation
- Given a dataset of non anomilous samples, we want to decide if a new example is an anomaly or not.
- To acheive this, we build a model p(x). Then we can decide that if \(p(x_{test}) \le \epsilon\) then we flag the example as anomaly. Otherwise, the example is considered OK.
- Anomaly detection is often used for fraud detection.
15.2 - Gaussian distribution
- If X is a distributed gaussian with mean \(\mu\) and variance \(\sigma^2\), then: \(X \sim N(\mu,\sigma^2) \)
- \(\sigma\) is also called the standard deviation.
- The gaussian distribution is \(p(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi} \cdot \sigma} exp ^{- \frac{(x-\mu)^2}{2\sigma^2} } \).
Parameter estimation
- How to estimate the parameters \(\mu\) and \(\sigma\) given a dataset that we suspect to be following gaussian distribution ?
- \(\mu = \frac{1}{m} \sum\limits_{i=1}^m x^{(i)} \)
- Then \(\sigma^2 = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)} - \mu)^2 \)
- The value computed above are the maximum likelyhood estimation of the parameters.
15.3 - Algorithm
Density estimation
- We start with a training set \({x^{(1)},\dots,x^{(m)}}\)
- For each example \(x \in \mathbb{R}^n\), we then compute the probability: \(p(x)=p(x_1;\mu_1, \sigma_1^2) \cdot p(x_2;\mu_2, \sigma_2^2) \cdots p(x_n;\mu_n, \sigma_n^2) \).
- We assume here that each feature is following a gaussian distribution with specific mean and variance.
- We can also write: \( p(x) = \prod\limits_{j=1}^n p(x_j;\mu_j, \sigma_j^2) \)
15.4 - Developing and Evaluating an Anomaly Detection System
- We now assume we have some labeled data (y=0 if normal, y=1 if anomalous).
- Typically we will have between 2-50 anomalous examples.
- From the training set we compute \(\mu_1,\sigma_1,\mu_2,\sigma_2,\dots\mu_n,\sigma_n\)
Algorithm evaluation
- We fit p(x) on the training set,
- On the cross validation set we predict y given x (y=1 if p(x) < \(\epsilon\), y=0, otherwise).
- Possible evaluation metrics:
- True positive/ false positive, false negative, true negative
- Precision/Recall
- F1-score
- We can use the cross validation set to choose parameter \(\epsilon\).
15.5 - Anomaly Detection vs Supervised Learning
- Use Anomaly detection if:
- Very small number of positive examples (y=1) for instance 0-20 with very large number of negative examples (y=0).
- Many different “types” of anomalies (future anomalies may look completely different)
- Use Supervised learning if:
- Large number of both positive and negative examples.
- Future positives examples might look like the current positives.
15.6 - Choosing What Features to Use
- If a feature is not gaussian distributed, then we can try to transform it to get back to gaussian distribution (for instance by taking the log(), replacing x with log(x+c), or maybe replacing x with x^c).
- To test this on the fly with Octave, we can use the function:
hist(x.^0.1, 50) hist(log(x),50)
15.7 - Multivariate Gaussian Distribution
- Also called Multivariate Normal Distribution.
- The parameters are: \(\mu \in \mathbb{R}^n, \Sigma \in \mathbb{R}^{n\times n}\)
- The probability formula is then: \(p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{\frac n2} \|\Sigma\|^{\frac 12}} exp(-\frac 12 (x-\mu)^T \Sigma^{-1} (x-\mu))\)
- \(\|\Sigma\|\) is the determinant if the matrix. It can be computed in Octave with:
det(Sigma)
15.8 - Anomaly Detection using the Multivariate Gaussian Distribution
- We can compute the mean and sigma as: \(\mu = \frac 1m \sum\limits_{i=1}^m x^{(i)} \in \mathbb{R}^n \) and \(\Sigma = \frac 1m \sum\limits_{i=1}^m (x^{(i)} - \mu) \cdot (x^{(i)} - \mu)^T \in \mathbb{R}^{n \times n}\)
- Note that the original model is a special case of the multivariate case with \(\Sigma\) being a diagonal matrix with the diagonal containing the values \(\sigma_1^2,\sigma_2^2,\dots,\sigma_n^2\).
- The original model can be used when we are creating extra features to capture anomalies due to unusual combinations. Its cheaper, and scales better (for n=10000 or n=100000)
- The multivariate Gaussian is used to automatically capture correlations between features. But its more computationally expensive. And we must have m > n or the Sigma matrix is non-invertible. Rule of thumb would be to have m > 10*n. (can also be non -invertible if we have some redondant features)