XV - Anomaly Detection

Given a dataset of non anomilous samples, we want to decide if a new example is an anomaly or not.
To acheive this, we build a model p(x). Then we can decide that if \(p(x_{test}) \le \epsilon\) then we flag the example as anomaly. Otherwise, the example is considered OK.
Anomaly detection is often used for fraud detection.

If X is a distributed gaussian with mean \(\mu\) and variance \(\sigma^2\), then: \(X \sim N(\mu,\sigma^2) \)
\(\sigma\) is also called the standard deviation.
The gaussian distribution is \(p(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi} \cdot \sigma} exp ^{- \frac{(x-\mu)^2}{2\sigma^2} } \).

How to estimate the parameters \(\mu\) and \(\sigma\) given a dataset that we suspect to be following gaussian distribution ?
\(\mu = \frac{1}{m} \sum\limits_{i=1}^m x^{(i)} \)
Then \(\sigma^2 = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)} - \mu)^2 \)

The value computed above are the maximum likelyhood estimation of the parameters.

We start with a training set \({x^{(1)},\dots,x^{(m)}}\)
For each example \(x \in \mathbb{R}^n\), we then compute the probability: \(p(x)=p(x_1;\mu_1, \sigma_1^2) \cdot p(x_2;\mu_2, \sigma_2^2) \cdots p(x_n;\mu_n, \sigma_n^2) \).
We assume here that each feature is following a gaussian distribution with specific mean and variance.
We can also write: \( p(x) = \prod\limits_{j=1}^n p(x_j;\mu_j, \sigma_j^2) \)

We now assume we have some labeled data (y=0 if normal, y=1 if anomalous).
Typically we will have between 2-50 anomalous examples.
From the training set we compute \(\mu_1,\sigma_1,\mu_2,\sigma_2,\dots\mu_n,\sigma_n\)

We fit p(x) on the training set,
On the cross validation set we predict y given x (y=1 if p(x) < \(\epsilon\), y=0, otherwise).
Possible evaluation metrics:
- True positive/ false positive, false negative, true negative
- Precision/Recall
- F1-score
We can use the cross validation set to choose parameter \(\epsilon\).

Use Anomaly detection if:
- Very small number of positive examples (y=1) for instance 0-20 with very large number of negative examples (y=0).
- Many different “types” of anomalies (future anomalies may look completely different)
Use Supervised learning if:
- Large number of both positive and negative examples.
- Future positives examples might look like the current positives.

If a feature is not gaussian distributed, then we can try to transform it to get back to gaussian distribution (for instance by taking the log(), replacing x with log(x+c), or maybe replacing x with x^c).
To test this on the fly with Octave, we can use the function:
```
hist(x.^0.1, 50)
hist(log(x),50)
```

Also called Multivariate Normal Distribution.
The parameters are: \(\mu \in \mathbb{R}^n, \Sigma \in \mathbb{R}^{n\times n}\)
The probability formula is then: \(p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{\frac n2} \|\Sigma\|^{\frac 12}} exp(-\frac 12 (x-\mu)^T \Sigma^{-1} (x-\mu))\)
\(\|\Sigma\|\) is the determinant if the matrix. It can be computed in Octave with:
```
det(Sigma)
```

We can compute the mean and sigma as: \(\mu = \frac 1m \sum\limits_{i=1}^m x^{(i)} \in \mathbb{R}^n \) and \(\Sigma = \frac 1m \sum\limits_{i=1}^m (x^{(i)} - \mu) \cdot (x^{(i)} - \mu)^T \in \mathbb{R}^{n \times n}\)
Note that the original model is a special case of the multivariate case with \(\Sigma\) being a diagonal matrix with the diagonal containing the values \(\sigma_1^2,\sigma_2^2,\dots,\sigma_n^2\).

The original model can be used when we are creating extra features to capture anomalies due to unusual combinations. Its cheaper, and scales better (for n=10000 or n=100000)
The multivariate Gaussian is used to automatically capture correlations between features. But its more computationally expensive. And we must have m > n or the Sigma matrix is non-invertible. Rule of thumb would be to have m > 10*n. (can also be non -invertible if we have some redondant features)