====== XV - Anomaly Detection ====== ===== 15.1 - Problem motivation ===== * Given a dataset of non anomilous samples, we want to decide if a new example is an anomaly or not. * To acheive this, we build a model p(x). Then we can decide that if \(p(x_{test}) \le \epsilon\) then we flag the example as anomaly. Otherwise, the example is considered OK. * Anomaly detection is often used for **fraud detection**. ===== 15.2 - Gaussian distribution ===== * If X is a distributed gaussian with mean \(\mu\) and variance \(\sigma^2\), then: \(X \sim N(\mu,\sigma^2) \) * \(\sigma\) is also called the **standard deviation**. * The gaussian distribution is \(p(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi} \cdot \sigma} exp ^{- \frac{(x-\mu)^2}{2\sigma^2} } \). ==== Parameter estimation ==== * How to estimate the parameters \(\mu\) and \(\sigma\) given a dataset that we suspect to be following gaussian distribution ? * \(\mu = \frac{1}{m} \sum\limits_{i=1}^m x^{(i)} \) * Then \(\sigma^2 = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)} - \mu)^2 \) * The value computed above are the **maximum likelyhood estimation** of the parameters. ===== 15.3 - Algorithm ===== ==== Density estimation ==== * We start with a training set \({x^{(1)},\dots,x^{(m)}}\) * For each example \(x \in \mathbb{R}^n\), we then compute the probability: \(p(x)=p(x_1;\mu_1, \sigma_1^2) \cdot p(x_2;\mu_2, \sigma_2^2) \cdots p(x_n;\mu_n, \sigma_n^2) \). * We assume here that each feature is following a gaussian distribution with specific mean and variance. * We can also write: \( p(x) = \prod\limits_{j=1}^n p(x_j;\mu_j, \sigma_j^2) \) ===== 15.4 - Developing and Evaluating an Anomaly Detection System ===== * We now assume we have some labeled data (y=0 if normal, y=1 if anomalous). * Typically we will have between 2-50 anomalous examples. * From the training set we compute \(\mu_1,\sigma_1,\mu_2,\sigma_2,\dots\mu_n,\sigma_n\) ==== Algorithm evaluation ==== * We fit p(x) on the training set, * On the cross validation set we predict y given x (y=1 if p(x) < \(\epsilon\), y=0, otherwise). * Possible evaluation metrics: * True positive/ false positive, false negative, true negative * Precision/Recall * F1-score * We can use the cross validation set to choose parameter \(\epsilon\). ===== 15.5 - Anomaly Detection vs Supervised Learning ===== * Use Anomaly detection if: * Very small number of positive examples (y=1) for instance 0-20 with very large number of negative examples (y=0). * Many different "types" of anomalies (future anomalies may look completely different) * Use Supervised learning if: * Large number of both positive and negative examples. * Future positives examples might look like the current positives. ===== 15.6 - Choosing What Features to Use ===== * If a feature is not gaussian distributed, then we can try to transform it to get back to gaussian distribution (for instance by taking the log(), replacing x with log(x+c), or maybe replacing x with x^c). * To test this on the fly with Octave, we can use the function: hist(x.^0.1, 50) hist(log(x),50) ===== 15.7 - Multivariate Gaussian Distribution ===== * Also called Multivariate Normal Distribution. * The parameters are: \(\mu \in \mathbb{R}^n, \Sigma \in \mathbb{R}^{n\times n}\) * The probability formula is then: \(p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{\frac n2} \|\Sigma\|^{\frac 12}} exp(-\frac 12 (x-\mu)^T \Sigma^{-1} (x-\mu))\) * \(\|\Sigma\|\) is the determinant if the matrix. It can be computed in Octave with:det(Sigma) ===== 15.8 - Anomaly Detection using the Multivariate Gaussian Distribution ===== * We can compute the mean and sigma as: \(\mu = \frac 1m \sum\limits_{i=1}^m x^{(i)} \in \mathbb{R}^n \) and \(\Sigma = \frac 1m \sum\limits_{i=1}^m (x^{(i)} - \mu) \cdot (x^{(i)} - \mu)^T \in \mathbb{R}^{n \times n}\) * Note that the original model is a special case of the multivariate case with \(\Sigma\) being a diagonal matrix with the diagonal containing the values \(\sigma_1^2,\sigma_2^2,\dots,\sigma_n^2\). * The original model can be used when we are creating extra features to capture anomalies due to unusual combinations. Its cheaper, and scales better (for n=10000 or n=100000) * The multivariate Gaussian is used to automatically capture correlations between features. But its more computationally expensive. And we must have m > n or the Sigma matrix is non-invertible. Rule of thumb would be to have m > 10*n. (can also be non -invertible if we have some redondant features)