====== Pattern Discovery in Data Mining ======
/*
Coursera introduction text:
Hi everyone!
I'm Emmanuel ROCHE, but everyone calls me "Manu" ;-). I'm French, and I currently live near Toulouse, in France. I have no serious experience with data mining or pattern discovery yet, but I'm a fast learner, and I believe this field could be quite interesting. So I decided to give it a try here.
*/
Course page: https://www.coursera.org/learn/data-patterns/home/welcome\\
Teacher: **Mr. Jiawei Han**\\
Textbook: **Data Mining - Concepts and Technics** by Jiawei Han, Micheline Kamber and Jian Pei (2011) [Morgan Kaufmann]
* Course structure:
* Lesson 1: Pattern Discovery: basic Concepts
* Lesson 2: Efficient Pattern Mining Methods
* Lesson 3: Pattern Evaluation
* Lesson 4: Mining Diverse Frequent patterns
* Lesson 5: Sequential Pattern Mining
* Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory patterns
* Lesson 7: Pattern Mining Applications: Mining Quality Phrases from text Data
* Lesson 8: Advanced Topics on Pattern Discovery
* Basic concepts of pattern discovery: **frequent pattern**, **closed pattern**, **max-pattern**, and **association rules**.
* Efficient pattern mining methods: **Apriori**, **ECLAT**, and **FPgrowth**.
* Pattern evaluation measures: **lift**, **chi-square**, **cosine**, **Jaccard**, and **Kulczynski**.
* Methods for mining sequential patterns: **GSP**, **SPADE**, **PrefixSpan**, and **CloSpan**.
* Pattern-based classifications: **CBA**, **CMAR**, **PatClass**, **DDPMine**, and **DPClass**.
===== Lesson 1: Pattern Discovery Basic Concepts =====
==== Key questions ====
* Why is pattern discovery an important task in data mining?
* What are the concepts of frequent itemsets?
* What are the concepts of association rules? How do you generate association rules from frequent itemsets?
* Why do we need compressed representations of frequent patterns? Why do we need two compression representations: closed patterns and max-patterns?
==== Definitions ====
* **Pattern**: a **set** of **items**, **subsequences** or **substructures** that occur frequently together (or strongly correlated) in a data set. Patterns represent **intrinsic** and **important properties** of datasets.
* **Pattern discovery**: trying to **uncover** patterns from massive data sets. Finding **inherent regularities** in a dataset.
* Pattern discovery is the fondation for many essential data mining tasks:
- Association, correlation and causality analysis,
- Minig sequential, structural (eg. sub-graph) patterns,
- Pattern analysis in spatiotemporal, multimedia, time-series and stream data,
- Classification: Discriminative pattern-based analysis
- Cluster analysis: Pattern-based subspace clustering
* **Itemset**: A set of one or more items.
* **k-itemset**: \(X = \{x_1,\cdots,x_k\}\)
* **(absolute) support (count)** of X: //Frequency of the number of occurences of an itemset X in all the dataset entries//. (**Note**: this definition doesn't clearly match the example given by the teacher, has he uses a specific element \(x_i\) **inside** some itemsets and counts the number of times this element occurs in all itemsets in the dataset...) => **clarification**: the itemsets can be anything in this definition, 1-itemset, 2-itemset, etc.. even if each entry in the dataset is a 5-itemset for instance.
* **(relative) support** of X: //The fraction of itemsets [or entries] in the dataset that contains the element X// (i.e. this is the **probability** that an itemset contains X).
* An itemset is **frequent**if the support of X is no less than a given **minimum support threshold** (denoted \(\sigma\)).
* **Association rule**: We define the association rule \(X \rightarrow Y (s,c)\), stating that, if given the itemset X, then the probability that we also have the itemset Y is of support **s**, and confidence **c**.
* Above, **s** is the support of \(X \cup Y\): eg. probability an entry contains \(X \cup Y\).
* **c** is the conditional probability that an entry containing X also contains Y: \(c = \frac{sup(X \cup Y)}{sup(X)}\)
* **Subtle notation**: The entries that contain \(X \cup Y\) are actually the **intersection** of the entries containing X an the entries containing Y (not the **union**!)
* **Association rule mining**: find **all** the rules \(X \rightarrow Y\) that have a given minimum support and confidence values.
==== Challenge ====
With the definitions above we often have **too many** frequent patterns.
* Solution 1: **Closed patterns**: //A pattern (itemset) X is **closed** if X is frequent and there exists no super-pattern \(Y \supset X\) with the same support as X//.
* => **Closed pattern** is a **lossless compression** of frequent patterns (eg. reduction of the number of patterns but we do not loose the support information).
* Solution 2: **Max-patterns**: //A pattern X is a **max-pattern** if X is frequent and there exists no frequent super-pattern \(Y \supset X\).//
* => **Max-pattern** is a **lossly compression** of frequent patterns: we only know a given sub-pattern is frequent, but we don't know the real support anymore.
=> In many applications, mining **closed patterns** is more desirable than mining **max-patterns**.
==== Recommended Readings ====
* "[[http://www.rakesh.agrawal-family.com/papers/sigmod93assoc.pdf|Mining association rules between sets of items in large databases]]" (R. Agrawal, T. Imielinski and A. Swami, in Proc. of SIFMOD'93)
* "Efficiently mining long patterns from databases" (R. J. Bayardo, in Proc. of SIGMOD'98)
* "Discovering frequent closed itemsets for association rules" (N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, in Proc. of ICDT'99)
* "Frequen Pattern Mining: Current Status and Future Directions" (J. Han, H. Cheng, D. Xin and X. Yan, in Data Mining and Knowledge Discovery, 15(1): 55-86, 2007)
==== Formal Model definition ====
Let \(\mathcal{I} = I_1,I_2,\dots,I_m\) be a set of binary attributes, called items. Let \(T\) be a database of transactions. Each transaction \(t\) is represented as a binary vector, with \(t[k] = 1\) if \(t\) bought the item \(I_k\), and \(t[k] = 0\) otherwise.
FIXME: Complete this section.
===== Lesson 2: Efficient Pattern Mining Methods =====
==== Key concepts ====
* The downward closure property or the Apriori property of frequent patterns
* The Apriori algorithm
* ECLAT: Mining frequent patterns by exploring vertical data format
* FPGrowth: A frequent pattern-growth approach
* Closet+ for mining closed patterns
==== Downward Closure Property of Frequent Patterns ====
* **Downward closure** property (also called **Apriori**): //Any subset of a frequent itemset must be frequent.//
* Thus an efficient mining methodology will take advantage of this, with the following rule: given an itemset **S**, if any of the subset of **S** is **infrequent** then there is **no chance for S to be frequent**. (so S can be pruned and doesn't require mining).
* From this **Apriori pruning principle**, we have 3 major methods developped:
- Level-wise, join-based approach: **Apriori** (Agrawal, Srikant @ VLDB'94)
- Vertical data format approach: **Eclat** (Zaki, Parthasarathy, Ogihara, Li @ KDD'97)
- Frequent pattern projection and growth: **FPgrowth** (Han, Pei, Yin @ SIGMOD'00)
==== The Apriori Algorithm ====
* Algorithm:
- Initially, scan database once to get **frequent 1-itemset**.
- **Repeat**:
- Generate candidate (k+1)-itemsets from frequent k-itemsets.
- Test the candidate (k+1)-itemsets to find which ones are frequent.
- set k = k+1
- **Stop** when no frequent or candidate set can be generated anymore.
- Return all the frequent itemsets found.
* **How to generate candidates** at level k+1:
- **Self-joining** \(F_k\): eg. {abc, abd, acd, ace, bcd} => {abcd, acde}
- **Pruning**: eg. **ade** is not in the \(F_3\) above, so **acde** is pruned.
* An **SQL implementation** can be derived to generate the candidates described above [not reportedhere, cf. corresponding course video]
==== Extensions or Improvements of Apriori ====
* Reducing number of transaction database scan passes:
* Partitioning (Savasere, et al., 1995)
* Dynamic itemset counting (Brin, et al., 1997)
* Shrink the number of candidates:
* Hashing (Park, et al., 1995)
* Pruning by support lower bounding (Bayardo 1998)
* Sampling (Toivonen, 1996)
* Exploring special data structures
* Tree projection (Aggarwal, et al. 2001)
* H-miner (Pei, et al. 2001)
* Hypercube decomposition (Uno, et al. 2004)
=== Partitioning ===
* => Only needs to scan the database twice to find frequent itemsets.
* => We partition the database into k partitions, then:
* **Theorem**: Any itemset that is potentially frequent in database TDB must be frequent in at least one of the partitions of TDB.
* Method:
* Scan 1: partition database: each partition should fit into computer RAM, so that we can avoid IO operations then, and look for all frequent itemsets in this partition.
* Scan 2: consolidate global frequent patterns.
=== Direct Hashing and Pruning (DHP) ===
* DHP can be used to reduce the number of candidates.
* **Observation**: //A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.//
* Example: while counting the 1-itemset with candidates a,b,c,d,e we also count the number of entries for 2-itemset buckets such as {ab,ad,ae}, {bd,be,de}, etc.
* The ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below the support threshold.
==== Mining Frequent Patterns by Exploring Vertical Data Format ====
* ECLAT (Equivalence Class Transformation): a depth-first search algorithm using set intersection.
* Need to generate the **Tid-List**: eg. list of transactions for each possible item.
* Then we can compute the **intersections** of the tid lists: \(t(e) = \{T_{10}, t_{20},T_{30}\}\), \(t(a) = \{T_{10}, T_{20}\}\), then \(t(ae) = \{T_{10}, T_{20}\}\)
* **Properties of Tid-Lists**:
* \(t(X) = t(Y)\) => X and Y always happen together.
* \(t(X) \subset t(Y)\) => transations having X always have Y.
* Frequent patterns can be derived from vertical intersections.
* Can use **diffset** to accelerate mining:
* Only keep track of differences of Tid-Lists (can improve memory usage).
==== FPGrowth: A Pattern Growth Approach ====
* Algorithm:
- Find frequent single items and partition the database based on each such item
- Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database
- Using **FP-Tree** to facilitate efficient processing.
* FIXME: I didn't get much out of that one, might need to check again later...
==== Mining Closed Patterns ====
* **Closet+** algorithm to mine closed itemsets by Pattern-Growth (efficient and direct)
* **itemset merging**: //if Y appears in every occurance of X, then Y is merged with X.//
* Many tricks implemented in closet+
* Hybrid tree projection,
* Sub-itemset pruning,
* Item skipping,
* Efficient subset checking, etc.
==== Recommended Readings ====
* R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
* A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases”, VLDB'95
* J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”, SIGMOD'95
* S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database systems: Alternatives and implications”, SIGMOD'98
* M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association rules”, Data Mining and Knowledge Discovery, 1997
* J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
* M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
* J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets”, KDD'03
* C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014
===== Lesson 3: Pattern Evaluation =====
==== Key concepts ====
* Pattern evaluation; interestingness of patterns/rules
* Popularly used measures in pattern evaluation: support, confidence, lift, chi-square, cosine, Jaccard
* Null-invariance, the Kulczynski measure, and imbalance ratio
==== Limitation of the Support-Confidence Framework ====
* Not all of patterns and rules are interesting.
* **Interestingness measures** can be classified as **objective** or **subjective**.
- **Objective interestingness measures** (such as support, confidence, correlation, etc) are defined by math formulas
- **Subjective interestingness measures** depends on one's perspective.
* Thus mining can be done:
- on a **Query-based** mode: asking the user what she would like to see
- against ons's knowledge-base: looking for unexpected, fresh or recent patterns/rules
- using visualization tools: multi-dimensional, interactive examination.
* Not all strongly supported and/or confident association rules are interesting. In the lecture videos we have the following example:
* //play-basketball// => //eat-cereal// [s=40%, c=66.7%]
* //**not** play-basketball// => //eat-cereal// [s=35%, c=87.5%]
==== Interestingness Measures: Lift and \(\chi^2\) ====
* **Lift**: measure of dependent/correlated events:
\[lift(B,C) = \frac{c(B \rightarrow C)}{s(C)} = \frac{s(B \cup C)}{s(B) \times s(C)}\]
* Lift(B,C) will tell how B and C are correlated:
* Lift(B,C) = 1 => B and C are independent.
* Lift(B,C) > 1 => we have a positive correlation
* Lift(B,C) < 1 => we have a negative correlation
* \(\chi^2\) is another measure to test correlated events:
\[\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}\]
* General \(\chi^2\) rules:
* \(\chi^2 = 0\): events are independent
* \(\chi^2 > 0\): events are correlated (either positively or negatively)
* But even Lift and \(\chi^2\) are not always good measures: could for instance detect strong positive correlation when there is not.
* [def] **Null transactions**: transactions that do not contain the itemset of interest.
==== Null Invariance Measures ====
* **Null invariance**: value does not change with the number of null-transactions.
* Lift and \(\chi^2\) are **not** null invariant.
* A few Null-invariant measures: (all give values in the range [0,1])
* \(AllConf(A,B) = \frac{s(A \cup B)}{max(s(A), s(B))}\)
* \(Jaccard(A,B) = \frac{s(A \cup B)}{s(A) + s(B) - s(A \cup B)}\)
* \(Cosine(A,B) = \frac{s(A \cup B)}{\sqrt{s(A) + s(B)}}\)
* \(Kulczynski(A,B) = \frac 12 \left( \frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)} \right)\)
* \(MaxConf(A,B) = max \left( \frac{s(A)}{s(A \cup B)}, \frac{s(B)}{s(A \cup B)} \right)\)
* null-transactions will often be very numerous, so null invariant measures are critical.
==== Comparison of Null-Invariant Measures ====
* Not all null invariant measures are equal.
* => **Kulczynski** is the most appropriate when we have contradictory correlations, so this one seems to be the one to use by default.
* Introducing **Imbalance ratio** (IR): measure the imbalance of two itemsets A and B in rule implications:
\[IR(A,B) = \frac{|s(A) - s(B)|}{s(A)+s(B)-s(A \cup B)}\]
* => **Kulczynski + Imbalance ratio** together can present a clear picture of all the typical cases.
==== Recommended Readings ====
* C. C. Aggarwal and R S. Yu. A New Framework for Itemset Generation. PODS'98
* S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97
* M. Klemettinen, H. Mannila, R Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94
* E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE'03
* P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02
* T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397, 2010
===== Lesson 4: Mining Diverse Patterns =====
==== Key concepts ====
* Multiple-level associations, level-shared mining, redundant rules
* Multi-dimensional associations, quantitative associations
* Negative correlations vs. rare patterns, null-invariance
* Compressed patterns, redundancy-aware top-k patterns
==== Mining Multi-Level Associations ====
* We can often have a hierarchy in items (eg. milk => 2% milk => Dairyland 2% milk)
* When we have such an item hierarchy we might mine with:
- Uniform min-support accros multiple levels (is that reasonable ?)
- Level-reduced min-support: items at the lower levels are expected to have lower support.
* How to perform efficient mining with multi min-support values ? => **shared** multi-leve mining: use the lowest min-support to pass down the set of candidates.
* Another problem with multi-leve association mining is the **redundancy**
* example: 2% milk is 25% of sold milk, and
* milk => wheat bread [s=8%, c=70%] (1)
* 2% milk => wheat bread [s=2%, c=72%] (2)
* Rule number 2 can be derived from rule 1 so it is redundant.
* => Redundancy filtering required.
* We may also require customised min-supports for different kinds of items: (for example whe very expense items are mixed with common items):
* Solution: use **group-based** individualized min-support:
* eg. {diamond, watch}: 0.05%, {bread, milk}: 5%
=> scalable mining methods can easily be extended to take customized min-support into account.
==== Mining Multi-Dimensional Associations ====
* Simple **Single-dimensional** rule: buys(X, "milk") => buys(X, "bread")
* Multi-dimensional rules:
* **Inter-dimension** association rules (no repeated predicate):
* age(X, "18-25") ^ occupation(X, "student") => buys(X, "coke")
* **Hybrid-dimension** association rules (repeated predicate):
* age(X, "18-25") ^ buys(X, "popcorn") => buys(X, "coke")
* Attributes can be:
* **categorical** (eg. discrete values): no ordering
* **quantitative** (numerical values):: implicit ordering
==== Mining Quantitative Associations ====
* Mining associations with numerical attributes (eg. age, salary, etc...)
* Mining methods:
* **Static discretization** => data cube-based aggregation
* **Dynamic discretization** => based on data distribution
* **Clustering** (eg distance-based association)
* Perform **Deviation analysis**, for instance: Gender=female => Wage: mean=$7/hr (overall mean = $9/hr);
* Mining **extraordinary or interesting phenomena**, in the gender=female example above:
* Left Hand Side (LHS) : a subset of the population
* Right Hand Side (RHS) : an extraordinary behavior of this subset
* An extraordinary test is accepted only if a statistical test (eg. Z-test) confirms the inference with high confidence.
* **Subrule** : can highlight the extraordinary behavior of a subset of the population of the **super rule**, for instance:
* (gender=female) ^ (South=yes) => mean wage = $6.3/hr
* Rule condition can be categorical or numerical (quantitative rules):
* Education in [14-18] (yrs) => mean wage = $11.64/hr
* Efficient methods to mine extraordinary rules: cf. Aumann and Lindell@KDD'99
==== Mining Negative Correlations ====
* **Rare patterns** : very low support but still very interesting (can be mined with **individualized group-based min-support** thresholds)
* **Negative patterns**: negatively correlated itemssets (eg. unlikely to happen together)
* Support-based definition of negative patterns: if itemset A and B are both frequent but rarely occur together: \(sup(A \cup B) << sup(A) x sup(B)\), then A and B are negatively correlated. => **similar to definition of Lift** => doesn't work well with large datasets will many null-transactions.
* So we introduce a **Kulczynski measure-based definition** of negative patterns: If itemset A and B are frequent but \(\frac{P(A|B) + P(B|A)}{2} < \epsilon\) where \(\epsilon\) is a negative pattern threshold, then A and B are negatively correlated. (for instance \(\epsilon\) ~ 0.01)
==== Mining Compressed Patterns ====
* We can produce many scattered patterns but not so meaningful.
* When considering **closed patterns** we cannot perform compression because there is too much emphasize on the support values (eg. To merge close patterns we are required to have the same support value).
* When using **Max-patterns** we might loose too much information in the missing support values.
* We might rather something in between (eg. a good balance) : keep some sub patterns from the max-patterns to retain some of the support information.
* To select what patterns canbe compressed, we use a **pattern distance measure**:
\[Dist(P_1,P_2) = 1 - \frac{|T(P_1) \cap T(P_2)|}{|T(P_1) \cup T(P_2)|}\]
* Then we can perform \(\delta\)-**clustering**: for each pattern P, find all patterns which can be expressed by P and whose distance to P is within \(\delta\) (\(\delta\)-cover) => All patterns in the cluster can be represented by P.
* There is a method for efficient and **direct** mining of compressed patterns (instead of mining all patterns first before compressing): cf. Xin et al. VLDB'05.
* **Redundancy-aware Top-k Patterns**: looking for high significance and low redundancy when compressing patterns.
* Method: Use MMS (Maximal Marginal Significance) for measuring the combined significance of pattern set: cf. Xin et al., Extracting Redundancy-Awre Top-k Patterns, KDD'06.
===== Lesson 5: Sequential Pattern Mining =====
==== Key concepts ====
* Sequential patterns, sequential pattern mining
* Sequential pattern mining algorithms: GSP, SPADE, and PrefixSpan
* Closed sequential patterns, the CloSpan algorithm
==== Sequential Pattern and Sequential Pattern Mining ====
* Sequential pattern mining has broad applications. For instance: "Buy a laptop, then a digital camera, then smartphone, within 6 months." => could also be applied to stocks and markets.
* So trying to get sequential patterns out of vast **sequence DBs** (eg. entries with timestamp) or **time-series DBs** (evenly spaced timestamps).
* For sequential patterns we also have the concept of **gapped** and **non-gapped**.
* **Sequential pattern mining**: given a set of sequences, find the **complete set of frequent subsequences** (satisfying a given min-support threshold).
* A **sequence** can be written for instance: < (ef) (ab) (df) c b > (eg. we use parenthesis to identify a shopping basket here for instance).
* Each **element** in a **sequence** may contain a set of **items** (also called **events**)
* Items within an element are unordered (usually listed alphabetically).
* Any "substring" keeping the parenthesis at appropriate location in a **sequence** can define a **subsequence**, for instance: is a **susequence** of
* Then given a support threshold, for instance min_sup=2, then we can find **frequent sequential pattern**
* Algorithm requirements: **efficient**, **scalable**, **finding complete sets**, **incorporating various kinds of user-specific constraints**.
* The **Apriori** property still holds for sequential patterns: //if a subsequence \(s_1\) is infrequent, none of \(s_1\)'s super-sequences can be frequent.//
* We can then develop the same kind of algorithms as before, based on the Apriori property:
- **GSP** (Generalized Sequential Patterns): cf. Srikant & Agrawal @ EDBT'96
- **SPADE** (Vertical format-based mining): cf. Zaki @ Machine Learning'00
- **PrefixSpan** (Pattern-growth method): cf. Pei, et al. @ ICDE'01
* To mine **closed sequential patterns** we have the **CloSpan** method: cf. yan, et al. @ SDM'03
* Finally we also have **constraint-based** sequential pattern mining.
==== GSP: Apriori-Based Sequential Pattern Mining ====
* We start with a sequence database:
^ SID ^ Sequence ^
| 10 | <(bd)cb(ac)> |
| 20 | <(bf)(ce)b(fg)> |
| 30 | <(ah)(bf)abf> |
| 40 | <(be)(ce)d> |
| 50 | |
- We start with initial candidates: eg. all singleton sequences: ,,,,,,,
- We scan the DB once and count support for each candidate.
- Prune infrequent candidates: say and here.
- Generate length-2 candidate sequences:
* as separated items: , , ...,
* or as elements in the sequence: <(ab)>, <(ac)>, ..., <(ef)>.
=> We repeat steps 2 to 4 until there is no candidate or frequent sequences found anymore.
=> With the pruning stage due to Apriori property we can reduce the number of candidates for the next length sequences on each step.
==== SPADE: Sequential Pattern Mining in Vertical Data Format ====
* A sequence database is mapped to (eg. sequence ID and Element ID), for instance with:
^ SID ^ Sequence ^
| 1 | |
| 2 | <(ad)c(bc)(ae)> |
| 3 | <(ef)(ab)(df)cb> |
| 4 | |
We can build the mapping:
^ SID ^ EID ^ items ^
| 1 | 1 | a |
| 1 | 2 | abc |
| 1 | 3 | ac |
| 1 | 4 | d |
| 1 | 5 | cf |
| 2 | 1 | ad |
| ... | ... | ... |
Then we figure out where each item occurs:
^ a ^^ b ^^ ... ^
^ SID ^ EID ^ SID ^ EID ^ ... ^
| 1 | 1 | 1 | 2 | ... |
| 1 | 2 | 2 | 3 | ... |
| 1 | 3 | 3 | 2 | ... |
| 2 | 1 | 3 | 5 | ... |
| 2 | 4 | 4 | 5 | ... |
| 3 | 2 | | | ... |
| ... | ... | | | ... |
Then to build length-2 sequences we check for each sequence ID the order of the Element IDs:
^ ab ^^ ba ^^ ... ^
^ SID ^ EID(a) ^ EID(b) ^ SID ^ EID(b) ^ EID(a) ^... ^
| 1 | 1 | 2 | 1 | 2 | 3 | ... |
| 2 | 1 | 3 | 2 | 3 | 4 | ... |
| 3 | 2 | 5 | | | | ... |
| ... | ... | ... | | | | ... |
Then we can grow the subsequences (patterns) one item at a time by Apriori candidate generation.
==== PrefixSpan: Sequential Pattern Mining by Pattern-Growth ====
* Need to introduce the concept of **prefix** and **suffix** (eg. projection). Starting from DB:
^ SID ^ Sequence ^
| 1 | |
| 2 | <(ad)c(bc)(ae)> |
| 3 | <(ef)(ab)(df)cb> |
| 4 | |
We build the prefix/suffix DB:
^ Prefix ^ Suffix (Projection) ^
| | <(abc)(ac)d(cf)> |
| | <(_bc)(ac)d(cf)> |
| | <(_c)(ac)d(cf)> |
* Given a **sequence** we can thus extract a set of **prefixes** and **suffixes** (eg. prefixes-based projections)
* PrefixSpan mining method:
- Find length-1 sequential patterns: eg. , , , , ,
- Then devide the search space to mine each projected DB:
* -projected DB, -projected DB, ..., -projected DB.
- In each projections, we find the length-1 frequent patterns, then we combine with the prefix to get length-2 frequent sequential patterns, for instance: , , <(ab)>, , , , etc.
=> The major advantages of PrefixSpan are:
* No candidate subseqs to be generated,
* Projected DBs keep shrinking.
* Major cost of PrefixSpan: **constructing projected DBs**
* suffixes are largely repeating in recursive projected DBs.
* When the DB can held in main memory we can use **pseudo projection**:
* Given s=:
* the projection can be noted: **s|: (,2)** (same as s, but start at position 2)
* the projection can be noted: **s|: (,4)** (same as s, but start at position 4)
* => No physical copy required, using pointer to the sequence and applying offsets.
* When not fitting in memory, we ahve to do a physical projection until the data can fit in memory.
* For efficient mining both physical and pseudo projections should be used together.
==== CloSpan: Mining Closed Sequential Patterns ====
* [def] **closed sequential pattern** s: //There exists no superpattern s' such that \(s' \supset s\) and s' and s have the same support.//
* For instance, given: : 20, : 20 and : 15 => and are **closed sequential patterns**.
* Why trying to mine closed sequential patterns ?
- Reduce the numer of (redundant) patterns
- While keeping the same expressive power
* **Property** \(P_1\): //If \(a \supset s_1\), s is closed **iff** two projected DBs have the same size.//
* From this property we can develop 2 pruning technics: **backward subpattern** and **backward superpattern** pruning to remove redundant search space.
* **CloSpan** was developed on this principle and greatly enhances efficiency. cf. Yan, et al. SDM'03
===== Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory Patterns =====
==== Key concepts ====
* Spatial association, progressive deepening
* Spatial colocation pattern, colocation rule
* Partition-based trajectory pattern mining, t-pattern
* Moving object clusters: flock, convoy, and swarm
* Trajectory clustering, partition-and-group framework
* Semantics-rich movement patterns
* Periodic movement patterns
==== Mining Spatial Associations ====
* Spatial frequent patterns and association rules are also in the form: **A => B [s%, c%]**
* In the rule format above, A and B are sets of **spatial or non-spatial predicates**, for instance:
* Togological relations: intersects, overlaps, disjoint, etc.
* Spatial orientation: left_of, west_of, under, etc.
* Distance information: close_to, within_distance, etc.
* Example: is_a(X, large_town) ^ intersect(X, highway) -> adjacent_to(X, water) [7%, 85%]
* We usually want to explore **spatial autocorrelation** (because spatial data tends to be highly self-correlated: nearby things are more related than distant ones).
* When mining spatial associations, we can apply the **progressive refinement** heuristic:
* Because we can derive a hierarchy of spatial relationships: close_to can be seen as a generalization of near_by, touch, intersect, contain, ...
* So we could start only with close_to, and if it is frequent, then try to **refine** with the other relationships.
* => **Progressive refinement**: //First search for rough relationship, and then refine it.//
* Thus, typical 2-steps mining of spatial associations:
- Rought spatial computation as a filter: using **MBR** (Minimul Bouding Rectangle) or **R-tree** for rough estimation.
- Detailed spatial algorithm (as refinement) : apply only to the objects which have passed the rough association test (eg. those passing the min_support threshold).
==== Mining Spatial Colocation Patterns ====
* [def] **Colocation pattern**: //A group of spatial features or events that are frequently co-located in the same region.//
* On a graphic, we could for instance use edge to notify co-location of near_by points.
* [def] Concept of **Rowset(C)**: //If every feature in pattern C appears as a feature of an instance in the neighbor-set L// FIXME: Try to understand and explain this further.
* From the concept of Rowset we can build a colocation rule R: A -> B, with conditional probability cp(R) defined as:
\[cp(R) = \frac{|\{L \in rowset(A) | \exists L' s.t. (L \subseteq L') \land (L' \in rowset(A \cup B))\}|}{|rowset(A)|}\]
* Following example from video lectures (currently not really understanding this): \(cp(\{A,B\} \rightarrow \{C, D\}) = |rowset({A,B,C,D})| / |rowset({A,B})| \)
* [def] **participation ratio** pr(C,f): //Probability that C is observed in a neighbor-set wherever feture f is observerd.//:
\[pr(C,f) = \frac{|\{r | (r \in S) \land (r. f = f) \land (\text{r is in a row instance of C})\}|}{|\{r | (r \in S) \land (r.f = f)\}|}\]
* **Property**: Monotonicity of participation ratio: //let C, C' be two co-location patterns such that \(C' \subset C\) , then for each feature \(f \in C': pr(C',f) \ge pr(C,f)\).//
* Based on previous monotonicity principle, an Apriori-like algorithm can be derived for efficient mining of colocation patterns:
- let min-feature-support = \(\sigma\), and min-pr = \(\rho\)
- Start with a set of single feature pattern \(\{p_1\}\) with support \(\ge \sigma\)
- Grow to size k, in Apriori way (i.e, stop growing if the pattern is infrequent)
- For each such pattern p, mine its super-pattern P, such that \(pr(P, p) \ge \rho\) in an Apriori way.
==== Mining and Aggregating Patterns over Multiple Trajectories ====
* **Partition-Based Trajectory Pattern Mining** (eg. Mining T-Patterns) (cf. F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory Pattern Mining, KDD’07)
* Partition the space into equal-width grids and obtain **Regions of Interests** (RoIs)
* Then transform each input trajectory into a **Time-annotated symbolic sequence**, for instance:
* Railway Station: [15min] -> Castle Square, [2h15min] -> Museum
* Railway Station: [10min] -> Middle Bridge, [10min] -> Campus
* Then use constraint-based sequential pattern mining to find trajectory patterns.
* Then matching T-patterns such as: \((x_0, y_0) - \alpha_1 \rightarrow (x_1,y_1)\) (??)
* **Detecting Moving Objects clusters**:
* **Flock and convoy**: Both require k **consecutive** time stamps.
* **Flock**: At least m entities are within a circular region of **radius r** and more in the same direction.
* **Convoy**: Density-based clustering at each timestamp, no need to use a rigid circle.
* **Swarm**: Moving objects may not be close to each other for all consecutive timestamps. Efficient pattern mining algorithm can be derived for mining swarm (cf. Z. Li, et al., Swarm: Mining Relaxed Temporal Moving Object Clusters, VLDB'10)
* **Trajectory Clustering**: **A Partition and Group Framework** (cf. J.-G. Lee, et al., Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07)
* Grouping trajectories as a whole => cannot find similar portions of trajectories.
* Solution: discovers common **sub-trajectories**, e.g. forcast hurrican landfall.
* 2 phases:
* **partitioning**: creating sequences of segments,
* then **grouping**: may find patterns in the segments created above.
* Partitioning can be done using minimum description length (MDL) principle: eg. reduce the number of points in the paths, but avoid introducing too much error (eg. real points too far away from approximating segment)
==== Mining Semantics-Rich Movement Patterns ====
* **Frequent Movement Pattern**: A movement sequence that frequently appears in the input trajectory database (even if we cannot consider precise spatial location, for instance: //Home -> Starbucks -> Office -> Restaurant -> Cafe -> Restaurant -> Bar//
* To mine Frequent movement patterns we may need to group similar places to be able to find frequent subsequences.
* 2-steps top-down mining approach:
- Find a set of coarse patterns that reflect people's semantics-level transitions (office -> restaurant, home -> gym)
- Split each coarse pattern into several fine-grained ones by grouping similar movement snippts.
* cf. C. Zhang et al., Splitter: Mining Fine-Grained Sequential Patterns in Semantic Trajectories, VLDB'14.
==== Mining Periodic Movement Patterns ====
* **Pattern discovery in sparse data** (for instance find bird flying patterns)
* Periodicity might show up in some reference **spots** (or **cluster centers**)
* Reference spots can be detected using **density-based method**
* Periods are detected for each reference spot using **Fourier Transform** and **auto-correlation**
* cf. Z. Li, et al., Mining Periodic Behaviors for Moving Objects, KDD'10
* cf. Z. Li, et al., ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 2015
==== Recommended Readings ====
* F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi: Trajectory Pattern Mining. KDD’07
* Y. Huang, S. Shekhar, H. Xiong: Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. on Knowledge & Data Eng., 16(12), 2004
* Y. Huang, J. Pei, H. Xiong: Mining Co-Location Patterns with Rare Events from Spatial Data Sets. GeoInformatica 10(3): 239-260, 2006
* K. Koperski, J. Han: Discovery of Spatial Association Rules in Geographic Information Databases. SSD’95
* J.-G. Lee, J. Han, and K.-Y. Whang: Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07
* Z. Li, B. Ding, J. Han, R. Kays: Swarm: Mining Relaxed Temporal Moving Object Clusters. VLDB’10
* Z. Li, B. Ding, J. Han, R. Kays, P. Nye: Mining Periodic Behaviors for Moving Objects. KDD’10
* Z. Li, J. Wang and J. Han, ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 27(5): 1219-1232, 2015
* C. Zhang, J. Han, L. Shou, J. Lu, and T. La Porta: Splitter: Mining Fine Grained Sequential Patterns in Semantic Trajectories. VLDB'14
===== Lesson 7: Pattern Mining Applications: Mining Quality Phrases from Text Data =====
==== Key concepts ====
* Phrase mining; training-based phrase mining vs. topic modeling-based phrase mining
* TNG, TurboTopic; KERT; criteria to judge quality of phrases
* ToPMine; word colocation; phrasal segmentation; PhraseLDA; SegPhrase
==== From Frequent Pattern Mining to Phrase Mining ====
* Important concept when mining text data is to mine **quality phrases**
* Unigrams (eg. single words) vs. Phrases:
* Unigrams are often **ambiguous**
* Phrase: a natural, meaningful, unambiguous semantic unit
* Thus, it would be interesting to mine semantically meaningful phrases:
* Traforming text data from **word granularity** to **phrase granularity**
* Enhance the power and efficiency at manipulating unstructured data
* How do we go from frequent patterns to phrases ?
* **General principle**: Exploit information redundancy and data-driven criteria to determine phrase boundaries and salience.
* **Methodology**:
- Frequent pattern mining and colocation analysis
- Phrasal segmentation
- Quality phrase assessment
* Recent developments of phrase mining:
* **ToPMine**: Mining quality phrase without training (A. El-Kishky, et al., 2015)
* **SegPhrase**: Mining quality phrase with tiny training sets (J. Liu, et al., 2015)
==== Previous Phrase Mining Methods ====
* Phrase mining came from NLP community as **Chunking**
* Modeling phrase as a sequence of labeling problem (B-NP, I-NP, O, ...)
* Previous methods required annotation and training:
* Annotate hundreds of documents as training data
* Train a supervised model based on part-of-speech features
* Recent trend:
* use distributional features based on **web n-grams** (Bergsma et al., 2010)
* state-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score.
* Limitations:
* high annotation cost, not scalable to a new language, a new domain/genre.
* May not fit domain-specific, dynamic, emerging applications (such as scientific domains, query logs, or social media)
* Thus trying to perform unsupervised phrase mining and topic modeling instead.
* Many study of unsupervised phrase mining are linked with **topic modeling**.
* Each **topic** is represented by a word distribution.
* **Documents** are represented by multiple topics in different proportions.
* => This approach does not require any prior annotations or labeling of the documents.
* With topic modeling we rather rely on **statistical topic modeling algorithms**:
* The most common algorithm: LDA (Latent Dirichlet Allocation) [cf. Blei, et al., 2003]
* 3 strategies to perform phrase mining using topic modeling:
* Strategy 1: simultaneously infer the topics and the phrase tokens (generate bag-of-words -> generate sequence of tokens)
* Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
* Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model.
=== Strategy 1: Simultaneously inferring Phrases and Topics ===
* **Bigram Topic Model** [cf. Wallach'06]
* Probabilistic generative model that conditions on previous word and topic when drawing next word.
* **Topical N-Grams** (TNG) [Wang, et al.'07] (a generalization of Bigram Topic Model)
* Probabilistic model that generates words in textual order
* Create n-grams by concatenating successive bigrams
* **Phrase-Discovering LDA** (PDLDA) [Lindsey, et al.'12]
* Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically.
* Each word is drawn based on previous m words (context and current phrase topic).
* Limitations:
* High model complexity: tends to overfit
* high inference cost: slow
=== Strategy 2: Post Topic-Modeling Phrase Construction: TurboTopics (I) ===
* **TurboTopics** [Blei & Lafferty'09]: Phrase construction as a post-processing step to Latent Dirichlet Allocation.
- Perform LDA on corpus to assign each token a topic label.
- Merge adjacent unigrams with the same topic label by a distribution-free permutation test on arbitrary-length back-off model.
- End recursive merging when all significant adjacent unigrams have been merged.
=== Strategy 2: Post Topic-Modeling Phrase Construction: KERT (II) ===
* **KERT** [Danilevsky et al.'14]: Phrase construction as a post-processing step to LDA
- Run bag-of-words model inference and assign topic label to each token
- Perform **frequent pattern mining** to extract candidate phrases within each topic
- Perform **phrase ranking** based on four different criteria:
* **Popularity**: eg. "information retrieval" vs. "cross-languagee information retrieval"
* **Concordance**: eg. "powerful tea" vs. "strong tea"
* **Informativeness**: eg. "this paper" (frequent but not informative)
* **Completeness**: "vector machine" vs "support vector machine"
==== ToPMine: Phrase Mining without Training Data ====
=== Strategy 3: First Phrase Mining then Topic Modeling ===
* Why in this order ?
* If we have for instance "Knowledge discovery using least square support vector machine classifiers..." (topics in blue and red) we won't be able to generate the proper phrases.
* With phrase mining first we will first group the phrase elements: "[Knowledge discovery] using [least square] [support vector machine] [classifiers]...", then apply the topics on the groups.
* Techniques for this strategy:
* First: phrase mining => document segmentation and then phrase ranking,
* Then Topic model inference with phrase constraint.
* **Phrase mining** stage:
- **Frequent contiguous pattern mining**: Extract candidate phrases and their counts.
- Perform agglomerative merging of adjacent unigrams as guided by a **significance score**.
- Perform Document segmentation to count phrase occurence (and calculate **rectified** (ie. true) phrase frequency)
- Do phrase ranking (using the criteria proposed in KERT): popularity, concordance, informativenes, completeness.
- Finally do phrase-based topic modeling:
* mined bag-of-phrases are passed as input to **PhraseLDA**, and extension of LDA, that constrains all words in a phrase to each sharing the **same latent topic**.
=> This method essentially perform collocation mining:
* Collocation: a sequence of words that occur **more frequently than expected**
* Often **interesting**: relay information not portrayed by their constituent terms.
* Many different measures used to extract collocations from a corpus [cf. Dunning 93, Pederson 96], for instance: mutual information, t-test, z-test, chi-squared test, likelihood ratio
* Parallelized Mutual Information: \(PMI(x,y) = log \frac{p(x,y)}{p(x)p(y)}\)
* Many of these measures can be used to guide the agglomerative phrase-segmentation algorithm.
* => ToPMine is efficient and generates high-quality topics and phrases without any training data.
==== SegPhrase: Phrase Mining with Tiny Training Sets ====
* A small set of training data may enhance the quality of phrase mining (cf. J. Liu et al., Mining Quality Phrases from Massive Text Corpora, In SIGMOD'15)
* SegPhrase+: The Overall Framework
* **ClassPhrase**: Frequent pattern mining, feature extraction, classification.
* **SegPhrase**: Phrasal segmentation and phrase quality estimation.
* **SegPhrase+**: One more round to enhance mined phrase quality.
=== Pattern Mining and Feature Extraction ===
* Pattern Mining for Candidate Set:
* Build a candidate phrases set by frequent pattern mining
* Mining frequent k-grams (k is typically small, eg. 6 in experiments)
* **Popularity** measured by raw frequent words and phrases mined from the corpus.
* **Feature Extraction**, based on **Concordance**
* Partition a phrase into two parts to check whether the co-occurence is significantly higher than pure random.
* **Feature Extraction**, based on **Informativeness**
* Quality phrases typically start and end with a non-stopword
* Use average IDF over words in the phrase to measure the semantics
* Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information) (eg. "state-of-the-art")
* **Classification**:
* Using a tiny training set (about 300 labels for 1GB, can also use phrase extracted from KB)
* Label: indicating whether a phrase is a high quality one or not: "support vector machine": 1, "the experiment shows": 0
* Construct models to distinguish quality phrases from poor ones
* Use Random Forest algorithm to bootstrap different datasets with limited labels.
* **Phrasal segmentation**: can tell which phrase is more appropriate
* Partition a sequence of words by maximizing the likelyhood
* Consider length penalty and filter out phrases with low rectified frequency.
* So the process is repeated twice:
* Classification --> Phrasal segmentation (=> **SegPhrase**)
* --> Classification --> Phrasal segmentation (=> **SegPhrase+**)
==== Recommended Readings ====
* S. Bergsma, E. Pitler, D. Lin, Creating robust supervised classifiers via web-scale n-gram data, ACL’2010
* D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions. arXiv:0907.1013, 2009
* D.M. Blei, A. Y. Ng, M. I. Jordan, J. D. Lafferty, Latent Dirichlet allocation. JMLR 2003
* K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
* M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents. SDM’14
* A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
* R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes. EMNLP-CoNLL’12.
* J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
* A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10
* X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. ICDM’07
===== Lesson 8: Advanced Topics on Pattern Discovery =====
==== Key concepts ====
* Stream data mining; mining frequent patterns in data streams; the lossy counting algorithm
* Software bug mining; pattern discovery for software bug mining; mining copy-and-paste bugs
* Image analysis; pattern mining for image analysis; visual items; visual patterns
* Three categories on privacy issues in data mining: input privacy, output privacy and owner privacy, k-anonymity and differential privacy; data perturbation for privacy-preserving pattern mining
* Invisible data mining; new application exploration
==== Frequent Pattern Mining in Data Streams ====
* **Data streams**: continuous, ordered, changing, fast, huge volumn
* Random access is expensive: **single scan algorithm**
* Most streams are low-level and multi-dimensional
* Stream Data Processing: **Stream Data Management System** (SDMS)
* **Stream mining** vs **Stream query**
* For mining, we want to see the big picture (eg. less precision)
* **Mining tasks**
* Pattern maining in stream,
* clustering,
* etc, (much more)
* Mining **precise** frequent patterns is unrealistic
* So we look for **approximate answers**
* How to mine frequent patterns with **good approximation** ?
* **Lossy Counting Algorithm** [Manku & Motwani, VLDB'02]
* => discard items with very low support count
* Given support threshold \(\sigma\), error threshold \(\epsilon\) and stream length N
* Output: items with frequency counts exceeding \((\sigma - \epsilon) N\)
* We undercount by: **Frequency count error** \(\le \epsilon N\)
* Approximation properties:
* No false negatives,
* False positives have true frequency count at least \((\sigma - \epsilon) N\)
* Frequency count underestimated by at most \(\epsilon N\)
* **Other issues**:
* Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05)
* Mining approximate frequent k-itemsets in data streams
* Mining sequential patterns in data streams
* Recommanded readings:
* G. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams”, VLDB’02
* A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, ICDT'05
==== Pattern Discovery for Software Bug Mining ====
* Finding bugs is challenging: often no clear specifications or properties, requires substantial human efforts in analyzing data.
* Software reliability analysis:
* Static bug detection: check the code
* Dynamic bug detection or testing: run the code
* Debugging: Given symptoms or failure, pinpoint the bug locations in the code.
* Why pattern mining ?
* Code or running sequences contain hidden patterns.
* Common patterns -> likely specification or property
* Violations (anomalies comparing to patterns) -> likely bugs
* Mining patterns to narrow down the scope of inspection:
* => code locations or predicates that happen more in failing runs but less in passign runs are suspicious bug locations.
* **Mining rules from source code**
* Bugs as deviant behavior (eg. by statistical analysis)
* Mining programming rules (eg. by frequent itemset mining)
* Mining function precedence protocols (eg. by frequent subsequence mining)
* Revealing neglected conditions (eg. by frequent itemset/subgraph mining)
* **Mining rules from revision histories**: by frequent itemset mining
* **Mining copy-paste patterns from source code**
* Find copy-paste bugs: CP-Miner (cf. Z. li, S. Lu, S. Myagmar, Y. Zhou, "CP-Mier: A Tool for finding Copy-paster and Related Bugs in Operating System Code.", OSDI'04)
=== Mining copy-and-paste bugs ===
* Copy-pasting is common:
* 12% in Linux (kernel ?) code
* 19% in X Window (server ?) code
* Copy-paster is error-prone:
* Can mine "forget-to-change" bugs by sequential pattern mining.
* **Building sequence database** from source code:
* Tokenize each component:
* Different operators, constants, key words -> different tokens
* Same type of identifiers -> same token
* for instance: "old = 3;" -> "5 61 20"; and "new = 3;" -> "5 61 20"
* Then **compute hash**: "5 61 20" -> 16
* So for a list of statements we produce a list of hashes: 65, 16, 16, 71, ...
* **Detecting the "Forget-to-change" bugs**:
* Modification to the sequence pattern mining algorithm:
* **Constrain the max gap**: (16, 16, 71) ..... (16, 16,10,71)
* Then combine neighboring copy-pasted segments repeatedly,
* Then find conflicts: eg. identify names that cannot be mapped to the any correspondance, and introduce an **unchanged ratio**
* If 0 < unchanged ratio < threshold, then report the location as a potential bug.
* CP-Miner reported many C-P bugs in Linux, Apache, etc (eg. analyzing millions of LOC).
==== Pattern Discovery for Image Analysis ====
* An image can be characterized by **visual primitives** (eg. **interest points**)
* Each visual primitive can be described by visual feature (eg. high-dimensional feature vector)
* Each image is a collection of visual primitives
=== Visual Patterns Discovery ===
* Visual primitives can be clustered into visual "items" (similar visual primitives belong to the same item)
* Each visual primitive finds its k-nearest neighbor in the image to form a "visual transaction"
* Then mining "frequent itemsets" leads to sementically meaningful visual patterns.
* So the process is:
- images -> feature extraction -> Visual primitives
- Visual primitives -> feature quantization -> Induced visual transactions
- Induced Visual transactions -> pattern discovery -> Visual patterns
* **Challenges**
* Images are spatial data
* Spatial configuration among the visual items matters
* Induced transactions may overlap with each other => over counting problem.
* **Uncertainties of visual items**
* Noisy clustering of visual primitives into visual items affects visual pattern
* Visual synonym and polysemy
=== Recommended Readings ===
* Hongxing Wang, Gangqiang Zhao, Junsong Yuan, Visual pattern discovery in image and video data: a brief survey, Wiley Interdisciplinary Review: Data Mining and Knowledge Discovery 4(1): 24-37 (2014)
* Hongxing Wang, Junsong Yuan, Ying Wu, Context-Aware Discovery of Visual Co-Occurrence Patterns. IEEE Transactions on Image Processing 23(4): 1805-1819 (2014)
* Gangqiang Zhao, Junsong Yuan, Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining. ICDM 2011: 1260-1265
* Junsong Yuan, Ying Wu, Ming Yang, From frequent itemsets to semantically meaningful visual patterns. KDD 2007: 864-873
==== Advanced Topics on Pattern Discovery: Pattern Mining and Society-Privacy Issue ====
* Potential adverse side-effects of data mining
* Privacy and accuracy are typically contradictory in nature.
* Improving one often incurs a cost on the other.
* 3 categories on privacy issues:
* Input privacy: distort or hide data
* Output privacy (eg. knowledge hiding)
* Owner privacy: hide owner/source information from data.
* **Ensuring Input Privacy**:
* Approach 1: B2B (business-to-business) environment.
* Approach 2: B2C (business-to-customer) environment.
* **K-anonymity privacy** requirement : each equivalent class contains at least k records.
* ** Data perturbation**
* **Statistical distortion**: using randomization algorithms
* MASK [Rizvi & Haritsa VLDB'02]
* Flip each 0/1 bit with probability p (this may increase a lot of items)
* Tune p carefully to achieve acceptable average privacy and good accuracy.
* Cut and paster (C&P) operator [Evfimievski et al. KDD'02]
* Uniform randomization: Each existing item in the real transaction is, with probability p, replaced with a new item not present in the original transaction.
==== Recommended Readings ====
* R. Agrawal and R. Srikant, Privacy-preserving data mining, SIGMOD'00
* C. C. Aggarwal and P. S. Yu, Privacy-Preserving Data Mining: Models and Algorithms, Springer, 2008
* C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. 2014
* A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD'02
* A. Gkoulalas-Divanis, J. Haritsa and M. Kantarcioglu, Privacy in Association Rule Mining, in C. Aggarwal and J. Han (eds.), Frequent Pattern Mining, Springer, 2014 (Chapter 15)
* N. Li, T. Li, S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. ICDE'07
* A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, TKDD 2007
* S. Rizvi and J. Haritsa. Maintaining data privacy in association rule mining. VLDB’02
* J. Vaidya, C. W. Clifton and Y. M. Zhu, Privacy Preserving Data Mining, Springer, 2010
==== Advanced Topics on Pattern Discovery: Looking Forward ====
* Lots of research issues on pattern discovery are still waiting to be solved.
* **Invisible Pattern Mining** can/could be used in many fields.