====== Pattern Discovery in Data Mining ======

/*
Coursera introduction text:
Hi everyone!

I'm Emmanuel ROCHE, but everyone calls me "Manu" ;-). I'm French, and I currently live near Toulouse, in France. I have no serious experience with data mining or pattern discovery yet, but I'm a fast learner, and I believe this field could be quite interesting. So I decided to give it a try here. 
*/

Course page: https://www.coursera.org/learn/data-patterns/home/welcome\\
Teacher: **Mr. Jiawei Han**\\
Textbook: **Data Mining - Concepts and Technics** by Jiawei Han, Micheline Kamber and Jian Pei (2011) [Morgan Kaufmann]

  * Course structure:
    * Lesson 1: Pattern Discovery: basic Concepts
    * Lesson 2: Efficient Pattern Mining Methods
    * Lesson 3: Pattern Evaluation
    * Lesson 4: Mining Diverse Frequent patterns
    * Lesson 5: Sequential Pattern Mining
    * Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory patterns
    * Lesson 7: Pattern Mining Applications: Mining Quality Phrases from text Data
    * Lesson 8: Advanced Topics on Pattern Discovery

  * Basic concepts of pattern discovery: **frequent pattern**, **closed pattern**, **max-pattern**, and **association rules**.
  * Efficient pattern mining methods: **Apriori**, **ECLAT**, and **FPgrowth**.
  * Pattern evaluation measures: **lift**, **chi-square**, **cosine**, **Jaccard**, and **Kulczynski**.
  * Methods for mining sequential patterns: **GSP**, **SPADE**, **PrefixSpan**, and **CloSpan**.
  * Pattern-based classifications: **CBA**, **CMAR**, **PatClass**, **DDPMine**, and **DPClass**.


===== Lesson 1: Pattern Discovery Basic Concepts =====

==== Key questions ====

  * Why is pattern discovery an important task in data mining?
  * What are the concepts of frequent itemsets?
  * What are the concepts of association rules? How do you generate association rules from frequent itemsets?
  * Why do we need compressed representations of frequent patterns? Why do we need two compression representations: closed patterns and max-patterns?

==== Definitions ====

  * **Pattern**: a **set** of **items**, **subsequences** or **substructures** that occur frequently together (or strongly correlated) in a data set. Patterns represent **intrinsic** and **important properties** of datasets.
  * **Pattern discovery**: trying to **uncover** patterns from massive data sets. Finding **inherent regularities** in a dataset.
  * Pattern discovery is the fondation for many essential data mining tasks:
    - Association, correlation and causality analysis,
    - Minig sequential, structural (eg. sub-graph) patterns,
    - Pattern analysis in spatiotemporal, multimedia, time-series and stream data,
    - Classification: Discriminative pattern-based analysis
    - Cluster analysis: Pattern-based subspace clustering

  * **Itemset**: A set of one or more items.
  * **k-itemset**: \(X = \{x_1,\cdots,x_k\}\)
  * **(absolute) support (count)** of X: //Frequency of the number of occurences of an itemset X in all the dataset entries//. (**Note**: this definition doesn't clearly match the example given by the teacher, has he uses a specific element \(x_i\) **inside** some itemsets and counts the number of times this element occurs in all itemsets in the dataset...) => **clarification**: the itemsets can be anything in this definition, 1-itemset, 2-itemset, etc.. even if each entry in the dataset is a 5-itemset for instance.
  * **(relative) support** of X: //The fraction of itemsets [or entries] in the dataset that contains the element X// (i.e. this is the **probability** that an itemset contains X).
  * An itemset is **frequent**if the support of X is no less than a given **minimum support threshold** (denoted \(\sigma\)).

  * **Association rule**: We define the association rule \(X \rightarrow Y (s,c)\), stating that, if given the itemset X, then the probability that we also have the itemset Y is of support **s**, and confidence **c**.
    * Above, **s** is the support of \(X \cup Y\): eg. probability an entry contains \(X \cup Y\).
    * **c** is the conditional probability that an entry containing X also contains Y: \(c = \frac{sup(X \cup Y)}{sup(X)}\)
  
  * **Subtle notation**: The entries that contain \(X \cup Y\) are actually the **intersection** of the entries containing X an the entries containing Y (not the **union**!)

  * **Association rule mining**: find **all** the rules  \(X \rightarrow Y\) that have a given minimum support and confidence values.

==== Challenge ====

With the definitions above we often have **too many** frequent patterns.

  * Solution 1: **Closed patterns**: //A pattern (itemset) X is **closed** if X is frequent and there exists no super-pattern \(Y \supset X\) with the same support as X//.
    * => **Closed pattern** is a **lossless compression** of frequent patterns (eg. reduction of the number of patterns but we do not loose the support information).
  * Solution 2: **Max-patterns**: //A pattern X is a **max-pattern** if X is frequent and there exists no frequent super-pattern \(Y \supset X\).//
    * => **Max-pattern** is a **lossly compression** of frequent patterns: we only know a given sub-pattern is frequent, but we don't know the real support anymore.

=> In many applications, mining **closed patterns** is more desirable than mining **max-patterns**.

==== Recommended Readings ====

  * "[[http://www.rakesh.agrawal-family.com/papers/sigmod93assoc.pdf|Mining association rules between sets of items in large databases]]" (R. Agrawal, T. Imielinski and A. Swami, in Proc. of SIFMOD'93)
  * "Efficiently mining long patterns from databases" (R. J. Bayardo, in Proc. of SIGMOD'98)
  * "Discovering frequent closed itemsets for association rules" (N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, in Proc. of ICDT'99)
  * "Frequen Pattern Mining: Current Status and Future Directions" (J. Han, H. Cheng, D. Xin and X. Yan, in Data Mining and Knowledge Discovery, 15(1): 55-86, 2007)


==== Formal Model definition ====

Let \(\mathcal{I} = I_1,I_2,\dots,I_m\) be a set of binary attributes, called items. Let \(T\) be a database of transactions. Each transaction \(t\) is represented as a binary vector, with \(t[k] = 1\) if \(t\) bought the item \(I_k\), and \(t[k] = 0\) otherwise.

FIXME: Complete this section.


===== Lesson 2: Efficient Pattern Mining Methods =====

==== Key concepts ====

  * The downward closure property or the Apriori property of frequent patterns
  * The Apriori algorithm
  * ECLAT: Mining frequent patterns by exploring vertical data format
  * FPGrowth: A frequent pattern-growth approach
  * Closet+ for mining closed patterns

==== Downward Closure Property of Frequent Patterns ====

  * **Downward closure** property (also called **Apriori**): //Any subset of a frequent itemset must be frequent.//
  * Thus an efficient mining methodology will take advantage of this, with the following rule: given an itemset **S**, if any of the subset of **S** is **infrequent** then there is **no chance for S to be frequent**. (so S can be pruned and doesn't require mining).

  * From this **Apriori pruning principle**, we have 3 major methods developped:
    - Level-wise, join-based approach: **Apriori** (Agrawal, Srikant @ VLDB'94)
    - Vertical data format approach: **Eclat** (Zaki, Parthasarathy, Ogihara, Li @ KDD'97)
    - Frequent pattern projection and growth: **FPgrowth** (Han, Pei, Yin @ SIGMOD'00)

==== The Apriori Algorithm ====

  * Algorithm:
    - Initially, scan database once to get **frequent 1-itemset**.
    - **Repeat**:
      - Generate candidate (k+1)-itemsets from frequent k-itemsets.
      - Test the candidate (k+1)-itemsets to find which ones are frequent.
      - set k = k+1
    - **Stop** when no frequent or candidate set can be generated anymore.
    - Return all the frequent itemsets found.

  * **How to generate candidates** at level k+1:
    - **Self-joining** \(F_k\): eg. {abc, abd, acd, ace, bcd} => {abcd, acde}
    - **Pruning**: eg. **ade** is not in the \(F_3\) above, so **acde** is pruned.

  * An **SQL implementation** can be derived to generate the candidates described above [not reportedhere, cf. corresponding course video]

==== Extensions or Improvements of Apriori ====

  * Reducing number of transaction database scan passes:
    * Partitioning (Savasere, et al., 1995)
    * Dynamic itemset counting (Brin, et al., 1997)
  * Shrink the number of candidates:
    * Hashing (Park, et al., 1995)
    * Pruning by support lower bounding (Bayardo 1998)
    * Sampling (Toivonen, 1996)
  * Exploring special data structures
    * Tree projection (Aggarwal, et al. 2001)
    * H-miner (Pei, et al. 2001)
    * Hypercube decomposition (Uno, et al. 2004)

=== Partitioning ===

  * => Only needs to scan the database twice to find frequent itemsets.
  * => We partition the database into k partitions, then:
  * **Theorem**: Any itemset that is potentially frequent in database TDB must be frequent in at least one of the partitions of TDB.

  * Method:
    * Scan 1: partition database: each partition should fit into computer RAM, so that we can avoid IO operations then, and look for all frequent itemsets in this partition.
    * Scan 2: consolidate global frequent patterns.

=== Direct Hashing and Pruning (DHP) ===

  * DHP can be used to reduce the number of candidates.
  * **Observation**: //A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.//

  * Example: while counting the 1-itemset with candidates a,b,c,d,e we also count the number of entries for 2-itemset buckets such as {ab,ad,ae}, {bd,be,de}, etc. 
  * The ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below the support threshold.

==== Mining Frequent Patterns by Exploring Vertical Data Format ====

  * ECLAT (Equivalence Class Transformation): a depth-first search algorithm using set intersection.
  * Need to generate the **Tid-List**: eg. list of transactions for each possible item.
  * Then we can compute the **intersections** of the tid lists: \(t(e) = \{T_{10}, t_{20},T_{30}\}\), \(t(a) = \{T_{10}, T_{20}\}\), then \(t(ae) = \{T_{10}, T_{20}\}\)

  * **Properties of Tid-Lists**:
    * \(t(X) = t(Y)\) => X and Y always happen together.
    * \(t(X) \subset t(Y)\) => transations having X always have Y.

  * Frequent patterns can be derived from vertical intersections.

  * Can use **diffset** to accelerate mining:
    * Only keep track of differences of Tid-Lists (can improve memory usage).

==== FPGrowth: A Pattern Growth Approach ====

  * Algorithm:
     - Find frequent single items and partition the database based on each such item
     - Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database
     - Using **FP-Tree** to facilitate efficient processing.

  * FIXME: I didn't get much out of that one, might need to check again later...

==== Mining Closed Patterns ====

  * **Closet+** algorithm to mine closed itemsets by Pattern-Growth (efficient and direct)
  * **itemset merging**: //if Y appears in every occurance of X, then Y is merged with X.//
  * Many tricks implemented in closet+
    * Hybrid tree projection,
    * Sub-itemset pruning,
    * Item skipping,
    * Efficient subset checking, etc.

==== Recommended Readings ====

  * R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94 
  * A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases”, VLDB'95
  * J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”, SIGMOD'95
  * S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database systems: Alternatives and implications”, SIGMOD'98
  * M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association rules”, Data Mining and Knowledge Discovery, 1997
  * J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
  * M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
  * J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets”, KDD'03
  * C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014

===== Lesson 3: Pattern Evaluation =====

==== Key concepts ====

  * Pattern evaluation; interestingness of patterns/rules
  * Popularly used measures in pattern evaluation: support, confidence, lift, chi-square, cosine, Jaccard
  * Null-invariance, the Kulczynski measure, and imbalance ratio

==== Limitation of the Support-Confidence Framework ====

  * Not all of patterns and rules are interesting.
  * **Interestingness measures** can be classified as **objective** or **subjective**.
    - **Objective interestingness measures** (such as support, confidence, correlation, etc) are defined by math formulas
    - **Subjective interestingness measures** depends on one's perspective.

  * Thus mining can be done:
    - on a **Query-based** mode: asking the user what she would like to see
    - against ons's knowledge-base: looking for unexpected, fresh or recent patterns/rules 
    - using visualization tools: multi-dimensional, interactive examination.

  * Not all strongly supported and/or confident association rules are interesting. In the lecture videos we have the following example:
    * //play-basketball// => //eat-cereal// [s=40%, c=66.7%]
    * //**not** play-basketball// => //eat-cereal// [s=35%, c=87.5%]
  
==== Interestingness Measures: Lift and \(\chi^2\) ====

  * **Lift**: measure of dependent/correlated events:
\[lift(B,C) = \frac{c(B \rightarrow C)}{s(C)} = \frac{s(B \cup C)}{s(B) \times s(C)}\]
  * Lift(B,C) will tell how B and C are correlated:
    * Lift(B,C) = 1 => B and C are independent.
    * Lift(B,C) > 1 => we have a positive correlation
    * Lift(B,C) < 1 => we have a negative correlation

  * \(\chi^2\) is another measure to test correlated events:
\[\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}\]

  * General \(\chi^2\) rules:
    * \(\chi^2 = 0\): events are independent
    * \(\chi^2 > 0\): events are correlated (either positively or negatively)

  * But even Lift and \(\chi^2\) are not always good measures: could for instance detect strong positive correlation when there is not.

  * [def] **Null transactions**: transactions that do not contain the itemset of interest.

==== Null Invariance Measures ====

  * **Null invariance**: value does not change with the number of null-transactions.
  * Lift and \(\chi^2\) are **not** null invariant.
  * A few Null-invariant measures: (all give values in the range [0,1])
    * \(AllConf(A,B) = \frac{s(A \cup B)}{max(s(A), s(B))}\)
    * \(Jaccard(A,B) = \frac{s(A \cup B)}{s(A) + s(B) - s(A \cup B)}\)
    * \(Cosine(A,B) = \frac{s(A \cup B)}{\sqrt{s(A) + s(B)}}\)
    * \(Kulczynski(A,B) = \frac 12 \left( \frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)} \right)\)
    * \(MaxConf(A,B) = max \left( \frac{s(A)}{s(A \cup B)}, \frac{s(B)}{s(A \cup B)} \right)\)

  * null-transactions will often be very numerous, so null invariant measures are critical.

==== Comparison of Null-Invariant Measures ====

  * Not all null invariant measures are equal.
  * => **Kulczynski** is the most appropriate when we have contradictory correlations, so this one seems to be the one to use by default.

  * Introducing **Imbalance ratio** (IR): measure the imbalance of two itemsets A and B in rule implications:
\[IR(A,B) = \frac{|s(A) - s(B)|}{s(A)+s(B)-s(A \cup B)}\]

  * => **Kulczynski + Imbalance ratio** together can present a clear picture of all the typical cases.

==== Recommended Readings ====

  * C. C. Aggarwal and R S. Yu. A New Framework for Itemset Generation. PODS'98 
  * S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97 
  * M. Klemettinen, H. Mannila, R Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94 
  * E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE'03 
  * P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02 
  * T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397, 2010


===== Lesson 4: Mining Diverse Patterns =====

==== Key concepts ====

  * Multiple-level associations, level-shared mining, redundant rules
  * Multi-dimensional associations, quantitative associations
  * Negative correlations vs. rare patterns, null-invariance
  * Compressed patterns, redundancy-aware top-k patterns

==== Mining Multi-Level Associations ====
 
  * We can often have a hierarchy in items (eg. milk => 2% milk => Dairyland 2% milk)
  * When we have such an item hierarchy we might mine with:
    - Uniform min-support accros multiple levels (is that reasonable ?)
    - Level-reduced min-support: items at the lower levels are expected to have lower support.
  
  * How to perform efficient mining with multi min-support values ? => **shared** multi-leve mining: use the lowest min-support to pass down the set of candidates.

  * Another problem with multi-leve association mining is the **redundancy** 
    * example: 2% milk is 25% of sold milk, and 
    * milk => wheat bread [s=8%, c=70%] (1)
    * 2% milk => wheat bread [s=2%, c=72%] (2)
    * Rule number 2 can be derived from rule 1 so it is redundant.

    * => Redundancy filtering required.

  * We may also require customised min-supports for different kinds of items: (for example whe very expense items are mixed with common items):
    * Solution: use **group-based** individualized min-support:
    * eg. {diamond, watch}: 0.05%, {bread, milk}: 5%

=> scalable mining methods can easily be extended to take customized min-support into account.

==== Mining Multi-Dimensional Associations ====

  * Simple **Single-dimensional** rule: buys(X, "milk") => buys(X, "bread")
  * Multi-dimensional rules:
    * **Inter-dimension** association rules (no repeated predicate):
      * age(X, "18-25") ^ occupation(X, "student") => buys(X, "coke")
    * **Hybrid-dimension** association rules (repeated predicate):
      * age(X, "18-25") ^ buys(X, "popcorn") => buys(X, "coke")
  
  * Attributes can be:
    * **categorical** (eg. discrete values): no ordering
    * **quantitative** (numerical values):: implicit ordering

==== Mining Quantitative Associations ====

  * Mining associations with numerical attributes (eg. age, salary, etc...)
  * Mining methods:
    * **Static discretization** => data cube-based aggregation
    * **Dynamic discretization** => based on data distribution
    * **Clustering** (eg distance-based association)
    * Perform **Deviation analysis**, for instance: Gender=female => Wage: mean=$7/hr (overall mean = $9/hr);
  
  * Mining **extraordinary or interesting phenomena**, in the gender=female example above:
    * Left Hand Side (LHS) : a subset of the population
    * Right Hand Side (RHS) : an extraordinary behavior of this subset
  
  * An extraordinary test is accepted only if a statistical test (eg. Z-test) confirms the inference with high confidence.

  * **Subrule** : can highlight the extraordinary behavior of a subset of the population of the **super rule**, for instance:
    * (gender=female) ^ (South=yes) => mean wage = $6.3/hr

  * Rule condition can be categorical or numerical (quantitative rules):
    * Education in [14-18] (yrs) => mean wage = $11.64/hr

  * Efficient methods to mine extraordinary rules: cf. Aumann and Lindell@KDD'99

==== Mining Negative Correlations ====

  * **Rare patterns** : very low support but still very interesting (can be mined with **individualized group-based min-support** thresholds)
  * **Negative patterns**: negatively correlated itemssets (eg. unlikely to happen together)
  * Support-based definition of negative patterns: if itemset A and B are both frequent but rarely occur together: \(sup(A \cup B) << sup(A) x sup(B)\), then A and B are negatively correlated. => **similar to definition of Lift** => doesn't work well with large datasets will many null-transactions.

  * So we introduce a **Kulczynski measure-based definition** of negative patterns: If itemset A and B are frequent but \(\frac{P(A|B) + P(B|A)}{2} < \epsilon\) where \(\epsilon\) is a negative pattern threshold, then A and B are negatively correlated. (for instance \(\epsilon\) ~ 0.01)

==== Mining Compressed Patterns ====

  * We can produce many scattered patterns but not so meaningful.
  * When considering **closed patterns** we cannot perform compression because there is too much emphasize on the support values (eg. To merge close patterns we are required to have the same support value).
  * When using **Max-patterns** we might loose too much information in the missing support values.
  * We might rather something in between (eg. a good balance) : keep some sub patterns from the max-patterns to retain some of the support information.

  * To select what patterns canbe compressed, we use a **pattern distance measure**:
\[Dist(P_1,P_2) = 1 - \frac{|T(P_1) \cap T(P_2)|}{|T(P_1) \cup T(P_2)|}\]

  * Then we can perform \(\delta\)-**clustering**: for each pattern P, find all patterns which can be expressed by P and whose distance to P is within \(\delta\) (\(\delta\)-cover) => All patterns in the cluster can be represented by P.

  * There is a method for efficient and **direct** mining of compressed patterns (instead of mining all patterns first before compressing): cf. Xin et al. VLDB'05.

  * **Redundancy-aware Top-k Patterns**: looking for high significance and low redundancy when compressing patterns.
  * Method: Use MMS (Maximal Marginal Significance) for measuring the combined significance of pattern set: cf. Xin et al., Extracting Redundancy-Awre Top-k Patterns, KDD'06.


===== Lesson 5: Sequential Pattern Mining =====

==== Key concepts ====

  * Sequential patterns, sequential pattern mining
  * Sequential pattern mining algorithms: GSP, SPADE, and PrefixSpan
  * Closed sequential patterns, the CloSpan algorithm

==== Sequential Pattern and Sequential Pattern Mining ====

  * Sequential pattern mining has broad applications. For instance: "Buy a laptop, then a digital camera, then smartphone, within 6 months." => could also be applied to stocks and markets.
  * So trying to get sequential patterns out of vast **sequence DBs** (eg. entries with timestamp) or **time-series DBs** (evenly spaced timestamps). 
  * For sequential patterns we also have the concept of **gapped** and **non-gapped**.

  * **Sequential pattern mining**: given a set of sequences, find the **complete set of frequent subsequences** (satisfying a given min-support threshold).

  * A **sequence** can be written for instance: < (ef) (ab) (df) c b > (eg. we use parenthesis to identify a shopping basket here for instance).
  * Each **element** in a **sequence** may contain a set of **items** (also called **events**)
  * Items within an element are unordered (usually listed alphabetically).

  * Any "substring" keeping the parenthesis at appropriate location in a **sequence** can define a **subsequence**, for instance: <a(bc)dc> is a **susequence** of <a(abc)(ac)d(cf)>

  * Then given a support threshold, for instance min_sup=2, then we can find **frequent sequential pattern**

  * Algorithm requirements: **efficient**, **scalable**, **finding complete sets**, **incorporating various kinds of user-specific constraints**.

  * The **Apriori** property still holds for sequential patterns: //if a subsequence \(s_1\) is infrequent, none of \(s_1\)'s super-sequences can be frequent.//

  * We can then develop the same kind of algorithms as before, based on the Apriori property:
    - **GSP** (Generalized Sequential Patterns): cf. Srikant & Agrawal @ EDBT'96
    - **SPADE** (Vertical format-based mining): cf. Zaki @ Machine Learning'00
    - **PrefixSpan** (Pattern-growth method): cf. Pei, et al. @ ICDE'01
  
  * To mine **closed sequential patterns** we have the **CloSpan** method: cf. yan, et al. @ SDM'03
  * Finally we also have **constraint-based** sequential pattern mining.
 
==== GSP: Apriori-Based Sequential Pattern Mining ====

  * We start with a sequence database: 

^ SID ^ Sequence ^
| 10 | <(bd)cb(ac)> |
| 20 | <(bf)(ce)b(fg)> |
| 30 | <(ah)(bf)abf> |
| 40 | <(be)(ce)d> | 
| 50 | <a(bd)bcb(ade)> |

  - We start with initial candidates: eg. all singleton sequences: <a>,<b>,<c>,<d>,<e>,<f>,<g>,<h>
  - We scan the DB once and count support for each candidate.
  - Prune infrequent candidates: say <g> and <h> here.
  - Generate length-2 candidate sequences: 
    * as separated items: <aa>, <ab>, ..., <ff>
    * or as elements in the sequence: <(ab)>, <(ac)>, ..., <(ef)>.

=> We repeat steps 2 to 4 until there is no candidate or frequent sequences found anymore.

=> With the pruning stage due to Apriori property we can reduce the number of candidates for the next length sequences on each step.


==== SPADE: Sequential Pattern Mining in Vertical Data Format ====

  * A sequence database is mapped to <SID, EID> (eg. sequence ID and Element ID), for instance with:

^ SID ^ Sequence ^
| 1 | <a(abc)(ac)d(cf)> |
| 2 | <(ad)c(bc)(ae)> |
| 3 | <(ef)(ab)(df)cb> |
| 4 | <eg(af)cbc> | 

We can build the mapping:

^ SID ^ EID ^ items ^
| 1 | 1 | a |
| 1 | 2 | abc |
| 1 | 3 | ac |
| 1 | 4 | d |
| 1 | 5 | cf |
| 2 | 1 | ad |
| ... | ... | ... |

Then we figure out where each item occurs:

^ a ^^ b ^^ ... ^
^ SID ^ EID ^ SID ^ EID ^ ... ^
| 1 | 1 | 1 | 2 | ... |
| 1 | 2 | 2 | 3 | ... |
| 1 | 3 | 3 | 2 | ... |
| 2 | 1 | 3 | 5 | ... |
| 2 | 4 | 4 | 5 | ... |
| 3 | 2 |   |   | ... |
| ... | ... |   |   | ... |

Then to build length-2 sequences we check for each sequence ID the order of the Element IDs:

^ ab ^^ ba ^^ ... ^
^ SID ^ EID(a) ^ EID(b) ^ SID ^ EID(b) ^ EID(a) ^... ^
| 1 | 1 | 2 | 1 | 2 | 3 | ... |
| 2 | 1 | 3 | 2 | 3 | 4 | ... |
| 3 | 2 | 5 |   |   |   | ... |
| ... | ... | ... |   |   |   | ... |

Then we can grow the subsequences (patterns) one item at a time by Apriori candidate generation.

==== PrefixSpan: Sequential Pattern Mining by Pattern-Growth ====

  * Need to introduce the concept of **prefix** and **suffix** (eg. projection). Starting from DB:

^ SID ^ Sequence ^
| 1 | <a(abc)(ac)d(cf)> |
| 2 | <(ad)c(bc)(ae)> |
| 3 | <(ef)(ab)(df)cb> |
| 4 | <eg(af)cbc> | 

We build the prefix/suffix DB:

^ Prefix ^ Suffix (Projection) ^
| <a> | <(abc)(ac)d(cf)> |
| <aa> | <(_bc)(ac)d(cf)> |
| <ab> | <(_c)(ac)d(cf)> |

* Given a **sequence** we can thus extract a set of **prefixes** and **suffixes** (eg. prefixes-based projections)

* PrefixSpan mining method:
  - Find length-1 sequential patterns: eg. <a>, <b>, <c>, <d>, <e>, <f>
  - Then devide the search space to mine each projected DB:
    * <a>-projected DB, <b>-projected DB, ..., <f>-projected DB.
  - In each projections, we find the length-1 frequent patterns, then we combine with the prefix to get length-2 frequent sequential patterns, for instance: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, etc.

=> The major advantages of PrefixSpan are:
  * No candidate subseqs to be generated,
  * Projected DBs keep shrinking.

  * Major cost of PrefixSpan: **constructing projected DBs**
    * suffixes are largely repeating in recursive projected DBs.
    * When the DB can held in main memory we can use **pseudo projection**:
      * Given s=<a(abc)(ac)d(cf)>: 
      * the <a> projection can be noted: **s|<a>: (,2)** (same as s, but start at position 2)
      * the <ab> projection can be noted: **s|<ab>: (,4)** (same as s, but start at position 4)
      * => No physical copy required, using pointer to the sequence and applying offsets.
    * When not fitting in memory, we ahve to do a physical projection until the data can fit in memory.
    * For efficient mining both physical and pseudo projections should be used together.

==== CloSpan: Mining Closed Sequential Patterns ====

  * [def] **closed sequential pattern** s: //There exists no superpattern s' such that \(s' \supset s\) and s' and s have the same support.//

  * For instance, given: <abc>: 20, <abcd>: 20 and <abcde>: 15 => <abcd> and <abcde> are **closed sequential patterns**.

  * Why trying to mine closed sequential patterns ?
    - Reduce the numer of (redundant) patterns
    - While keeping the same expressive power
  
  * **Property** \(P_1\): //If \(a \supset s_1\), s is closed **iff** two projected DBs have the same size.//
 
  * From this property we can develop 2 pruning technics: **backward subpattern** and **backward superpattern** pruning to remove redundant search space.

  * **CloSpan** was developed on this principle and greatly enhances efficiency. cf. Yan, et al. SDM'03

===== Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory Patterns =====

==== Key concepts ====

  * Spatial association, progressive deepening
  * Spatial colocation pattern, colocation rule
  * Partition-based trajectory pattern mining, t-pattern
  * Moving object clusters: flock, convoy, and swarm
  * Trajectory clustering, partition-and-group framework
  * Semantics-rich movement patterns
  * Periodic movement patterns

==== Mining Spatial Associations ====

  * Spatial frequent patterns and association rules are also in the form: **A => B [s%, c%]**
  * In the rule format above, A and B are sets of **spatial or non-spatial predicates**, for instance: 
    * Togological relations: intersects, overlaps, disjoint, etc.
    * Spatial orientation: left_of, west_of, under, etc. 
    * Distance information: close_to, within_distance, etc.
  * Example: is_a(X, large_town) ^ intersect(X, highway) -> adjacent_to(X, water) [7%, 85%]

  * We usually want to explore **spatial autocorrelation** (because spatial data tends to be highly self-correlated: nearby things are more related than distant ones).

  * When mining spatial associations, we can apply the **progressive refinement** heuristic:
    * Because we can derive a hierarchy of spatial relationships: close_to can be seen as a generalization of near_by, touch, intersect, contain, ...
    * So we could start only with close_to, and if it is frequent, then try to **refine** with the other relationships.
    * => **Progressive refinement**: //First search for rough relationship, and then refine it.//
  
  * Thus, typical 2-steps mining of spatial associations:
   - Rought spatial computation as a filter: using **MBR** (Minimul Bouding Rectangle) or **R-tree** for rough estimation.
   - Detailed spatial algorithm (as refinement) : apply only to the objects which have passed the rough association test (eg. those passing the min_support threshold).
  
==== Mining Spatial Colocation Patterns ====

  * [def] **Colocation pattern**: //A group of spatial features or events that are frequently co-located in the same region.//
  * On a graphic, we could for instance use edge to notify co-location of near_by points.

  * [def] Concept of **Rowset(C)**: //If every feature in pattern C appears as a feature of an instance in the neighbor-set L// FIXME: Try to understand and explain this further.

  * From the concept of Rowset we can build a colocation rule R: A -> B, with conditional probability cp(R) defined as:
\[cp(R) = \frac{|\{L \in rowset(A) | \exists L' s.t. (L \subseteq L') \land (L' \in rowset(A \cup B))\}|}{|rowset(A)|}\]

  * Following example from video lectures (currently not really understanding this): \(cp(\{A,B\} \rightarrow \{C, D\}) = |rowset({A,B,C,D})| / |rowset({A,B})| \)

  * [def] **participation ratio** pr(C,f): //Probability that C is observed in a neighbor-set wherever feture f is observerd.//:
\[pr(C,f) = \frac{|\{r | (r \in S) \land (r. f = f) \land (\text{r is in a row instance of C})\}|}{|\{r | (r \in S) \land (r.f = f)\}|}\]

  * **Property**: Monotonicity of participation ratio: //let C, C' be two co-location patterns such that \(C' \subset C\) , then for each feature \(f \in C': pr(C',f) \ge pr(C,f)\).// 

  * Based on previous monotonicity principle, an Apriori-like algorithm can be derived for efficient mining of colocation patterns:
    - let min-feature-support = \(\sigma\), and min-pr = \(\rho\)
    - Start with a set of single feature pattern \(\{p_1\}\) with support \(\ge \sigma\)
    - Grow to size k, in Apriori way (i.e, stop growing if the pattern is infrequent)
    - For each such pattern p, mine its super-pattern P, such that \(pr(P, p) \ge \rho\) in an Apriori way.

==== Mining and Aggregating Patterns over Multiple Trajectories ====

  * **Partition-Based Trajectory Pattern Mining** (eg. Mining T-Patterns) (cf. F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory Pattern Mining, KDD’07)
    * Partition the space into equal-width grids and obtain **Regions of Interests** (RoIs)
    * Then transform each input trajectory into a **Time-annotated symbolic sequence**, for instance:
      * Railway Station: [15min] -> Castle Square, [2h15min] -> Museum
      * Railway Station: [10min] -> Middle Bridge, [10min] -> Campus
    * Then use constraint-based sequential pattern mining to find trajectory patterns.
      * Then matching T-patterns such as: \((x_0, y_0) - \alpha_1 \rightarrow (x_1,y_1)\)  (??)

  * **Detecting Moving Objects clusters**:
    * **Flock and convoy**: Both require k **consecutive** time stamps.
    * **Flock**: At least m entities are within a circular region of **radius r** and more in the same direction.
    * **Convoy**: Density-based clustering at each timestamp, no need to use a rigid circle.
    * **Swarm**: Moving objects may not be close to each other for all consecutive timestamps. Efficient pattern mining algorithm can be derived for mining swarm (cf. Z. Li, et al., Swarm: Mining Relaxed Temporal Moving Object Clusters, VLDB'10)

  * **Trajectory Clustering**: **A Partition and Group Framework** (cf. J.-G. Lee, et al., Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07)
    * Grouping trajectories as a whole => cannot find similar portions of trajectories.
    * Solution: discovers common **sub-trajectories**, e.g. forcast hurrican landfall.
    * 2 phases: 
      * **partitioning**: creating sequences of segments, 
      * then **grouping**: may find patterns in the segments created above.
    
    * Partitioning can be done using minimum description length (MDL) principle: eg. reduce the number of points in the paths, but avoid introducing too much error (eg. real points too far away from approximating segment)

==== Mining Semantics-Rich Movement Patterns ====

  * **Frequent Movement Pattern**: A movement sequence that frequently appears in the input trajectory database (even if we cannot consider precise spatial location, for instance: //Home -> Starbucks -> Office -> Restaurant -> Cafe -> Restaurant -> Bar//

  * To mine Frequent movement patterns we may need to group similar places to be able to find frequent subsequences.

  * 2-steps top-down mining approach:
    - Find a set of coarse patterns that reflect people's semantics-level transitions (office -> restaurant, home -> gym)
    - Split each coarse pattern into several fine-grained ones by grouping similar movement snippts.
  
  * cf. C. Zhang et al., Splitter: Mining Fine-Grained Sequential Patterns in Semantic Trajectories, VLDB'14.

==== Mining Periodic Movement Patterns ====

  * **Pattern discovery in sparse data** (for instance find bird flying patterns)
  * Periodicity might show up in some reference **spots** (or **cluster centers**)
  * Reference spots can be detected using **density-based method**
  * Periods are detected for each reference spot using **Fourier Transform** and **auto-correlation**

  * cf. Z. Li, et al., Mining Periodic Behaviors for Moving Objects, KDD'10
  * cf. Z. Li, et al., ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 2015

==== Recommended Readings ====

  * F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi: Trajectory Pattern Mining.  KDD’07
  * Y. Huang, S. Shekhar, H. Xiong: Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. on Knowledge & Data Eng., 16(12), 2004
  * Y. Huang, J. Pei, H. Xiong: Mining Co-Location Patterns with Rare Events from Spatial Data Sets. GeoInformatica 10(3): 239-260, 2006
  * K. Koperski, J. Han: Discovery of Spatial Association Rules in Geographic Information Databases. SSD’95 
  * J.-G. Lee, J. Han, and K.-Y. Whang: Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07 
  * Z. Li, B. Ding, J. Han, R. Kays: Swarm: Mining Relaxed Temporal Moving Object Clusters. VLDB’10 
  * Z. Li, B. Ding, J. Han, R. Kays, P. Nye: Mining Periodic Behaviors for Moving Objects. KDD’10 
  * Z. Li, J. Wang and J. Han, ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 27(5): 1219-1232, 2015 
  * C. Zhang, J. Han, L. Shou, J. Lu, and T. La Porta: Splitter: Mining Fine Grained Sequential Patterns in Semantic Trajectories. VLDB'14

===== Lesson 7: Pattern Mining Applications: Mining Quality Phrases from Text Data =====

==== Key concepts ====

  * Phrase mining; training-based phrase mining vs. topic modeling-based phrase mining
  * TNG, TurboTopic; KERT; criteria to judge quality of phrases
  * ToPMine; word colocation; phrasal segmentation; PhraseLDA; SegPhrase

==== From Frequent Pattern Mining to Phrase Mining ====

  * Important concept when mining text data is to mine **quality phrases**
  * Unigrams (eg. single words) vs. Phrases:
    * Unigrams are often **ambiguous**
    * Phrase: a natural, meaningful, unambiguous semantic unit
  * Thus, it would be interesting to mine semantically meaningful phrases:
    * Traforming text data from **word granularity** to **phrase granularity**
    * Enhance the power and efficiency at manipulating unstructured data
  
  * How do we go from frequent patterns to phrases ?
    * **General principle**: Exploit information redundancy and data-driven criteria to determine phrase boundaries and salience.
    * **Methodology**:
      - Frequent pattern mining and colocation analysis
      - Phrasal segmentation
      - Quality phrase assessment

  * Recent developments of phrase mining:
    * **ToPMine**: Mining quality phrase without training (A. El-Kishky, et al., 2015)
    * **SegPhrase**: Mining quality phrase with tiny training sets (J. Liu, et al., 2015)

==== Previous Phrase Mining Methods ====

  * Phrase mining came from NLP community as **Chunking**
    * Modeling phrase as a sequence of labeling problem (B-NP, I-NP, O, ...)
  * Previous methods required annotation and training:
    * Annotate hundreds of documents as training data 
    * Train a supervised model based on part-of-speech features
  * Recent trend:
    * use distributional features based on **web n-grams** (Bergsma et al., 2010)
    * state-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score.
    * Limitations:
      * high annotation cost, not scalable to a new language, a new domain/genre.
      * May not fit domain-specific, dynamic, emerging applications (such as scientific domains, query logs, or social media)
    
  * Thus trying to perform unsupervised phrase mining and topic modeling instead.
    * Many study of unsupervised phrase mining are linked with **topic modeling**.
    * Each **topic** is represented by a word distribution.
    * **Documents** are represented by multiple topics in different proportions.
    * => This approach does not require any prior annotations or labeling of the documents.
  
  * With topic modeling we rather rely on **statistical topic modeling algorithms**:
    * The most common algorithm: LDA (Latent Dirichlet Allocation) [cf. Blei, et al., 2003]
  
  * 3 strategies to perform phrase mining using topic modeling:
    * Strategy 1: simultaneously infer the topics and the phrase tokens (generate bag-of-words -> generate sequence of tokens)
    * Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
    * Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model.

=== Strategy 1: Simultaneously inferring Phrases and Topics ===

  * **Bigram Topic Model** [cf. Wallach'06]
    * Probabilistic generative model that conditions on previous word and topic when drawing next word.
  * **Topical N-Grams** (TNG) [Wang, et al.'07] (a generalization of Bigram Topic Model)
    * Probabilistic model that generates words in textual order
    * Create n-grams by concatenating successive bigrams
  * **Phrase-Discovering LDA** (PDLDA) [Lindsey, et al.'12]
    * Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically.
    * Each word is drawn based on previous m words (context and current phrase topic).
    * Limitations:
      * High model complexity: tends to overfit
      * high inference cost: slow

=== Strategy 2: Post Topic-Modeling Phrase Construction: TurboTopics (I) ===

  * **TurboTopics** [Blei & Lafferty'09]: Phrase construction as a post-processing step to Latent Dirichlet Allocation.
    - Perform LDA on corpus to assign each token a topic label.
    - Merge adjacent unigrams with the same topic label by a distribution-free permutation test on arbitrary-length back-off model.
    - End recursive merging when all significant adjacent unigrams have been merged.

=== Strategy 2: Post Topic-Modeling Phrase Construction: KERT (II) ===

  * **KERT** [Danilevsky et al.'14]: Phrase construction as a post-processing step to LDA
    - Run bag-of-words model inference and assign topic label to each token
    - Perform **frequent pattern mining** to extract candidate phrases within each topic
    - Perform **phrase ranking** based on four different criteria:
      * **Popularity**: eg. "information retrieval" vs. "cross-languagee information retrieval"
      * **Concordance**: eg. "powerful tea" vs. "strong tea"
      * **Informativeness**: eg. "this paper" (frequent but not informative)
      * **Completeness**: "vector machine" vs "support vector machine"


==== ToPMine: Phrase Mining without Training Data ====

=== Strategy 3: First Phrase Mining then Topic Modeling ===

  * Why in this order ?
  * If we have for instance "<color blue>Knowledge</color> <color red>discovery</color> using <color blue>least square</color> <color red>support</color> <color blue>vector</color> <color red>machine classifiers</color>..." (topics in blue and red) we won't be able to generate the proper phrases.
  * With phrase mining first we will first group the phrase elements: "[Knowledge discovery] using [least square] [support vector machine] [classifiers]...", then apply the topics on the groups.

  * Techniques for this strategy:
    * First: phrase mining => document segmentation and then phrase ranking,
    * Then Topic model inference with phrase constraint.

  * **Phrase mining** stage:
    - **Frequent contiguous pattern mining**: Extract candidate phrases and their counts.
    - Perform agglomerative merging of adjacent unigrams as guided by a **significance score**.
    - Perform Document segmentation to count phrase occurence (and calculate **rectified** (ie. true) phrase frequency)
    - Do phrase ranking (using the criteria proposed in KERT): popularity, concordance, informativenes, completeness.
    - Finally do phrase-based topic modeling:
      * mined bag-of-phrases are passed as input to **PhraseLDA**, and extension of LDA, that constrains all words in a phrase to each sharing the **same latent topic**.

=> This method essentially perform collocation mining:
  * Collocation: a sequence of words that occur **more frequently than expected**
    * Often **interesting**: relay information not portrayed by their constituent terms.

  * Many different measures used to extract collocations from a corpus [cf. Dunning 93, Pederson 96], for instance: mutual information, t-test, z-test, chi-squared test, likelihood ratio
    * Parallelized Mutual Information: \(PMI(x,y) = log \frac{p(x,y)}{p(x)p(y)}\) 
  * Many of these measures can be used to guide the agglomerative phrase-segmentation algorithm.

  * => ToPMine is efficient and generates high-quality topics and phrases without any training data.

==== SegPhrase: Phrase Mining with Tiny Training Sets ====

  * A small set of training data may enhance the quality of phrase mining (cf. J. Liu et al., Mining Quality Phrases from Massive Text Corpora, In SIGMOD'15)

  * SegPhrase+: The Overall Framework
  * **ClassPhrase**: Frequent pattern mining, feature extraction, classification.
  * **SegPhrase**: Phrasal segmentation and phrase quality estimation.
  * **SegPhrase+**: One more round to enhance mined phrase quality.

=== Pattern Mining and Feature Extraction ===

  * Pattern Mining for Candidate Set:
    * Build a candidate phrases set by frequent pattern mining
    * Mining frequent k-grams (k is typically small, eg. 6 in experiments)
    * **Popularity** measured by raw frequent words and phrases mined from the corpus.
  * **Feature Extraction**, based on **Concordance**
    * Partition a phrase into two parts to check whether the co-occurence is significantly higher than pure random.
  * **Feature Extraction**, based on **Informativeness**
    * Quality phrases typically start and end with a non-stopword
    * Use average IDF over words in the phrase to measure the semantics
    * Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information) (eg. "state-of-the-art")
  * **Classification**:
    * Using a tiny training set (about 300 labels for 1GB, can also use phrase extracted from KB)
    * Label: indicating whether a phrase is a high quality one or not: "support vector machine": 1, "the experiment shows": 0
    * Construct models to distinguish quality phrases from poor ones 
    * Use Random Forest algorithm to bootstrap different datasets with limited labels.
  * **Phrasal segmentation**: can tell which phrase is more appropriate
    * Partition a sequence of words by maximizing the likelyhood
    * Consider length penalty and filter out phrases with low rectified frequency.
  * So the process is repeated twice:
    * Classification --> Phrasal segmentation (=> **SegPhrase**)
    * --> Classification --> Phrasal segmentation (=> **SegPhrase+**)

==== Recommended Readings ====

  * S. Bergsma, E. Pitler, D. Lin, Creating robust supervised classifiers via web-scale n-gram data, ACL’2010 
  * D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions. arXiv:0907.1013, 2009 
  * D.M. Blei, A. Y. Ng,  M. I. Jordan, J. D. Lafferty, Latent Dirichlet allocation. JMLR 2003 
  * K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
  * M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents. SDM’14 
  * A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han.  Scalable Topical Phrase Mining from Text Corpora. VLDB’15 
  * R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes. EMNLP-CoNLL’12. 
  * J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15 
  * A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10 
  * X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. ICDM’07

===== Lesson 8: Advanced Topics on Pattern Discovery =====

==== Key concepts ====

  * Stream data mining; mining frequent patterns in data streams; the lossy counting algorithm
  * Software bug mining; pattern discovery for software bug mining; mining copy-and-paste bugs
  * Image analysis; pattern mining for image analysis; visual items; visual patterns
  * Three categories on privacy issues in data mining: input privacy, output privacy and owner privacy, k-anonymity and differential privacy; data perturbation for privacy-preserving pattern mining
  * Invisible data mining; new application exploration

==== Frequent Pattern Mining in Data Streams ====

  * **Data streams**: continuous, ordered, changing, fast, huge volumn
  * Random access is expensive: **single scan algorithm**
  * Most streams are low-level and multi-dimensional
  * Stream Data Processing: **Stream Data Management System** (SDMS)

  * **Stream mining** vs **Stream query**
    * For mining, we want to see the big picture (eg. less precision)
  * **Mining tasks**
    * Pattern maining in stream,
    * clustering,
    * etc, (much more)
  
  * Mining **precise** frequent patterns is unrealistic
  * So we look for **approximate answers**
  * How to mine frequent patterns with **good approximation** ?
    * **Lossy Counting Algorithm** [Manku & Motwani, VLDB'02]
    * => discard items with very low support count
    * Given support threshold \(\sigma\), error threshold \(\epsilon\) and stream length N
    * Output: items with frequency counts exceeding \((\sigma - \epsilon) N\)
    * We undercount by: **Frequency count error** \(\le \epsilon N\)
  * Approximation properties:
    * No false negatives,
    * False positives have true frequency count at least \((\sigma - \epsilon) N\)
    * Frequency count underestimated by at most \(\epsilon N\)

  * **Other issues**:
    * Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05)  
    * Mining approximate frequent k-itemsets in data streams 
    * Mining sequential patterns in data streams 

  * Recommanded readings:
    * G. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams”, VLDB’02 
    * A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, ICDT'05

==== Pattern Discovery for Software Bug Mining ====

  * Finding bugs is challenging: often no clear specifications or properties, requires substantial human efforts in analyzing data.

  * Software reliability analysis:
    * Static bug detection: check the code
    * Dynamic bug detection or testing: run the code
    * Debugging: Given symptoms or failure, pinpoint the bug locations in the code.
  
  * Why pattern mining ?
    * Code or running sequences contain hidden patterns.
    * Common patterns -> likely specification or property
    * Violations (anomalies comparing to patterns) -> likely bugs
    * Mining patterns to narrow down the scope of inspection:
      * => code locations or predicates that happen more in failing runs but less in passign runs are suspicious bug locations.

  * **Mining rules from source code**
    * Bugs as deviant behavior (eg. by statistical analysis)
    * Mining programming rules (eg. by frequent itemset mining)
    * Mining function precedence protocols (eg. by frequent subsequence mining)
    * Revealing neglected conditions (eg. by frequent itemset/subgraph mining)
  * **Mining rules from revision histories**: by frequent itemset mining
  * **Mining copy-paste patterns from source code**
    * Find copy-paste bugs: CP-Miner (cf. Z. li, S. Lu, S. Myagmar, Y. Zhou, "CP-Mier: A Tool for finding Copy-paster and Related Bugs in Operating System Code.", OSDI'04)
  
=== Mining copy-and-paste bugs ===

  * Copy-pasting is common:
    * 12% in Linux (kernel ?) code
    * 19% in X Window (server ?) code
  * Copy-paster is error-prone:
    * Can mine "forget-to-change" bugs by sequential pattern mining.
  
  * **Building sequence database** from source code:
    * Tokenize each component: 
    * Different operators, constants, key words -> different tokens
    * Same type of identifiers -> same token
    * for instance: "old = 3;" -> "5 61 20"; and "new = 3;" -> "5 61 20"
    * Then **compute hash**: "5 61 20" -> 16
    * So for a list of statements we produce a list of hashes: 65, 16, 16, 71, ...
  
  * **Detecting the "Forget-to-change" bugs**:
    * Modification to the sequence pattern mining algorithm:
    * **Constrain the max gap**: <color red>(16, 16, 71) ..... (16, 16,</color><color blue>10,</color><color red>71)</color>
    * Then combine neighboring copy-pasted segments repeatedly,
    * Then find conflicts: eg. identify names that cannot be mapped to the any correspondance, and introduce an **unchanged ratio**
    * If 0 < unchanged ratio < threshold, then report the location as a potential bug.

  * CP-Miner reported many C-P bugs in Linux, Apache, etc (eg. analyzing millions of LOC).

==== Pattern Discovery for Image Analysis ====

  * An image can be characterized by **visual primitives** (eg. **interest points**)
  * Each visual primitive can be described by visual feature (eg. high-dimensional feature vector)
  * Each image is a collection of visual primitives

=== Visual Patterns Discovery ===

  * Visual primitives can be clustered into visual "items" (similar visual primitives belong to the same item)
  * Each visual primitive finds its k-nearest neighbor in the image to form a "visual transaction"
  * Then mining "frequent itemsets" leads to sementically meaningful visual patterns.

  * So the process is:
    - images -> feature extraction -> Visual primitives
    - Visual primitives -> feature quantization -> Induced visual transactions
    - Induced Visual transactions -> pattern discovery -> Visual patterns

  * **Challenges**
    * Images are spatial data
      * Spatial configuration among the visual items matters
      * Induced transactions may overlap with each other => over counting problem.
    * **Uncertainties of visual items**
      * Noisy clustering of visual primitives into visual items affects visual pattern
      * Visual synonym and polysemy

=== Recommended Readings ===

  * Hongxing Wang, Gangqiang Zhao, Junsong Yuan, Visual pattern discovery in image and video data: a brief survey, Wiley Interdisciplinary Review: Data Mining and Knowledge Discovery 4(1): 24-37 (2014) 
  * Hongxing Wang, Junsong Yuan, Ying Wu, Context-Aware Discovery of Visual Co-Occurrence Patterns. IEEE Transactions on Image Processing 23(4): 1805-1819 (2014) 
  * Gangqiang Zhao, Junsong Yuan, Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining. ICDM 2011: 1260-1265 
  * Junsong Yuan, Ying Wu, Ming Yang, From frequent itemsets to semantically meaningful visual patterns. KDD 2007: 864-873


==== Advanced Topics on Pattern Discovery: Pattern Mining and Society-Privacy Issue ====

  * Potential adverse side-effects of data mining 
    * Privacy and accuracy are typically contradictory in nature.
    * Improving one often incurs a cost on the other.
  * 3 categories on privacy issues:
    * Input privacy: distort or hide data 
    * Output privacy (eg. knowledge hiding)
    * Owner privacy: hide owner/source information from data.
  * **Ensuring Input Privacy**:
    * Approach 1: B2B (business-to-business) environment.
    * Approach 2: B2C (business-to-customer) environment.
  * **K-anonymity privacy** requirement : each equivalent class contains at least k records.

  * ** Data perturbation**
    * **Statistical distortion**: using randomization algorithms
    * MASK [Rizvi & Haritsa VLDB'02]
      * Flip each 0/1 bit with probability p (this may increase a lot of items)
      * Tune p carefully to achieve acceptable average privacy and good accuracy.
    * Cut and paster (C&P) operator [Evfimievski et al. KDD'02]
      * Uniform randomization: Each existing item in the real transaction is, with probability p, replaced with a new item not present in the original transaction.

==== Recommended Readings ====

  * R. Agrawal and R. Srikant, Privacy-preserving data mining, SIGMOD'00  
  * C. C. Aggarwal and P. S. Yu, Privacy-Preserving Data Mining: Models and Algorithms, Springer, 2008 
  * C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. 2014 
  * A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In  KDD'02 
  * A. Gkoulalas-Divanis, J. Haritsa and M. Kantarcioglu, Privacy in Association Rule Mining, in C. Aggarwal and J. Han (eds.), Frequent Pattern Mining, Springer, 2014 (Chapter 15) 
  * N. Li, T. Li, S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. ICDE'07 
  * A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, TKDD 2007 
  * S. Rizvi and J. Haritsa. Maintaining data privacy in association rule mining.  VLDB’02 
  * J. Vaidya, C. W. Clifton and Y. M. Zhu, Privacy Preserving Data Mining, Springer, 2010 

==== Advanced Topics on Pattern Discovery: Looking Forward ====

  * Lots of research issues on pattern discovery are still waiting to be solved.
  * **Invisible Pattern Mining** can/could be used in many fields.