# Pattern Discovery in Data Mining

Course page: https://www.coursera.org/learn/data-patterns/home/welcome

Teacher: **Mr. Jiawei Han**

Textbook: **Data Mining - Concepts and Technics** by Jiawei Han, Micheline Kamber and Jian Pei (2011) [Morgan Kaufmann]

- Course structure:
- Lesson 1: Pattern Discovery: basic Concepts
- Lesson 2: Efficient Pattern Mining Methods
- Lesson 3: Pattern Evaluation
- Lesson 4: Mining Diverse Frequent patterns
- Lesson 5: Sequential Pattern Mining
- Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory patterns
- Lesson 7: Pattern Mining Applications: Mining Quality Phrases from text Data
- Lesson 8: Advanced Topics on Pattern Discovery

- Basic concepts of pattern discovery:
**frequent pattern**,**closed pattern**,**max-pattern**, and**association rules**. - Efficient pattern mining methods:
**Apriori**,**ECLAT**, and**FPgrowth**. - Pattern evaluation measures:
**lift**,**chi-square**,**cosine**,**Jaccard**, and**Kulczynski**. - Methods for mining sequential patterns:
**GSP**,**SPADE**,**PrefixSpan**, and**CloSpan**. - Pattern-based classifications:
**CBA**,**CMAR**,**PatClass**,**DDPMine**, and**DPClass**.

## Lesson 1: Pattern Discovery Basic Concepts

### Key questions

- Why is pattern discovery an important task in data mining?
- What are the concepts of frequent itemsets?
- What are the concepts of association rules? How do you generate association rules from frequent itemsets?
- Why do we need compressed representations of frequent patterns? Why do we need two compression representations: closed patterns and max-patterns?

### Definitions

**Pattern**: a**set**of**items**,**subsequences**or**substructures**that occur frequently together (or strongly correlated) in a data set. Patterns represent**intrinsic**and**important properties**of datasets.**Pattern discovery**: trying to**uncover**patterns from massive data sets. Finding**inherent regularities**in a dataset.- Pattern discovery is the fondation for many essential data mining tasks:
- Association, correlation and causality analysis,
- Minig sequential, structural (eg. sub-graph) patterns,
- Pattern analysis in spatiotemporal, multimedia, time-series and stream data,
- Classification: Discriminative pattern-based analysis
- Cluster analysis: Pattern-based subspace clustering

**Itemset**: A set of one or more items.**k-itemset**: \(X = \{x_1,\cdots,x_k\}\)**(absolute) support (count)**of X:*Frequency of the number of occurences of an itemset X in all the dataset entries*. (**Note**: this definition doesn't clearly match the example given by the teacher, has he uses a specific element \(x_i\)**inside**some itemsets and counts the number of times this element occurs in all itemsets in the dataset…) ⇒**clarification**: the itemsets can be anything in this definition, 1-itemset, 2-itemset, etc.. even if each entry in the dataset is a 5-itemset for instance.**(relative) support**of X:*The fraction of itemsets [or entries] in the dataset that contains the element X*(i.e. this is the**probability**that an itemset contains X).- An itemset is
**frequent**if the support of X is no less than a given**minimum support threshold**(denoted \(\sigma\)).

**Association rule**: We define the association rule \(X \rightarrow Y (s,c)\), stating that, if given the itemset X, then the probability that we also have the itemset Y is of support**s**, and confidence**c**.- Above,
**s**is the support of \(X \cup Y\): eg. probability an entry contains \(X \cup Y\). **c**is the conditional probability that an entry containing X also contains Y: \(c = \frac{sup(X \cup Y)}{sup(X)}\)

**Subtle notation**: The entries that contain \(X \cup Y\) are actually the**intersection**of the entries containing X an the entries containing Y (not the**union**!)

**Association rule mining**: find**all**the rules \(X \rightarrow Y\) that have a given minimum support and confidence values.

### Challenge

With the definitions above we often have **too many** frequent patterns.

- Solution 1:
**Closed patterns**:*A pattern (itemset) X is*.**closed**if X is frequent and there exists no super-pattern \(Y \supset X\) with the same support as X- ⇒
**Closed pattern**is a**lossless compression**of frequent patterns (eg. reduction of the number of patterns but we do not loose the support information).

- Solution 2:
**Max-patterns**:*A pattern X is a***max-pattern**if X is frequent and there exists no frequent super-pattern \(Y \supset X\).- ⇒
**Max-pattern**is a**lossly compression**of frequent patterns: we only know a given sub-pattern is frequent, but we don't know the real support anymore.

⇒ In many applications, mining **closed patterns** is more desirable than mining **max-patterns**.

### Recommended Readings

- “Mining association rules between sets of items in large databases” (R. Agrawal, T. Imielinski and A. Swami, in Proc. of SIFMOD'93)
- “Efficiently mining long patterns from databases” (R. J. Bayardo, in Proc. of SIGMOD'98)
- “Discovering frequent closed itemsets for association rules” (N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, in Proc. of ICDT'99)
- “Frequen Pattern Mining: Current Status and Future Directions” (J. Han, H. Cheng, D. Xin and X. Yan, in Data Mining and Knowledge Discovery, 15(1): 55-86, 2007)

### Formal Model definition

Let \(\mathcal{I} = I_1,I_2,\dots,I_m\) be a set of binary attributes, called items. Let \(T\) be a database of transactions. Each transaction \(t\) is represented as a binary vector, with \(t[k] = 1\) if \(t\) bought the item \(I_k\), and \(t[k] = 0\) otherwise.

: Complete this section.

## Lesson 2: Efficient Pattern Mining Methods

### Key concepts

- The downward closure property or the Apriori property of frequent patterns
- The Apriori algorithm
- ECLAT: Mining frequent patterns by exploring vertical data format
- FPGrowth: A frequent pattern-growth approach
- Closet+ for mining closed patterns

### Downward Closure Property of Frequent Patterns

**Downward closure**property (also called**Apriori**):*Any subset of a frequent itemset must be frequent.*- Thus an efficient mining methodology will take advantage of this, with the following rule: given an itemset
**S**, if any of the subset of**S**is**infrequent**then there is**no chance for S to be frequent**. (so S can be pruned and doesn't require mining).

- From this
**Apriori pruning principle**, we have 3 major methods developped:- Level-wise, join-based approach:
**Apriori**(Agrawal, Srikant @ VLDB'94) - Vertical data format approach:
**Eclat**(Zaki, Parthasarathy, Ogihara, Li @ KDD'97) - Frequent pattern projection and growth:
**FPgrowth**(Han, Pei, Yin @ SIGMOD'00)

### The Apriori Algorithm

- Algorithm:
- Initially, scan database once to get
**frequent 1-itemset**. **Repeat**:- Generate candidate (k+1)-itemsets from frequent k-itemsets.
- Test the candidate (k+1)-itemsets to find which ones are frequent.
- set k = k+1

**Stop**when no frequent or candidate set can be generated anymore.- Return all the frequent itemsets found.

**How to generate candidates**at level k+1:**Self-joining**\(F_k\): eg. {abc, abd, acd, ace, bcd} ⇒ {abcd, acde}**Pruning**: eg.**ade**is not in the \(F_3\) above, so**acde**is pruned.

- An
**SQL implementation**can be derived to generate the candidates described above [not reportedhere, cf. corresponding course video]

### Extensions or Improvements of Apriori

- Reducing number of transaction database scan passes:
- Partitioning (Savasere, et al., 1995)
- Dynamic itemset counting (Brin, et al., 1997)

- Shrink the number of candidates:
- Hashing (Park, et al., 1995)
- Pruning by support lower bounding (Bayardo 1998)
- Sampling (Toivonen, 1996)

- Exploring special data structures
- Tree projection (Aggarwal, et al. 2001)
- H-miner (Pei, et al. 2001)
- Hypercube decomposition (Uno, et al. 2004)

#### Partitioning

- ⇒ Only needs to scan the database twice to find frequent itemsets.
- ⇒ We partition the database into k partitions, then:
**Theorem**: Any itemset that is potentially frequent in database TDB must be frequent in at least one of the partitions of TDB.

- Method:
- Scan 1: partition database: each partition should fit into computer RAM, so that we can avoid IO operations then, and look for all frequent itemsets in this partition.
- Scan 2: consolidate global frequent patterns.

#### Direct Hashing and Pruning (DHP)

- DHP can be used to reduce the number of candidates.
**Observation**:*A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.*

- Example: while counting the 1-itemset with candidates a,b,c,d,e we also count the number of entries for 2-itemset buckets such as {ab,ad,ae}, {bd,be,de}, etc.
- The ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below the support threshold.

### Mining Frequent Patterns by Exploring Vertical Data Format

- ECLAT (Equivalence Class Transformation): a depth-first search algorithm using set intersection.
- Need to generate the
**Tid-List**: eg. list of transactions for each possible item. - Then we can compute the
**intersections**of the tid lists: \(t(e) = \{T_{10}, t_{20},T_{30}\}\), \(t(a) = \{T_{10}, T_{20}\}\), then \(t(ae) = \{T_{10}, T_{20}\}\)

**Properties of Tid-Lists**:- \(t(X) = t(Y)\) ⇒ X and Y always happen together.
- \(t(X) \subset t(Y)\) ⇒ transations having X always have Y.

- Frequent patterns can be derived from vertical intersections.

- Can use
**diffset**to accelerate mining:- Only keep track of differences of Tid-Lists (can improve memory usage).

### FPGrowth: A Pattern Growth Approach

- Algorithm:
- Find frequent single items and partition the database based on each such item
- Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database
- Using
**FP-Tree**to facilitate efficient processing.

- : I didn't get much out of that one, might need to check again later…

### Mining Closed Patterns

**Closet+**algorithm to mine closed itemsets by Pattern-Growth (efficient and direct)**itemset merging**:*if Y appears in every occurance of X, then Y is merged with X.*- Many tricks implemented in closet+
- Hybrid tree projection,
- Sub-itemset pruning,
- Item skipping,
- Efficient subset checking, etc.

### Recommended Readings

- R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
- A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases”, VLDB'95
- J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”, SIGMOD'95
- S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database systems: Alternatives and implications”, SIGMOD'98
- M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association rules”, Data Mining and Knowledge Discovery, 1997
- J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
- M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
- J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets”, KDD'03
- C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014

## Lesson 3: Pattern Evaluation

### Key concepts

- Pattern evaluation; interestingness of patterns/rules
- Popularly used measures in pattern evaluation: support, confidence, lift, chi-square, cosine, Jaccard
- Null-invariance, the Kulczynski measure, and imbalance ratio

### Limitation of the Support-Confidence Framework

- Not all of patterns and rules are interesting.
**Interestingness measures**can be classified as**objective**or**subjective**.**Objective interestingness measures**(such as support, confidence, correlation, etc) are defined by math formulas**Subjective interestingness measures**depends on one's perspective.

- Thus mining can be done:
- on a
**Query-based**mode: asking the user what she would like to see - against ons's knowledge-base: looking for unexpected, fresh or recent patterns/rules
- using visualization tools: multi-dimensional, interactive examination.

- Not all strongly supported and/or confident association rules are interesting. In the lecture videos we have the following example:
*play-basketball*⇒*eat-cereal*[s=40%, c=66.7%]⇒**not**play-basketball*eat-cereal*[s=35%, c=87.5%]

### Interestingness Measures: Lift and \(\chi^2\)

**Lift**: measure of dependent/correlated events:

\[lift(B,C) = \frac{c(B \rightarrow C)}{s(C)} = \frac{s(B \cup C)}{s(B) \times s(C)}\]

- Lift(B,C) will tell how B and C are correlated:
- Lift(B,C) = 1 ⇒ B and C are independent.
- Lift(B,C) > 1 ⇒ we have a positive correlation
- Lift(B,C) < 1 ⇒ we have a negative correlation

- \(\chi^2\) is another measure to test correlated events:

\[\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}\]

- General \(\chi^2\) rules:
- \(\chi^2 = 0\): events are independent
- \(\chi^2 > 0\): events are correlated (either positively or negatively)

- But even Lift and \(\chi^2\) are not always good measures: could for instance detect strong positive correlation when there is not.

- [def]
**Null transactions**: transactions that do not contain the itemset of interest.

### Null Invariance Measures

**Null invariance**: value does not change with the number of null-transactions.- Lift and \(\chi^2\) are
**not**null invariant. - A few Null-invariant measures: (all give values in the range [0,1])
- \(AllConf(A,B) = \frac{s(A \cup B)}{max(s(A), s(B))}\)
- \(Jaccard(A,B) = \frac{s(A \cup B)}{s(A) + s(B) - s(A \cup B)}\)
- \(Cosine(A,B) = \frac{s(A \cup B)}{\sqrt{s(A) + s(B)}}\)
- \(Kulczynski(A,B) = \frac 12 \left( \frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)} \right)\)
- \(MaxConf(A,B) = max \left( \frac{s(A)}{s(A \cup B)}, \frac{s(B)}{s(A \cup B)} \right)\)

- null-transactions will often be very numerous, so null invariant measures are critical.

### Comparison of Null-Invariant Measures

- Not all null invariant measures are equal.
- ⇒
**Kulczynski**is the most appropriate when we have contradictory correlations, so this one seems to be the one to use by default.

- Introducing
**Imbalance ratio**(IR): measure the imbalance of two itemsets A and B in rule implications:

\[IR(A,B) = \frac{|s(A) - s(B)|}{s(A)+s(B)-s(A \cup B)}\]

- ⇒
**Kulczynski + Imbalance ratio**together can present a clear picture of all the typical cases.

### Recommended Readings

- C. C. Aggarwal and R S. Yu. A New Framework for Itemset Generation. PODS'98
- S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97
- M. Klemettinen, H. Mannila, R Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94
- E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE'03
- P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02
- T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397, 2010

## Lesson 4: Mining Diverse Patterns

### Key concepts

- Multiple-level associations, level-shared mining, redundant rules
- Multi-dimensional associations, quantitative associations
- Negative correlations vs. rare patterns, null-invariance
- Compressed patterns, redundancy-aware top-k patterns

### Mining Multi-Level Associations

- We can often have a hierarchy in items (eg. milk ⇒ 2% milk ⇒ Dairyland 2% milk)
- When we have such an item hierarchy we might mine with:
- Uniform min-support accros multiple levels (is that reasonable ?)
- Level-reduced min-support: items at the lower levels are expected to have lower support.

- How to perform efficient mining with multi min-support values ? ⇒
**shared**multi-leve mining: use the lowest min-support to pass down the set of candidates.

- Another problem with multi-leve association mining is the
**redundancy**- example: 2% milk is 25% of sold milk, and
- milk ⇒ wheat bread [s=8%, c=70%] (1)
- 2% milk ⇒ wheat bread [s=2%, c=72%] (2)
- Rule number 2 can be derived from rule 1 so it is redundant.

- ⇒ Redundancy filtering required.

- We may also require customised min-supports for different kinds of items: (for example whe very expense items are mixed with common items):
- Solution: use
**group-based**individualized min-support: - eg. {diamond, watch}: 0.05%, {bread, milk}: 5%

⇒ scalable mining methods can easily be extended to take customized min-support into account.

### Mining Multi-Dimensional Associations

- Simple
**Single-dimensional**rule: buys(X, “milk”) ⇒ buys(X, “bread”) - Multi-dimensional rules:
**Inter-dimension**association rules (no repeated predicate):- age(X, “18-25”) ^ occupation(X, “student”) ⇒ buys(X, “coke”)

**Hybrid-dimension**association rules (repeated predicate):- age(X, “18-25”) ^ buys(X, “popcorn”) ⇒ buys(X, “coke”)

- Attributes can be:
**categorical**(eg. discrete values): no ordering**quantitative**(numerical values):: implicit ordering

### Mining Quantitative Associations

- Mining associations with numerical attributes (eg. age, salary, etc…)
- Mining methods:
**Static discretization**⇒ data cube-based aggregation**Dynamic discretization**⇒ based on data distribution**Clustering**(eg distance-based association)- Perform
**Deviation analysis**, for instance: Gender=female ⇒ Wage: mean=$7/hr (overall mean = $9/hr);

- Mining
**extraordinary or interesting phenomena**, in the gender=female example above: - Left Hand Side (LHS) : a subset of the population
- Right Hand Side (RHS) : an extraordinary behavior of this subset

- An extraordinary test is accepted only if a statistical test (eg. Z-test) confirms the inference with high confidence.

**Subrule**: can highlight the extraordinary behavior of a subset of the population of the**super rule**, for instance:- (gender=female) ^ (South=yes) ⇒ mean wage = $6.3/hr

- Rule condition can be categorical or numerical (quantitative rules):
- Education in [14-18] (yrs) ⇒ mean wage = $11.64/hr

- Efficient methods to mine extraordinary rules: cf. Aumann and Lindell@KDD'99

### Mining Negative Correlations

**Rare patterns**: very low support but still very interesting (can be mined with**individualized group-based min-support**thresholds)**Negative patterns**: negatively correlated itemssets (eg. unlikely to happen together)- Support-based definition of negative patterns: if itemset A and B are both frequent but rarely occur together: \(sup(A \cup B) << sup(A) x sup(B)\), then A and B are negatively correlated. ⇒
**similar to definition of Lift**⇒ doesn't work well with large datasets will many null-transactions.

- So we introduce a
**Kulczynski measure-based definition**of negative patterns: If itemset A and B are frequent but \(\frac{P(A|B) + P(B|A)}{2} < \epsilon\) where \(\epsilon\) is a negative pattern threshold, then A and B are negatively correlated. (for instance \(\epsilon\) ~ 0.01)

### Mining Compressed Patterns

- We can produce many scattered patterns but not so meaningful.
- When considering
**closed patterns**we cannot perform compression because there is too much emphasize on the support values (eg. To merge close patterns we are required to have the same support value). - When using
**Max-patterns**we might loose too much information in the missing support values. - We might rather something in between (eg. a good balance) : keep some sub patterns from the max-patterns to retain some of the support information.

- To select what patterns canbe compressed, we use a
**pattern distance measure**:

\[Dist(P_1,P_2) = 1 - \frac{|T(P_1) \cap T(P_2)|}{|T(P_1) \cup T(P_2)|}\]

- Then we can perform \(\delta\)-
**clustering**: for each pattern P, find all patterns which can be expressed by P and whose distance to P is within \(\delta\) (\(\delta\)-cover) ⇒ All patterns in the cluster can be represented by P.

- There is a method for efficient and
**direct**mining of compressed patterns (instead of mining all patterns first before compressing): cf. Xin et al. VLDB'05.

**Redundancy-aware Top-k Patterns**: looking for high significance and low redundancy when compressing patterns.- Method: Use MMS (Maximal Marginal Significance) for measuring the combined significance of pattern set: cf. Xin et al., Extracting Redundancy-Awre Top-k Patterns, KDD'06.

## Lesson 5: Sequential Pattern Mining

### Key concepts

- Sequential patterns, sequential pattern mining
- Sequential pattern mining algorithms: GSP, SPADE, and PrefixSpan
- Closed sequential patterns, the CloSpan algorithm

### Sequential Pattern and Sequential Pattern Mining

- Sequential pattern mining has broad applications. For instance: “Buy a laptop, then a digital camera, then smartphone, within 6 months.” ⇒ could also be applied to stocks and markets.
- So trying to get sequential patterns out of vast
**sequence DBs**(eg. entries with timestamp) or**time-series DBs**(evenly spaced timestamps). - For sequential patterns we also have the concept of
**gapped**and**non-gapped**.

**Sequential pattern mining**: given a set of sequences, find the**complete set of frequent subsequences**(satisfying a given min-support threshold).

- A
**sequence**can be written for instance: < (ef) (ab) (df) c b > (eg. we use parenthesis to identify a shopping basket here for instance). - Each
**element**in a**sequence**may contain a set of**items**(also called**events**) - Items within an element are unordered (usually listed alphabetically).

- Any “substring” keeping the parenthesis at appropriate location in a
**sequence**can define a**subsequence**, for instance: <a(bc)dc> is a**susequence**of <a(abc)(ac)d(cf)>

- Then given a support threshold, for instance min_sup=2, then we can find
**frequent sequential pattern**

- Algorithm requirements:
**efficient**,**scalable**,**finding complete sets**,**incorporating various kinds of user-specific constraints**.

- The
**Apriori**property still holds for sequential patterns:*if a subsequence \(s_1\) is infrequent, none of \(s_1\)'s super-sequences can be frequent.*

- We can then develop the same kind of algorithms as before, based on the Apriori property:
**GSP**(Generalized Sequential Patterns): cf. Srikant & Agrawal @ EDBT'96**SPADE**(Vertical format-based mining): cf. Zaki @ Machine Learning'00**PrefixSpan**(Pattern-growth method): cf. Pei, et al. @ ICDE'01

- To mine
**closed sequential patterns**we have the**CloSpan**method: cf. yan, et al. @ SDM'03 - Finally we also have
**constraint-based**sequential pattern mining.

### GSP: Apriori-Based Sequential Pattern Mining

- We start with a sequence database:

SID | Sequence |
---|---|

10 | <(bd)cb(ac)> |

20 | <(bf)(ce)b(fg)> |

30 | <(ah)(bf)abf> |

40 | <(be)(ce)d> |

50 | <a(bd)bcb(ade)> |

- We start with initial candidates: eg. all singleton sequences: <a>,<b>,<c>,<d>,<e>,<f>,<g>,<h>
- We scan the DB once and count support for each candidate.
- Prune infrequent candidates: say <g> and <h> here.
- Generate length-2 candidate sequences:
- as separated items: <aa>, <ab>, …, <ff>
- or as elements in the sequence: <(ab)>, <(ac)>, …, <(ef)>.

⇒ We repeat steps 2 to 4 until there is no candidate or frequent sequences found anymore.

⇒ With the pruning stage due to Apriori property we can reduce the number of candidates for the next length sequences on each step.

### SPADE: Sequential Pattern Mining in Vertical Data Format

- A sequence database is mapped to <SID, EID> (eg. sequence ID and Element ID), for instance with:

SID | Sequence |
---|---|

1 | <a(abc)(ac)d(cf)> |

2 | <(ad)c(bc)(ae)> |

3 | <(ef)(ab)(df)cb> |

4 | <eg(af)cbc> |

We can build the mapping:

SID | EID | items |
---|---|---|

1 | 1 | a |

1 | 2 | abc |

1 | 3 | ac |

1 | 4 | d |

1 | 5 | cf |

2 | 1 | ad |

… | … | … |

Then we figure out where each item occurs:

a | b | … | ||
---|---|---|---|---|

SID | EID | SID | EID | … |

1 | 1 | 1 | 2 | … |

1 | 2 | 2 | 3 | … |

1 | 3 | 3 | 2 | … |

2 | 1 | 3 | 5 | … |

2 | 4 | 4 | 5 | … |

3 | 2 | … | ||

… | … | … |

Then to build length-2 sequences we check for each sequence ID the order of the Element IDs:

ab | ba | … | ||||
---|---|---|---|---|---|---|

SID | EID(a) | EID(b) | SID | EID(b) | EID(a) | … |

1 | 1 | 2 | 1 | 2 | 3 | … |

2 | 1 | 3 | 2 | 3 | 4 | … |

3 | 2 | 5 | … | |||

… | … | … | … |

Then we can grow the subsequences (patterns) one item at a time by Apriori candidate generation.

### PrefixSpan: Sequential Pattern Mining by Pattern-Growth

- Need to introduce the concept of
**prefix**and**suffix**(eg. projection). Starting from DB:

SID | Sequence |
---|---|

1 | <a(abc)(ac)d(cf)> |

2 | <(ad)c(bc)(ae)> |

3 | <(ef)(ab)(df)cb> |

4 | <eg(af)cbc> |

We build the prefix/suffix DB:

Prefix | Suffix (Projection) |
---|---|

<a> | <(abc)(ac)d(cf)> |

<aa> | <(_bc)(ac)d(cf)> |

<ab> | <(_c)(ac)d(cf)> |

* Given a **sequence** we can thus extract a set of **prefixes** and **suffixes** (eg. prefixes-based projections)

* PrefixSpan mining method:

- Find length-1 sequential patterns: eg. <a>, <b>, <c>, <d>, <e>, <f>
- Then devide the search space to mine each projected DB:
- <a>-projected DB, <b>-projected DB, …, <f>-projected DB.

- In each projections, we find the length-1 frequent patterns, then we combine with the prefix to get length-2 frequent sequential patterns, for instance: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, etc.

⇒ The major advantages of PrefixSpan are:

- No candidate subseqs to be generated,
- Projected DBs keep shrinking.

- Major cost of PrefixSpan:
**constructing projected DBs**- suffixes are largely repeating in recursive projected DBs.
- When the DB can held in main memory we can use
**pseudo projection**:- Given s=<a(abc)(ac)d(cf)>:
- the <a> projection can be noted:
**s|<a>: (,2)**(same as s, but start at position 2) - the <ab> projection can be noted:
**s|<ab>: (,4)**(same as s, but start at position 4) - ⇒ No physical copy required, using pointer to the sequence and applying offsets.

- When not fitting in memory, we ahve to do a physical projection until the data can fit in memory.
- For efficient mining both physical and pseudo projections should be used together.

### CloSpan: Mining Closed Sequential Patterns

- [def]
**closed sequential pattern**s:*There exists no superpattern s' such that \(s' \supset s\) and s' and s have the same support.*

- For instance, given: <abc>: 20, <abcd>: 20 and <abcde>: 15 ⇒ <abcd> and <abcde> are
**closed sequential patterns**.

- Why trying to mine closed sequential patterns ?
- Reduce the numer of (redundant) patterns
- While keeping the same expressive power

**Property**\(P_1\):*If \(a \supset s_1\), s is closed***iff**two projected DBs have the same size.

- From this property we can develop 2 pruning technics:
**backward subpattern**and**backward superpattern**pruning to remove redundant search space.

**CloSpan**was developed on this principle and greatly enhances efficiency. cf. Yan, et al. SDM'03

## Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory Patterns

### Key concepts

- Spatial association, progressive deepening
- Spatial colocation pattern, colocation rule
- Partition-based trajectory pattern mining, t-pattern
- Moving object clusters: flock, convoy, and swarm
- Trajectory clustering, partition-and-group framework
- Semantics-rich movement patterns
- Periodic movement patterns

### Mining Spatial Associations

- Spatial frequent patterns and association rules are also in the form:
**A ⇒ B [s%, c%]** - In the rule format above, A and B are sets of
**spatial or non-spatial predicates**, for instance:- Togological relations: intersects, overlaps, disjoint, etc.
- Spatial orientation: left_of, west_of, under, etc.
- Distance information: close_to, within_distance, etc.

- Example: is_a(X, large_town) ^ intersect(X, highway) → adjacent_to(X, water) [7%, 85%]

- We usually want to explore
**spatial autocorrelation**(because spatial data tends to be highly self-correlated: nearby things are more related than distant ones).

- When mining spatial associations, we can apply the
**progressive refinement**heuristic:- Because we can derive a hierarchy of spatial relationships: close_to can be seen as a generalization of near_by, touch, intersect, contain, …
- So we could start only with close_to, and if it is frequent, then try to
**refine**with the other relationships. - ⇒
**Progressive refinement**:*First search for rough relationship, and then refine it.*

- Thus, typical 2-steps mining of spatial associations:

- Rought spatial computation as a filter: using
**MBR**(Minimul Bouding Rectangle) or**R-tree**for rough estimation. - Detailed spatial algorithm (as refinement) : apply only to the objects which have passed the rough association test (eg. those passing the min_support threshold).

### Mining Spatial Colocation Patterns

- [def]
**Colocation pattern**:*A group of spatial features or events that are frequently co-located in the same region.* - On a graphic, we could for instance use edge to notify co-location of near_by points.

- [def] Concept of
**Rowset(C)**:*If every feature in pattern C appears as a feature of an instance in the neighbor-set L*: Try to understand and explain this further.

- From the concept of Rowset we can build a colocation rule R: A → B, with conditional probability cp(R) defined as:

\[cp(R) = \frac{|\{L \in rowset(A) | \exists L' s.t. (L \subseteq L') \land (L' \in rowset(A \cup B))\}|}{|rowset(A)|}\]

- Following example from video lectures (currently not really understanding this): \(cp(\{A,B\} \rightarrow \{C, D\}) = |rowset({A,B,C,D})| / |rowset({A,B})| \)

- [def]
**participation ratio**pr(C,f):*Probability that C is observed in a neighbor-set wherever feture f is observerd.*:

\[pr(C,f) = \frac{|\{r | (r \in S) \land (r. f = f) \land (\text{r is in a row instance of C})\}|}{|\{r | (r \in S) \land (r.f = f)\}|}\]

**Property**: Monotonicity of participation ratio:*let C, C' be two co-location patterns such that \(C' \subset C\) , then for each feature \(f \in C': pr(C',f) \ge pr(C,f)\).*

- Based on previous monotonicity principle, an Apriori-like algorithm can be derived for efficient mining of colocation patterns:
- let min-feature-support = \(\sigma\), and min-pr = \(\rho\)
- Start with a set of single feature pattern \(\{p_1\}\) with support \(\ge \sigma\)
- Grow to size k, in Apriori way (i.e, stop growing if the pattern is infrequent)
- For each such pattern p, mine its super-pattern P, such that \(pr(P, p) \ge \rho\) in an Apriori way.

### Mining and Aggregating Patterns over Multiple Trajectories

**Partition-Based Trajectory Pattern Mining**(eg. Mining T-Patterns) (cf. F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory Pattern Mining, KDD’07)- Partition the space into equal-width grids and obtain
**Regions of Interests**(RoIs) - Then transform each input trajectory into a
**Time-annotated symbolic sequence**, for instance:- Railway Station: [15min] → Castle Square, [2h15min] → Museum
- Railway Station: [10min] → Middle Bridge, [10min] → Campus

- Then use constraint-based sequential pattern mining to find trajectory patterns.
- Then matching T-patterns such as: \((x_0, y_0) - \alpha_1 \rightarrow (x_1,y_1)\) (??)

**Detecting Moving Objects clusters**:**Flock and convoy**: Both require k**consecutive**time stamps.**Flock**: At least m entities are within a circular region of**radius r**and more in the same direction.**Convoy**: Density-based clustering at each timestamp, no need to use a rigid circle.**Swarm**: Moving objects may not be close to each other for all consecutive timestamps. Efficient pattern mining algorithm can be derived for mining swarm (cf. Z. Li, et al., Swarm: Mining Relaxed Temporal Moving Object Clusters, VLDB'10)

**Trajectory Clustering**:**A Partition and Group Framework**(cf. J.-G. Lee, et al., Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07)- Grouping trajectories as a whole ⇒ cannot find similar portions of trajectories.
- Solution: discovers common
**sub-trajectories**, e.g. forcast hurrican landfall. - 2 phases:
**partitioning**: creating sequences of segments,- then
**grouping**: may find patterns in the segments created above.

- Partitioning can be done using minimum description length (MDL) principle: eg. reduce the number of points in the paths, but avoid introducing too much error (eg. real points too far away from approximating segment)

### Mining Semantics-Rich Movement Patterns

**Frequent Movement Pattern**: A movement sequence that frequently appears in the input trajectory database (even if we cannot consider precise spatial location, for instance:*Home → Starbucks → Office → Restaurant → Cafe → Restaurant → Bar*

- To mine Frequent movement patterns we may need to group similar places to be able to find frequent subsequences.

- 2-steps top-down mining approach:
- Find a set of coarse patterns that reflect people's semantics-level transitions (office → restaurant, home → gym)
- Split each coarse pattern into several fine-grained ones by grouping similar movement snippts.

- cf. C. Zhang et al., Splitter: Mining Fine-Grained Sequential Patterns in Semantic Trajectories, VLDB'14.

### Mining Periodic Movement Patterns

**Pattern discovery in sparse data**(for instance find bird flying patterns)- Periodicity might show up in some reference
**spots**(or**cluster centers**) - Reference spots can be detected using
**density-based method** - Periods are detected for each reference spot using
**Fourier Transform**and**auto-correlation**

- cf. Z. Li, et al., Mining Periodic Behaviors for Moving Objects, KDD'10
- cf. Z. Li, et al., ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 2015

### Recommended Readings

- F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi: Trajectory Pattern Mining. KDD’07
- Y. Huang, S. Shekhar, H. Xiong: Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. on Knowledge & Data Eng., 16(12), 2004
- Y. Huang, J. Pei, H. Xiong: Mining Co-Location Patterns with Rare Events from Spatial Data Sets. GeoInformatica 10(3): 239-260, 2006
- K. Koperski, J. Han: Discovery of Spatial Association Rules in Geographic Information Databases. SSD’95
- J.-G. Lee, J. Han, and K.-Y. Whang: Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07
- Z. Li, B. Ding, J. Han, R. Kays: Swarm: Mining Relaxed Temporal Moving Object Clusters. VLDB’10
- Z. Li, B. Ding, J. Han, R. Kays, P. Nye: Mining Periodic Behaviors for Moving Objects. KDD’10
- Z. Li, J. Wang and J. Han, ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 27(5): 1219-1232, 2015
- C. Zhang, J. Han, L. Shou, J. Lu, and T. La Porta: Splitter: Mining Fine Grained Sequential Patterns in Semantic Trajectories. VLDB'14

## Lesson 7: Pattern Mining Applications: Mining Quality Phrases from Text Data

### Key concepts

- Phrase mining; training-based phrase mining vs. topic modeling-based phrase mining
- TNG, TurboTopic; KERT; criteria to judge quality of phrases
- ToPMine; word colocation; phrasal segmentation; PhraseLDA; SegPhrase

### From Frequent Pattern Mining to Phrase Mining

- Important concept when mining text data is to mine
**quality phrases** - Unigrams (eg. single words) vs. Phrases:
- Unigrams are often
**ambiguous** - Phrase: a natural, meaningful, unambiguous semantic unit

- Thus, it would be interesting to mine semantically meaningful phrases:
- Traforming text data from
**word granularity**to**phrase granularity** - Enhance the power and efficiency at manipulating unstructured data

- How do we go from frequent patterns to phrases ?
**General principle**: Exploit information redundancy and data-driven criteria to determine phrase boundaries and salience.**Methodology**:- Frequent pattern mining and colocation analysis
- Phrasal segmentation
- Quality phrase assessment

- Recent developments of phrase mining:
**ToPMine**: Mining quality phrase without training (A. El-Kishky, et al., 2015)**SegPhrase**: Mining quality phrase with tiny training sets (J. Liu, et al., 2015)

### Previous Phrase Mining Methods

- Phrase mining came from NLP community as
**Chunking**- Modeling phrase as a sequence of labeling problem (B-NP, I-NP, O, …)

- Previous methods required annotation and training:
- Annotate hundreds of documents as training data
- Train a supervised model based on part-of-speech features

- Recent trend:
- use distributional features based on
**web n-grams**(Bergsma et al., 2010) - state-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score.
- Limitations:
- high annotation cost, not scalable to a new language, a new domain/genre.
- May not fit domain-specific, dynamic, emerging applications (such as scientific domains, query logs, or social media)

- Thus trying to perform unsupervised phrase mining and topic modeling instead.
- Many study of unsupervised phrase mining are linked with
**topic modeling**. - Each
**topic**is represented by a word distribution. **Documents**are represented by multiple topics in different proportions.- ⇒ This approach does not require any prior annotations or labeling of the documents.

- With topic modeling we rather rely on
**statistical topic modeling algorithms**: - The most common algorithm: LDA (Latent Dirichlet Allocation) [cf. Blei, et al., 2003]

- 3 strategies to perform phrase mining using topic modeling:
- Strategy 1: simultaneously infer the topics and the phrase tokens (generate bag-of-words → generate sequence of tokens)
- Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
- Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model.

#### Strategy 1: Simultaneously inferring Phrases and Topics

**Bigram Topic Model**[cf. Wallach'06]- Probabilistic generative model that conditions on previous word and topic when drawing next word.

**Topical N-Grams**(TNG) [Wang, et al.'07] (a generalization of Bigram Topic Model)- Probabilistic model that generates words in textual order
- Create n-grams by concatenating successive bigrams

**Phrase-Discovering LDA**(PDLDA) [Lindsey, et al.'12]- Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically.
- Each word is drawn based on previous m words (context and current phrase topic).
- Limitations:
- High model complexity: tends to overfit
- high inference cost: slow

#### Strategy 2: Post Topic-Modeling Phrase Construction: TurboTopics (I)

**TurboTopics**[Blei & Lafferty'09]: Phrase construction as a post-processing step to Latent Dirichlet Allocation.- Perform LDA on corpus to assign each token a topic label.
- Merge adjacent unigrams with the same topic label by a distribution-free permutation test on arbitrary-length back-off model.
- End recursive merging when all significant adjacent unigrams have been merged.

#### Strategy 2: Post Topic-Modeling Phrase Construction: KERT (II)

**KERT**[Danilevsky et al.'14]: Phrase construction as a post-processing step to LDA- Run bag-of-words model inference and assign topic label to each token
- Perform
**frequent pattern mining**to extract candidate phrases within each topic - Perform
**phrase ranking**based on four different criteria:**Popularity**: eg. “information retrieval” vs. “cross-languagee information retrieval”**Concordance**: eg. “powerful tea” vs. “strong tea”**Informativeness**: eg. “this paper” (frequent but not informative)**Completeness**: “vector machine” vs “support vector machine”

### ToPMine: Phrase Mining without Training Data

#### Strategy 3: First Phrase Mining then Topic Modeling

- Why in this order ?
- If we have for instance “Knowledge discovery using least square support vector machine classifiers…” (topics in blue and red) we won't be able to generate the proper phrases.
- With phrase mining first we will first group the phrase elements: “[Knowledge discovery] using [least square] [support vector machine] [classifiers]…”, then apply the topics on the groups.

- Techniques for this strategy:
- First: phrase mining ⇒ document segmentation and then phrase ranking,
- Then Topic model inference with phrase constraint.

**Phrase mining**stage:**Frequent contiguous pattern mining**: Extract candidate phrases and their counts.- Perform agglomerative merging of adjacent unigrams as guided by a
**significance score**. - Perform Document segmentation to count phrase occurence (and calculate
**rectified**(ie. true) phrase frequency) - Do phrase ranking (using the criteria proposed in KERT): popularity, concordance, informativenes, completeness.
- Finally do phrase-based topic modeling:
- mined bag-of-phrases are passed as input to
**PhraseLDA**, and extension of LDA, that constrains all words in a phrase to each sharing the**same latent topic**.

⇒ This method essentially perform collocation mining:

- Collocation: a sequence of words that occur
**more frequently than expected**- Often
**interesting**: relay information not portrayed by their constituent terms.

- Many different measures used to extract collocations from a corpus [cf. Dunning 93, Pederson 96], for instance: mutual information, t-test, z-test, chi-squared test, likelihood ratio
- Parallelized Mutual Information: \(PMI(x,y) = log \frac{p(x,y)}{p(x)p(y)}\)

- Many of these measures can be used to guide the agglomerative phrase-segmentation algorithm.

- ⇒ ToPMine is efficient and generates high-quality topics and phrases without any training data.

### SegPhrase: Phrase Mining with Tiny Training Sets

- A small set of training data may enhance the quality of phrase mining (cf. J. Liu et al., Mining Quality Phrases from Massive Text Corpora, In SIGMOD'15)

- SegPhrase+: The Overall Framework
**ClassPhrase**: Frequent pattern mining, feature extraction, classification.**SegPhrase**: Phrasal segmentation and phrase quality estimation.**SegPhrase+**: One more round to enhance mined phrase quality.

#### Pattern Mining and Feature Extraction

- Pattern Mining for Candidate Set:
- Build a candidate phrases set by frequent pattern mining
- Mining frequent k-grams (k is typically small, eg. 6 in experiments)
**Popularity**measured by raw frequent words and phrases mined from the corpus.

**Feature Extraction**, based on**Concordance**- Partition a phrase into two parts to check whether the co-occurence is significantly higher than pure random.

**Feature Extraction**, based on**Informativeness**- Quality phrases typically start and end with a non-stopword
- Use average IDF over words in the phrase to measure the semantics
- Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information) (eg. “state-of-the-art”)

**Classification**:- Using a tiny training set (about 300 labels for 1GB, can also use phrase extracted from KB)
- Label: indicating whether a phrase is a high quality one or not: “support vector machine”: 1, “the experiment shows”: 0
- Construct models to distinguish quality phrases from poor ones
- Use Random Forest algorithm to bootstrap different datasets with limited labels.

**Phrasal segmentation**: can tell which phrase is more appropriate- Partition a sequence of words by maximizing the likelyhood
- Consider length penalty and filter out phrases with low rectified frequency.

- So the process is repeated twice:
- Classification –> Phrasal segmentation (⇒
**SegPhrase**) - –> Classification –> Phrasal segmentation (⇒
**SegPhrase+**)

### Recommended Readings

- S. Bergsma, E. Pitler, D. Lin, Creating robust supervised classifiers via web-scale n-gram data, ACL’2010
- D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions. arXiv:0907.1013, 2009
- D.M. Blei, A. Y. Ng, M. I. Jordan, J. D. Lafferty, Latent Dirichlet allocation. JMLR 2003
- K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
- M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents. SDM’14
- A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
- R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes. EMNLP-CoNLL’12.
- J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
- A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10
- X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. ICDM’07

## Lesson 8: Advanced Topics on Pattern Discovery

### Key concepts

- Stream data mining; mining frequent patterns in data streams; the lossy counting algorithm
- Software bug mining; pattern discovery for software bug mining; mining copy-and-paste bugs
- Image analysis; pattern mining for image analysis; visual items; visual patterns
- Three categories on privacy issues in data mining: input privacy, output privacy and owner privacy, k-anonymity and differential privacy; data perturbation for privacy-preserving pattern mining
- Invisible data mining; new application exploration

### Frequent Pattern Mining in Data Streams

**Data streams**: continuous, ordered, changing, fast, huge volumn- Random access is expensive:
**single scan algorithm** - Most streams are low-level and multi-dimensional
- Stream Data Processing:
**Stream Data Management System**(SDMS)

**Stream mining**vs**Stream query**- For mining, we want to see the big picture (eg. less precision)

**Mining tasks**- Pattern maining in stream,
- clustering,
- etc, (much more)

- Mining
**precise**frequent patterns is unrealistic - So we look for
**approximate answers** - How to mine frequent patterns with
**good approximation**? **Lossy Counting Algorithm**[Manku & Motwani, VLDB'02]- ⇒ discard items with very low support count
- Given support threshold \(\sigma\), error threshold \(\epsilon\) and stream length N
- Output: items with frequency counts exceeding \((\sigma - \epsilon) N\)
- We undercount by:
**Frequency count error**\(\le \epsilon N\) - Approximation properties:
- No false negatives,
- False positives have true frequency count at least \((\sigma - \epsilon) N\)
- Frequency count underestimated by at most \(\epsilon N\)

**Other issues**:- Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05)
- Mining approximate frequent k-itemsets in data streams
- Mining sequential patterns in data streams

- Recommanded readings:
- G. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams”, VLDB’02
- A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, ICDT'05

### Pattern Discovery for Software Bug Mining

- Finding bugs is challenging: often no clear specifications or properties, requires substantial human efforts in analyzing data.

- Software reliability analysis:
- Static bug detection: check the code
- Dynamic bug detection or testing: run the code
- Debugging: Given symptoms or failure, pinpoint the bug locations in the code.

- Why pattern mining ?
- Code or running sequences contain hidden patterns.
- Common patterns → likely specification or property
- Violations (anomalies comparing to patterns) → likely bugs
- Mining patterns to narrow down the scope of inspection:
- ⇒ code locations or predicates that happen more in failing runs but less in passign runs are suspicious bug locations.

**Mining rules from source code**- Bugs as deviant behavior (eg. by statistical analysis)
- Mining programming rules (eg. by frequent itemset mining)
- Mining function precedence protocols (eg. by frequent subsequence mining)
- Revealing neglected conditions (eg. by frequent itemset/subgraph mining)

**Mining rules from revision histories**: by frequent itemset mining**Mining copy-paste patterns from source code**- Find copy-paste bugs: CP-Miner (cf. Z. li, S. Lu, S. Myagmar, Y. Zhou, “CP-Mier: A Tool for finding Copy-paster and Related Bugs in Operating System Code.”, OSDI'04)

#### Mining copy-and-paste bugs

- Copy-pasting is common:
- 12% in Linux (kernel ?) code
- 19% in X Window (server ?) code

- Copy-paster is error-prone:
- Can mine “forget-to-change” bugs by sequential pattern mining.

**Building sequence database**from source code:- Tokenize each component:
- Different operators, constants, key words → different tokens
- Same type of identifiers → same token
- for instance: “old = 3;” → “5 61 20”; and “new = 3;” → “5 61 20”
- Then
**compute hash**: “5 61 20” → 16 - So for a list of statements we produce a list of hashes: 65, 16, 16, 71, …

**Detecting the “Forget-to-change” bugs**:- Modification to the sequence pattern mining algorithm:
**Constrain the max gap**: (16, 16, 71) ….. (16, 16,10,71)- Then combine neighboring copy-pasted segments repeatedly,
- Then find conflicts: eg. identify names that cannot be mapped to the any correspondance, and introduce an
**unchanged ratio** - If 0 < unchanged ratio < threshold, then report the location as a potential bug.

- CP-Miner reported many C-P bugs in Linux, Apache, etc (eg. analyzing millions of LOC).

### Pattern Discovery for Image Analysis

- An image can be characterized by
**visual primitives**(eg.**interest points**) - Each visual primitive can be described by visual feature (eg. high-dimensional feature vector)
- Each image is a collection of visual primitives

#### Visual Patterns Discovery

- Visual primitives can be clustered into visual “items” (similar visual primitives belong to the same item)
- Each visual primitive finds its k-nearest neighbor in the image to form a “visual transaction”
- Then mining “frequent itemsets” leads to sementically meaningful visual patterns.

- So the process is:
- images → feature extraction → Visual primitives
- Visual primitives → feature quantization → Induced visual transactions
- Induced Visual transactions → pattern discovery → Visual patterns

**Challenges**- Images are spatial data
- Spatial configuration among the visual items matters
- Induced transactions may overlap with each other ⇒ over counting problem.

**Uncertainties of visual items**- Noisy clustering of visual primitives into visual items affects visual pattern
- Visual synonym and polysemy

#### Recommended Readings

- Hongxing Wang, Gangqiang Zhao, Junsong Yuan, Visual pattern discovery in image and video data: a brief survey, Wiley Interdisciplinary Review: Data Mining and Knowledge Discovery 4(1): 24-37 (2014)
- Hongxing Wang, Junsong Yuan, Ying Wu, Context-Aware Discovery of Visual Co-Occurrence Patterns. IEEE Transactions on Image Processing 23(4): 1805-1819 (2014)
- Gangqiang Zhao, Junsong Yuan, Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining. ICDM 2011: 1260-1265
- Junsong Yuan, Ying Wu, Ming Yang, From frequent itemsets to semantically meaningful visual patterns. KDD 2007: 864-873

### Advanced Topics on Pattern Discovery: Pattern Mining and Society-Privacy Issue

- Potential adverse side-effects of data mining
- Privacy and accuracy are typically contradictory in nature.
- Improving one often incurs a cost on the other.

- 3 categories on privacy issues:
- Input privacy: distort or hide data
- Output privacy (eg. knowledge hiding)
- Owner privacy: hide owner/source information from data.

**Ensuring Input Privacy**:- Approach 1: B2B (business-to-business) environment.
- Approach 2: B2C (business-to-customer) environment.

**K-anonymity privacy**requirement : each equivalent class contains at least k records.

**Data perturbation****Statistical distortion**: using randomization algorithms- MASK [Rizvi & Haritsa VLDB'02]
- Flip each 0/1 bit with probability p (this may increase a lot of items)
- Tune p carefully to achieve acceptable average privacy and good accuracy.

- Cut and paster (C&P) operator [Evfimievski et al. KDD'02]
- Uniform randomization: Each existing item in the real transaction is, with probability p, replaced with a new item not present in the original transaction.

### Recommended Readings

- R. Agrawal and R. Srikant, Privacy-preserving data mining, SIGMOD'00
- C. C. Aggarwal and P. S. Yu, Privacy-Preserving Data Mining: Models and Algorithms, Springer, 2008
- C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. 2014
- A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD'02
- A. Gkoulalas-Divanis, J. Haritsa and M. Kantarcioglu, Privacy in Association Rule Mining, in C. Aggarwal and J. Han (eds.), Frequent Pattern Mining, Springer, 2014 (Chapter 15)
- N. Li, T. Li, S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. ICDE'07
- A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, TKDD 2007
- S. Rizvi and J. Haritsa. Maintaining data privacy in association rule mining. VLDB’02
- J. Vaidya, C. W. Clifton and Y. M. Zhu, Privacy Preserving Data Mining, Springer, 2010

### Advanced Topics on Pattern Discovery: Looking Forward

- Lots of research issues on pattern discovery are still waiting to be solved.
**Invisible Pattern Mining**can/could be used in many fields.