Table of Contents

Pattern Discovery in Data Mining

Course page: https://www.coursera.org/learn/data-patterns/home/welcome
Teacher: Mr. Jiawei Han
Textbook: Data Mining - Concepts and Technics by Jiawei Han, Micheline Kamber and Jian Pei (2011) [Morgan Kaufmann]

Lesson 1: Pattern Discovery Basic Concepts

Key questions

Definitions

Challenge

With the definitions above we often have too many frequent patterns.

⇒ In many applications, mining closed patterns is more desirable than mining max-patterns.

Formal Model definition

Let \(\mathcal{I} = I_1,I_2,\dots,I_m\) be a set of binary attributes, called items. Let \(T\) be a database of transactions. Each transaction \(t\) is represented as a binary vector, with \(t[k] = 1\) if \(t\) bought the item \(I_k\), and \(t[k] = 0\) otherwise.

FIXME: Complete this section.

Lesson 2: Efficient Pattern Mining Methods

Key concepts

Downward Closure Property of Frequent Patterns

The Apriori Algorithm

Extensions or Improvements of Apriori

Partitioning

Direct Hashing and Pruning (DHP)

Mining Frequent Patterns by Exploring Vertical Data Format

FPGrowth: A Pattern Growth Approach

Mining Closed Patterns

Lesson 3: Pattern Evaluation

Key concepts

Limitation of the Support-Confidence Framework

Interestingness Measures: Lift and \(\chi^2\)

\[lift(B,C) = \frac{c(B \rightarrow C)}{s(C)} = \frac{s(B \cup C)}{s(B) \times s(C)}\]

\[\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}\]

Null Invariance Measures

Comparison of Null-Invariant Measures

\[IR(A,B) = \frac{|s(A) - s(B)|}{s(A)+s(B)-s(A \cup B)}\]

Lesson 4: Mining Diverse Patterns

Key concepts

Mining Multi-Level Associations

⇒ scalable mining methods can easily be extended to take customized min-support into account.

Mining Multi-Dimensional Associations

Mining Quantitative Associations

Mining Negative Correlations

Mining Compressed Patterns

\[Dist(P_1,P_2) = 1 - \frac{|T(P_1) \cap T(P_2)|}{|T(P_1) \cup T(P_2)|}\]

Lesson 5: Sequential Pattern Mining

Key concepts

Sequential Pattern and Sequential Pattern Mining

GSP: Apriori-Based Sequential Pattern Mining

SID Sequence
10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
  1. We start with initial candidates: eg. all singleton sequences: <a>,<b>,<c>,<d>,<e>,<f>,<g>,<h>
  2. We scan the DB once and count support for each candidate.
  3. Prune infrequent candidates: say <g> and <h> here.
  4. Generate length-2 candidate sequences:
    • as separated items: <aa>, <ab>, …, <ff>
    • or as elements in the sequence: <(ab)>, <(ac)>, …, <(ef)>.

⇒ We repeat steps 2 to 4 until there is no candidate or frequent sequences found anymore.

⇒ With the pruning stage due to Apriori property we can reduce the number of candidates for the next length sequences on each step.

SPADE: Sequential Pattern Mining in Vertical Data Format

SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>

We can build the mapping:

SID EID items
1 1 a
1 2 abc
1 3 ac
1 4 d
1 5 cf
2 1 ad

Then we figure out where each item occurs:

a b
SID EID SID EID
1 1 1 2
1 2 2 3
1 3 3 2
2 1 3 5
2 4 4 5
3 2

Then to build length-2 sequences we check for each sequence ID the order of the Element IDs:

ab ba
SID EID(a) EID(b) SID EID(b) EID(a)
1 1 2 1 2 3
2 1 3 2 3 4
3 2 5

Then we can grow the subsequences (patterns) one item at a time by Apriori candidate generation.

PrefixSpan: Sequential Pattern Mining by Pattern-Growth

SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>

We build the prefix/suffix DB:

Prefix Suffix (Projection)
<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<ab> <(_c)(ac)d(cf)>

* Given a sequence we can thus extract a set of prefixes and suffixes (eg. prefixes-based projections)

* PrefixSpan mining method:

  1. Find length-1 sequential patterns: eg. <a>, <b>, <c>, <d>, <e>, <f>
  2. Then devide the search space to mine each projected DB:
    • <a>-projected DB, <b>-projected DB, …, <f>-projected DB.
  3. In each projections, we find the length-1 frequent patterns, then we combine with the prefix to get length-2 frequent sequential patterns, for instance: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, etc.

⇒ The major advantages of PrefixSpan are:

CloSpan: Mining Closed Sequential Patterns

Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory Patterns

Key concepts

Mining Spatial Associations

  1. Rought spatial computation as a filter: using MBR (Minimul Bouding Rectangle) or R-tree for rough estimation.
  2. Detailed spatial algorithm (as refinement) : apply only to the objects which have passed the rough association test (eg. those passing the min_support threshold).

Mining Spatial Colocation Patterns

\[cp(R) = \frac{|\{L \in rowset(A) | \exists L' s.t. (L \subseteq L') \land (L' \in rowset(A \cup B))\}|}{|rowset(A)|}\]

\[pr(C,f) = \frac{|\{r | (r \in S) \land (r. f = f) \land (\text{r is in a row instance of C})\}|}{|\{r | (r \in S) \land (r.f = f)\}|}\]

Mining and Aggregating Patterns over Multiple Trajectories

Mining Semantics-Rich Movement Patterns

Mining Periodic Movement Patterns

Lesson 7: Pattern Mining Applications: Mining Quality Phrases from Text Data

Key concepts

From Frequent Pattern Mining to Phrase Mining

Previous Phrase Mining Methods

Strategy 1: Simultaneously inferring Phrases and Topics

Strategy 2: Post Topic-Modeling Phrase Construction: TurboTopics (I)

Strategy 2: Post Topic-Modeling Phrase Construction: KERT (II)

ToPMine: Phrase Mining without Training Data

Strategy 3: First Phrase Mining then Topic Modeling

⇒ This method essentially perform collocation mining:

SegPhrase: Phrase Mining with Tiny Training Sets

Pattern Mining and Feature Extraction

Lesson 8: Advanced Topics on Pattern Discovery

Key concepts

Frequent Pattern Mining in Data Streams

Pattern Discovery for Software Bug Mining

Mining copy-and-paste bugs

Pattern Discovery for Image Analysis

Visual Patterns Discovery

Advanced Topics on Pattern Discovery: Pattern Mining and Society-Privacy Issue

Advanced Topics on Pattern Discovery: Looking Forward