Pattern Discovery in Data Mining
Course page: https://www.coursera.org/learn/data-patterns/home/welcome
Teacher: Mr. Jiawei Han
Textbook: Data Mining - Concepts and Technics by Jiawei Han, Micheline Kamber and Jian Pei (2011) [Morgan Kaufmann]
Course structure:
Lesson 1: Pattern Discovery: basic Concepts
Lesson 2: Efficient Pattern Mining Methods
Lesson 3: Pattern Evaluation
Lesson 4: Mining Diverse Frequent patterns
Lesson 5: Sequential Pattern Mining
Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory patterns
Lesson 7: Pattern Mining Applications: Mining Quality Phrases from text Data
Lesson 8: Advanced Topics on Pattern Discovery
Basic concepts of pattern discovery: frequent pattern, closed pattern, max-pattern, and association rules.
Efficient pattern mining methods: Apriori, ECLAT, and FPgrowth.
Pattern evaluation measures: lift, chi-square, cosine, Jaccard, and Kulczynski.
Methods for mining sequential patterns: GSP, SPADE, PrefixSpan, and CloSpan.
Pattern-based classifications: CBA, CMAR, PatClass, DDPMine, and DPClass.
Lesson 1: Pattern Discovery Basic Concepts
Key questions
Why is pattern discovery an important task in data mining?
What are the concepts of frequent itemsets?
What are the concepts of association rules? How do you generate association rules from frequent itemsets?
Why do we need compressed representations of frequent patterns? Why do we need two compression representations: closed patterns and max-patterns?
Definitions
Pattern: a set of items, subsequences or substructures that occur frequently together (or strongly correlated) in a data set. Patterns represent intrinsic and important properties of datasets.
Pattern discovery: trying to uncover patterns from massive data sets. Finding inherent regularities in a dataset.
Pattern discovery is the fondation for many essential data mining tasks:
Association, correlation and causality analysis,
Minig sequential, structural (eg. sub-graph) patterns,
Pattern analysis in spatiotemporal, multimedia, time-series and stream data,
Classification: Discriminative pattern-based analysis
Cluster analysis: Pattern-based subspace clustering
Itemset: A set of one or more items.
k-itemset: \(X = \{x_1,\cdots,x_k\}\)
(absolute) support (count) of X: Frequency of the number of occurences of an itemset X in all the dataset entries. (Note: this definition doesn't clearly match the example given by the teacher, has he uses a specific element \(x_i\) inside some itemsets and counts the number of times this element occurs in all itemsets in the dataset…) ⇒ clarification: the itemsets can be anything in this definition, 1-itemset, 2-itemset, etc.. even if each entry in the dataset is a 5-itemset for instance.
(relative) support of X: The fraction of itemsets [or entries] in the dataset that contains the element X (i.e. this is the probability that an itemset contains X).
An itemset is frequentif the support of X is no less than a given minimum support threshold (denoted \(\sigma\)).
Association rule: We define the association rule \(X \rightarrow Y (s,c)\), stating that, if given the itemset X, then the probability that we also have the itemset Y is of support s, and confidence c.
Above, s is the support of \(X \cup Y\): eg. probability an entry contains \(X \cup Y\).
c is the conditional probability that an entry containing X also contains Y: \(c = \frac{sup(X \cup Y)}{sup(X)}\)
Challenge
With the definitions above we often have too many frequent patterns.
Solution 1: Closed patterns: A pattern (itemset) X is closed if X is frequent and there exists no super-pattern \(Y \supset X\) with the same support as X.
Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern \(Y \supset X\).
⇒ In many applications, mining closed patterns is more desirable than mining max-patterns.
Recommended Readings
-
“Efficiently mining long patterns from databases” (R. J. Bayardo, in Proc. of SIGMOD'98)
“Discovering frequent closed itemsets for association rules” (N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, in Proc. of ICDT'99)
“Frequen Pattern Mining: Current Status and Future Directions” (J. Han, H. Cheng, D. Xin and X. Yan, in Data Mining and Knowledge Discovery, 15(1): 55-86, 2007)
Let \(\mathcal{I} = I_1,I_2,\dots,I_m\) be a set of binary attributes, called items. Let \(T\) be a database of transactions. Each transaction \(t\) is represented as a binary vector, with \(t[k] = 1\) if \(t\) bought the item \(I_k\), and \(t[k] = 0\) otherwise.
: Complete this section.
Lesson 2: Efficient Pattern Mining Methods
Key concepts
The downward closure property or the Apriori property of frequent patterns
The Apriori algorithm
ECLAT: Mining frequent patterns by exploring vertical data format
FPGrowth: A frequent pattern-growth approach
Closet+ for mining closed patterns
Downward Closure Property of Frequent Patterns
Downward closure property (also called Apriori): Any subset of a frequent itemset must be frequent.
Thus an efficient mining methodology will take advantage of this, with the following rule: given an itemset S, if any of the subset of S is infrequent then there is no chance for S to be frequent. (so S can be pruned and doesn't require mining).
The Apriori Algorithm
Algorithm:
Initially, scan database once to get frequent 1-itemset.
Repeat:
Generate candidate (k+1)-itemsets from frequent k-itemsets.
Test the candidate (k+1)-itemsets to find which ones are frequent.
set k = k+1
Stop when no frequent or candidate set can be generated anymore.
Return all the frequent itemsets found.
Extensions or Improvements of Apriori
Reducing number of transaction database scan passes:
Partitioning (Savasere, et al., 1995)
Dynamic itemset counting (Brin, et al., 1997)
Shrink the number of candidates:
Hashing (Park, et al., 1995)
Pruning by support lower bounding (Bayardo 1998)
Sampling (Toivonen, 1996)
Exploring special data structures
Tree projection (Aggarwal, et al. 2001)
H-miner (Pei, et al. 2001)
Hypercube decomposition (Uno, et al. 2004)
Partitioning
⇒ Only needs to scan the database twice to find frequent itemsets.
⇒ We partition the database into k partitions, then:
Theorem: Any itemset that is potentially frequent in database TDB must be frequent in at least one of the partitions of TDB.
Method:
Scan 1: partition database: each partition should fit into computer RAM, so that we can avoid IO operations then, and look for all frequent itemsets in this partition.
Scan 2: consolidate global frequent patterns.
Direct Hashing and Pruning (DHP)
Example: while counting the 1-itemset with candidates a,b,c,d,e we also count the number of entries for 2-itemset buckets such as {ab,ad,ae}, {bd,be,de}, etc.
The ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below the support threshold.
ECLAT (Equivalence Class Transformation): a depth-first search algorithm using set intersection.
Need to generate the Tid-List: eg. list of transactions for each possible item.
Then we can compute the intersections of the tid lists: \(t(e) = \{T_{10}, t_{20},T_{30}\}\), \(t(a) = \{T_{10}, T_{20}\}\), then \(t(ae) = \{T_{10}, T_{20}\}\)
FPGrowth: A Pattern Growth Approach
Algorithm:
Find frequent single items and partition the database based on each such item
Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database
Using FP-Tree to facilitate efficient processing.
: I didn't get much out of that one, might need to check again later…
Mining Closed Patterns
Closet+ algorithm to mine closed itemsets by Pattern-Growth (efficient and direct)
itemset merging: if Y appears in every occurance of X, then Y is merged with X.
Many tricks implemented in closet+
Recommended Readings
R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases”, VLDB'95
J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”, SIGMOD'95
S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database systems: Alternatives and implications”, SIGMOD'98
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association rules”, Data Mining and Knowledge Discovery, 1997
J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets”, KDD'03
C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014
Lesson 3: Pattern Evaluation
Key concepts
Pattern evaluation; interestingness of patterns/rules
Popularly used measures in pattern evaluation: support, confidence, lift, chi-square, cosine, Jaccard
Null-invariance, the Kulczynski measure, and imbalance ratio
Limitation of the Support-Confidence Framework
Thus mining can be done:
on a Query-based mode: asking the user what she would like to see
against ons's knowledge-base: looking for unexpected, fresh or recent patterns/rules
using visualization tools: multi-dimensional, interactive examination.
Interestingness Measures: Lift and \(\chi^2\)
\[lift(B,C) = \frac{c(B \rightarrow C)}{s(C)} = \frac{s(B \cup C)}{s(B) \times s(C)}\]
\[\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}\]
Null Invariance Measures
Null invariance: value does not change with the number of null-transactions.
Lift and \(\chi^2\) are not null invariant.
A few Null-invariant measures: (all give values in the range [0,1])
\(AllConf(A,B) = \frac{s(A \cup B)}{max(s(A), s(B))}\)
\(Jaccard(A,B) = \frac{s(A \cup B)}{s(A) + s(B) - s(A \cup B)}\)
\(Cosine(A,B) = \frac{s(A \cup B)}{\sqrt{s(A) + s(B)}}\)
\(Kulczynski(A,B) = \frac 12 \left( \frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)} \right)\)
\(MaxConf(A,B) = max \left( \frac{s(A)}{s(A \cup B)}, \frac{s(B)}{s(A \cup B)} \right)\)
Comparison of Null-Invariant Measures
Not all null invariant measures are equal.
⇒ Kulczynski is the most appropriate when we have contradictory correlations, so this one seems to be the one to use by default.
\[IR(A,B) = \frac{|s(A) - s(B)|}{s(A)+s(B)-s(A \cup B)}\]
Recommended Readings
C. C. Aggarwal and R S. Yu. A New Framework for Itemset Generation. PODS'98
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97
M. Klemettinen, H. Mannila, R Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94
E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE'03
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02
T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397, 2010
Lesson 4: Mining Diverse Patterns
Key concepts
Multiple-level associations, level-shared mining, redundant rules
Multi-dimensional associations, quantitative associations
Negative correlations vs. rare patterns, null-invariance
Compressed patterns, redundancy-aware top-k patterns
Mining Multi-Level Associations
⇒ scalable mining methods can easily be extended to take customized min-support into account.
Mining Multi-Dimensional Associations
Mining Quantitative Associations
Mining extraordinary or interesting phenomena, in the gender=female example above:
Left Hand Side (LHS) : a subset of the population
Right Hand Side (RHS) : an extraordinary behavior of this subset
Mining Negative Correlations
Rare patterns : very low support but still very interesting (can be mined with individualized group-based min-support thresholds)
Negative patterns: negatively correlated itemssets (eg. unlikely to happen together)
Support-based definition of negative patterns: if itemset A and B are both frequent but rarely occur together: \(sup(A \cup B) << sup(A) x sup(B)\), then A and B are negatively correlated. ⇒ similar to definition of Lift ⇒ doesn't work well with large datasets will many null-transactions.
Mining Compressed Patterns
We can produce many scattered patterns but not so meaningful.
When considering closed patterns we cannot perform compression because there is too much emphasize on the support values (eg. To merge close patterns we are required to have the same support value).
When using Max-patterns we might loose too much information in the missing support values.
We might rather something in between (eg. a good balance) : keep some sub patterns from the max-patterns to retain some of the support information.
\[Dist(P_1,P_2) = 1 - \frac{|T(P_1) \cap T(P_2)|}{|T(P_1) \cup T(P_2)|}\]
Redundancy-aware Top-k Patterns: looking for high significance and low redundancy when compressing patterns.
Method: Use MMS (Maximal Marginal Significance) for measuring the combined significance of pattern set: cf. Xin et al., Extracting Redundancy-Awre Top-k Patterns, KDD'06.
Lesson 5: Sequential Pattern Mining
Key concepts
Sequential patterns, sequential pattern mining
Sequential pattern mining algorithms: GSP, SPADE, and PrefixSpan
Closed sequential patterns, the CloSpan algorithm
Sequential Pattern and Sequential Pattern Mining
Sequential pattern mining has broad applications. For instance: “Buy a laptop, then a digital camera, then smartphone, within 6 months.” ⇒ could also be applied to stocks and markets.
So trying to get sequential patterns out of vast sequence DBs (eg. entries with timestamp) or time-series DBs (evenly spaced timestamps).
For sequential patterns we also have the concept of gapped and non-gapped.
A sequence can be written for instance: < (ef) (ab) (df) c b > (eg. we use parenthesis to identify a shopping basket here for instance).
Each element in a sequence may contain a set of items (also called events)
Items within an element are unordered (usually listed alphabetically).
Algorithm requirements: efficient, scalable, finding complete sets, incorporating various kinds of user-specific constraints.
To mine closed sequential patterns we have the CloSpan method: cf. yan, et al. @ SDM'03
Finally we also have constraint-based sequential pattern mining.
GSP: Apriori-Based Sequential Pattern Mining
SID | Sequence |
10 | <(bd)cb(ac)> |
20 | <(bf)(ce)b(fg)> |
30 | <(ah)(bf)abf> |
40 | <(be)(ce)d> |
50 | <a(bd)bcb(ade)> |
We start with initial candidates: eg. all singleton sequences: <a>,<b>,<c>,<d>,<e>,<f>,<g>,<h>
We scan the DB once and count support for each candidate.
Prune infrequent candidates: say <g> and <h> here.
Generate length-2 candidate sequences:
as separated items: <aa>, <ab>, …, <ff>
or as elements in the sequence: <(ab)>, <(ac)>, …, <(ef)>.
⇒ We repeat steps 2 to 4 until there is no candidate or frequent sequences found anymore.
⇒ With the pruning stage due to Apriori property we can reduce the number of candidates for the next length sequences on each step.
SID | Sequence |
1 | <a(abc)(ac)d(cf)> |
2 | <(ad)c(bc)(ae)> |
3 | <(ef)(ab)(df)cb> |
4 | <eg(af)cbc> |
We can build the mapping:
SID | EID | items |
1 | 1 | a |
1 | 2 | abc |
1 | 3 | ac |
1 | 4 | d |
1 | 5 | cf |
2 | 1 | ad |
… | … | … |
Then we figure out where each item occurs:
a | b | … |
SID | EID | SID | EID | … |
1 | 1 | 1 | 2 | … |
1 | 2 | 2 | 3 | … |
1 | 3 | 3 | 2 | … |
2 | 1 | 3 | 5 | … |
2 | 4 | 4 | 5 | … |
3 | 2 | | | … |
… | … | | | … |
Then to build length-2 sequences we check for each sequence ID the order of the Element IDs:
ab | ba | … | | |
SID | EID(a) | EID(b) | SID | EID(b) | EID(a) | … |
1 | 1 | 2 | 1 | 2 | 3 | … |
2 | 1 | 3 | 2 | 3 | 4 | … |
3 | 2 | 5 | | | | … |
… | … | … | | | | … |
Then we can grow the subsequences (patterns) one item at a time by Apriori candidate generation.
PrefixSpan: Sequential Pattern Mining by Pattern-Growth
SID | Sequence |
1 | <a(abc)(ac)d(cf)> |
2 | <(ad)c(bc)(ae)> |
3 | <(ef)(ab)(df)cb> |
4 | <eg(af)cbc> |
We build the prefix/suffix DB:
Prefix | Suffix (Projection) |
<a> | <(abc)(ac)d(cf)> |
<aa> | <(_bc)(ac)d(cf)> |
<ab> | <(_c)(ac)d(cf)> |
* Given a sequence we can thus extract a set of prefixes and suffixes (eg. prefixes-based projections)
* PrefixSpan mining method:
Find length-1 sequential patterns: eg. <a>, <b>, <c>, <d>, <e>, <f>
Then devide the search space to mine each projected DB:
In each projections, we find the length-1 frequent patterns, then we combine with the prefix to get length-2 frequent sequential patterns, for instance: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, etc.
⇒ The major advantages of PrefixSpan are:
CloSpan: Mining Closed Sequential Patterns
Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory Patterns
Key concepts
Spatial association, progressive deepening
Spatial colocation pattern, colocation rule
Partition-based trajectory pattern mining, t-pattern
Moving object clusters: flock, convoy, and swarm
Trajectory clustering, partition-and-group framework
Semantics-rich movement patterns
Periodic movement patterns
Mining Spatial Associations
Spatial frequent patterns and association rules are also in the form: A ⇒ B [s%, c%]
In the rule format above, A and B are sets of spatial or non-spatial predicates, for instance:
Togological relations: intersects, overlaps, disjoint, etc.
Spatial orientation: left_of, west_of, under, etc.
Distance information: close_to, within_distance, etc.
Example: is_a(X, large_town) ^ intersect(X, highway) -> adjacent_to(X, water) [7%, 85%]
Rought spatial computation as a filter: using MBR (Minimul Bouding Rectangle) or R-tree for rough estimation.
Detailed spatial algorithm (as refinement) : apply only to the objects which have passed the rough association test (eg. those passing the min_support threshold).
Mining Spatial Colocation Patterns
[def] Colocation pattern: A group of spatial features or events that are frequently co-located in the same region.
On a graphic, we could for instance use edge to notify co-location of near_by points.
[def] Concept of
Rowset(C):
If every feature in pattern C appears as a feature of an instance in the neighbor-set L : Try to understand and explain this further.
\[cp(R) = \frac{|\{L \in rowset(A) | \exists L' s.t. (L \subseteq L') \land (L' \in rowset(A \cup B))\}|}{|rowset(A)|}\]
Following example from video lectures (currently not really understanding this): \(cp(\{A,B\} \rightarrow \{C, D\}) = |rowset({A,B,C,D})| / |rowset({A,B})| \)
\[pr(C,f) = \frac{|\{r | (r \in S) \land (r. f = f) \land (\text{r is in a row instance of C})\}|}{|\{r | (r \in S) \land (r.f = f)\}|}\]
Property: Monotonicity of participation ratio: let C, C' be two co-location patterns such that \(C' \subset C\) , then for each feature \(f \in C': pr(C',f) \ge pr(C,f)\).
Mining and Aggregating Patterns over Multiple Trajectories
Partition-Based Trajectory Pattern Mining (eg. Mining T-Patterns) (cf. F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory Pattern Mining, KDD’07)
Partition the space into equal-width grids and obtain Regions of Interests (RoIs)
Then transform each input trajectory into a Time-annotated symbolic sequence, for instance:
Railway Station: [15min] -> Castle Square, [2h15min] -> Museum
Railway Station: [10min] -> Middle Bridge, [10min] -> Campus
Then use constraint-based sequential pattern mining to find trajectory patterns.
Trajectory Clustering: A Partition and Group Framework (cf. J.-G. Lee, et al., Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07)
Grouping trajectories as a whole ⇒ cannot find similar portions of trajectories.
Solution: discovers common sub-trajectories, e.g. forcast hurrican landfall.
2 phases:
partitioning: creating sequences of segments,
then grouping: may find patterns in the segments created above.
Mining Semantics-Rich Movement Patterns
Mining Periodic Movement Patterns
Pattern discovery in sparse data (for instance find bird flying patterns)
Periodicity might show up in some reference spots (or cluster centers)
Reference spots can be detected using density-based method
Periods are detected for each reference spot using Fourier Transform and auto-correlation
cf. Z. Li, et al., Mining Periodic Behaviors for Moving Objects, KDD'10
cf. Z. Li, et al., ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 2015
Recommended Readings
F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi: Trajectory Pattern Mining. KDD’07
Y. Huang, S. Shekhar, H. Xiong: Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. on Knowledge & Data Eng., 16(12), 2004
Y. Huang, J. Pei, H. Xiong: Mining Co-Location Patterns with Rare Events from Spatial Data Sets. GeoInformatica 10(3): 239-260, 2006
K. Koperski, J. Han: Discovery of Spatial Association Rules in Geographic Information Databases. SSD’95
J.-G. Lee, J. Han, and K.-Y. Whang: Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07
Z. Li, B. Ding, J. Han, R. Kays: Swarm: Mining Relaxed Temporal Moving Object Clusters. VLDB’10
Z. Li, B. Ding, J. Han, R. Kays, P. Nye: Mining Periodic Behaviors for Moving Objects. KDD’10
Z. Li, J. Wang and J. Han, ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 27(5): 1219-1232, 2015
C. Zhang, J. Han, L. Shou, J. Lu, and T. La Porta: Splitter: Mining Fine Grained Sequential Patterns in Semantic Trajectories. VLDB'14
Lesson 7: Pattern Mining Applications: Mining Quality Phrases from Text Data
Key concepts
Phrase mining; training-based phrase mining vs. topic modeling-based phrase mining
TNG, TurboTopic; KERT; criteria to judge quality of phrases
ToPMine; word colocation; phrasal segmentation; PhraseLDA; SegPhrase
From Frequent Pattern Mining to Phrase Mining
Important concept when mining text data is to mine quality phrases
Unigrams (eg. single words) vs. Phrases:
Unigrams are often ambiguous
Phrase: a natural, meaningful, unambiguous semantic unit
Thus, it would be interesting to mine semantically meaningful phrases:
Previous Phrase Mining Methods
Thus trying to perform unsupervised phrase mining and topic modeling instead.
Many study of unsupervised phrase mining are linked with topic modeling.
Each topic is represented by a word distribution.
Documents are represented by multiple topics in different proportions.
⇒ This approach does not require any prior annotations or labeling of the documents.
With topic modeling we rather rely on statistical topic modeling algorithms:
The most common algorithm: LDA (Latent Dirichlet Allocation) [cf. Blei, et al., 2003]
3 strategies to perform phrase mining using topic modeling:
Strategy 1: simultaneously infer the topics and the phrase tokens (generate bag-of-words -> generate sequence of tokens)
Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model.
Strategy 1: Simultaneously inferring Phrases and Topics
Bigram Topic Model [cf. Wallach'06]
Topical N-Grams (TNG) [Wang, et al.'07] (a generalization of Bigram Topic Model)
Phrase-Discovering LDA (PDLDA) [Lindsey, et al.'12]
Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically.
Each word is drawn based on previous m words (context and current phrase topic).
Limitations:
Strategy 2: Post Topic-Modeling Phrase Construction: TurboTopics (I)
Strategy 2: Post Topic-Modeling Phrase Construction: KERT (II)
ToPMine: Phrase Mining without Training Data
Strategy 3: First Phrase Mining then Topic Modeling
Why in this order ?
If we have for instance “Knowledge discovery using least square support vector machine classifiers…” (topics in blue and red) we won't be able to generate the proper phrases.
With phrase mining first we will first group the phrase elements: “[Knowledge discovery] using [least square] [support vector machine] [classifiers]…”, then apply the topics on the groups.
Phrase mining stage:
Frequent contiguous pattern mining: Extract candidate phrases and their counts.
Perform agglomerative merging of adjacent unigrams as guided by a significance score.
Perform Document segmentation to count phrase occurence (and calculate rectified (ie. true) phrase frequency)
Do phrase ranking (using the criteria proposed in KERT): popularity, concordance, informativenes, completeness.
Finally do phrase-based topic modeling:
⇒ This method essentially perform collocation mining:
Many different measures used to extract collocations from a corpus [cf. Dunning 93, Pederson 96], for instance: mutual information, t-test, z-test, chi-squared test, likelihood ratio
Many of these measures can be used to guide the agglomerative phrase-segmentation algorithm.
SegPhrase: Phrase Mining with Tiny Training Sets
SegPhrase+: The Overall Framework
ClassPhrase: Frequent pattern mining, feature extraction, classification.
SegPhrase: Phrasal segmentation and phrase quality estimation.
SegPhrase+: One more round to enhance mined phrase quality.
Pattern Mining and Feature Extraction
Pattern Mining for Candidate Set:
Build a candidate phrases set by frequent pattern mining
Mining frequent k-grams (k is typically small, eg. 6 in experiments)
Popularity measured by raw frequent words and phrases mined from the corpus.
Feature Extraction, based on Concordance
Feature Extraction, based on Informativeness
Quality phrases typically start and end with a non-stopword
Use average IDF over words in the phrase to measure the semantics
Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information) (eg. “state-of-the-art”)
Classification:
Using a tiny training set (about 300 labels for 1GB, can also use phrase extracted from KB)
Label: indicating whether a phrase is a high quality one or not: “support vector machine”: 1, “the experiment shows”: 0
Construct models to distinguish quality phrases from poor ones
Use Random Forest algorithm to bootstrap different datasets with limited labels.
Phrasal segmentation: can tell which phrase is more appropriate
So the process is repeated twice:
Recommended Readings
S. Bergsma, E. Pitler, D. Lin, Creating robust supervised classifiers via web-scale n-gram data, ACL’2010
D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions. arXiv:0907.1013, 2009
D.M. Blei, A. Y. Ng, M. I. Jordan, J. D. Lafferty, Latent Dirichlet allocation. JMLR 2003
K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents. SDM’14
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes. EMNLP-CoNLL’12.
J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10
X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. ICDM’07
Lesson 8: Advanced Topics on Pattern Discovery
Key concepts
Stream data mining; mining frequent patterns in data streams; the lossy counting algorithm
Software bug mining; pattern discovery for software bug mining; mining copy-and-paste bugs
Image analysis; pattern mining for image analysis; visual items; visual patterns
Three categories on privacy issues in data mining: input privacy, output privacy and owner privacy, k-anonymity and differential privacy; data perturbation for privacy-preserving pattern mining
Invisible data mining; new application exploration
Frequent Pattern Mining in Data Streams
Data streams: continuous, ordered, changing, fast, huge volumn
Random access is expensive: single scan algorithm
Most streams are low-level and multi-dimensional
Stream Data Processing: Stream Data Management System (SDMS)
Mining precise frequent patterns is unrealistic
So we look for approximate answers
How to mine frequent patterns with good approximation ?
Lossy Counting Algorithm [Manku & Motwani, VLDB'02]
⇒ discard items with very low support count
Given support threshold \(\sigma\), error threshold \(\epsilon\) and stream length N
Output: items with frequency counts exceeding \((\sigma - \epsilon) N\)
We undercount by: Frequency count error \(\le \epsilon N\)
Approximation properties:
No false negatives,
False positives have true frequency count at least \((\sigma - \epsilon) N\)
Frequency count underestimated by at most \(\epsilon N\)
Other issues:
Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05)
Mining approximate frequent k-itemsets in data streams
Mining sequential patterns in data streams
Recommanded readings:
G. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams”, VLDB’02
A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, ICDT'05
Pattern Discovery for Software Bug Mining
Why pattern mining ?
Code or running sequences contain hidden patterns.
Common patterns -> likely specification or property
Violations (anomalies comparing to patterns) -> likely bugs
Mining patterns to narrow down the scope of inspection:
Mining rules from source code
Bugs as deviant behavior (eg. by statistical analysis)
Mining programming rules (eg. by frequent itemset mining)
Mining function precedence protocols (eg. by frequent subsequence mining)
Revealing neglected conditions (eg. by frequent itemset/subgraph mining)
Mining rules from revision histories: by frequent itemset mining
Mining copy-paste patterns from source code
Find copy-paste bugs: CP-Miner (cf. Z. li, S. Lu, S. Myagmar, Y. Zhou, “CP-Mier: A Tool for finding Copy-paster and Related Bugs in Operating System Code.”, OSDI'04)
Mining copy-and-paste bugs
Building sequence database from source code:
Tokenize each component:
Different operators, constants, key words -> different tokens
Same type of identifiers -> same token
for instance: “old = 3;” -> “5 61 20”; and “new = 3;” -> “5 61 20”
Then compute hash: “5 61 20” -> 16
So for a list of statements we produce a list of hashes: 65, 16, 16, 71, …
Detecting the “Forget-to-change” bugs:
Modification to the sequence pattern mining algorithm:
Constrain the max gap: (16, 16, 71) ….. (16, 16,10,71)
Then combine neighboring copy-pasted segments repeatedly,
Then find conflicts: eg. identify names that cannot be mapped to the any correspondance, and introduce an unchanged ratio
If 0 < unchanged ratio < threshold, then report the location as a potential bug.
Pattern Discovery for Image Analysis
An image can be characterized by visual primitives (eg. interest points)
Each visual primitive can be described by visual feature (eg. high-dimensional feature vector)
Each image is a collection of visual primitives
Visual Patterns Discovery
Visual primitives can be clustered into visual “items” (similar visual primitives belong to the same item)
Each visual primitive finds its k-nearest neighbor in the image to form a “visual transaction”
Then mining “frequent itemsets” leads to sementically meaningful visual patterns.
So the process is:
images -> feature extraction -> Visual primitives
Visual primitives -> feature quantization -> Induced visual transactions
Induced Visual transactions -> pattern discovery -> Visual patterns
Recommended Readings
Hongxing Wang, Gangqiang Zhao, Junsong Yuan, Visual pattern discovery in image and video data: a brief survey, Wiley Interdisciplinary Review: Data Mining and Knowledge Discovery 4(1): 24-37 (2014)
Hongxing Wang, Junsong Yuan, Ying Wu, Context-Aware Discovery of Visual Co-Occurrence Patterns. IEEE Transactions on Image Processing 23(4): 1805-1819 (2014)
Gangqiang Zhao, Junsong Yuan, Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining. ICDM 2011: 1260-1265
Junsong Yuan, Ying Wu, Ming Yang, From frequent itemsets to semantically meaningful visual patterns. KDD 2007: 864-873
Advanced Topics on Pattern Discovery: Pattern Mining and Society-Privacy Issue
Potential adverse side-effects of data mining
3 categories on privacy issues:
Input privacy: distort or hide data
Output privacy (eg. knowledge hiding)
Owner privacy: hide owner/source information from data.
Ensuring Input Privacy:
K-anonymity privacy requirement : each equivalent class contains at least k records.
Data perturbation
Statistical distortion: using randomization algorithms
MASK [Rizvi & Haritsa VLDB'02]
Cut and paster (C&P) operator [Evfimievski et al. KDD'02]
Recommended Readings
R. Agrawal and R. Srikant, Privacy-preserving data mining, SIGMOD'00
C. C. Aggarwal and P. S. Yu, Privacy-Preserving Data Mining: Models and Algorithms, Springer, 2008
C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. 2014
A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD'02
A. Gkoulalas-Divanis, J. Haritsa and M. Kantarcioglu, Privacy in Association Rule Mining, in C. Aggarwal and J. Han (eds.), Frequent Pattern Mining, Springer, 2014 (Chapter 15)
N. Li, T. Li, S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. ICDE'07
A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, TKDD 2007
S. Rizvi and J. Haritsa. Maintaining data privacy in association rule mining. VLDB’02
J. Vaidya, C. W. Clifton and Y. M. Zhu, Privacy Preserving Data Mining, Springer, 2010
Advanced Topics on Pattern Discovery: Looking Forward