====== Pattern Discovery in Data Mining ====== /* Coursera introduction text: Hi everyone! I'm Emmanuel ROCHE, but everyone calls me "Manu" ;-). I'm French, and I currently live near Toulouse, in France. I have no serious experience with data mining or pattern discovery yet, but I'm a fast learner, and I believe this field could be quite interesting. So I decided to give it a try here. */ Course page: https://www.coursera.org/learn/data-patterns/home/welcome\\ Teacher: **Mr. Jiawei Han**\\ Textbook: **Data Mining - Concepts and Technics** by Jiawei Han, Micheline Kamber and Jian Pei (2011) [Morgan Kaufmann] * Course structure: * Lesson 1: Pattern Discovery: basic Concepts * Lesson 2: Efficient Pattern Mining Methods * Lesson 3: Pattern Evaluation * Lesson 4: Mining Diverse Frequent patterns * Lesson 5: Sequential Pattern Mining * Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory patterns * Lesson 7: Pattern Mining Applications: Mining Quality Phrases from text Data * Lesson 8: Advanced Topics on Pattern Discovery * Basic concepts of pattern discovery: **frequent pattern**, **closed pattern**, **max-pattern**, and **association rules**. * Efficient pattern mining methods: **Apriori**, **ECLAT**, and **FPgrowth**. * Pattern evaluation measures: **lift**, **chi-square**, **cosine**, **Jaccard**, and **Kulczynski**. * Methods for mining sequential patterns: **GSP**, **SPADE**, **PrefixSpan**, and **CloSpan**. * Pattern-based classifications: **CBA**, **CMAR**, **PatClass**, **DDPMine**, and **DPClass**. ===== Lesson 1: Pattern Discovery Basic Concepts ===== ==== Key questions ==== * Why is pattern discovery an important task in data mining? * What are the concepts of frequent itemsets? * What are the concepts of association rules? How do you generate association rules from frequent itemsets? * Why do we need compressed representations of frequent patterns? Why do we need two compression representations: closed patterns and max-patterns? ==== Definitions ==== * **Pattern**: a **set** of **items**, **subsequences** or **substructures** that occur frequently together (or strongly correlated) in a data set. Patterns represent **intrinsic** and **important properties** of datasets. * **Pattern discovery**: trying to **uncover** patterns from massive data sets. Finding **inherent regularities** in a dataset. * Pattern discovery is the fondation for many essential data mining tasks: - Association, correlation and causality analysis, - Minig sequential, structural (eg. sub-graph) patterns, - Pattern analysis in spatiotemporal, multimedia, time-series and stream data, - Classification: Discriminative pattern-based analysis - Cluster analysis: Pattern-based subspace clustering * **Itemset**: A set of one or more items. * **k-itemset**: \(X = \{x_1,\cdots,x_k\}\) * **(absolute) support (count)** of X: //Frequency of the number of occurences of an itemset X in all the dataset entries//. (**Note**: this definition doesn't clearly match the example given by the teacher, has he uses a specific element \(x_i\) **inside** some itemsets and counts the number of times this element occurs in all itemsets in the dataset...) => **clarification**: the itemsets can be anything in this definition, 1-itemset, 2-itemset, etc.. even if each entry in the dataset is a 5-itemset for instance. * **(relative) support** of X: //The fraction of itemsets [or entries] in the dataset that contains the element X// (i.e. this is the **probability** that an itemset contains X). * An itemset is **frequent**if the support of X is no less than a given **minimum support threshold** (denoted \(\sigma\)). * **Association rule**: We define the association rule \(X \rightarrow Y (s,c)\), stating that, if given the itemset X, then the probability that we also have the itemset Y is of support **s**, and confidence **c**. * Above, **s** is the support of \(X \cup Y\): eg. probability an entry contains \(X \cup Y\). * **c** is the conditional probability that an entry containing X also contains Y: \(c = \frac{sup(X \cup Y)}{sup(X)}\) * **Subtle notation**: The entries that contain \(X \cup Y\) are actually the **intersection** of the entries containing X an the entries containing Y (not the **union**!) * **Association rule mining**: find **all** the rules \(X \rightarrow Y\) that have a given minimum support and confidence values. ==== Challenge ==== With the definitions above we often have **too many** frequent patterns. * Solution 1: **Closed patterns**: //A pattern (itemset) X is **closed** if X is frequent and there exists no super-pattern \(Y \supset X\) with the same support as X//. * => **Closed pattern** is a **lossless compression** of frequent patterns (eg. reduction of the number of patterns but we do not loose the support information). * Solution 2: **Max-patterns**: //A pattern X is a **max-pattern** if X is frequent and there exists no frequent super-pattern \(Y \supset X\).// * => **Max-pattern** is a **lossly compression** of frequent patterns: we only know a given sub-pattern is frequent, but we don't know the real support anymore. => In many applications, mining **closed patterns** is more desirable than mining **max-patterns**. ==== Recommended Readings ==== * "[[http://www.rakesh.agrawal-family.com/papers/sigmod93assoc.pdf|Mining association rules between sets of items in large databases]]" (R. Agrawal, T. Imielinski and A. Swami, in Proc. of SIFMOD'93) * "Efficiently mining long patterns from databases" (R. J. Bayardo, in Proc. of SIGMOD'98) * "Discovering frequent closed itemsets for association rules" (N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, in Proc. of ICDT'99) * "Frequen Pattern Mining: Current Status and Future Directions" (J. Han, H. Cheng, D. Xin and X. Yan, in Data Mining and Knowledge Discovery, 15(1): 55-86, 2007) ==== Formal Model definition ==== Let \(\mathcal{I} = I_1,I_2,\dots,I_m\) be a set of binary attributes, called items. Let \(T\) be a database of transactions. Each transaction \(t\) is represented as a binary vector, with \(t[k] = 1\) if \(t\) bought the item \(I_k\), and \(t[k] = 0\) otherwise. FIXME: Complete this section. ===== Lesson 2: Efficient Pattern Mining Methods ===== ==== Key concepts ==== * The downward closure property or the Apriori property of frequent patterns * The Apriori algorithm * ECLAT: Mining frequent patterns by exploring vertical data format * FPGrowth: A frequent pattern-growth approach * Closet+ for mining closed patterns ==== Downward Closure Property of Frequent Patterns ==== * **Downward closure** property (also called **Apriori**): //Any subset of a frequent itemset must be frequent.// * Thus an efficient mining methodology will take advantage of this, with the following rule: given an itemset **S**, if any of the subset of **S** is **infrequent** then there is **no chance for S to be frequent**. (so S can be pruned and doesn't require mining). * From this **Apriori pruning principle**, we have 3 major methods developped: - Level-wise, join-based approach: **Apriori** (Agrawal, Srikant @ VLDB'94) - Vertical data format approach: **Eclat** (Zaki, Parthasarathy, Ogihara, Li @ KDD'97) - Frequent pattern projection and growth: **FPgrowth** (Han, Pei, Yin @ SIGMOD'00) ==== The Apriori Algorithm ==== * Algorithm: - Initially, scan database once to get **frequent 1-itemset**. - **Repeat**: - Generate candidate (k+1)-itemsets from frequent k-itemsets. - Test the candidate (k+1)-itemsets to find which ones are frequent. - set k = k+1 - **Stop** when no frequent or candidate set can be generated anymore. - Return all the frequent itemsets found. * **How to generate candidates** at level k+1: - **Self-joining** \(F_k\): eg. {abc, abd, acd, ace, bcd} => {abcd, acde} - **Pruning**: eg. **ade** is not in the \(F_3\) above, so **acde** is pruned. * An **SQL implementation** can be derived to generate the candidates described above [not reportedhere, cf. corresponding course video] ==== Extensions or Improvements of Apriori ==== * Reducing number of transaction database scan passes: * Partitioning (Savasere, et al., 1995) * Dynamic itemset counting (Brin, et al., 1997) * Shrink the number of candidates: * Hashing (Park, et al., 1995) * Pruning by support lower bounding (Bayardo 1998) * Sampling (Toivonen, 1996) * Exploring special data structures * Tree projection (Aggarwal, et al. 2001) * H-miner (Pei, et al. 2001) * Hypercube decomposition (Uno, et al. 2004) === Partitioning === * => Only needs to scan the database twice to find frequent itemsets. * => We partition the database into k partitions, then: * **Theorem**: Any itemset that is potentially frequent in database TDB must be frequent in at least one of the partitions of TDB. * Method: * Scan 1: partition database: each partition should fit into computer RAM, so that we can avoid IO operations then, and look for all frequent itemsets in this partition. * Scan 2: consolidate global frequent patterns. === Direct Hashing and Pruning (DHP) === * DHP can be used to reduce the number of candidates. * **Observation**: //A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.// * Example: while counting the 1-itemset with candidates a,b,c,d,e we also count the number of entries for 2-itemset buckets such as {ab,ad,ae}, {bd,be,de}, etc. * The ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below the support threshold. ==== Mining Frequent Patterns by Exploring Vertical Data Format ==== * ECLAT (Equivalence Class Transformation): a depth-first search algorithm using set intersection. * Need to generate the **Tid-List**: eg. list of transactions for each possible item. * Then we can compute the **intersections** of the tid lists: \(t(e) = \{T_{10}, t_{20},T_{30}\}\), \(t(a) = \{T_{10}, T_{20}\}\), then \(t(ae) = \{T_{10}, T_{20}\}\) * **Properties of Tid-Lists**: * \(t(X) = t(Y)\) => X and Y always happen together. * \(t(X) \subset t(Y)\) => transations having X always have Y. * Frequent patterns can be derived from vertical intersections. * Can use **diffset** to accelerate mining: * Only keep track of differences of Tid-Lists (can improve memory usage). ==== FPGrowth: A Pattern Growth Approach ==== * Algorithm: - Find frequent single items and partition the database based on each such item - Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database - Using **FP-Tree** to facilitate efficient processing. * FIXME: I didn't get much out of that one, might need to check again later... ==== Mining Closed Patterns ==== * **Closet+** algorithm to mine closed itemsets by Pattern-Growth (efficient and direct) * **itemset merging**: //if Y appears in every occurance of X, then Y is merged with X.// * Many tricks implemented in closet+ * Hybrid tree projection, * Sub-itemset pruning, * Item skipping, * Efficient subset checking, etc. ==== Recommended Readings ==== * R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94 * A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases”, VLDB'95 * J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”, SIGMOD'95 * S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database systems: Alternatives and implications”, SIGMOD'98 * M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association rules”, Data Mining and Knowledge Discovery, 1997 * J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00 * M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02 * J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets”, KDD'03 * C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014 ===== Lesson 3: Pattern Evaluation ===== ==== Key concepts ==== * Pattern evaluation; interestingness of patterns/rules * Popularly used measures in pattern evaluation: support, confidence, lift, chi-square, cosine, Jaccard * Null-invariance, the Kulczynski measure, and imbalance ratio ==== Limitation of the Support-Confidence Framework ==== * Not all of patterns and rules are interesting. * **Interestingness measures** can be classified as **objective** or **subjective**. - **Objective interestingness measures** (such as support, confidence, correlation, etc) are defined by math formulas - **Subjective interestingness measures** depends on one's perspective. * Thus mining can be done: - on a **Query-based** mode: asking the user what she would like to see - against ons's knowledge-base: looking for unexpected, fresh or recent patterns/rules - using visualization tools: multi-dimensional, interactive examination. * Not all strongly supported and/or confident association rules are interesting. In the lecture videos we have the following example: * //play-basketball// => //eat-cereal// [s=40%, c=66.7%] * //**not** play-basketball// => //eat-cereal// [s=35%, c=87.5%] ==== Interestingness Measures: Lift and \(\chi^2\) ==== * **Lift**: measure of dependent/correlated events: \[lift(B,C) = \frac{c(B \rightarrow C)}{s(C)} = \frac{s(B \cup C)}{s(B) \times s(C)}\] * Lift(B,C) will tell how B and C are correlated: * Lift(B,C) = 1 => B and C are independent. * Lift(B,C) > 1 => we have a positive correlation * Lift(B,C) < 1 => we have a negative correlation * \(\chi^2\) is another measure to test correlated events: \[\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}\] * General \(\chi^2\) rules: * \(\chi^2 = 0\): events are independent * \(\chi^2 > 0\): events are correlated (either positively or negatively) * But even Lift and \(\chi^2\) are not always good measures: could for instance detect strong positive correlation when there is not. * [def] **Null transactions**: transactions that do not contain the itemset of interest. ==== Null Invariance Measures ==== * **Null invariance**: value does not change with the number of null-transactions. * Lift and \(\chi^2\) are **not** null invariant. * A few Null-invariant measures: (all give values in the range [0,1]) * \(AllConf(A,B) = \frac{s(A \cup B)}{max(s(A), s(B))}\) * \(Jaccard(A,B) = \frac{s(A \cup B)}{s(A) + s(B) - s(A \cup B)}\) * \(Cosine(A,B) = \frac{s(A \cup B)}{\sqrt{s(A) + s(B)}}\) * \(Kulczynski(A,B) = \frac 12 \left( \frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)} \right)\) * \(MaxConf(A,B) = max \left( \frac{s(A)}{s(A \cup B)}, \frac{s(B)}{s(A \cup B)} \right)\) * null-transactions will often be very numerous, so null invariant measures are critical. ==== Comparison of Null-Invariant Measures ==== * Not all null invariant measures are equal. * => **Kulczynski** is the most appropriate when we have contradictory correlations, so this one seems to be the one to use by default. * Introducing **Imbalance ratio** (IR): measure the imbalance of two itemsets A and B in rule implications: \[IR(A,B) = \frac{|s(A) - s(B)|}{s(A)+s(B)-s(A \cup B)}\] * => **Kulczynski + Imbalance ratio** together can present a clear picture of all the typical cases. ==== Recommended Readings ==== * C. C. Aggarwal and R S. Yu. A New Framework for Itemset Generation. PODS'98 * S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97 * M. Klemettinen, H. Mannila, R Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94 * E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE'03 * P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02 * T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397, 2010 ===== Lesson 4: Mining Diverse Patterns ===== ==== Key concepts ==== * Multiple-level associations, level-shared mining, redundant rules * Multi-dimensional associations, quantitative associations * Negative correlations vs. rare patterns, null-invariance * Compressed patterns, redundancy-aware top-k patterns ==== Mining Multi-Level Associations ==== * We can often have a hierarchy in items (eg. milk => 2% milk => Dairyland 2% milk) * When we have such an item hierarchy we might mine with: - Uniform min-support accros multiple levels (is that reasonable ?) - Level-reduced min-support: items at the lower levels are expected to have lower support. * How to perform efficient mining with multi min-support values ? => **shared** multi-leve mining: use the lowest min-support to pass down the set of candidates. * Another problem with multi-leve association mining is the **redundancy** * example: 2% milk is 25% of sold milk, and * milk => wheat bread [s=8%, c=70%] (1) * 2% milk => wheat bread [s=2%, c=72%] (2) * Rule number 2 can be derived from rule 1 so it is redundant. * => Redundancy filtering required. * We may also require customised min-supports for different kinds of items: (for example whe very expense items are mixed with common items): * Solution: use **group-based** individualized min-support: * eg. {diamond, watch}: 0.05%, {bread, milk}: 5% => scalable mining methods can easily be extended to take customized min-support into account. ==== Mining Multi-Dimensional Associations ==== * Simple **Single-dimensional** rule: buys(X, "milk") => buys(X, "bread") * Multi-dimensional rules: * **Inter-dimension** association rules (no repeated predicate): * age(X, "18-25") ^ occupation(X, "student") => buys(X, "coke") * **Hybrid-dimension** association rules (repeated predicate): * age(X, "18-25") ^ buys(X, "popcorn") => buys(X, "coke") * Attributes can be: * **categorical** (eg. discrete values): no ordering * **quantitative** (numerical values):: implicit ordering ==== Mining Quantitative Associations ==== * Mining associations with numerical attributes (eg. age, salary, etc...) * Mining methods: * **Static discretization** => data cube-based aggregation * **Dynamic discretization** => based on data distribution * **Clustering** (eg distance-based association) * Perform **Deviation analysis**, for instance: Gender=female => Wage: mean=$7/hr (overall mean = $9/hr); * Mining **extraordinary or interesting phenomena**, in the gender=female example above: * Left Hand Side (LHS) : a subset of the population * Right Hand Side (RHS) : an extraordinary behavior of this subset * An extraordinary test is accepted only if a statistical test (eg. Z-test) confirms the inference with high confidence. * **Subrule** : can highlight the extraordinary behavior of a subset of the population of the **super rule**, for instance: * (gender=female) ^ (South=yes) => mean wage = $6.3/hr * Rule condition can be categorical or numerical (quantitative rules): * Education in [14-18] (yrs) => mean wage = $11.64/hr * Efficient methods to mine extraordinary rules: cf. Aumann and Lindell@KDD'99 ==== Mining Negative Correlations ==== * **Rare patterns** : very low support but still very interesting (can be mined with **individualized group-based min-support** thresholds) * **Negative patterns**: negatively correlated itemssets (eg. unlikely to happen together) * Support-based definition of negative patterns: if itemset A and B are both frequent but rarely occur together: \(sup(A \cup B) << sup(A) x sup(B)\), then A and B are negatively correlated. => **similar to definition of Lift** => doesn't work well with large datasets will many null-transactions. * So we introduce a **Kulczynski measure-based definition** of negative patterns: If itemset A and B are frequent but \(\frac{P(A|B) + P(B|A)}{2} < \epsilon\) where \(\epsilon\) is a negative pattern threshold, then A and B are negatively correlated. (for instance \(\epsilon\) ~ 0.01) ==== Mining Compressed Patterns ==== * We can produce many scattered patterns but not so meaningful. * When considering **closed patterns** we cannot perform compression because there is too much emphasize on the support values (eg. To merge close patterns we are required to have the same support value). * When using **Max-patterns** we might loose too much information in the missing support values. * We might rather something in between (eg. a good balance) : keep some sub patterns from the max-patterns to retain some of the support information. * To select what patterns canbe compressed, we use a **pattern distance measure**: \[Dist(P_1,P_2) = 1 - \frac{|T(P_1) \cap T(P_2)|}{|T(P_1) \cup T(P_2)|}\] * Then we can perform \(\delta\)-**clustering**: for each pattern P, find all patterns which can be expressed by P and whose distance to P is within \(\delta\) (\(\delta\)-cover) => All patterns in the cluster can be represented by P. * There is a method for efficient and **direct** mining of compressed patterns (instead of mining all patterns first before compressing): cf. Xin et al. VLDB'05. * **Redundancy-aware Top-k Patterns**: looking for high significance and low redundancy when compressing patterns. * Method: Use MMS (Maximal Marginal Significance) for measuring the combined significance of pattern set: cf. Xin et al., Extracting Redundancy-Awre Top-k Patterns, KDD'06. ===== Lesson 5: Sequential Pattern Mining ===== ==== Key concepts ==== * Sequential patterns, sequential pattern mining * Sequential pattern mining algorithms: GSP, SPADE, and PrefixSpan * Closed sequential patterns, the CloSpan algorithm ==== Sequential Pattern and Sequential Pattern Mining ==== * Sequential pattern mining has broad applications. For instance: "Buy a laptop, then a digital camera, then smartphone, within 6 months." => could also be applied to stocks and markets. * So trying to get sequential patterns out of vast **sequence DBs** (eg. entries with timestamp) or **time-series DBs** (evenly spaced timestamps). * For sequential patterns we also have the concept of **gapped** and **non-gapped**. * **Sequential pattern mining**: given a set of sequences, find the **complete set of frequent subsequences** (satisfying a given min-support threshold). * A **sequence** can be written for instance: < (ef) (ab) (df) c b > (eg. we use parenthesis to identify a shopping basket here for instance). * Each **element** in a **sequence** may contain a set of **items** (also called **events**) * Items within an element are unordered (usually listed alphabetically). * Any "substring" keeping the parenthesis at appropriate location in a **sequence** can define a **subsequence**, for instance: is a **susequence** of * Then given a support threshold, for instance min_sup=2, then we can find **frequent sequential pattern** * Algorithm requirements: **efficient**, **scalable**, **finding complete sets**, **incorporating various kinds of user-specific constraints**. * The **Apriori** property still holds for sequential patterns: //if a subsequence \(s_1\) is infrequent, none of \(s_1\)'s super-sequences can be frequent.// * We can then develop the same kind of algorithms as before, based on the Apriori property: - **GSP** (Generalized Sequential Patterns): cf. Srikant & Agrawal @ EDBT'96 - **SPADE** (Vertical format-based mining): cf. Zaki @ Machine Learning'00 - **PrefixSpan** (Pattern-growth method): cf. Pei, et al. @ ICDE'01 * To mine **closed sequential patterns** we have the **CloSpan** method: cf. yan, et al. @ SDM'03 * Finally we also have **constraint-based** sequential pattern mining. ==== GSP: Apriori-Based Sequential Pattern Mining ==== * We start with a sequence database: ^ SID ^ Sequence ^ | 10 | <(bd)cb(ac)> | | 20 | <(bf)(ce)b(fg)> | | 30 | <(ah)(bf)abf> | | 40 | <(be)(ce)d> | | 50 | | - We start with initial candidates: eg. all singleton sequences: ,,,,,,, - We scan the DB once and count support for each candidate. - Prune infrequent candidates: say and here. - Generate length-2 candidate sequences: * as separated items: , , ..., * or as elements in the sequence: <(ab)>, <(ac)>, ..., <(ef)>. => We repeat steps 2 to 4 until there is no candidate or frequent sequences found anymore. => With the pruning stage due to Apriori property we can reduce the number of candidates for the next length sequences on each step. ==== SPADE: Sequential Pattern Mining in Vertical Data Format ==== * A sequence database is mapped to (eg. sequence ID and Element ID), for instance with: ^ SID ^ Sequence ^ | 1 | | | 2 | <(ad)c(bc)(ae)> | | 3 | <(ef)(ab)(df)cb> | | 4 | | We can build the mapping: ^ SID ^ EID ^ items ^ | 1 | 1 | a | | 1 | 2 | abc | | 1 | 3 | ac | | 1 | 4 | d | | 1 | 5 | cf | | 2 | 1 | ad | | ... | ... | ... | Then we figure out where each item occurs: ^ a ^^ b ^^ ... ^ ^ SID ^ EID ^ SID ^ EID ^ ... ^ | 1 | 1 | 1 | 2 | ... | | 1 | 2 | 2 | 3 | ... | | 1 | 3 | 3 | 2 | ... | | 2 | 1 | 3 | 5 | ... | | 2 | 4 | 4 | 5 | ... | | 3 | 2 | | | ... | | ... | ... | | | ... | Then to build length-2 sequences we check for each sequence ID the order of the Element IDs: ^ ab ^^ ba ^^ ... ^ ^ SID ^ EID(a) ^ EID(b) ^ SID ^ EID(b) ^ EID(a) ^... ^ | 1 | 1 | 2 | 1 | 2 | 3 | ... | | 2 | 1 | 3 | 2 | 3 | 4 | ... | | 3 | 2 | 5 | | | | ... | | ... | ... | ... | | | | ... | Then we can grow the subsequences (patterns) one item at a time by Apriori candidate generation. ==== PrefixSpan: Sequential Pattern Mining by Pattern-Growth ==== * Need to introduce the concept of **prefix** and **suffix** (eg. projection). Starting from DB: ^ SID ^ Sequence ^ | 1 | | | 2 | <(ad)c(bc)(ae)> | | 3 | <(ef)(ab)(df)cb> | | 4 | | We build the prefix/suffix DB: ^ Prefix ^ Suffix (Projection) ^ | | <(abc)(ac)d(cf)> | | | <(_bc)(ac)d(cf)> | | | <(_c)(ac)d(cf)> | * Given a **sequence** we can thus extract a set of **prefixes** and **suffixes** (eg. prefixes-based projections) * PrefixSpan mining method: - Find length-1 sequential patterns: eg. , , , , , - Then devide the search space to mine each projected DB: * -projected DB, -projected DB, ..., -projected DB. - In each projections, we find the length-1 frequent patterns, then we combine with the prefix to get length-2 frequent sequential patterns, for instance: , , <(ab)>, , , , etc. => The major advantages of PrefixSpan are: * No candidate subseqs to be generated, * Projected DBs keep shrinking. * Major cost of PrefixSpan: **constructing projected DBs** * suffixes are largely repeating in recursive projected DBs. * When the DB can held in main memory we can use **pseudo projection**: * Given s=: * the projection can be noted: **s|: (,2)** (same as s, but start at position 2) * the projection can be noted: **s|: (,4)** (same as s, but start at position 4) * => No physical copy required, using pointer to the sequence and applying offsets. * When not fitting in memory, we ahve to do a physical projection until the data can fit in memory. * For efficient mining both physical and pseudo projections should be used together. ==== CloSpan: Mining Closed Sequential Patterns ==== * [def] **closed sequential pattern** s: //There exists no superpattern s' such that \(s' \supset s\) and s' and s have the same support.// * For instance, given: : 20, : 20 and : 15 => and are **closed sequential patterns**. * Why trying to mine closed sequential patterns ? - Reduce the numer of (redundant) patterns - While keeping the same expressive power * **Property** \(P_1\): //If \(a \supset s_1\), s is closed **iff** two projected DBs have the same size.// * From this property we can develop 2 pruning technics: **backward subpattern** and **backward superpattern** pruning to remove redundant search space. * **CloSpan** was developed on this principle and greatly enhances efficiency. cf. Yan, et al. SDM'03 ===== Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory Patterns ===== ==== Key concepts ==== * Spatial association, progressive deepening * Spatial colocation pattern, colocation rule * Partition-based trajectory pattern mining, t-pattern * Moving object clusters: flock, convoy, and swarm * Trajectory clustering, partition-and-group framework * Semantics-rich movement patterns * Periodic movement patterns ==== Mining Spatial Associations ==== * Spatial frequent patterns and association rules are also in the form: **A => B [s%, c%]** * In the rule format above, A and B are sets of **spatial or non-spatial predicates**, for instance: * Togological relations: intersects, overlaps, disjoint, etc. * Spatial orientation: left_of, west_of, under, etc. * Distance information: close_to, within_distance, etc. * Example: is_a(X, large_town) ^ intersect(X, highway) -> adjacent_to(X, water) [7%, 85%] * We usually want to explore **spatial autocorrelation** (because spatial data tends to be highly self-correlated: nearby things are more related than distant ones). * When mining spatial associations, we can apply the **progressive refinement** heuristic: * Because we can derive a hierarchy of spatial relationships: close_to can be seen as a generalization of near_by, touch, intersect, contain, ... * So we could start only with close_to, and if it is frequent, then try to **refine** with the other relationships. * => **Progressive refinement**: //First search for rough relationship, and then refine it.// * Thus, typical 2-steps mining of spatial associations: - Rought spatial computation as a filter: using **MBR** (Minimul Bouding Rectangle) or **R-tree** for rough estimation. - Detailed spatial algorithm (as refinement) : apply only to the objects which have passed the rough association test (eg. those passing the min_support threshold). ==== Mining Spatial Colocation Patterns ==== * [def] **Colocation pattern**: //A group of spatial features or events that are frequently co-located in the same region.// * On a graphic, we could for instance use edge to notify co-location of near_by points. * [def] Concept of **Rowset(C)**: //If every feature in pattern C appears as a feature of an instance in the neighbor-set L// FIXME: Try to understand and explain this further. * From the concept of Rowset we can build a colocation rule R: A -> B, with conditional probability cp(R) defined as: \[cp(R) = \frac{|\{L \in rowset(A) | \exists L' s.t. (L \subseteq L') \land (L' \in rowset(A \cup B))\}|}{|rowset(A)|}\] * Following example from video lectures (currently not really understanding this): \(cp(\{A,B\} \rightarrow \{C, D\}) = |rowset({A,B,C,D})| / |rowset({A,B})| \) * [def] **participation ratio** pr(C,f): //Probability that C is observed in a neighbor-set wherever feture f is observerd.//: \[pr(C,f) = \frac{|\{r | (r \in S) \land (r. f = f) \land (\text{r is in a row instance of C})\}|}{|\{r | (r \in S) \land (r.f = f)\}|}\] * **Property**: Monotonicity of participation ratio: //let C, C' be two co-location patterns such that \(C' \subset C\) , then for each feature \(f \in C': pr(C',f) \ge pr(C,f)\).// * Based on previous monotonicity principle, an Apriori-like algorithm can be derived for efficient mining of colocation patterns: - let min-feature-support = \(\sigma\), and min-pr = \(\rho\) - Start with a set of single feature pattern \(\{p_1\}\) with support \(\ge \sigma\) - Grow to size k, in Apriori way (i.e, stop growing if the pattern is infrequent) - For each such pattern p, mine its super-pattern P, such that \(pr(P, p) \ge \rho\) in an Apriori way. ==== Mining and Aggregating Patterns over Multiple Trajectories ==== * **Partition-Based Trajectory Pattern Mining** (eg. Mining T-Patterns) (cf. F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory Pattern Mining, KDD’07) * Partition the space into equal-width grids and obtain **Regions of Interests** (RoIs) * Then transform each input trajectory into a **Time-annotated symbolic sequence**, for instance: * Railway Station: [15min] -> Castle Square, [2h15min] -> Museum * Railway Station: [10min] -> Middle Bridge, [10min] -> Campus * Then use constraint-based sequential pattern mining to find trajectory patterns. * Then matching T-patterns such as: \((x_0, y_0) - \alpha_1 \rightarrow (x_1,y_1)\) (??) * **Detecting Moving Objects clusters**: * **Flock and convoy**: Both require k **consecutive** time stamps. * **Flock**: At least m entities are within a circular region of **radius r** and more in the same direction. * **Convoy**: Density-based clustering at each timestamp, no need to use a rigid circle. * **Swarm**: Moving objects may not be close to each other for all consecutive timestamps. Efficient pattern mining algorithm can be derived for mining swarm (cf. Z. Li, et al., Swarm: Mining Relaxed Temporal Moving Object Clusters, VLDB'10) * **Trajectory Clustering**: **A Partition and Group Framework** (cf. J.-G. Lee, et al., Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07) * Grouping trajectories as a whole => cannot find similar portions of trajectories. * Solution: discovers common **sub-trajectories**, e.g. forcast hurrican landfall. * 2 phases: * **partitioning**: creating sequences of segments, * then **grouping**: may find patterns in the segments created above. * Partitioning can be done using minimum description length (MDL) principle: eg. reduce the number of points in the paths, but avoid introducing too much error (eg. real points too far away from approximating segment) ==== Mining Semantics-Rich Movement Patterns ==== * **Frequent Movement Pattern**: A movement sequence that frequently appears in the input trajectory database (even if we cannot consider precise spatial location, for instance: //Home -> Starbucks -> Office -> Restaurant -> Cafe -> Restaurant -> Bar// * To mine Frequent movement patterns we may need to group similar places to be able to find frequent subsequences. * 2-steps top-down mining approach: - Find a set of coarse patterns that reflect people's semantics-level transitions (office -> restaurant, home -> gym) - Split each coarse pattern into several fine-grained ones by grouping similar movement snippts. * cf. C. Zhang et al., Splitter: Mining Fine-Grained Sequential Patterns in Semantic Trajectories, VLDB'14. ==== Mining Periodic Movement Patterns ==== * **Pattern discovery in sparse data** (for instance find bird flying patterns) * Periodicity might show up in some reference **spots** (or **cluster centers**) * Reference spots can be detected using **density-based method** * Periods are detected for each reference spot using **Fourier Transform** and **auto-correlation** * cf. Z. Li, et al., Mining Periodic Behaviors for Moving Objects, KDD'10 * cf. Z. Li, et al., ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 2015 ==== Recommended Readings ==== * F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi: Trajectory Pattern Mining. KDD’07 * Y. Huang, S. Shekhar, H. Xiong: Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. on Knowledge & Data Eng., 16(12), 2004 * Y. Huang, J. Pei, H. Xiong: Mining Co-Location Patterns with Rare Events from Spatial Data Sets. GeoInformatica 10(3): 239-260, 2006 * K. Koperski, J. Han: Discovery of Spatial Association Rules in Geographic Information Databases. SSD’95 * J.-G. Lee, J. Han, and K.-Y. Whang: Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07 * Z. Li, B. Ding, J. Han, R. Kays: Swarm: Mining Relaxed Temporal Moving Object Clusters. VLDB’10 * Z. Li, B. Ding, J. Han, R. Kays, P. Nye: Mining Periodic Behaviors for Moving Objects. KDD’10 * Z. Li, J. Wang and J. Han, ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 27(5): 1219-1232, 2015 * C. Zhang, J. Han, L. Shou, J. Lu, and T. La Porta: Splitter: Mining Fine Grained Sequential Patterns in Semantic Trajectories. VLDB'14 ===== Lesson 7: Pattern Mining Applications: Mining Quality Phrases from Text Data ===== ==== Key concepts ==== * Phrase mining; training-based phrase mining vs. topic modeling-based phrase mining * TNG, TurboTopic; KERT; criteria to judge quality of phrases * ToPMine; word colocation; phrasal segmentation; PhraseLDA; SegPhrase ==== From Frequent Pattern Mining to Phrase Mining ==== * Important concept when mining text data is to mine **quality phrases** * Unigrams (eg. single words) vs. Phrases: * Unigrams are often **ambiguous** * Phrase: a natural, meaningful, unambiguous semantic unit * Thus, it would be interesting to mine semantically meaningful phrases: * Traforming text data from **word granularity** to **phrase granularity** * Enhance the power and efficiency at manipulating unstructured data * How do we go from frequent patterns to phrases ? * **General principle**: Exploit information redundancy and data-driven criteria to determine phrase boundaries and salience. * **Methodology**: - Frequent pattern mining and colocation analysis - Phrasal segmentation - Quality phrase assessment * Recent developments of phrase mining: * **ToPMine**: Mining quality phrase without training (A. El-Kishky, et al., 2015) * **SegPhrase**: Mining quality phrase with tiny training sets (J. Liu, et al., 2015) ==== Previous Phrase Mining Methods ==== * Phrase mining came from NLP community as **Chunking** * Modeling phrase as a sequence of labeling problem (B-NP, I-NP, O, ...) * Previous methods required annotation and training: * Annotate hundreds of documents as training data * Train a supervised model based on part-of-speech features * Recent trend: * use distributional features based on **web n-grams** (Bergsma et al., 2010) * state-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score. * Limitations: * high annotation cost, not scalable to a new language, a new domain/genre. * May not fit domain-specific, dynamic, emerging applications (such as scientific domains, query logs, or social media) * Thus trying to perform unsupervised phrase mining and topic modeling instead. * Many study of unsupervised phrase mining are linked with **topic modeling**. * Each **topic** is represented by a word distribution. * **Documents** are represented by multiple topics in different proportions. * => This approach does not require any prior annotations or labeling of the documents. * With topic modeling we rather rely on **statistical topic modeling algorithms**: * The most common algorithm: LDA (Latent Dirichlet Allocation) [cf. Blei, et al., 2003] * 3 strategies to perform phrase mining using topic modeling: * Strategy 1: simultaneously infer the topics and the phrase tokens (generate bag-of-words -> generate sequence of tokens) * Strategy 2: Post bag-of-words model inference, visualize topics with n-grams * Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model. === Strategy 1: Simultaneously inferring Phrases and Topics === * **Bigram Topic Model** [cf. Wallach'06] * Probabilistic generative model that conditions on previous word and topic when drawing next word. * **Topical N-Grams** (TNG) [Wang, et al.'07] (a generalization of Bigram Topic Model) * Probabilistic model that generates words in textual order * Create n-grams by concatenating successive bigrams * **Phrase-Discovering LDA** (PDLDA) [Lindsey, et al.'12] * Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically. * Each word is drawn based on previous m words (context and current phrase topic). * Limitations: * High model complexity: tends to overfit * high inference cost: slow === Strategy 2: Post Topic-Modeling Phrase Construction: TurboTopics (I) === * **TurboTopics** [Blei & Lafferty'09]: Phrase construction as a post-processing step to Latent Dirichlet Allocation. - Perform LDA on corpus to assign each token a topic label. - Merge adjacent unigrams with the same topic label by a distribution-free permutation test on arbitrary-length back-off model. - End recursive merging when all significant adjacent unigrams have been merged. === Strategy 2: Post Topic-Modeling Phrase Construction: KERT (II) === * **KERT** [Danilevsky et al.'14]: Phrase construction as a post-processing step to LDA - Run bag-of-words model inference and assign topic label to each token - Perform **frequent pattern mining** to extract candidate phrases within each topic - Perform **phrase ranking** based on four different criteria: * **Popularity**: eg. "information retrieval" vs. "cross-languagee information retrieval" * **Concordance**: eg. "powerful tea" vs. "strong tea" * **Informativeness**: eg. "this paper" (frequent but not informative) * **Completeness**: "vector machine" vs "support vector machine" ==== ToPMine: Phrase Mining without Training Data ==== === Strategy 3: First Phrase Mining then Topic Modeling === * Why in this order ? * If we have for instance "Knowledge discovery using least square support vector machine classifiers..." (topics in blue and red) we won't be able to generate the proper phrases. * With phrase mining first we will first group the phrase elements: "[Knowledge discovery] using [least square] [support vector machine] [classifiers]...", then apply the topics on the groups. * Techniques for this strategy: * First: phrase mining => document segmentation and then phrase ranking, * Then Topic model inference with phrase constraint. * **Phrase mining** stage: - **Frequent contiguous pattern mining**: Extract candidate phrases and their counts. - Perform agglomerative merging of adjacent unigrams as guided by a **significance score**. - Perform Document segmentation to count phrase occurence (and calculate **rectified** (ie. true) phrase frequency) - Do phrase ranking (using the criteria proposed in KERT): popularity, concordance, informativenes, completeness. - Finally do phrase-based topic modeling: * mined bag-of-phrases are passed as input to **PhraseLDA**, and extension of LDA, that constrains all words in a phrase to each sharing the **same latent topic**. => This method essentially perform collocation mining: * Collocation: a sequence of words that occur **more frequently than expected** * Often **interesting**: relay information not portrayed by their constituent terms. * Many different measures used to extract collocations from a corpus [cf. Dunning 93, Pederson 96], for instance: mutual information, t-test, z-test, chi-squared test, likelihood ratio * Parallelized Mutual Information: \(PMI(x,y) = log \frac{p(x,y)}{p(x)p(y)}\) * Many of these measures can be used to guide the agglomerative phrase-segmentation algorithm. * => ToPMine is efficient and generates high-quality topics and phrases without any training data. ==== SegPhrase: Phrase Mining with Tiny Training Sets ==== * A small set of training data may enhance the quality of phrase mining (cf. J. Liu et al., Mining Quality Phrases from Massive Text Corpora, In SIGMOD'15) * SegPhrase+: The Overall Framework * **ClassPhrase**: Frequent pattern mining, feature extraction, classification. * **SegPhrase**: Phrasal segmentation and phrase quality estimation. * **SegPhrase+**: One more round to enhance mined phrase quality. === Pattern Mining and Feature Extraction === * Pattern Mining for Candidate Set: * Build a candidate phrases set by frequent pattern mining * Mining frequent k-grams (k is typically small, eg. 6 in experiments) * **Popularity** measured by raw frequent words and phrases mined from the corpus. * **Feature Extraction**, based on **Concordance** * Partition a phrase into two parts to check whether the co-occurence is significantly higher than pure random. * **Feature Extraction**, based on **Informativeness** * Quality phrases typically start and end with a non-stopword * Use average IDF over words in the phrase to measure the semantics * Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information) (eg. "state-of-the-art") * **Classification**: * Using a tiny training set (about 300 labels for 1GB, can also use phrase extracted from KB) * Label: indicating whether a phrase is a high quality one or not: "support vector machine": 1, "the experiment shows": 0 * Construct models to distinguish quality phrases from poor ones * Use Random Forest algorithm to bootstrap different datasets with limited labels. * **Phrasal segmentation**: can tell which phrase is more appropriate * Partition a sequence of words by maximizing the likelyhood * Consider length penalty and filter out phrases with low rectified frequency. * So the process is repeated twice: * Classification --> Phrasal segmentation (=> **SegPhrase**) * --> Classification --> Phrasal segmentation (=> **SegPhrase+**) ==== Recommended Readings ==== * S. Bergsma, E. Pitler, D. Lin, Creating robust supervised classifiers via web-scale n-gram data, ACL’2010 * D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions. arXiv:0907.1013, 2009 * D.M. Blei, A. Y. Ng, M. I. Jordan, J. D. Lafferty, Latent Dirichlet allocation. JMLR 2003 * K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991 * M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents. SDM’14 * A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15 * R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes. EMNLP-CoNLL’12. * J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15 * A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10 * X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. ICDM’07 ===== Lesson 8: Advanced Topics on Pattern Discovery ===== ==== Key concepts ==== * Stream data mining; mining frequent patterns in data streams; the lossy counting algorithm * Software bug mining; pattern discovery for software bug mining; mining copy-and-paste bugs * Image analysis; pattern mining for image analysis; visual items; visual patterns * Three categories on privacy issues in data mining: input privacy, output privacy and owner privacy, k-anonymity and differential privacy; data perturbation for privacy-preserving pattern mining * Invisible data mining; new application exploration ==== Frequent Pattern Mining in Data Streams ==== * **Data streams**: continuous, ordered, changing, fast, huge volumn * Random access is expensive: **single scan algorithm** * Most streams are low-level and multi-dimensional * Stream Data Processing: **Stream Data Management System** (SDMS) * **Stream mining** vs **Stream query** * For mining, we want to see the big picture (eg. less precision) * **Mining tasks** * Pattern maining in stream, * clustering, * etc, (much more) * Mining **precise** frequent patterns is unrealistic * So we look for **approximate answers** * How to mine frequent patterns with **good approximation** ? * **Lossy Counting Algorithm** [Manku & Motwani, VLDB'02] * => discard items with very low support count * Given support threshold \(\sigma\), error threshold \(\epsilon\) and stream length N * Output: items with frequency counts exceeding \((\sigma - \epsilon) N\) * We undercount by: **Frequency count error** \(\le \epsilon N\) * Approximation properties: * No false negatives, * False positives have true frequency count at least \((\sigma - \epsilon) N\) * Frequency count underestimated by at most \(\epsilon N\) * **Other issues**: * Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05) * Mining approximate frequent k-itemsets in data streams * Mining sequential patterns in data streams * Recommanded readings: * G. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams”, VLDB’02 * A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, ICDT'05 ==== Pattern Discovery for Software Bug Mining ==== * Finding bugs is challenging: often no clear specifications or properties, requires substantial human efforts in analyzing data. * Software reliability analysis: * Static bug detection: check the code * Dynamic bug detection or testing: run the code * Debugging: Given symptoms or failure, pinpoint the bug locations in the code. * Why pattern mining ? * Code or running sequences contain hidden patterns. * Common patterns -> likely specification or property * Violations (anomalies comparing to patterns) -> likely bugs * Mining patterns to narrow down the scope of inspection: * => code locations or predicates that happen more in failing runs but less in passign runs are suspicious bug locations. * **Mining rules from source code** * Bugs as deviant behavior (eg. by statistical analysis) * Mining programming rules (eg. by frequent itemset mining) * Mining function precedence protocols (eg. by frequent subsequence mining) * Revealing neglected conditions (eg. by frequent itemset/subgraph mining) * **Mining rules from revision histories**: by frequent itemset mining * **Mining copy-paste patterns from source code** * Find copy-paste bugs: CP-Miner (cf. Z. li, S. Lu, S. Myagmar, Y. Zhou, "CP-Mier: A Tool for finding Copy-paster and Related Bugs in Operating System Code.", OSDI'04) === Mining copy-and-paste bugs === * Copy-pasting is common: * 12% in Linux (kernel ?) code * 19% in X Window (server ?) code * Copy-paster is error-prone: * Can mine "forget-to-change" bugs by sequential pattern mining. * **Building sequence database** from source code: * Tokenize each component: * Different operators, constants, key words -> different tokens * Same type of identifiers -> same token * for instance: "old = 3;" -> "5 61 20"; and "new = 3;" -> "5 61 20" * Then **compute hash**: "5 61 20" -> 16 * So for a list of statements we produce a list of hashes: 65, 16, 16, 71, ... * **Detecting the "Forget-to-change" bugs**: * Modification to the sequence pattern mining algorithm: * **Constrain the max gap**: (16, 16, 71) ..... (16, 16,10,71) * Then combine neighboring copy-pasted segments repeatedly, * Then find conflicts: eg. identify names that cannot be mapped to the any correspondance, and introduce an **unchanged ratio** * If 0 < unchanged ratio < threshold, then report the location as a potential bug. * CP-Miner reported many C-P bugs in Linux, Apache, etc (eg. analyzing millions of LOC). ==== Pattern Discovery for Image Analysis ==== * An image can be characterized by **visual primitives** (eg. **interest points**) * Each visual primitive can be described by visual feature (eg. high-dimensional feature vector) * Each image is a collection of visual primitives === Visual Patterns Discovery === * Visual primitives can be clustered into visual "items" (similar visual primitives belong to the same item) * Each visual primitive finds its k-nearest neighbor in the image to form a "visual transaction" * Then mining "frequent itemsets" leads to sementically meaningful visual patterns. * So the process is: - images -> feature extraction -> Visual primitives - Visual primitives -> feature quantization -> Induced visual transactions - Induced Visual transactions -> pattern discovery -> Visual patterns * **Challenges** * Images are spatial data * Spatial configuration among the visual items matters * Induced transactions may overlap with each other => over counting problem. * **Uncertainties of visual items** * Noisy clustering of visual primitives into visual items affects visual pattern * Visual synonym and polysemy === Recommended Readings === * Hongxing Wang, Gangqiang Zhao, Junsong Yuan, Visual pattern discovery in image and video data: a brief survey, Wiley Interdisciplinary Review: Data Mining and Knowledge Discovery 4(1): 24-37 (2014) * Hongxing Wang, Junsong Yuan, Ying Wu, Context-Aware Discovery of Visual Co-Occurrence Patterns. IEEE Transactions on Image Processing 23(4): 1805-1819 (2014) * Gangqiang Zhao, Junsong Yuan, Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining. ICDM 2011: 1260-1265 * Junsong Yuan, Ying Wu, Ming Yang, From frequent itemsets to semantically meaningful visual patterns. KDD 2007: 864-873 ==== Advanced Topics on Pattern Discovery: Pattern Mining and Society-Privacy Issue ==== * Potential adverse side-effects of data mining * Privacy and accuracy are typically contradictory in nature. * Improving one often incurs a cost on the other. * 3 categories on privacy issues: * Input privacy: distort or hide data * Output privacy (eg. knowledge hiding) * Owner privacy: hide owner/source information from data. * **Ensuring Input Privacy**: * Approach 1: B2B (business-to-business) environment. * Approach 2: B2C (business-to-customer) environment. * **K-anonymity privacy** requirement : each equivalent class contains at least k records. * ** Data perturbation** * **Statistical distortion**: using randomization algorithms * MASK [Rizvi & Haritsa VLDB'02] * Flip each 0/1 bit with probability p (this may increase a lot of items) * Tune p carefully to achieve acceptable average privacy and good accuracy. * Cut and paster (C&P) operator [Evfimievski et al. KDD'02] * Uniform randomization: Each existing item in the real transaction is, with probability p, replaced with a new item not present in the original transaction. ==== Recommended Readings ==== * R. Agrawal and R. Srikant, Privacy-preserving data mining, SIGMOD'00 * C. C. Aggarwal and P. S. Yu, Privacy-Preserving Data Mining: Models and Algorithms, Springer, 2008 * C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. 2014 * A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD'02 * A. Gkoulalas-Divanis, J. Haritsa and M. Kantarcioglu, Privacy in Association Rule Mining, in C. Aggarwal and J. Han (eds.), Frequent Pattern Mining, Springer, 2014 (Chapter 15) * N. Li, T. Li, S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. ICDE'07 * A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, TKDD 2007 * S. Rizvi and J. Haritsa. Maintaining data privacy in association rule mining. VLDB’02 * J. Vaidya, C. W. Clifton and Y. M. Zhu, Privacy Preserving Data Mining, Springer, 2010 ==== Advanced Topics on Pattern Discovery: Looking Forward ==== * Lots of research issues on pattern discovery are still waiting to be solved. * **Invisible Pattern Mining** can/could be used in many fields.