public:courses:data_mining:pattern_discovery:pattern_discovery

Action disabled: register

Pattern Discovery in Data Mining

Course page: https://www.coursera.org/learn/data-patterns/home/welcome
Teacher: Mr. Jiawei Han
Textbook: Data Mining - Concepts and Technics by Jiawei Han, Micheline Kamber and Jian Pei (2011) [Morgan Kaufmann]

  • Course structure:
    • Lesson 1: Pattern Discovery: basic Concepts
    • Lesson 2: Efficient Pattern Mining Methods
    • Lesson 3: Pattern Evaluation
    • Lesson 4: Mining Diverse Frequent patterns
    • Lesson 5: Sequential Pattern Mining
    • Lesson 6: Pattern Mining Applications: Mining Spatiotemporal and Trajectory patterns
    • Lesson 7: Pattern Mining Applications: Mining Quality Phrases from text Data
    • Lesson 8: Advanced Topics on Pattern Discovery
  • Basic concepts of pattern discovery: frequent pattern, closed pattern, max-pattern, and association rules.
  • Efficient pattern mining methods: Apriori, ECLAT, and FPgrowth.
  • Pattern evaluation measures: lift, chi-square, cosine, Jaccard, and Kulczynski.
  • Methods for mining sequential patterns: GSP, SPADE, PrefixSpan, and CloSpan.
  • Pattern-based classifications: CBA, CMAR, PatClass, DDPMine, and DPClass.
  • Why is pattern discovery an important task in data mining?
  • What are the concepts of frequent itemsets?
  • What are the concepts of association rules? How do you generate association rules from frequent itemsets?
  • Why do we need compressed representations of frequent patterns? Why do we need two compression representations: closed patterns and max-patterns?
  • Pattern: a set of items, subsequences or substructures that occur frequently together (or strongly correlated) in a data set. Patterns represent intrinsic and important properties of datasets.
  • Pattern discovery: trying to uncover patterns from massive data sets. Finding inherent regularities in a dataset.
  • Pattern discovery is the fondation for many essential data mining tasks:
    1. Association, correlation and causality analysis,
    2. Minig sequential, structural (eg. sub-graph) patterns,
    3. Pattern analysis in spatiotemporal, multimedia, time-series and stream data,
    4. Classification: Discriminative pattern-based analysis
    5. Cluster analysis: Pattern-based subspace clustering
  • Itemset: A set of one or more items.
  • k-itemset: \(X = \{x_1,\cdots,x_k\}\)
  • (absolute) support (count) of X: Frequency of the number of occurences of an itemset X in all the dataset entries. (Note: this definition doesn't clearly match the example given by the teacher, has he uses a specific element \(x_i\) inside some itemsets and counts the number of times this element occurs in all itemsets in the dataset…) ⇒ clarification: the itemsets can be anything in this definition, 1-itemset, 2-itemset, etc.. even if each entry in the dataset is a 5-itemset for instance.
  • (relative) support of X: The fraction of itemsets [or entries] in the dataset that contains the element X (i.e. this is the probability that an itemset contains X).
  • An itemset is frequentif the support of X is no less than a given minimum support threshold (denoted \(\sigma\)).
  • Association rule: We define the association rule \(X \rightarrow Y (s,c)\), stating that, if given the itemset X, then the probability that we also have the itemset Y is of support s, and confidence c.
    • Above, s is the support of \(X \cup Y\): eg. probability an entry contains \(X \cup Y\).
    • c is the conditional probability that an entry containing X also contains Y: \(c = \frac{sup(X \cup Y)}{sup(X)}\)
  • Subtle notation: The entries that contain \(X \cup Y\) are actually the intersection of the entries containing X an the entries containing Y (not the union!)
  • Association rule mining: find all the rules \(X \rightarrow Y\) that have a given minimum support and confidence values.

With the definitions above we often have too many frequent patterns.

  • Solution 1: Closed patterns: A pattern (itemset) X is closed if X is frequent and there exists no super-pattern \(Y \supset X\) with the same support as X.
    • Closed pattern is a lossless compression of frequent patterns (eg. reduction of the number of patterns but we do not loose the support information).
  • Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern \(Y \supset X\).
    • Max-pattern is a lossly compression of frequent patterns: we only know a given sub-pattern is frequent, but we don't know the real support anymore.

⇒ In many applications, mining closed patterns is more desirable than mining max-patterns.

  • Mining association rules between sets of items in large databases” (R. Agrawal, T. Imielinski and A. Swami, in Proc. of SIFMOD'93)
  • “Efficiently mining long patterns from databases” (R. J. Bayardo, in Proc. of SIGMOD'98)
  • “Discovering frequent closed itemsets for association rules” (N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, in Proc. of ICDT'99)
  • “Frequen Pattern Mining: Current Status and Future Directions” (J. Han, H. Cheng, D. Xin and X. Yan, in Data Mining and Knowledge Discovery, 15(1): 55-86, 2007)

Let \(\mathcal{I} = I_1,I_2,\dots,I_m\) be a set of binary attributes, called items. Let \(T\) be a database of transactions. Each transaction \(t\) is represented as a binary vector, with \(t[k] = 1\) if \(t\) bought the item \(I_k\), and \(t[k] = 0\) otherwise.

FIXME: Complete this section.

  • The downward closure property or the Apriori property of frequent patterns
  • The Apriori algorithm
  • ECLAT: Mining frequent patterns by exploring vertical data format
  • FPGrowth: A frequent pattern-growth approach
  • Closet+ for mining closed patterns
  • Downward closure property (also called Apriori): Any subset of a frequent itemset must be frequent.
  • Thus an efficient mining methodology will take advantage of this, with the following rule: given an itemset S, if any of the subset of S is infrequent then there is no chance for S to be frequent. (so S can be pruned and doesn't require mining).
  • From this Apriori pruning principle, we have 3 major methods developped:
    1. Level-wise, join-based approach: Apriori (Agrawal, Srikant @ VLDB'94)
    2. Vertical data format approach: Eclat (Zaki, Parthasarathy, Ogihara, Li @ KDD'97)
    3. Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin @ SIGMOD'00)
  • Algorithm:
    1. Initially, scan database once to get frequent 1-itemset.
    2. Repeat:
      1. Generate candidate (k+1)-itemsets from frequent k-itemsets.
      2. Test the candidate (k+1)-itemsets to find which ones are frequent.
      3. set k = k+1
    3. Stop when no frequent or candidate set can be generated anymore.
    4. Return all the frequent itemsets found.
  • How to generate candidates at level k+1:
    1. Self-joining \(F_k\): eg. {abc, abd, acd, ace, bcd} ⇒ {abcd, acde}
    2. Pruning: eg. ade is not in the \(F_3\) above, so acde is pruned.
  • An SQL implementation can be derived to generate the candidates described above [not reportedhere, cf. corresponding course video]
  • Reducing number of transaction database scan passes:
    • Partitioning (Savasere, et al., 1995)
    • Dynamic itemset counting (Brin, et al., 1997)
  • Shrink the number of candidates:
    • Hashing (Park, et al., 1995)
    • Pruning by support lower bounding (Bayardo 1998)
    • Sampling (Toivonen, 1996)
  • Exploring special data structures
    • Tree projection (Aggarwal, et al. 2001)
    • H-miner (Pei, et al. 2001)
    • Hypercube decomposition (Uno, et al. 2004)

Partitioning

  • ⇒ Only needs to scan the database twice to find frequent itemsets.
  • ⇒ We partition the database into k partitions, then:
  • Theorem: Any itemset that is potentially frequent in database TDB must be frequent in at least one of the partitions of TDB.
  • Method:
    • Scan 1: partition database: each partition should fit into computer RAM, so that we can avoid IO operations then, and look for all frequent itemsets in this partition.
    • Scan 2: consolidate global frequent patterns.

Direct Hashing and Pruning (DHP)

  • DHP can be used to reduce the number of candidates.
  • Observation: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.
  • Example: while counting the 1-itemset with candidates a,b,c,d,e we also count the number of entries for 2-itemset buckets such as {ab,ad,ae}, {bd,be,de}, etc.
  • The ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below the support threshold.
  • ECLAT (Equivalence Class Transformation): a depth-first search algorithm using set intersection.
  • Need to generate the Tid-List: eg. list of transactions for each possible item.
  • Then we can compute the intersections of the tid lists: \(t(e) = \{T_{10}, t_{20},T_{30}\}\), \(t(a) = \{T_{10}, T_{20}\}\), then \(t(ae) = \{T_{10}, T_{20}\}\)
  • Properties of Tid-Lists:
    • \(t(X) = t(Y)\) ⇒ X and Y always happen together.
    • \(t(X) \subset t(Y)\) ⇒ transations having X always have Y.
  • Frequent patterns can be derived from vertical intersections.
  • Can use diffset to accelerate mining:
    • Only keep track of differences of Tid-Lists (can improve memory usage).
  • Algorithm:
    1. Find frequent single items and partition the database based on each such item
    2. Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database
    3. Using FP-Tree to facilitate efficient processing.
  • FIXME: I didn't get much out of that one, might need to check again later…
  • Closet+ algorithm to mine closed itemsets by Pattern-Growth (efficient and direct)
  • itemset merging: if Y appears in every occurance of X, then Y is merged with X.
  • Many tricks implemented in closet+
    • Hybrid tree projection,
    • Sub-itemset pruning,
    • Item skipping,
    • Efficient subset checking, etc.
  • R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
  • A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases”, VLDB'95
  • J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”, SIGMOD'95
  • S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database systems: Alternatives and implications”, SIGMOD'98
  • M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association rules”, Data Mining and Knowledge Discovery, 1997
  • J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
  • M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
  • J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets”, KDD'03
  • C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014
  • Pattern evaluation; interestingness of patterns/rules
  • Popularly used measures in pattern evaluation: support, confidence, lift, chi-square, cosine, Jaccard
  • Null-invariance, the Kulczynski measure, and imbalance ratio
  • Not all of patterns and rules are interesting.
  • Interestingness measures can be classified as objective or subjective.
    1. Objective interestingness measures (such as support, confidence, correlation, etc) are defined by math formulas
    2. Subjective interestingness measures depends on one's perspective.
  • Thus mining can be done:
    1. on a Query-based mode: asking the user what she would like to see
    2. against ons's knowledge-base: looking for unexpected, fresh or recent patterns/rules
    3. using visualization tools: multi-dimensional, interactive examination.
  • Not all strongly supported and/or confident association rules are interesting. In the lecture videos we have the following example:
    • play-basketballeat-cereal [s=40%, c=66.7%]
    • not play-basketballeat-cereal [s=35%, c=87.5%]
  • Lift: measure of dependent/correlated events:

\[lift(B,C) = \frac{c(B \rightarrow C)}{s(C)} = \frac{s(B \cup C)}{s(B) \times s(C)}\]

  • Lift(B,C) will tell how B and C are correlated:
    • Lift(B,C) = 1 ⇒ B and C are independent.
    • Lift(B,C) > 1 ⇒ we have a positive correlation
    • Lift(B,C) < 1 ⇒ we have a negative correlation
  • \(\chi^2\) is another measure to test correlated events:

\[\chi^2 = \sum \frac{(Observed - Expected)^2}{Expected}\]

  • General \(\chi^2\) rules:
    • \(\chi^2 = 0\): events are independent
    • \(\chi^2 > 0\): events are correlated (either positively or negatively)
  • But even Lift and \(\chi^2\) are not always good measures: could for instance detect strong positive correlation when there is not.
  • [def] Null transactions: transactions that do not contain the itemset of interest.
  • Null invariance: value does not change with the number of null-transactions.
  • Lift and \(\chi^2\) are not null invariant.
  • A few Null-invariant measures: (all give values in the range [0,1])
    • \(AllConf(A,B) = \frac{s(A \cup B)}{max(s(A), s(B))}\)
    • \(Jaccard(A,B) = \frac{s(A \cup B)}{s(A) + s(B) - s(A \cup B)}\)
    • \(Cosine(A,B) = \frac{s(A \cup B)}{\sqrt{s(A) + s(B)}}\)
    • \(Kulczynski(A,B) = \frac 12 \left( \frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)} \right)\)
    • \(MaxConf(A,B) = max \left( \frac{s(A)}{s(A \cup B)}, \frac{s(B)}{s(A \cup B)} \right)\)
  • null-transactions will often be very numerous, so null invariant measures are critical.
  • Not all null invariant measures are equal.
  • Kulczynski is the most appropriate when we have contradictory correlations, so this one seems to be the one to use by default.
  • Introducing Imbalance ratio (IR): measure the imbalance of two itemsets A and B in rule implications:

\[IR(A,B) = \frac{|s(A) - s(B)|}{s(A)+s(B)-s(A \cup B)}\]

  • Kulczynski + Imbalance ratio together can present a clear picture of all the typical cases.
  • C. C. Aggarwal and R S. Yu. A New Framework for Itemset Generation. PODS'98
  • S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97
  • M. Klemettinen, H. Mannila, R Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94
  • E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE'03
  • P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02
  • T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397, 2010
  • Multiple-level associations, level-shared mining, redundant rules
  • Multi-dimensional associations, quantitative associations
  • Negative correlations vs. rare patterns, null-invariance
  • Compressed patterns, redundancy-aware top-k patterns
  • We can often have a hierarchy in items (eg. milk ⇒ 2% milk ⇒ Dairyland 2% milk)
  • When we have such an item hierarchy we might mine with:
    1. Uniform min-support accros multiple levels (is that reasonable ?)
    2. Level-reduced min-support: items at the lower levels are expected to have lower support.
  • How to perform efficient mining with multi min-support values ? ⇒ shared multi-leve mining: use the lowest min-support to pass down the set of candidates.
  • Another problem with multi-leve association mining is the redundancy
    • example: 2% milk is 25% of sold milk, and
    • milk ⇒ wheat bread [s=8%, c=70%] (1)
    • 2% milk ⇒ wheat bread [s=2%, c=72%] (2)
    • Rule number 2 can be derived from rule 1 so it is redundant.
  • ⇒ Redundancy filtering required.
  • We may also require customised min-supports for different kinds of items: (for example whe very expense items are mixed with common items):
    • Solution: use group-based individualized min-support:
    • eg. {diamond, watch}: 0.05%, {bread, milk}: 5%

⇒ scalable mining methods can easily be extended to take customized min-support into account.

  • Simple Single-dimensional rule: buys(X, “milk”) ⇒ buys(X, “bread”)
  • Multi-dimensional rules:
    • Inter-dimension association rules (no repeated predicate):
      • age(X, “18-25”) ^ occupation(X, “student”) ⇒ buys(X, “coke”)
    • Hybrid-dimension association rules (repeated predicate):
      • age(X, “18-25”) ^ buys(X, “popcorn”) ⇒ buys(X, “coke”)
  • Attributes can be:
  • categorical (eg. discrete values): no ordering
  • quantitative (numerical values):: implicit ordering
  • Mining associations with numerical attributes (eg. age, salary, etc…)
  • Mining methods:
    • Static discretization ⇒ data cube-based aggregation
    • Dynamic discretization ⇒ based on data distribution
    • Clustering (eg distance-based association)
    • Perform Deviation analysis, for instance: Gender=female ⇒ Wage: mean=$7/hr (overall mean = $9/hr);
  • Mining extraordinary or interesting phenomena, in the gender=female example above:
  • Left Hand Side (LHS) : a subset of the population
  • Right Hand Side (RHS) : an extraordinary behavior of this subset
  • An extraordinary test is accepted only if a statistical test (eg. Z-test) confirms the inference with high confidence.
  • Subrule : can highlight the extraordinary behavior of a subset of the population of the super rule, for instance:
    • (gender=female) ^ (South=yes) ⇒ mean wage = $6.3/hr
  • Rule condition can be categorical or numerical (quantitative rules):
    • Education in [14-18] (yrs) ⇒ mean wage = $11.64/hr
  • Efficient methods to mine extraordinary rules: cf. Aumann and Lindell@KDD'99
  • Rare patterns : very low support but still very interesting (can be mined with individualized group-based min-support thresholds)
  • Negative patterns: negatively correlated itemssets (eg. unlikely to happen together)
  • Support-based definition of negative patterns: if itemset A and B are both frequent but rarely occur together: \(sup(A \cup B) << sup(A) x sup(B)\), then A and B are negatively correlated. ⇒ similar to definition of Lift ⇒ doesn't work well with large datasets will many null-transactions.
  • So we introduce a Kulczynski measure-based definition of negative patterns: If itemset A and B are frequent but \(\frac{P(A|B) + P(B|A)}{2} < \epsilon\) where \(\epsilon\) is a negative pattern threshold, then A and B are negatively correlated. (for instance \(\epsilon\) ~ 0.01)
  • We can produce many scattered patterns but not so meaningful.
  • When considering closed patterns we cannot perform compression because there is too much emphasize on the support values (eg. To merge close patterns we are required to have the same support value).
  • When using Max-patterns we might loose too much information in the missing support values.
  • We might rather something in between (eg. a good balance) : keep some sub patterns from the max-patterns to retain some of the support information.
  • To select what patterns canbe compressed, we use a pattern distance measure:

\[Dist(P_1,P_2) = 1 - \frac{|T(P_1) \cap T(P_2)|}{|T(P_1) \cup T(P_2)|}\]

  • Then we can perform \(\delta\)-clustering: for each pattern P, find all patterns which can be expressed by P and whose distance to P is within \(\delta\) (\(\delta\)-cover) ⇒ All patterns in the cluster can be represented by P.
  • There is a method for efficient and direct mining of compressed patterns (instead of mining all patterns first before compressing): cf. Xin et al. VLDB'05.
  • Redundancy-aware Top-k Patterns: looking for high significance and low redundancy when compressing patterns.
  • Method: Use MMS (Maximal Marginal Significance) for measuring the combined significance of pattern set: cf. Xin et al., Extracting Redundancy-Awre Top-k Patterns, KDD'06.
  • Sequential patterns, sequential pattern mining
  • Sequential pattern mining algorithms: GSP, SPADE, and PrefixSpan
  • Closed sequential patterns, the CloSpan algorithm
  • Sequential pattern mining has broad applications. For instance: “Buy a laptop, then a digital camera, then smartphone, within 6 months.” ⇒ could also be applied to stocks and markets.
  • So trying to get sequential patterns out of vast sequence DBs (eg. entries with timestamp) or time-series DBs (evenly spaced timestamps).
  • For sequential patterns we also have the concept of gapped and non-gapped.
  • Sequential pattern mining: given a set of sequences, find the complete set of frequent subsequences (satisfying a given min-support threshold).
  • A sequence can be written for instance: < (ef) (ab) (df) c b > (eg. we use parenthesis to identify a shopping basket here for instance).
  • Each element in a sequence may contain a set of items (also called events)
  • Items within an element are unordered (usually listed alphabetically).
  • Any “substring” keeping the parenthesis at appropriate location in a sequence can define a subsequence, for instance: <a(bc)dc> is a susequence of <a(abc)(ac)d(cf)>
  • Then given a support threshold, for instance min_sup=2, then we can find frequent sequential pattern
  • Algorithm requirements: efficient, scalable, finding complete sets, incorporating various kinds of user-specific constraints.
  • The Apriori property still holds for sequential patterns: if a subsequence \(s_1\) is infrequent, none of \(s_1\)'s super-sequences can be frequent.
  • We can then develop the same kind of algorithms as before, based on the Apriori property:
    1. GSP (Generalized Sequential Patterns): cf. Srikant & Agrawal @ EDBT'96
    2. SPADE (Vertical format-based mining): cf. Zaki @ Machine Learning'00
    3. PrefixSpan (Pattern-growth method): cf. Pei, et al. @ ICDE'01
  • To mine closed sequential patterns we have the CloSpan method: cf. yan, et al. @ SDM'03
  • Finally we also have constraint-based sequential pattern mining.
  • We start with a sequence database:
SID Sequence
10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
  1. We start with initial candidates: eg. all singleton sequences: <a>,<b>,<c>,<d>,<e>,<f>,<g>,<h>
  2. We scan the DB once and count support for each candidate.
  3. Prune infrequent candidates: say <g> and <h> here.
  4. Generate length-2 candidate sequences:
    • as separated items: <aa>, <ab>, …, <ff>
    • or as elements in the sequence: <(ab)>, <(ac)>, …, <(ef)>.

⇒ We repeat steps 2 to 4 until there is no candidate or frequent sequences found anymore.

⇒ With the pruning stage due to Apriori property we can reduce the number of candidates for the next length sequences on each step.

  • A sequence database is mapped to <SID, EID> (eg. sequence ID and Element ID), for instance with:
SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>

We can build the mapping:

SID EID items
1 1 a
1 2 abc
1 3 ac
1 4 d
1 5 cf
2 1 ad

Then we figure out where each item occurs:

a b
SID EID SID EID
1 1 1 2
1 2 2 3
1 3 3 2
2 1 3 5
2 4 4 5
3 2

Then to build length-2 sequences we check for each sequence ID the order of the Element IDs:

ab ba
SID EID(a) EID(b) SID EID(b) EID(a)
1 1 2 1 2 3
2 1 3 2 3 4
3 2 5

Then we can grow the subsequences (patterns) one item at a time by Apriori candidate generation.

  • Need to introduce the concept of prefix and suffix (eg. projection). Starting from DB:
SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>

We build the prefix/suffix DB:

Prefix Suffix (Projection)
<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<ab> <(_c)(ac)d(cf)>

* Given a sequence we can thus extract a set of prefixes and suffixes (eg. prefixes-based projections)

* PrefixSpan mining method:

  1. Find length-1 sequential patterns: eg. <a>, <b>, <c>, <d>, <e>, <f>
  2. Then devide the search space to mine each projected DB:
    • <a>-projected DB, <b>-projected DB, …, <f>-projected DB.
  3. In each projections, we find the length-1 frequent patterns, then we combine with the prefix to get length-2 frequent sequential patterns, for instance: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, etc.

⇒ The major advantages of PrefixSpan are:

  • No candidate subseqs to be generated,
  • Projected DBs keep shrinking.
  • Major cost of PrefixSpan: constructing projected DBs
    • suffixes are largely repeating in recursive projected DBs.
    • When the DB can held in main memory we can use pseudo projection:
      • Given s=<a(abc)(ac)d(cf)>:
      • the <a> projection can be noted: s|<a>: (,2) (same as s, but start at position 2)
      • the <ab> projection can be noted: s|<ab>: (,4) (same as s, but start at position 4)
      • ⇒ No physical copy required, using pointer to the sequence and applying offsets.
    • When not fitting in memory, we ahve to do a physical projection until the data can fit in memory.
    • For efficient mining both physical and pseudo projections should be used together.
  • [def] closed sequential pattern s: There exists no superpattern s' such that \(s' \supset s\) and s' and s have the same support.
  • For instance, given: <abc>: 20, <abcd>: 20 and <abcde>: 15 ⇒ <abcd> and <abcde> are closed sequential patterns.
  • Why trying to mine closed sequential patterns ?
    1. Reduce the numer of (redundant) patterns
    2. While keeping the same expressive power
  • Property \(P_1\): If \(a \supset s_1\), s is closed iff two projected DBs have the same size.
  • From this property we can develop 2 pruning technics: backward subpattern and backward superpattern pruning to remove redundant search space.
  • CloSpan was developed on this principle and greatly enhances efficiency. cf. Yan, et al. SDM'03
  • Spatial association, progressive deepening
  • Spatial colocation pattern, colocation rule
  • Partition-based trajectory pattern mining, t-pattern
  • Moving object clusters: flock, convoy, and swarm
  • Trajectory clustering, partition-and-group framework
  • Semantics-rich movement patterns
  • Periodic movement patterns
  • Spatial frequent patterns and association rules are also in the form: A ⇒ B [s%, c%]
  • In the rule format above, A and B are sets of spatial or non-spatial predicates, for instance:
    • Togological relations: intersects, overlaps, disjoint, etc.
    • Spatial orientation: left_of, west_of, under, etc.
    • Distance information: close_to, within_distance, etc.
  • Example: is_a(X, large_town) ^ intersect(X, highway) -> adjacent_to(X, water) [7%, 85%]
  • We usually want to explore spatial autocorrelation (because spatial data tends to be highly self-correlated: nearby things are more related than distant ones).
  • When mining spatial associations, we can apply the progressive refinement heuristic:
    • Because we can derive a hierarchy of spatial relationships: close_to can be seen as a generalization of near_by, touch, intersect, contain, …
    • So we could start only with close_to, and if it is frequent, then try to refine with the other relationships.
    • Progressive refinement: First search for rough relationship, and then refine it.
  • Thus, typical 2-steps mining of spatial associations:
  1. Rought spatial computation as a filter: using MBR (Minimul Bouding Rectangle) or R-tree for rough estimation.
  2. Detailed spatial algorithm (as refinement) : apply only to the objects which have passed the rough association test (eg. those passing the min_support threshold).
  • [def] Colocation pattern: A group of spatial features or events that are frequently co-located in the same region.
  • On a graphic, we could for instance use edge to notify co-location of near_by points.
  • [def] Concept of Rowset(C): If every feature in pattern C appears as a feature of an instance in the neighbor-set L FIXME: Try to understand and explain this further.
  • From the concept of Rowset we can build a colocation rule R: A -> B, with conditional probability cp(R) defined as:

\[cp(R) = \frac{|\{L \in rowset(A) | \exists L' s.t. (L \subseteq L') \land (L' \in rowset(A \cup B))\}|}{|rowset(A)|}\]

  • Following example from video lectures (currently not really understanding this): \(cp(\{A,B\} \rightarrow \{C, D\}) = |rowset({A,B,C,D})| / |rowset({A,B})| \)
  • [def] participation ratio pr(C,f): Probability that C is observed in a neighbor-set wherever feture f is observerd.:

\[pr(C,f) = \frac{|\{r | (r \in S) \land (r. f = f) \land (\text{r is in a row instance of C})\}|}{|\{r | (r \in S) \land (r.f = f)\}|}\]

  • Property: Monotonicity of participation ratio: let C, C' be two co-location patterns such that \(C' \subset C\) , then for each feature \(f \in C': pr(C',f) \ge pr(C,f)\).
  • Based on previous monotonicity principle, an Apriori-like algorithm can be derived for efficient mining of colocation patterns:
    1. let min-feature-support = \(\sigma\), and min-pr = \(\rho\)
    2. Start with a set of single feature pattern \(\{p_1\}\) with support \(\ge \sigma\)
    3. Grow to size k, in Apriori way (i.e, stop growing if the pattern is infrequent)
    4. For each such pattern p, mine its super-pattern P, such that \(pr(P, p) \ge \rho\) in an Apriori way.
  • Partition-Based Trajectory Pattern Mining (eg. Mining T-Patterns) (cf. F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory Pattern Mining, KDD’07)
    • Partition the space into equal-width grids and obtain Regions of Interests (RoIs)
    • Then transform each input trajectory into a Time-annotated symbolic sequence, for instance:
      • Railway Station: [15min] -> Castle Square, [2h15min] -> Museum
      • Railway Station: [10min] -> Middle Bridge, [10min] -> Campus
    • Then use constraint-based sequential pattern mining to find trajectory patterns.
      • Then matching T-patterns such as: \((x_0, y_0) - \alpha_1 \rightarrow (x_1,y_1)\) (??)
  • Detecting Moving Objects clusters:
    • Flock and convoy: Both require k consecutive time stamps.
    • Flock: At least m entities are within a circular region of radius r and more in the same direction.
    • Convoy: Density-based clustering at each timestamp, no need to use a rigid circle.
    • Swarm: Moving objects may not be close to each other for all consecutive timestamps. Efficient pattern mining algorithm can be derived for mining swarm (cf. Z. Li, et al., Swarm: Mining Relaxed Temporal Moving Object Clusters, VLDB'10)
  • Trajectory Clustering: A Partition and Group Framework (cf. J.-G. Lee, et al., Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07)
    • Grouping trajectories as a whole ⇒ cannot find similar portions of trajectories.
    • Solution: discovers common sub-trajectories, e.g. forcast hurrican landfall.
    • 2 phases:
      • partitioning: creating sequences of segments,
      • then grouping: may find patterns in the segments created above.
  • Partitioning can be done using minimum description length (MDL) principle: eg. reduce the number of points in the paths, but avoid introducing too much error (eg. real points too far away from approximating segment)
  • Frequent Movement Pattern: A movement sequence that frequently appears in the input trajectory database (even if we cannot consider precise spatial location, for instance: Home -> Starbucks -> Office -> Restaurant -> Cafe -> Restaurant -> Bar
  • To mine Frequent movement patterns we may need to group similar places to be able to find frequent subsequences.
  • 2-steps top-down mining approach:
    1. Find a set of coarse patterns that reflect people's semantics-level transitions (office -> restaurant, home -> gym)
    2. Split each coarse pattern into several fine-grained ones by grouping similar movement snippts.
  • cf. C. Zhang et al., Splitter: Mining Fine-Grained Sequential Patterns in Semantic Trajectories, VLDB'14.
  • Pattern discovery in sparse data (for instance find bird flying patterns)
  • Periodicity might show up in some reference spots (or cluster centers)
  • Reference spots can be detected using density-based method
  • Periods are detected for each reference spot using Fourier Transform and auto-correlation
  • cf. Z. Li, et al., Mining Periodic Behaviors for Moving Objects, KDD'10
  • cf. Z. Li, et al., ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 2015
  • F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi: Trajectory Pattern Mining. KDD’07
  • Y. Huang, S. Shekhar, H. Xiong: Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. on Knowledge & Data Eng., 16(12), 2004
  • Y. Huang, J. Pei, H. Xiong: Mining Co-Location Patterns with Rare Events from Spatial Data Sets. GeoInformatica 10(3): 239-260, 2006
  • K. Koperski, J. Han: Discovery of Spatial Association Rules in Geographic Information Databases. SSD’95
  • J.-G. Lee, J. Han, and K.-Y. Whang: Trajectory Clustering: A Partition-and-Group Framework, SIGMOD'07
  • Z. Li, B. Ding, J. Han, R. Kays: Swarm: Mining Relaxed Temporal Moving Object Clusters. VLDB’10
  • Z. Li, B. Ding, J. Han, R. Kays, P. Nye: Mining Periodic Behaviors for Moving Objects. KDD’10
  • Z. Li, J. Wang and J. Han, ePeriodicity: Mining Event Periodicity from Incomplete Observations. IEEE TKDE, 27(5): 1219-1232, 2015
  • C. Zhang, J. Han, L. Shou, J. Lu, and T. La Porta: Splitter: Mining Fine Grained Sequential Patterns in Semantic Trajectories. VLDB'14
  • Phrase mining; training-based phrase mining vs. topic modeling-based phrase mining
  • TNG, TurboTopic; KERT; criteria to judge quality of phrases
  • ToPMine; word colocation; phrasal segmentation; PhraseLDA; SegPhrase
  • Important concept when mining text data is to mine quality phrases
  • Unigrams (eg. single words) vs. Phrases:
    • Unigrams are often ambiguous
    • Phrase: a natural, meaningful, unambiguous semantic unit
  • Thus, it would be interesting to mine semantically meaningful phrases:
    • Traforming text data from word granularity to phrase granularity
    • Enhance the power and efficiency at manipulating unstructured data
  • How do we go from frequent patterns to phrases ?
  • General principle: Exploit information redundancy and data-driven criteria to determine phrase boundaries and salience.
  • Methodology:
    1. Frequent pattern mining and colocation analysis
    2. Phrasal segmentation
    3. Quality phrase assessment
  • Recent developments of phrase mining:
    • ToPMine: Mining quality phrase without training (A. El-Kishky, et al., 2015)
    • SegPhrase: Mining quality phrase with tiny training sets (J. Liu, et al., 2015)
  • Phrase mining came from NLP community as Chunking
    • Modeling phrase as a sequence of labeling problem (B-NP, I-NP, O, …)
  • Previous methods required annotation and training:
    • Annotate hundreds of documents as training data
    • Train a supervised model based on part-of-speech features
  • Recent trend:
    • use distributional features based on web n-grams (Bergsma et al., 2010)
    • state-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score.
    • Limitations:
      • high annotation cost, not scalable to a new language, a new domain/genre.
      • May not fit domain-specific, dynamic, emerging applications (such as scientific domains, query logs, or social media)
  • Thus trying to perform unsupervised phrase mining and topic modeling instead.
  • Many study of unsupervised phrase mining are linked with topic modeling.
  • Each topic is represented by a word distribution.
  • Documents are represented by multiple topics in different proportions.
  • ⇒ This approach does not require any prior annotations or labeling of the documents.
  • With topic modeling we rather rely on statistical topic modeling algorithms:
  • The most common algorithm: LDA (Latent Dirichlet Allocation) [cf. Blei, et al., 2003]
  • 3 strategies to perform phrase mining using topic modeling:
  • Strategy 1: simultaneously infer the topics and the phrase tokens (generate bag-of-words -> generate sequence of tokens)
  • Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
  • Strategy 3: Prior bag-of-words model inference, mine phrases and impose on the bag-of-words model.

Strategy 1: Simultaneously inferring Phrases and Topics

  • Bigram Topic Model [cf. Wallach'06]
    • Probabilistic generative model that conditions on previous word and topic when drawing next word.
  • Topical N-Grams (TNG) [Wang, et al.'07] (a generalization of Bigram Topic Model)
    • Probabilistic model that generates words in textual order
    • Create n-grams by concatenating successive bigrams
  • Phrase-Discovering LDA (PDLDA) [Lindsey, et al.'12]
    • Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically.
    • Each word is drawn based on previous m words (context and current phrase topic).
    • Limitations:
      • High model complexity: tends to overfit
      • high inference cost: slow

Strategy 2: Post Topic-Modeling Phrase Construction: TurboTopics (I)

  • TurboTopics [Blei & Lafferty'09]: Phrase construction as a post-processing step to Latent Dirichlet Allocation.
    1. Perform LDA on corpus to assign each token a topic label.
    2. Merge adjacent unigrams with the same topic label by a distribution-free permutation test on arbitrary-length back-off model.
    3. End recursive merging when all significant adjacent unigrams have been merged.

Strategy 2: Post Topic-Modeling Phrase Construction: KERT (II)

  • KERT [Danilevsky et al.'14]: Phrase construction as a post-processing step to LDA
    1. Run bag-of-words model inference and assign topic label to each token
    2. Perform frequent pattern mining to extract candidate phrases within each topic
    3. Perform phrase ranking based on four different criteria:
      • Popularity: eg. “information retrieval” vs. “cross-languagee information retrieval”
      • Concordance: eg. “powerful tea” vs. “strong tea”
      • Informativeness: eg. “this paper” (frequent but not informative)
      • Completeness: “vector machine” vs “support vector machine”

Strategy 3: First Phrase Mining then Topic Modeling

  • Why in this order ?
  • If we have for instance “Knowledge discovery using least square support vector machine classifiers…” (topics in blue and red) we won't be able to generate the proper phrases.
  • With phrase mining first we will first group the phrase elements: “[Knowledge discovery] using [least square] [support vector machine] [classifiers]…”, then apply the topics on the groups.
  • Techniques for this strategy:
    • First: phrase mining ⇒ document segmentation and then phrase ranking,
    • Then Topic model inference with phrase constraint.
  • Phrase mining stage:
    1. Frequent contiguous pattern mining: Extract candidate phrases and their counts.
    2. Perform agglomerative merging of adjacent unigrams as guided by a significance score.
    3. Perform Document segmentation to count phrase occurence (and calculate rectified (ie. true) phrase frequency)
    4. Do phrase ranking (using the criteria proposed in KERT): popularity, concordance, informativenes, completeness.
    5. Finally do phrase-based topic modeling:
      • mined bag-of-phrases are passed as input to PhraseLDA, and extension of LDA, that constrains all words in a phrase to each sharing the same latent topic.

⇒ This method essentially perform collocation mining:

  • Collocation: a sequence of words that occur more frequently than expected
    • Often interesting: relay information not portrayed by their constituent terms.
  • Many different measures used to extract collocations from a corpus [cf. Dunning 93, Pederson 96], for instance: mutual information, t-test, z-test, chi-squared test, likelihood ratio
    • Parallelized Mutual Information: \(PMI(x,y) = log \frac{p(x,y)}{p(x)p(y)}\)
  • Many of these measures can be used to guide the agglomerative phrase-segmentation algorithm.
  • ⇒ ToPMine is efficient and generates high-quality topics and phrases without any training data.
  • A small set of training data may enhance the quality of phrase mining (cf. J. Liu et al., Mining Quality Phrases from Massive Text Corpora, In SIGMOD'15)
  • SegPhrase+: The Overall Framework
  • ClassPhrase: Frequent pattern mining, feature extraction, classification.
  • SegPhrase: Phrasal segmentation and phrase quality estimation.
  • SegPhrase+: One more round to enhance mined phrase quality.

Pattern Mining and Feature Extraction

  • Pattern Mining for Candidate Set:
    • Build a candidate phrases set by frequent pattern mining
    • Mining frequent k-grams (k is typically small, eg. 6 in experiments)
    • Popularity measured by raw frequent words and phrases mined from the corpus.
  • Feature Extraction, based on Concordance
    • Partition a phrase into two parts to check whether the co-occurence is significantly higher than pure random.
  • Feature Extraction, based on Informativeness
    • Quality phrases typically start and end with a non-stopword
    • Use average IDF over words in the phrase to measure the semantics
    • Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuations information) (eg. “state-of-the-art”)
  • Classification:
    • Using a tiny training set (about 300 labels for 1GB, can also use phrase extracted from KB)
    • Label: indicating whether a phrase is a high quality one or not: “support vector machine”: 1, “the experiment shows”: 0
    • Construct models to distinguish quality phrases from poor ones
    • Use Random Forest algorithm to bootstrap different datasets with limited labels.
  • Phrasal segmentation: can tell which phrase is more appropriate
    • Partition a sequence of words by maximizing the likelyhood
    • Consider length penalty and filter out phrases with low rectified frequency.
  • So the process is repeated twice:
    • Classification --> Phrasal segmentation (⇒ SegPhrase)
    • --> Classification --> Phrasal segmentation (⇒ SegPhrase+)
  • S. Bergsma, E. Pitler, D. Lin, Creating robust supervised classifiers via web-scale n-gram data, ACL’2010
  • D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions. arXiv:0907.1013, 2009
  • D.M. Blei, A. Y. Ng, M. I. Jordan, J. D. Lafferty, Latent Dirichlet allocation. JMLR 2003
  • K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
  • M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents. SDM’14
  • A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
  • R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes. EMNLP-CoNLL’12.
  • J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
  • A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10
  • X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. ICDM’07
  • Stream data mining; mining frequent patterns in data streams; the lossy counting algorithm
  • Software bug mining; pattern discovery for software bug mining; mining copy-and-paste bugs
  • Image analysis; pattern mining for image analysis; visual items; visual patterns
  • Three categories on privacy issues in data mining: input privacy, output privacy and owner privacy, k-anonymity and differential privacy; data perturbation for privacy-preserving pattern mining
  • Invisible data mining; new application exploration
  • Data streams: continuous, ordered, changing, fast, huge volumn
  • Random access is expensive: single scan algorithm
  • Most streams are low-level and multi-dimensional
  • Stream Data Processing: Stream Data Management System (SDMS)
  • Stream mining vs Stream query
    • For mining, we want to see the big picture (eg. less precision)
  • Mining tasks
    • Pattern maining in stream,
    • clustering,
    • etc, (much more)
  • Mining precise frequent patterns is unrealistic
  • So we look for approximate answers
  • How to mine frequent patterns with good approximation ?
  • Lossy Counting Algorithm [Manku & Motwani, VLDB'02]
  • ⇒ discard items with very low support count
  • Given support threshold \(\sigma\), error threshold \(\epsilon\) and stream length N
  • Output: items with frequency counts exceeding \((\sigma - \epsilon) N\)
  • We undercount by: Frequency count error \(\le \epsilon N\)
  • Approximation properties:
  • No false negatives,
  • False positives have true frequency count at least \((\sigma - \epsilon) N\)
  • Frequency count underestimated by at most \(\epsilon N\)
  • Other issues:
    • Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05)
    • Mining approximate frequent k-itemsets in data streams
    • Mining sequential patterns in data streams
  • Recommanded readings:
    • G. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams”, VLDB’02
    • A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, ICDT'05
  • Finding bugs is challenging: often no clear specifications or properties, requires substantial human efforts in analyzing data.
  • Software reliability analysis:
    • Static bug detection: check the code
    • Dynamic bug detection or testing: run the code
    • Debugging: Given symptoms or failure, pinpoint the bug locations in the code.
  • Why pattern mining ?
  • Code or running sequences contain hidden patterns.
  • Common patterns -> likely specification or property
  • Violations (anomalies comparing to patterns) -> likely bugs
  • Mining patterns to narrow down the scope of inspection:
    • ⇒ code locations or predicates that happen more in failing runs but less in passign runs are suspicious bug locations.
  • Mining rules from source code
    • Bugs as deviant behavior (eg. by statistical analysis)
    • Mining programming rules (eg. by frequent itemset mining)
    • Mining function precedence protocols (eg. by frequent subsequence mining)
    • Revealing neglected conditions (eg. by frequent itemset/subgraph mining)
  • Mining rules from revision histories: by frequent itemset mining
  • Mining copy-paste patterns from source code
    • Find copy-paste bugs: CP-Miner (cf. Z. li, S. Lu, S. Myagmar, Y. Zhou, “CP-Mier: A Tool for finding Copy-paster and Related Bugs in Operating System Code.”, OSDI'04)

Mining copy-and-paste bugs

  • Copy-pasting is common:
    • 12% in Linux (kernel ?) code
    • 19% in X Window (server ?) code
  • Copy-paster is error-prone:
    • Can mine “forget-to-change” bugs by sequential pattern mining.
  • Building sequence database from source code:
  • Tokenize each component:
  • Different operators, constants, key words -> different tokens
  • Same type of identifiers -> same token
  • for instance: “old = 3;” -> “5 61 20”; and “new = 3;” -> “5 61 20”
  • Then compute hash: “5 61 20” -> 16
  • So for a list of statements we produce a list of hashes: 65, 16, 16, 71, …
  • Detecting the “Forget-to-change” bugs:
  • Modification to the sequence pattern mining algorithm:
  • Constrain the max gap: (16, 16, 71) ….. (16, 16,10,71)
  • Then combine neighboring copy-pasted segments repeatedly,
  • Then find conflicts: eg. identify names that cannot be mapped to the any correspondance, and introduce an unchanged ratio
  • If 0 < unchanged ratio < threshold, then report the location as a potential bug.
  • CP-Miner reported many C-P bugs in Linux, Apache, etc (eg. analyzing millions of LOC).
  • An image can be characterized by visual primitives (eg. interest points)
  • Each visual primitive can be described by visual feature (eg. high-dimensional feature vector)
  • Each image is a collection of visual primitives

Visual Patterns Discovery

  • Visual primitives can be clustered into visual “items” (similar visual primitives belong to the same item)
  • Each visual primitive finds its k-nearest neighbor in the image to form a “visual transaction”
  • Then mining “frequent itemsets” leads to sementically meaningful visual patterns.
  • So the process is:
    1. images -> feature extraction -> Visual primitives
    2. Visual primitives -> feature quantization -> Induced visual transactions
    3. Induced Visual transactions -> pattern discovery -> Visual patterns
  • Challenges
    • Images are spatial data
      • Spatial configuration among the visual items matters
      • Induced transactions may overlap with each other ⇒ over counting problem.
    • Uncertainties of visual items
      • Noisy clustering of visual primitives into visual items affects visual pattern
      • Visual synonym and polysemy
  • Hongxing Wang, Gangqiang Zhao, Junsong Yuan, Visual pattern discovery in image and video data: a brief survey, Wiley Interdisciplinary Review: Data Mining and Knowledge Discovery 4(1): 24-37 (2014)
  • Hongxing Wang, Junsong Yuan, Ying Wu, Context-Aware Discovery of Visual Co-Occurrence Patterns. IEEE Transactions on Image Processing 23(4): 1805-1819 (2014)
  • Gangqiang Zhao, Junsong Yuan, Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining. ICDM 2011: 1260-1265
  • Junsong Yuan, Ying Wu, Ming Yang, From frequent itemsets to semantically meaningful visual patterns. KDD 2007: 864-873
  • Potential adverse side-effects of data mining
    • Privacy and accuracy are typically contradictory in nature.
    • Improving one often incurs a cost on the other.
  • 3 categories on privacy issues:
    • Input privacy: distort or hide data
    • Output privacy (eg. knowledge hiding)
    • Owner privacy: hide owner/source information from data.
  • Ensuring Input Privacy:
    • Approach 1: B2B (business-to-business) environment.
    • Approach 2: B2C (business-to-customer) environment.
  • K-anonymity privacy requirement : each equivalent class contains at least k records.
  • Data perturbation
    • Statistical distortion: using randomization algorithms
    • MASK [Rizvi & Haritsa VLDB'02]
      • Flip each 0/1 bit with probability p (this may increase a lot of items)
      • Tune p carefully to achieve acceptable average privacy and good accuracy.
    • Cut and paster (C&P) operator [Evfimievski et al. KDD'02]
      • Uniform randomization: Each existing item in the real transaction is, with probability p, replaced with a new item not present in the original transaction.
  • R. Agrawal and R. Srikant, Privacy-preserving data mining, SIGMOD'00
  • C. C. Aggarwal and P. S. Yu, Privacy-Preserving Data Mining: Models and Algorithms, Springer, 2008
  • C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. 2014
  • A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD'02
  • A. Gkoulalas-Divanis, J. Haritsa and M. Kantarcioglu, Privacy in Association Rule Mining, in C. Aggarwal and J. Han (eds.), Frequent Pattern Mining, Springer, 2014 (Chapter 15)
  • N. Li, T. Li, S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. ICDE'07
  • A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, TKDD 2007
  • S. Rizvi and J. Haritsa. Maintaining data privacy in association rule mining. VLDB’02
  • J. Vaidya, C. W. Clifton and Y. M. Zhu, Privacy Preserving Data Mining, Springer, 2010
  • Lots of research issues on pattern discovery are still waiting to be solved.
  • Invisible Pattern Mining can/could be used in many fields.
  • public/courses/data_mining/pattern_discovery/pattern_discovery.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1