Abstract

Although common serial design mining has an significant role in several data mining tasks, however, it frequently produces a large amount of consecutive shapes, which decreases its effectiveness and efficiency. For several submissions mining all the regular successive designs is not needed, and mining frequent Max, or Barred following shapes will offer the same amount of measureable. Linking to frequent successive pattern mining, recurrent Max, or Closed successive pattern mining generates less number of patterns, and so recovers the productivity and usefulness of these tasks.

This theory first gives a official meaning for recurrent Max, and Closed consecutive design mining difficult, and then suggests two effectual programs MaxSequence, and ClosedSequence to solve these hitches. Lastly it compares the grades, and presentation of these courses with two physical force programs intended to solve the same complications.

CHAPTER 1

INTRODUCTION

1. INTRODUCTION

Association rule, and Common pattern, mining was anticipated by Agrawal, Imielinski, and Swami in 7. Altered techniques and algorithms have been recommended for solving this problem 12, 13, 14, 15. Various pattern mining can produce a large number of common patterns. Unlike studies have been done to enforce checks into mining process, to engender only the interesting patterns. Pei, and Han in 16 show which controls can be pushed into the mining process to improve its proficiency. In 17 Wang, He, and Han recommend a method to add support check for itemsets. Garofalakis in 18 shows how constraints can be integrated into sequential pattern mining process using regular terminologies.

An substitute to numerous pattern mining is frequent Closed arrangement mining, that produces less number of patterns, and is verified to be as useful as frequent pattern mining for suggestion rule mining 8. Zaki, and Hsiao in 8 proposed an algorithm called CHARM for frequent Closed pattern mining. Pei, Han, and Mao later in 9 announced CLOSET, an competent algorithm for mining frequent closed patterns that outstripped CHARM.

Frequent sequence mining can also produce an disagreeable large number of structures. Two alternatives for frequent categorization mining are frequent Max, and Closed system mining. In this thesis first I give a correct definition for theses two difficulties, and then I announce two resourceful programs to explain them.

1.1. Motivation

Since its summary, consecutive arrangement mining 1 has become an essential data mining mission, and it has been used in a comprehensive range of presentations, as well as the examines of customer purchase comportment, disease treatments, Web access outlines, DNA structures, and many more. The problem is to find all consecutive patterns with advanced, or equal support to a predefined minimum support beginning in a data order database. The backing of a successive pattern is the number, or measurement, of data sequences in the database that comprise that pattern. Altered techniques and algorithms have been suggested to expand the proficiency of this task 1, 2, 4, 5, 6, 11.

The consecutive patterns generated by the consecutive pattern mining program can be used as the input of alternative program to do a specific data mining task, for illustration frequent patterns can be used to produce association rules. It has been acknowledged that by diminishing the minimum support, the number of regular sequential patterns can grow rapidly. This large number of regular consecutive patterns can reduce the effectiveness, and efficiency of the mining task. The effectiveness is compact, for of the large number of patterns spawned in the first stage wanted to be processed in the later stages of the mining task. The efficiency is also reduced, because customers have to go complete a large number of elements in the result set to find convenient data.

We express a sequence, M, as a recurrent Max categorization, if there does not exist any frequent classification that is the proper superset of M. Sequence C is a frequent Closed structure, if there does not exist any frequent arrangement that is the proper superset of C, and has the same maintenance as C (more formal descriptions are in the following episode). For many mining responsibilities the set of Max, or Closed classifications (with less number of essentials than the complete set of common sequences), can run the same amount of data, as the complete set of common sequences. In these tasks using the Max, or Closed, structures, can increase the proficiency by decreasing the number of the classifications needed to be handled. This also can improve the helpfulness by dropping the idleness in the final outcomes.

In this thesis I announce two effective programs to mine Max, and Closed regular sequences in a structure database. These series are based on Prefix-Projected Pattern Growth, PrefixSpan 4. They produce some applicant frequent arrangements, sequences that have the probable to be Max, or Closed. These applicant frequent sequences can be seen as local Max, or Closed arrangements. A set of Max, or Closed, numerous structures start to this point, the result set, is reserved, and every new applicant is associated to this set. If the new applicant is a Max, or Closed, subset ofa arrangement in the result set, then we ignore the applicant, otherwise we erase all the structures in the result set that are Max, or Closed, subset of the applicant, and add the applicant to the result set. This way the result set will hold the global Max, or Closed, common structures after all the applicants are checked. As it was described the whole process can be damaged down into two major developments. Fabricating applicant arrangements and checking the applicants with the current result set.

The method of generating the common orders is done capably using the PrefixSpan. PrefixSpan is one of the recent algorithms for mining common consecutive patterns and it out executes previous consecutive pattern mining algorithms, particularly when mining for long consecutive patterns. Mining and using frequent Max, or Closed, consecutive patterns become more significant and effective when there are many long common arrangements. Since checking the applicants with the result set is a time uncontrollable process, generating as few applicants as possible, and also using methods to speed up the checking process, can improve the presentation of the programs significantly.

1.2. Thesis Association

The rest of the thesis is structured as follows. In addition to problematic definition chapter 2 also contains some evidence about altered serial pattern mining methods. Chapter 3 converses the general problem of Max, and Closed order mining. In chapter 4 brute force algorithms for mining Max, and Closed sequential outlines are familiarized. I announce two efficient programs MaxSequence, and ClosedSequence for mining common Max, and Closed orders in chapter 5. The experimental, and presentation results are presented in chapter 6. Chapter 7 covers the end, and some thoughts for future study.

CHAPTER 2

PROBLEM DESCRIPTION AND PREFIXSPAN 4

2. Problem Description and PrefixSpan 4

In this chapter, I first express the problem of common Max, and Closed, consecutive pattern mining, and then in section 2.2 I give some background evidence about consecutive pattern mining, and some comprehensive description of PrefixSpan.

2.1. Problem Description

Let I={i1, i2,….in} be a set of literals, called items. Any non-empty subset of I , is called an itemset. A order is an well-ordered list of itemsets. A order s _ is denoted by <s1 s2 …si>, where si is an itemset, i.e., si ???? for 1 ??i??l? si is also called an element of the order, and denoted as {x1 x2 …… xm}where xj _ is an item, i.e., ?j ?i i I ifor 1 ? j? m. For effortlessness, the brackets are omitted for components with only one item. Since an component is a set, and the order of its items is not significant, we assume the items in an component are ordered alphabetically. Length of a order is the number of instances of items in that order. A order with length l is called an l-sequence. I

A order with length l is called an l-sequence. A order ??????< a1 a2 … an>, is called a subsequence of the order ????< b1 b2 … bm> ?? and denoted as ??? ?, if there exist integers 1 ? j1 < j2…< jn ? m!such that a1 ? bj1, a2 ? bj2,… an ? bjn Order ??is also called the supersequence of ??.

A set of tuples ,< sid , s>where sid ) is a order identifier, and s is a order, is called a order database. A tuple < sid , s> is said to contain a order ??, if ??is a subsequence of s , i.e., ? The support of a arrangement ??in a order database S ,is the number of tuples in S holding ?, i.e.,

.

A order ????<a1 a2 … an> with support of m, is also shown as <a1 a2 … an> : m. A order ??is called a frequent consecutive pattern in order database S , if the number of tuples in _ that cover ??isgreater than or equal to a given positive integer ??, called support threshold, or minimum support, i.e., supportS (???????.

A order ??is called a frequent Max sequential pattern in order database S, if ??is a frequent consecutive pattern in S, and there exists no frequent consecutive pattern ??in S, such that ??is a proper supersequence of ??. A order ??is called a frequent Closed sequential pattern in sequence database S, if ??is a frequent sequential pattern in S, and there exists no frequent sequential pattern ??in S, such that (1) ??is a proper supersequence of ??, and (2) every tuple containing ??also contains ??, i.e.,

supportS (???? supportS (??

The problem of sequential pattern mining is to find the complete set of common sequential patterns in a order database,for a given minimum support threshold. Given a order database S, and a lowest support threshold ??, the difficult of sequential Max, or Closed, pattern mining is to find the complete set of common Max, or Closed, consecutive patterns in S . It is obvious that for any given order database, and support threshold the comprehensive set of frequent arrangements , and the complete set of frequent Max, and Closed, orders M, and C, the next relation holds:

Given two orders ?????< a1 a2 … an>, and , ????< b1 b2 … bm> where m ??n.???iscalled a prefix of ??if and only if:

1 .ai = bi for I ??????

?? am = bm;

3. All the items in (am – bm)are alphabetically after those in bm.

Given arrangement ?, and a subsequence of it, ?. Another subsequence of ?, ???is aprojection of ??with respect to prefix ??if and only if:

1. ???has prefix ?

2. No supersequence of ??, ???exists such that ???is a subsequence of ?, and also has prefix ?.

2.2. Sequential pattern mining

Consecutive pattern mining is an significant data mining task, and changed algorithms have been projected to perform this task proficiently. The problem is to find all consecutive patterns with higher, or equal support to a predefined lowest support threshold in a data order database. In this section we are successful to converse some of the projected algorithms for this task. Since prefix estimate is the main idea behind the MaxSequence, and ClosedSequence programs, we are going to discuss PrefixSpan in more element.

2.2.1. Apriori Based Algorithms

Agrawal, and Srikant in 1 hosted the consecutive pattern mining problem, and three algorithms to resolve it. Among these algorithms AprioriAll was the only one to mine the comprehensive set of common consecutive patterns. Later in 2 they proposed the GSP(Generalized Sequential Patterns) algorithm, for illumination this problem. GSP outperforms AprioriAll by up to 20 times. Both of these algorithms are based on the priori heuristics.

The priori heuristics was suggested in 7 for association rule mining, and situations that the sub outlines of a common pattern should also be common. Using this heuristic AprioriAll, and GSP thin down the search space for common consecutive patterns drastically. The mining process activates by scanning the database, and discovery all the common items (length 1 frequent sequences).

The mining process remains, since priori heuristics,like this. Having all the length ? common orders, these algorithms, seeing priori heuristic, produce all the length ?+1 possible frequent orders, and by scanning the database count the actual support for these orders. Having the length ?+1 common orders, this process can be repeated to get the ?+2 length frequent orders.

For mining length ? frequent sequences in these Apriori based algorithms, ? scans of database is essential. This can be very expensive for mining long frequent consecutive patterns.

2.2.2. SPADE

Zaki in 6 proposed another approach for mining common consecutive patterns, called SPADE (Consecutive Pattern Discovery using Correspondence classes).This method uses vertical database format, and using lattice search methods, and join operations, mines the frequent consecutive patterns. In this method for each item a vertical id-list is created. Each list contains the arrangement identifiers of the sequences this item looked in, and their corresponding time stamps. By performance temporal joins on these id-lists all of the recurrent consecutive patterns can be enumerated. Decaying the original search space into smaller subspaces using lattice-theoretic approach reduces search space in this method.

Unlike Apriori based algorithms this method does not make multiple scans of the database, and it can mine all the frequent orders in three database scans.

2.2.3. FreeSpan

Han, et al. introduced the FreeSpan (Frequent pattern-projected Sequential pattern mining) in 5. FreeSpan using recurrent items, projects the order database into anticipated databases. Each predictable database is then recursively predictable further. The sizes of the predictable databases often decrease rapidly, and these smaller databases are easier to work with. This method is significantly faster than Apriori based methods. The problem of this method is that the same orders can be duplicated in many projected databases. In a later work 2 Pei, et al. introduced PrefixSpan. PrefixSpan not only eliminates the redone classification problem, but also out performs previous consecutive pattern mining algorithms, expressly when mining for long consecutive patterns.

Since mining, and using common Max, or Closed, consecutive patterns become more important, and efficient, when there are many long recurrent orders generated, the PrefixSpan is a good choice to be used for Max, and Closed order mining. In next section we will study this method in more detail.

2.2.4. PrefixSpan

PrefixSpan 4 is a recently suggested method for mining frequent consecutive patterns. Unlike Apriori-based algorithms, such as GSP 2, that mine recurrent consecutive patterns by candidate generation, PrefixSpan uses prefix projection to mine the complete set of frequent consecutive patterns. Here is a brief explanation on how does PrefixSpan work. For more comprehensive data, I refer the reader to 4.

Given order database S, and lowest support of €, PrefixSpan performs the following steps to mine the frequent consecutive patterns.

Step 1: Scan S once, and invention all the recurrent items. These recurrent items are recurrent successive patterns with distance one.

Step 2: Allowing to the length-1 recurrent orders found in the first step, the thorough set of recurrent serial patterns can be separated into different subsets, one subset for each length-1 frequent sequence.

Step 3: Each subset, in second step, can be extracted by building its equivalent postfix projected database, and mining it recursively (a frequent items e, projected database is the set of postfixes of e in the unique database).

PrefixSpan uses recursion to mine the regular sequences. First regular items in the folder are initiate, and then for each regular item a expected database is produced (the prefix of the expected list is a regular sequence). This process is frequent for each probable database, until the expected folder contains no regular items, which in that time the recursion for that office ends, and the finishing returns to the work process. Lets call this end of the recursion branch as a backtrack.

Input: A sequence database S, and minimum support threshold €.

Output: The complete set of sequential patterns.

Method: Call PrefixSpan( < >, 0, S)

Subroutine PrefixSpan( ?, ?, S|? )

Parameters: ?: a sequence; ? : the length of ?; S|?: the ?-projected database, if ? ? < >,

otherwise the sequence database S

.

Method:

Scan S|? once, find the set of frequent items b such that:

b can be assembled to the last element of ? to form a sequence;

or < b > can be appended to ? to form a sequence.

For each frequent item b, append it to ? to form a sequence ?’, and output ?’;

For each ?’, construct ?’ -projected database S|?, and call PrefixSpan(?’, l + 1,S|?)

Algorithm 2.1 PrefixSpan, from 4

Algorithm 2.1 is from 4, and shows how does PrefixSpan work. In the following chapters I describe two programs for mining frequent Max, and Closed sequential patterns. These programs use PrefixSpan as their basis to generate candidate frequent Max, and Closed sequences.

CHAPTER 3

MINING FREQUENT MAX, AND BARRED SEQUENCES

3. Mining frequent Max, and Barred Sequences

As it was stated previous one way of mining regular Max, or Closed, successive patterns is to break down the process into two following major tasks:

Producing applicant orders. The set of runner orders should be a superset of the broad set of Max, or Closed, sequences. Clearly the earlier the number of orders in the applicant set to the number of orders in the Max, or Closed, sequence set, the better the applicant set.

Inspection the applicants, and protection the Max, or Closed, orders as the product set. As the applicants are produced, we need to checkered them with the earlier generated applicants, and keep only the Max, or Closed, ones. For any given applicant sequence ? , we need to check, if there exists a sequence ? , in the result set, such that ? is Max, or Closed, subsequence of ? . If yes, we discount ?, then we erase all the orders, which are Max, or Closed, subsequences of ? from the end set, and add the order ? to the result set. This way we assurance that after testing all of the applicants, the result set will contain the broad set of Max, or Closed, orders.

The designated Max, or Closed, order mining process is shown in algorithm 3.1.

In the following chapters, two methods for mining Max, or Closed, orders are announced. These methods are like in the way that they do the mining next the two declared steps, but they are unlike in how they achieve these steps. The first method, called the Naïve method, is presented in chapter 4. In chapter 5 the more effective systems MaxSequence, and ClosedSequence are presented. In chapter 6 the results, and presentation of these methods are discussed.

Input: A order database, and lowest support threshold ?.

Output: The complete set of Max, or Closed sequential patterns.

Method:

ResultSet = {}

aSequence = GetFirstCandidate( )

While aSeqence is not empty AddToResultset(Resultset,aSequence )

aSequence = GetNextCandidate( )

Output Resultset.

Subroutine AddToResultset( aSet, aSequence )

Parameters: aSet: is a set of sequences; aSequence: is a sequence.

Method:

IsSuper = False

For each sequence aSeq in aSet

If IsSuper == False

If aSeq is Max, or Closed, supersequence of aSequence

Exit

If aSequence is Max, or Closed, supersequence of aSeq

Remove aSeq from aSet

IsSuper = True Add aSequence to a Set.

Function GetFirstCandidate ( )

Returns the first candid sequence (will be discussed for different methods).

Function GetNextCandidate ( )

Returns the Next candid sequence (will be discussed for different methods).

Algorithm 3.1 Mining Max, or Closed sequential patterns

CHAPTER 4

NAÏVE APPROACH

4. Naïve Approach

Lets first discourse a simple, brute force, approach for mining Max, or Closed, sequences. I call this method the naïve approach. This approach is built on the two step process declared in earlier chapter, and achieves these tasks as follows:

Since the whole set of Max, and Closed, orders is the subset of the complete set of frequent orders, we can use the whole set of regular orders as our applicant set. I used PrefixSpan for making the regular orders in this process.

In this technique we use a list arrangement as our product set. For any given applicant sequence ? , opening from the first order in the list, we checked all the orders in the list to see, if there occurs a order ? , in the list, such that ? is Max, or Closed, subset of ? . As soon as we find a Max, or Closed, superset of ? we stop inspection orders in the list, disregard ?, and move to next applicant. If nearby is no Max, or Closed, superset of ? in the list, we image the list for subsets of ? , and erase them from the list, and lastly we add ? to the list. After restating these steps for all the applicants, the list will have the complete set of frequent Max, or Closed, orders.

For example consider the sequence database S shown in table 4.1. For support threshold value of 0.5, this database contains thirteen frequent sequences.

Seq.ID Sequence 100 ;a1 a2 a3 a4 a5 a6; 200 ; a2 a4 a6; 300 ;a1 a3 a5; Table 4.1 A sequence database

These orders are shown on the middle column of table 4.2 in the order they are created using PrefixSpan. In this example we take the set of all regular orders as our applicant set for Max sequence mining. The right column on each row of table 4.2 shows the result set for this method before testing the caused sequence with the result set. At the creation, row 1, the result set is empty.

After making ;a1;, it is checked with the result set, since it is not a subsequence of any order in the end result set it is added to the result set. This is the case for next four created sequences ; a3; to;a6;. After adding ; a6;A_ the result set will comprise essentials shown in row 6. In this stage order ;a1 a3; is created, and it is check with the result set. A super sequence of ;a1 a3; does not exist in the consequence set, so we check for its sub orders in the set, ;a2; , and ;a4; _ are subsequences of ;a2 a4; , so they are erased from the end set, and;a1 a3; is added to the set. This stays till the last regular order, ;a5 a6;is created. This order is check with the end set, and sine the order ; a3 a4 a6; in the product set is a super sequence of ;a5 a6;, it is not added to the result set. After this since there are no more regular orders, the end set covers the whole set of Max orders.

Table 4.2 An example of Max sequence generation

The algorithm for Naïve approach is shown in algorithm 4.1.The whole set of regular orders is the upper bound for the applicant set, and since the tests achieved, for each applicant, in the second step are very time unbearable, an effective withdrawal method must be able to produce less amount of applicants than the broad set of common orders. Also using a list building in the second step is not effective, since for each applicant, we have to reflect all the orders in the list, and as the removal process lasts the number of orders in the list can developed very large.

In following subdivision we will see how we can recover the presentation of the Max, and Closed, sequence mining, by creating less number of applicants, and also by successful the presentation of the second, checking, step of the process. Also later in the consequences episode the results, and presentation of the newly advanced devices will be related with the results, and presentation of the naïve approach.

Input: A sequence database S, and minimum support threshold ?.

Output: The complete set of Max, or Closed sequential patterns.

Method:

ResultSet = {}

Call NaïveApproach( ; ;, 0, S, Resultset )

Output Resultset.

Subroutine NaïveApproach( ?, ? , S|?, Resultset )

Parameters: ?: a sequence; ?: the length of ?; S|?: the ? -projected database, if ? ? ; ;,

otherwise the sequence database S; Resultset: a set of sequences.

Method:

Scan S|? once, find the set of frequent items b such that:

b can be assembled to the last element of ? to form a sequence; or ;b; can be appended to ? to form a sequence.

Foreachfrequentitemb,appenditto? to form a sequence ?’, and call

AddToResultset( Resultset, ? );

For each ?, construct ? -projected database S|? , and call NaïveApproach(?, l + 1, S|? , Resultset ).

Subroutine AddToResultset( aSet, aSequence )

Parameters: aSet: is a set of sequences; aSequence: is a sequence.

Method:

IsSuper = False

For each sequence aSeq in aSet

If IsSuper == False

If aSeq is Max, or Closed, supersequence of aSequence

Exit

If aSequence is Max, or Closed, supersequence of aSeq

Remove aSeq from aSet

IsSuper = True

Add aSequence to aSet.

Algorithm 4.1 Naive Approach for Max, or Closed, sequence mining

CHAPTER 5

MAXSEQUENCE, AND CLOSEDSEQUENCE

5. MaxSequence and ClosedSequence

In this Chapter I propose two plans MaxSequence, and ClosedSequence for effective mining of Max, and Closed orders. These plans follow the outline described at chapter 3 (outline used by the Naïve approach), but they do the steps complex in more effective ways. In section 5.1 I describe how these courses manage to do the mining route with making less amount of applicant orders. How these courses do the candidate-checking step is conversed in section 5.2. Section 5.3 describes additional optimization technique for Max sequence mining called String Removal.

5.1. Creating Applicants

One method to progress the presentation of Max, or Closed, order mining is to decrease the number of applicants produced during the mining method. In ideal case the cardinality of the applicant set should be equal to the cardinality of the result set. This means that any candidate generated, is a Max, or Closed, order. So the lower bound for the cardinality of the applicant set is the cardinality of the product set.

On the other hand for any given order database, and support beginning the whole set of regular orders ?, and the whole set of regular Max, and Closed, sequences µ , and C, have the next relation µ ??C? ??? . From this we can see that ? can be used as a applicant set for mining µ, and C. In fact the cardinality of 9 is the upper bound for the

cardinality of applicant set. In many cases the cardinality of 9 can be much better than the cardinality of µ or C and this advises that choosing ? as applicant set is not a very decent choice. For example as an extreme case consider the sequence database S given in Table 5.1, and the minimum support of 1 (i.e., every occurrence is frequent).

Seq.ID Sequence

100 < a1 a2 … a100>

200< a1 a2 … a50>

Table 5.1 An extreme sequence database

For given minimum support this sequence dataset has 2 10 0 – 1= 1 0 30 frequent sequences. These sequences are < a1>… <a100>, < a1 a2>… < a99 a100>… < a1 a2 … a100>. For same minimum support _ has only one Max sequence, < a1 a2 … a100>, and two Closed sequences, < a1 a2 … a50>, < a1 a2 … a100>.. This example shows that if we choose 9 as our candidate set we have to consider around 1030 sequences in order to find Max, or Closed, sequences.

MaxSequence, and ClosedSequence use prefix projection to mine frequent sequences in a sequence database. Considering the properties of Max, and Closed sequences, and how prefix projection works, these programs mine only a subset of frequent sequences, and later they use only a subset of mined sequences as their candidate sets. It can be proved that these candidate sets are supersets of the complete set of frequent Max, and Closed, sequences. In next two sections we will see how these programs use smaller subsets of ? as their candidate sets. In 5.1.1 a method called Common Prefix Detection is introduced. Using this method in prefix projection, MaxSequence, and ClosedSequence will mine a subset of frequent sequences that are likely to be Max, or Closed, sequences. How MaxSequence, and ClosedSequence do select the candidate sequences from the mined sequences is described in section 5.1.2.

5.1.1. Common Prefix Detection

Like PrefixSpan MaxSequence, and ClosedSequence use prefix projection to mine frequent sequences, but unlike PrefixSpan they do not mine the complete set of frequent sequences. For a projected database instead of doing the projection for every frequent item, MaxSequence, and ClosedSequence look for a common prefix in the database, and if they find one they will do the projection based on that prefix. If they do not find a common prefix, then like PrefixSpan, for each frequent item they will create a projected database. This way these two programs will generate only a subset of the complete set of frequent sequences, ?. Lets call this subset p, and the complete set of Max, and Closed sequences as µ, and C respectively. In this section first I will illustrate this method using an example, and later I will prove that µ, and C are subsets of p.

Sequence_id Sequence

10 < a1 a2 … a10>

20 < a2 … a10>

30 < a2 >

Table 5.2 Sequence database S

Lets consider another extreme case, the sequence database S given in Table 5.2, and the minimum support of 1 (i.e., every occurrence is frequent). PrefixSpan first scans to S find the frequent items. In this case all items, a1 … a10, are frequent. These frequent items are the length-1 frequent sequences. Then for each frequent item, it creates item?s corresponding projected database. For each projected database these steps are repeated, until the database has no frequent items. In each repetition the length of the prefixes increase by one, and the prefixes are added to the set of frequent sequences. Table 5.3 shows the projected databases created for S during this process. For sequence database, 511 non-empty projected databases, and 1023 frequent sequences will be generated.

Table 5.3 Projected databases for sequence database S

Now lets see how MaxSequence, and ClosedSequence use prefix detection to generate less number of frequent sequences. They follow the same steps as PrefixSpan. The only difference is after finding frequent items for a database, they check to see if they can find a common prefix among the non-empty sequences in the database. If there exists a common prefix, the projection continues based on that prefix, instead of all frequent items. If there does not exist a common prefix, the projection continues based on the frequent items. In case of this example, there is no common prefix in S,, so the first iteration will be like PrefixSpan, and projected databases for a1 to a10 are created. The <a1> projected database contains only one sequence < a2 … a10>. For this projected database there exists the common prefix of < a2 … a10>_ So in next iteration the projection will be done only based on < a2 … a10> instead of <a2>, <a3>, …, <a10>. The <a2> projected database contains two non-empty sequences < a3 … a10>_ and < a3 … a10>_ and the common prefix for them is < a3 … a10>_. so the next projection will be based on < a3 … a10>_ Table 5.4 shows the projected databases created for S during this process. The underlined items in the table are the detected common prefixes. For given sequence database, and minimum support, MaxSequence, and ClosedSequence will generate 9 non-empty projected databases, and 19 frequent sequences.

Table 5.4 Projected databases for S, using prefix detection

Now we need to show that the set of frequent sequences generated using this method p, covers the set of Max, and Closed sequences. In other word we need to prove that:

µ ??C? ???

Assume a projected database, ?-projected (??is a sequence), with n non- empty sequencesin it. For any given frequent sequence, ?, in ?-projection it is easy to see that support7 ??projected(?? ??n . This means the maximum support for a frequent sequence obtained from a projected database, is equal to the number of non-empty sequences in that database. So if we can find a common prefix, ?, among all the non-empty sequences in ?- projected database, then sequence ??will have the maximum possible support in the database. For any given sequence ?, such that ??is a subsequence of ?, we will have the following:

support ??projected(?? ??n

support ??projected(? ??n

support ??projected(?? ??support ??projected(?

From these relations we can conclude that :

support ??projected(? = support ??projected(?? = n

From this conclusion, and the definitions of Max, and Closed sequence, we can see that sequence ? cannot be a Max, or Closed, sequence, because sequence ?, a proper supersequence of ?, is also frequent, and has the same support as ?. This reasoning is true for the original sequence database S that ?-projected is projected from that. This means that sequence ???will be the supersequence of the sequence ??, and supportS(???? supportS(??? and from these we can conclude that the sequence ???cannot be a Max, or Closed sequence. Also when we create the ??–projected database from the ?– projected database, any frequent item in ?– projected database that is not covered in the common prefix, ?, will remain frequent in the resulting ??– projected database, and resulting frequent sequences from these items will be mined in the later iterations of the process. This shows that during the mining process using common prefix detection method, we are only ignoring frequent sequences that do not have the potential to be Max, or Closed sequences, and therefore we can say that p is the superset of µ and C.

The difference between the number of frequent sequences obtained using PrefixSpan with, and without common prefix detection is significant for the extreme case we studied in this section, but what about more general non extreme cases? In general cases, and specially for high values of support threshold, there might not be a significant advantage in using common prefix detection, and the overhead of this detection might even make the mining process slower. But for dense sequence databases, and for low values of the support threshold, this method will show its advantages.

In general as the projection process goes further, and the resulting projected databases tend to have less number of sequences, it is more probable to detect a common prefix in the database. As a result in general cases, it seems to be useful to use the common prefix detection, only when number of the sequences in a projected database, falls below a certain limit. Further studies can be done to find a proper value for this limit.

5.1.2. Candidate Selection

In previous section I showed how we can mine a subset of frequent sequences p, that covers all the Max, and Closed sequences. In this section we will see how we can choose subsets of p, as candidate sets for Max, and Closed sequence mining.

Like PrefixSpan, MaxSequence, and ClosedSequence programs use recursion to mine frequent sequences. In a given database they find the frequent items (or a common prefix), and create the corresponding projected databases for those frequent items (or that prefix), and then recursively do the same for each projected database until the resulting projected database contains no frequent items, which in that time the recursion for that branch ends, and a backtrack (returning to the calling procedure at the end of the recursion branch) happens.

In each iteration of this process the prefix of the projected database is added to the set of frequent sequences. Also in each iteration the prefixes of the projected databases are created by growing the prefix of the original database by appending either a frequent item, or a common prefix to end of it. This means if during the mining process ?– projected database is created from ?– projected database, then we can say that ????? . Starting from the original database, and moving down in the recursion tree the size of the prefix grows until we cannot go any further (a backtrack happens). This means that if ?– projected database contains no frequent items, then ? is the supersequence of all the sequences before it. From this fact, and the definition of the Max sequence we can see that none of the sequences before ? can be a Max sequence, and only ? has the potential to be a Max sequence. This suggests that for mining Max sequences, it is sufficient to use only the sequences generated during backtracks as our candidate set. MaxSequence uses these sequences as its candidate set.

Figure 5.1 illustrates this process for sequence database S shown as the root table in the figure, and minimum support of one. Sequences in the original database have no common prefix, so the projection happens based on the frequent items. There are five frequent items, so five projected databases, one for each frequent item, are generated.

The <a1>?Projected database has one sequence, so the projection happens based on that sequence and <a1 a2 a3 a4 a5> ?Projected database, with no sequences in it, is created. Since this projected database is empty, a backtrack happens, and the process returns to the higher level in the tree (that is the <a2>?Projected database).

The same thing happens for <a2>??<a3>and <a4> projected databases. They all have common prefixes, and projection continues based on those common prefixes. <a5>?Projected database is empty, and no further projection is necessary. The prefix of each projected table is a frequent sequence, and candidate sequences will be chosen from these frequent sequences. As it was described earlier among frequent sequences generated, only those that are generated during a backtrack are selected as the candidate sequences.

In this case the candidate sequences are < a1 a2 a3 a4 a5 >, < a2 a3 a4 a5 >, < a3 a4 a5 >, < a4 a5 >, <a5>.

Figure 5.1 Example for MaxSequence candidate generation, and selection.

The same candidate set cannot be used for Closed sequence mining, because although the prefix of the projected database is the supersequence of the prefix of the original database, but their support are not necessarily equal. Given the sequence database , as during the mining process ?– projected database is created from ?– projected database, the number of sequences in the ?– projected database will be less than or equal to the number of the sequences in the ?– projected database. This suggests that:

supportS(???? supportS(??

Lets consider the following two cases:

1. _ supportS(???? supportS(??: In this case ??cannot be a Closed sequence,because there exists sequence ??that is a supersequence of ?, and its support is not less than ??s support.

2. supportS(???? supportS(??: In this case ??can be a Closed sequence,because in its recursion branch it has a bigger support than its supersequences.

From this, and the facts we used for generating candidate set for Max sequence mining, we can generate the candidate set for Closed sequence mining in the following way. We add the sequence generated during a backtrack to the candidate set. Going up on the recursion branch we ignore the sequences that have the same support as the last sequence added to the candidate set, and only add sequences that have higher support than the last sequence added to the candidate set.

Implementing the mining procedure, in a depth first fashion is not efficient. In mining process for each projected database we need various data structures to keep track of frequent items, and some other information. These data structures can be big, and storing them in stack, for each iteration of the program, might not be practical for big databases. Therefore some of these structures are defined globally. For each database after finding its frequent items by scanning it once, we create all the corresponding projected databases for it at the same iteration. After doing this we no longer need the information inside global structures, and next iteration can overwrite this information. As described earlier for generating the candidate set for ClosedSequence we need to move up the recursion track as we encounter a backtrack, but since we do not keep the frequent item information for previous iterations, we need to keep track of sequences as they are generated. For doing this a tree like data structure is used. As projected databases are created from an original database, the corresponding frequent sequences generated from these projected databases are added to the tree as the children of the sequence generated from the original database.

A sequence generated from a database that does not have any frequent items, represents a leaf node in the tree. Closed sequence uses this structure to generate its candidate set. As it reaches a leaf node, it adds this node to the candidate set, and moves to the parent node, if parent node has higher support, then it will be added to the candidate set, otherwise the upper node in the tree will be checked. The process will repeat these steps until it reaches the root node. All the childless nodes can be deleted from the tree, after they have been check.

I will illustrate the process of candidate generation for ClosedSequence using the example sequence database shown in figure 5.1. Same process, as it described for MaxSequence, will happen for ClosedSequence frequent sequence generation, but the selection process will be different. Figure 5.2 shows the recursion tree for this process. The numbers following the sequences in each node is the support for that sequence. Below the root we have all the frequent items for the main sequence database. The child parent relation shows from which parent databases the child databases are projected.

For ClosedSequence candidate selection, lets start by checking the left most branch. As soon as we get to the leaf node, we will add it to the candidate set, and since the parent node does not have a higher support we will ignore it. For the next branch we will add the leaf node, and since its parent node has a higher support we will also add the parent to the candidate set. For other branches in this case we only add the leaf nodes to the candidate set. The candidate set in this case will contain these < a1 … a5>: 1, < a2 … a5> : 2, < a2 > : 3, < a3 … a5> : 2,and < a4 a5> : 2.

Figure 5.2 A sample Recursion tree

Algorithms for MaxSequence, and ClosedSequence are shown in algorithm 5.2, and 5.3. How to check a candid sequence and to add it to the result set if it is a Max, or Closed sequence is shown in algorithm 5.4. Routines used in this algorithm will be discussed in the next chapter. There will be two sets of routines, one set for Max sequence mining, and one set for closed sequence mining. Recursion tree routines that are used in ClosedSequence are shown in Algorithm 5.1.

Subroutine AddToRecursionTree( RecursionTree, aSequence, aParent )

Parameters: RecursionTree: a tree structure to keep track of the order of sequence generation.; aSequence: is a sequence; apparent: is a sequence.

Method:

Add aSequence to the RecursionTree as a child of aParent

Subroutine EmptyRecursionTree( RecursionTree, aSequence, aTree )

Parameters: RecursionTree: a tree structure to keep track of the order of sequence generation.; aSequence: is a sequence; aTree: is a Max-Tree, or Closed-Tree for storing sequences;

Method:

LastSupport = 0

For each sequence aSeq, starting from aSequence to all of its ancestors

If aSeq has no children

If aSeq has a support higher than LastSupport LastSupport = Support of aSeq

Call AddToResultset( aTree,aSeq )

Delete aSeq from RecursionTree

Else

Exit for loop.

For rest of the remaining ancestors

If the support of the ancestor is less than or equal the LastSupport

Set the support of the ancestor to – 1. // No need to consider this sequence for rest of its children.

Algorithm 5.1 Recursion tree routines

Input: A sequence database , and minimum support threshold ?.

Output: The complete set of Max sequential patterns.

Method:

MaxTree = {}.

Call MaxSequence( < >, 0, S, MaxTree )

Output MaxTree.

Subroutine MaxSequence( ?, ? , C, MaxTree )

Parameters: ?: a sequence; ? : the length of ?; S|?,: the ? -projected database, if ? ? < >,otherwise the sequence database ; MaxTree: a Max-Tree for storing results.

Method:

Scan S|? once, find the set of frequent items b such that:

b can be assembled to the last element of ? to form a sequence; or < b > can be appended to a to form a sequence.

If the set of frequent items for S|? is empty

Call AddToResultset( MaxTree, ? )

Return

Try to find a common prefix b among all non-empty sequences in |? such that: Firs element of b can be assembled to the last element of ? to form a sequence; or ? can be appended to a to form a sequence.

If b is found

Append b to ? to from a sequence ?’, Construct a’ -projected database S|?,

Call MaxSequence (a’, l + length(?), S|? , MaxTree ). Else

For each frequent item b,

Append it to a to form a sequence ?’, Construct a’ -projected database S|?,

Call MaxSequence (a’, l + 1, S|?, MaxTree ).

Algorithm 5.2 MaxSequence a program for Max sequence mining

Subroutine AddToResultset( aTree, aSequence )

Parameters: aTree: is a Max-Tree, or Closed-Tree for storing sequences; aSequence: is a sequence.

Method:

If SuperSequenceExists( aTree, aSequence )

Return

DeleteSubSequences( aTree, aSequence )

Add aSequence to aTree

Algorithm 5.4 Routine to add a sequence to a resultset tree

5.2. Checking Candidates

In previous section I described how MaxSequence, and ClosedSequence programs generate their candidate sets. In this section I will describe how to find the Max, and Closed, sequences from the candidate sequences.

MaxSequence, and ClosedSequence, keeps a list of sequences as their result sets. At the beginning these sets are empty. As a candidate sequence, ?, is generated, it is checked with the sequences inside the result set. If a Max, or Closed, supersequence of ? is found inside the result set, ? will be ignored, otherwise all the Max, or Closed, subsequences of ? will be deleted from the result set, and ? will be added to the set. After checking all the candidate sequences, the result set will contain the complete set of Max, or Closed, sequences. Given a sequence ?, and a result set, the performance of this check depends on how fast we can answer these two queries. First if there exists a Max, or Closed, supersequence of ? in the result set, and second which sequences in the result set are Max, or Closed, subsequences of ?. MaxSequence, and ClosedSequence programs use two special data structures, Max-Tree, and Closed-Tree, as their result sets. These data structures are designed in a way that can answer the checking queries efficiently.

These data structures are similar to CR-Tree 3, and are described in the next two sections. Section 5.2.1 describes Max-Tree, the result set for MaxSequence, and section 5.2.2 talks about Closed-Tree, the result set for ClosedSequence.

5.2.1. Max-Tree

Max-Tree is a CR-Tree like structure that is used as the result set for MaxSequence program. In this section we will discuss the structure of the Max-Tree, and we will also see how it facilitates the candidate checking in the Max sequence mining.

This structure has two major parts, a tree structure, and an array of linked lists called node link table. Each node in the tree represents an element and each path (starting from the root, and ending in a leaf) in the tree represents a sequential pattern. Sequences with a common prefix will share the element nodes of their common prefix. This means that the sequences are stored in the tree in compressed form.

Figure 5.3 A sample Max-Tree.

In the node link table, a linked list is kept for each frequent item in the sequence database. A node in these lists is a pointer to a tree node. A tree node will be in the linked list of the item ?, if it is the last element of a sequence that contains ?. These lists are kept sorted based on the level of the tree node they are pointing at. The level of an element in the tree in fact shows the ordinal number of that element in the sequence.

Figure 5.3 shows a sample Max-Tree. Each node in the tree represents an element, and each path from the root to a leaf node represents a sequence. This tree has three leaves, so it represents three sequences. These sequences are < i1 i3 … i2 (i3 i4)>, < i1 i5 >, and < i3 i1 i4>. Dashed curve lines represent the linked lists for each item in the node linkable.Notice that the node for element in the right most branch has a higher tree level, than the element in the left most branch, so it will be the first item in the linked list for item i1.Since length of ? is three, the sequence in the tree cannot be a supersequence of ?. The process stops, and the answer to the query is no.

Figure 5.4 Max-Tree after deletion

To check a sequence in the tree (specified by its last node in the tree) with a given sequence ?, we compare the last element of ? with the last tree node, and move toward the first element of ? and the parent of the compared tree node. The tree node level, and length of ? can also help in the checking process. To delete a sequence from the tree, the program starts from the last element of the sequence, the leaf node, toward the first element of the sequence, the root node, and deletes every node that does not have any other children. The corresponding pointers in the node link table should also be deleted.

Routines that are used for checking a candid sequence, and adding it to a Max-Tree, if it is a Max sequence are shown in algorithm 5.5. These are the routines that can be used in algorithm 5.4 in previous section, for Max sequence mining.

// Returns true if a proper Max supersequence of aSequence exists in aTree.

Boolean Function SuperSequenceExists( aTree, aSequence )

Parameters: aTree: is a Max-Tree for storing sequences; aSequence: is a sequence.

Method:

is the item in the last element of aSequence with shortest link list in aTree’s node link table.

For every pointer in the ‘s linked list, starting from the beginning of the list aNode is the tree node that this pointer pointing to.

If level of aNode is less than the length of aSequence Exit for loop

TreeSequence is the sequence represented in aTree by the path starting from the root to aNode.

If aSequence is the Max subsequence TreeSequence

Return True

Return False

Subroutine DeleteSubSequences ( aTree, aSequence )

Parameters: aTree: is a Max-Tree for storing sequences; aSequence: is a sequence.

Method:

For each element, anElement, in aSequence

For each item, anItem, in anElement

For every pointer in the anItem’s linked list in aTree’s node link table that points to a leaf node, starting from the end of the linked list.

aNode is the tree node that this pointer pointing to.

If level of aNode is greater than the length of aSequence

Exit inner for loop

TreeSequence is the sequence represented in aTree by the path starting from the root to aNode.

If aSequence is the Max supersequence TreeSequence

Delete TreeSequence from aTree

Algorithm 5.5 Max-Tree routines

5.2.2. Closed-Tree

Closed-Tree is also a CR-Tree like structure that is used as the result set for ClosedSequence program. In this section we will discuss the structure of the Closed-Tree, and we will also see how it facilitates the candidate checking in the Closed sequence mining.

Like Max-Tree, this structure has two major parts, a tree structure, and an array of linked lists called node link table. But unlike Max-Tree each node in the tree represents an item. Nodes can be part of different sequences, and among the different support values for those sequences each node stores the highest support value. Each node also has a flag that shows if it is an intra item. Non-first items in a multiple item element are intra items. A node with higher support than all of its children is called a terminal node. A path from root to a terminal node represents a sequence in the tree. Sequences with a common prefix will share the item nodes of their common prefix. This means that the sequences are stored in the tree in compressed form.

Like Max-Tree, a linked list is kept for each frequent item in the node link table. A node in these lists is a pointer to a tree node. A tree node will be in the linked list of the item ?, if it is the last instance of item ? in a sequence.

Figure 5.5 A sample Closed-Tree

Figure 5.5 shows a sample Closed-Tree. Each node in the tree represents an item, and each path from the root to a terminal node represents a sequence. The underlined item is an intra node. This tree has four terminal nodes, so it represents four sequences. These sequences are < i1 i2 i i1 >:2, < (i1 i3) i1 >:3,<i2 i3>:3, _ and ,<i2 i3 i1>:1.Dashed curve lines represent the linked lists for each item in the node link table.

Given a sequence ?, lets see how ClosedSequence, using Closed-Tree, answers this query that if result set contains a Closed supersequence of ?. Lets call the last item in ?, ix, and the link list for this item in the node link table, Lix .Sequence ? is checked with all the sequences that start from the root, and end with a node pointed to by an element in Lix. If a Closed supersequence of ? is found the process stops, and the answer the query is yes, otherwise answer is no.

Lets examine this process on the tree shown in figure 5.5. Assume ? = < i1 i2 >:2. The linked list for i3, the last item of ?, has two nodes. The first node represents the sequence < (i1 i3) i1 >:3. This sequence is not a supersequence of ?, so we move to the second node in the linked list. This node represents the sequence < i1 i3 >:3 and this sequence is a closed supersequence of ?, so the process stops, and answer to the query is yes. The query for sequence ? =< i2 i3 >:4 will follow the same steps, but since the sequence < i1 i3 >:3 is not a Closed supersequence of ?, and there are no more nodes in the linked list of item i3 process stops, and the answer to the query will be no.

To delete the Closed subsequences of a given sequence ?, in the Closed-Tree the following steps are performed. For each item i, in ? the following steps will be performed. Lets call the linked list of item i in the node link table Li . For all the nodes in Li that point to a terminal node in the tree, and their corresponding tree node has a smaller or equal support, and level than the support, and length of ?, check if the sequence starting from the root, and ending by this node is a Closed subsequence of ?. If yes delete the sequence from the tree, and continue.

Lets examine sequence ?? = < i1 >:4 on the tree shown in figure 5.5, and see if there are

Closed subsequences of ??in the tree. The linked list for item i1 contains three nodes.The one that represents the sequence < i1 >:3 Z , is not terminal, so it is ignored. The second and third nodes, that represent sequences < i1 i2 i1 >:2 and < i2 i3 i1 >:1 respectively , are terminal, but their length is more than ?, so they cannot be ??s Closed supersequences. Since there are no more nodes in the linked list to check, the answer to the query is no. For ? =< i2 i3 >:4, we start with the link list of the item i2. This linked list points to two nodes, but none of them are terminal nodes, so we move to the linked list of i3. This linked list points to two terminal nodes in the tree. The one that represents the sequence v < (i1 i3) i3 >:3 has higher level than the length of ?, so it cannot be a subsequence of ?. The one that represents the sequence < i2 i3 >:3 has less support , and level, than the support, and the length of ?, so it is checked with ?, and since it is a Close subsequence of ?, it is deleted from the tree. The tree after deleting the sequence < i2 i3 >:3 from it, is shown in figure 5.6.

Figure 5.6 Closed-Tree after deletion

To check a sequence in the tree (specified by its last node in the tree) with a given sequence ?, we compare the last item of ? with the last tree node, and move toward the first item of ? and the parent of the compared tree node. Comparing tree node level, and support with length, and support of ? can also speed up the process. To delete a sequence from the tree, the program starts from the last element of the sequence, the terminal node, toward the first element of the sequence, the root node, and deletes every node that does not have any other children. The support of the nodes that are not deleted from the tree (nodes that have children) should be updated, and the corresponding pointers in the node link table should be deleted.

Routines that are used for checking a candid sequence, and adding it to a Closed-Tree, if it is a Closed sequence are shown in algorithm 5.6. These are the routines that can be used in algorithm 5.4 in previous section, for Closed sequence mining.

// Returns true if a proper Closed supersequence of aSequence exists in aTree.

Boolean Function SuperSequenceExists( aTree, aSequence )

Parameters: aTree: is a Closed-Tree for storing sequences; aSequence: is a sequence.

Method:

is the last item of aSequence.

For every pointer in thei’s linked list

aNode is the tree node that this pointer pointing to.

If level of aNode is not less than the length of aSequence

TreeSequence is the sequence represented in aTree by the path starting from the root to aNode.

If aSequence is the Closed subsequence TreeSequence

Return True

Return False

Subroutine DeleteSubSequences ( aTree, aSequence )

Parameters: aTree: is a Closed-Tree for storing sequences; aSequence: is a sequence.

Method:

For each element, anElement, in aSequence

For each item, anItem, in anElement

For every pointer in the anItem’s linked list in aTree’s node link table that points to a terminal node, starting from the end of the linked list.

aNode is the tree node that this pointer pointing to.

If level of aNode is not greater than the length of aSequence

TreeSequence is the sequence represented in aTree by the path starting from the root to aNode.

If aSequence is the Closed supersequence TreeSequence

Delete TreeSequence from aTree

Algorithm 5.6 Closed-Tree routines

5.3. String Elimination

String Elimination is another method that can be used to further improve the performance of Max sequence mining. As we create the projected databases we can check the sequences in the database with the result set. If a sequence in a projected database is a subsequence of a sequence in the result set, then we can delete that sequence from the projected database. This will work for Max sequence mining, because we are looking for the longest frequent sequences, and we are not interested in shorter sequences, even if they have higher supports. This idea will not work for Closed sequence mining, because by eliminating sequences in projected databases, we might loss some shorter frequent sequences with higher supports.

Figure 5.7 Example for String Elimination

As an example consider the sequence database shown in figure 5.1. The result of Max, and Closed, sequence mining are { ; a1 a2 a3 a4 a5 ; }, and { ; a1 a2 a3 a4 a5 ; : 1, ; a2 a3 a4 a5 ; : 2, ; a2 ; : 3} respectively. Figure 5.7 shows the mining process on the same sequence database using string elimination. The underlined sequences in the projected databases are the eliminated sequences. The result of Max, and Closed sequence mining in this case are { ; a1 a2 a3 a4 a5 ; : 1, ; a2 a3 a4 a5 ; : 2, ; a2 ; : 3, ; a3 ; : 2, ; a4 ; : 2, ; a5 ; : 2} and respectively. As it can be seen in this case using string elimination produces the right set of sequences for Max sequence mining, but the sequences generated for Closed sequence mining are not the right ones.

CHAPTER 6

RESULTS

6. Results

In this chapter we compare the results, and performance of the different programs described in pervious chapters, on some sample datasets. The programs were tested using seven synthetic sequence databases, and one real database. The synthetic databases were generated using publicly available synthetic generation program of the IBM Quest data mining project 10. This data generator has been used in many sequential pattern mining studies. Some parameters of this program are shown in table 6.1, and the descriptions of sequence databases used for testing are summarized in table 6.2. The real database, called ProductClicks, is constructed form the transactional data file, clicks.data, of the KDD Cup 2000. These Data files are described in 19. ProductClicks shows which of the 1423 different products, and assortments, are viewed by 29369 different users, and in what order. Product pages viewed in one session are considered as an item set, and different sessions for one user is considered as a sequence.

Name Command Description

D -ncust Number of customers in 000s (default: 100)

C -slen Average transaction per customer (default: 10)

T -tlen Average items per transaction (default: 2.5)

N -nitems Number of different items in 000s (default: 10)

-rept Repetition level (default: 0)

NS -seq.npats Number of sequential patterns (default: 5000)

S -seq.patlen Average length of maximal pattern (default: 4)

-seq.corr Correlation between patterns (default: 0.25)

-seq.conf Average confidence in a rule (default: 0.75)

NI -lit.npats Number of patterns (default: 25000)

I -lit.patlen Average length of maximal pattern (default: 1.25)

-lit.corr Correlation between patterns (default: 0.25)

-lit.conf Average confidence in a rule (default: 0.75)

Table 6.1Command line options for IBM Quest data generator

Experiments were performed on a 550 MHz Pentium III machine, with 256 MB of main memory. Ten programs were tested during these experiments. Summary information for these programs is shown in table 6.3.

Name N C T S I NS D Size (MB)

C20-N100-T2.5-D100K 100 20 2.5 4 1.25 5K 100 24.9

C10-N10-T1-S4-I1.25-D100K 10 10 1 4 1.25 5K 100 5.7

C2-N10-T5-S4-I1.25-D100K 10 2 5 4 1.25 5K 100 1

C5-N10-T2.5-S4-I1.25-D100K 10 2 2.5 4 1.25 5K 100 4.4

C5-N10-T2.5-S4-I2.5-D100K 10 5 2.5 4 2.5 5K 100 4.1

C10-N10-T2.5-Ns100-D1K 10 10 2.5 4 1.25 100 1 0.12

C10-N10-T2.5-Ns5K-D1K 10 10 2.5 4 1.25 5K 1 0.12

Table 6.2 Synthetic sequence database descriptions

In section 6.1 the performance of MaxSequence is compared with MaxNaive, and PrefixSpan. MaxSequence, and MaxNaive are described in chapters 5, and 4 respectively. In section 6.2 a similar evaluation is performed for ClosedSequence. ClosedSequence, and ClosedNaive are implementations of the algorithms described in chapters 5, and 4 respectively.

The effects of the optimization methods (Common Prefix Detection, Max-Tree, Closed-Tree, and String Elimination) that are used in MaxSequence, and ClosedSequence programs are studied in section 6.3

Program Description Name in

Charts

PrefixSpan PrefixSpan-1 (pseudo-projection) for frequent sequence P

mining. MaxNaive Max sequence mining program using naïve approach MN

described in chapter 4. MaxSequence Max sequence mining program described in chapter 5 MS

(without string elimination). MaxSeqSE Max sequence mining program with string elimination. MSSE

MaxSeqNoPre Max sequence mining program described in chapter 5, MSNP

without prefix detection. MaxSeq Max sequence mining program described in chapter 5, MSNMT

NoMaxTree without Max-Tree (Uses a linear list like MaxNaive). ClosedNaive Closed sequence mining program using naïve approach CN

described in chapter 4. ClosedSequence Closed sequence mining program described in chapter 5. CS

ClosedSeqNoPre Closed sequence mining program described in chapter 5, CSNP

without prefix detection. ClosedSeq Closed sequence mining program described in chapter 5, CSNCP

NoClosedTree without Closed-Tree (Uses a linear list like ClosedNaive). Table 6.3 Programs that are being tested

6.1. Max Sequence Mining Results

In this section the performance of Max sequence mining programs MaxNaive, and MaxSequence are examined. First programs were tested using seven synthetic sequence databases described in table 6.2. The results are shown in figure 6.1. For each database two charts have been provided. The charts on the left compare the run times for MaxSequence (MS), MaxNaive (MN), and PrefixSpan (P).

Figure 6.1 MaxSequence performance charts

Figure 6.1 MaxSequence performance charts

The charts on the right show the number of frequent sequences (Freq), number of Max sequences (Max), and the number of candidates generated by MaxSequence (Cand). The set of frequent sequences in fact is the candidate set for MaxNaive, so the number of frequent sequences also shows the number of candidates generated by MaxNaive. As it can be seen from the charts in all cases MaxSequence runs faster than MaxNaive. This speed becomes more noticeable as the minimum support decreases. Also as the minimum support decreases the difference between the number of frequent sequences, and the number of Max sequences increases. Number of the candidate sequences is always between the number of frequent sequences, and the number of Max sequences.

In another experiment the performance of MaxSequence, with String Elimination, is compared with PrefixSpan using a real sequence dataset, ProductClicks. The results are shown in figure 6.2. As it can be seen from the charts the run time difference between two programs are significant in lower minimum supports. Also the number of frequent sequences in lower minimum support is in millions, comparing to the number of Max sequences in the same minimum support, which is in thousands.

Figure 6.2 MaxSequence performance charts for ProductClicks

6.2. Closed Sequence Mining Results

In this section the performance of Closed sequence mining programs ClosedNaive, and ClosedSequence are examined. Like previous section first the programs were tested using seven sequence databases described in table 6.2. The results are shown in figure 6.3.

Figure 6.3 ClosedSequence performance charts

Again for each database two charts have been provided. The run time charts, and the number of candidate sequences charts. The charts on the left compare the run times for ClosedSequence (CS), ClosedNaive (CN), and PrefixSpan (P). The charts on the right show the number of frequent sequences (Freq), number of Closed sequences (Closed), and the number of candidates generated by ClosedSequence (Cand).

The set of frequent sequences in fact is the candidate set for ClosedNaive, so the number of frequent sequences also shows the number of candidates generated by ClosedNaive. ClosedSequence performs much faster than the ClosedNaive, and the difference becomes more apparent as the minimum support decreases. Also as the minimum support decreases the difference between the number of frequent sequences, and the number of Closed sequences increases. Number of the candidate sequences is always between the number of frequent sequences, and the number of Closed sequences, but, as it can be guessed, the difference between the numbers of candidates generated by two programs is less than the case in Max sequence mining.

As it can be seen from comparing the run time charts in this section, with the ones from the previous section, in some cases the ClosedSequence program runs faster than the MaxSequence. This is because of the optimizations that were done during implementation for ClosedSequence, and is not related to the algorithms.

As another experiment the performance of ClosedSequence is compared with PrefixSpan using a real sequence dataset, ProductClicks. The results are shown in figure 6.2. As it can be seen from the charts, the run time difference between two programs becomes more noticeable as the minimum supports decreases. Also the number of frequent sequences in lower minimum support is in millions, comparing to the number of Closed sequences in the same minimum support, which is in thousands.

Figure 6.3 ClosedSequence performance charts

Figure 6.4 ClosedSequence performance charts for ProductClicks

6.3. Effects of Optimization Methods

In this section we will study the effects of the optimization methods Common Prefix Detection, Max-Tree, Closed-Tree, and String Elimination. Different programs, described in table 6.3, are used to mine Max, and Closed sequences in the synthetic sequence database C5-N10-T2.5-S4-I2.5-D100K. Figure 6.5 shows the effect of String Elimination in Max sequence mining. In this chart we compare the run time for MaxSequence without String Elimination, MS, and MaxSequence with String Elimination, MSSE. As it can be seen from the chart, MSSE has a better performance than MS.

Figure 6.5 Effect of String Elimination

Figure 6.6 shows the effects of using Max-Tree, and Closed-Tree in MaxSequence, and ClosedSequence programs. The chart on the left compares the run time of MaxSequence, which uses Max-Tree, MS, with the run time of a version of MaxSequence that uses a linear list instead of Max-Tree, MSNMT. The chart on the right shows the run time of ClosedSequence, which uses Closed-Tree, CS, and the run time of a version of ClosedSequence that uses a linear list instead of Closed-Tree, CSNCT. Both charts show the performance will increase significantly using Max-Tree, and Closed-Tree, instead of using linear lists.

Figure 6.6 Effects of using Max-Tree, and Closed-Tree

The effect of Common Prefix Detection is shown in the charts of the figure 6.7. The charts on the first row show this effect for Max sequence mining. On the left hand chart the run time of MaxSequence, MS, is compared with the run time of a version of MaxSequence, which does not use Common Prefix Detection, MSNP. This chart shows that MS outperforms MSNP significantly. The reason for this can be seen in the right hand chart that shows the number of candidates generated by MS is much less than the number of candidates generated by MSNP. The charts on the second row of the figure 6.7 show the effect of Common Prefix Detection in Closed sequence mining. On the left hand chart the run time of ClosedSequence, CS, is compared with the run time of a version of ClosedSequence called CSNP, which does not use Common Prefix Detection. This chart shows that CS outperforms CSNP significantly. The reason for this again can be seen in the right hand chart that shows the number of candidates generated by CS is much less than the number of candidates generated by CSNP.

Figure 6.7 Effect of Common Prefix Detection

CHAPTER 7

CONCLUSION AND FUTURE WORK

7. Conclusion and Future Work

Frequent Max, and Closed sequential pattern mining problems were formally defined in this thesis, and later different programs were suggested to solve these problems. The main idea for these programs is to generate some candidate sequences, and then keep only the Max, or Closed ones. Some ideas were suggested to decrease the number of candidate sequences generated, and also some data structures were developed to be able to keep more sequences in the memory, and to be able to compare the sequences more efficiently. According to the tests performed, the MaxSequence, and ClosedSequence programs seem to have an acceptable performance. The Prefix Detection feature in these programs might slow them down a little bit in general cases, but it can improve the performance drastically in extreme cases.

These results give an idea for future research. What if the program can decide to either use, or not use the Prefix Detection. This can be done based on different factors. One way would be to find some measure for database density, and based on this measure decide to use, or not use the Prefix Detection (a good measure for database density can be used in many other applications). Another approach could be to make a decision based on the minimum support value, and the number of the sequences in the projected database. As the number of sequences decreases in a projected database, the chance of having a common prefix among them increases, so the program can turn the Prefix Detection on, or off based on the number of sequences in the database. This threshold value could be related to the number of sequences in the original database, and the minimum support value.

Another research idea would be to scale up this process. The result set for these programs are kept in the main memory. Although these sequences are kept in the compressed form, using Max-Tree and Closed-Tree, but this can cause problem in some extreme cases, or when there is a restriction on the size of the usable memory. Max-Tree and Closed-Tree could be modified for these extreme cases, in a way that they can be stored on the disk, without losing their efficiency in checking new sequences.

Bibliography

1 R. Agrawal, and R. Srikant: “Mining Sequential Patterns”, Proc. of the Int’l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995.

2 R. Srikant , and R. Agrawal: “Mining Sequential Patterns: Generalizations and performance improvements”, In Proc. 5th Int’l Conference Extending DatabaseTechnology (EDBT), Avignon, France, March 1996.

3 W. Li: “Classification Based on Multiple Association Rules”. M.Sc. Thesis, Simon Fraser University, April 2001.

4 J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu,”PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth”,Proc. 2001 Int. Conf. on Data Engineering (ICDE’01), Heidelberg, Germany, April 2001.

5 J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, “FreeSpan:Frequent Pattern-Projected Sequential Pattern Mining”, Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), Boston, MA, Aug. 2000.

6 M. J. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences”, in Machine Learning Journal, special issue on Unsupervised Learning (Doug Fisher, ed.), pages 31-60, Vol. 42 Nos. 1/2, Jan/Feb 2001.q7 R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases.”, Proc. of the ACM SIGMOD Conference on Management of Data, pages 207-216, Washington, D.C., May 1993.

8 M. J. Zaki, and C. Hsiao, “CHARM: An Efficient Algorithm for Closed Association Rule Mining”, in Technical Report 99-10, Computer Science, Rensselaer Polytechnic Institute, 1999.

9 J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.”, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 2000, pages 21-30, Dallas, TX, 2000.

10 IBM Quest Data Mining Project Synthetic Data Generation Program: http:// www.almaden.ibm.com/cs/quest/syndata.html.

11 K. Wang, and J. Tan, “Incremental Discovery of Sequential Patterns.”, In 1996 ACM SIGMOD Data Mining Workshop: Research Issues on Data Mining and Knowledge Discovery (SIGMOD’96), pages 95-102, Montreal, Canada, May 1996. 11

12 R. Agrawal, and R. Srikant, “Fast algorithms for mining association rules in largedatabases.”, In Research Report RJ 9839, IBM Almaden Research Center, San Jose, CA, June 1994.

13 H. Mannila, H. Toivonen, and A. I. Verkamo, “Efficient algorithms for discovering association rules.”, In Proc. AAAI’94 Workshop Knowledge Discovery in Databases (KDD’94), pages 181-192, Seattle, WA, July 1994.

14 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association rules.”, Data Mining, and Knowledge Discovery, 1:343-374, 1997.

15 J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation.”, Proc. 2000 ACM SIGMOD Int. Conf. Management of Data (SIGMOD’00), pages 1-12, Dallas, TX, May 2000.

16 J. Pei, and J. Han, “Can We Push More Constraints into Frequent Pattern Mining?”, Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD’00), Boston, MA, August 2000.

17 K. Wang, Y. He, and J. Han, “Mining Frequent Itemsets Using Support Constraints”, Proc. 2000 Int. Conf. on Very Large Data Bases (VLDB’00), Cairo, Egypt, September 2000.

18 M. N. Garofalakis, R. Rastogi, and K. Shim, “Mining Sequential Patterns with Regular Expression Constraints”, IEEE Transactions on Knowledge, and Data Engineering, volume 14, number 3, pages 530-552, May/June 2002.

19 R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. “KDD-Cup 2000 organizers’ report: Peeling the onion.” SIGKDD Explorations, volume 2, issue 2, pages 86-98, December 2000. http://www.ecn.purdue.edu/KDDCUP.