MG-FSM: Large-Scale Frequent Sequence Mining
MG-FSM is a scalable, distributed (i.e., shared nothing) algorithm for frequent sequence mining (FSM) on MapReduce. The algorithm can handle so-called "gap constraints", which can be used to limit the output to a controlled set of frequent sequences.
At a high-level, MG-FSM carefully partitions and rewrites the set of input sequences in such a way that each partition independently and in parallel. We achieve this by extending the notion of item-based partitioning, which underlies a number of frequent pattern mining algorithms to gap-contrained frequent sequence mining. Our experiments, in which we mined more than 1 billion sequences, suggest that MG-FSM is multiple orders of magnitude faster than baseline algorithms for general gap-constrained FSM and is competitive to state-of-the-art algorithms for n-gram mining.
For more information, please refer to our related publication.
The source code is hosted on MG-FSM GitHub repository.