What does the items in one bracket reperesent in sequential pattern mining - data-mining

I have seen many databases for sequential pattern mining and the sequences they take in these databases are like
<(af)(d)(e)(a)>
<(e)(abf)(bde)>
What does the set of items in one bracket like (af), (abf), (bde) represent?
Does it mean that they are related to one another or something else
On what basis do we classify items into this one element?
I am using a weblog file as dataset.

The input of a sequential pattern mining algorithm is a sequence database. A sequence is an ordered list of itemsets.
Here is an example of sequence:
<(e)(abf)(bde)>
This sequence should be interpreted as follows:
First the item "e" occurred. It was then followed by "a", "b" and "f" simultaneously. These items where then followed by "b", "d" and "e" simultaneously.
So the answer is items between brackets are assumed to be unordered or occuring at the same time. Items between brackets are called an "itemset".
Note that it is also assumed that no item can appear more than once in an itemset. So it woul be illegal to have an itemset such as (a a b)
Moreover, you should also know that most sequential pattern mining algorithms assume that items in an itemset are lexically ordered (e.g. PrefixSpan). If the items are not lexically ordered in an itemset, the algorithms may not provide the good result becauset they use some optimization that take this assumption.
If you want to try some sequential pattern mining algorithm, you can have a look at the SPMF software : http://www.philippe-fournier-viger.com/spmf/ which provide a graphical user interface and many examples (i'm the project founder).
Hope this answer your question well.

Related

Sorting list of positive words that most likely spoken by people at the top of list and the rarely spoken at end of list

I have a list of positive words, the list has more than 1000 words. Is there any way to sort the list words from which one mostly to rarely spoken words? is there any idea how do it in C++ or C?
If I have static millions of csv tweets file and positive.txt file, does that work to make comparison to sort?
This is called a self-organising list. Assuming you have a dataset, Knuth gives two algorithms:
every time you find a used word, exchange it with its predecessor in the list, if any,
OR
every time you find a used word, exchange it with the top item in the list.
After processing your dataset, your list should be more or less self-organised into frequency of use order.

String completion and matching algorithm

You have two sets: S1={B,C,D,T,M,...} and S2={with each other letter of alphabet not present in S1}.
Now, I have some string constructed with consonants from S1 (ie. BBWRD) which I want to be transformed into words/sentences based of provided dictionary (ie. dict from spelling mechanism).
Algorithm can fill spaces between each of letter in 'base word' with any amount of letters from S2. Order can't be changed and letters/consonants from S1 can't be used.
The only thing that came to my mind is usage of regexp. Can you propose any other, better approach? Or at least give name to this kind of algorithms, so I could search further.
I'd think about creating a search tree. Each node would have |S1| subnodes and leaves would contain a list of possible words, which may be constructed from a given acronym (such that - for example = D on the path W->R->D would contain "Word". Searching in such tree would be pretty fast, though it would require a noticeable amount of memory to be stored for quick access.

When do we actually use a Trie?

I am starting to read about Trie. I got also references from friends here in: Tutorials on Trie
I am not clear on the following:
It seems that to go on and use a Trie one assumes that all the input strings that will be the search space and used to build the Trie are separated in distinct word boundaries.
E.g. all the example tutorials I have seen use input such as:
S={ball, bid, byte, car, cat, mac, map etc...}
Then we build the trie from S and do our searches (really fast)
My question is: How did we end up with S to begin with?
I mean before starting to read about tries I imagined that S would be an arbitrarily long text e.g. A Shakespeare passage.
Then using a Trie we could find things really fast.
But it seems this is not the case.
Is the assumption here that the input passage (of Shakespeare for example) is pre-processed first extracting all the words to get S?
So if one wants to search for patterns (same way as you do when you Google and see all pages having also spaces in your search query) a Trie is not appropriate?
When can we know if a Trie is the data structure that we can actually use?
Tries are useful where you have a fixed dictionary you want to look up quickly. Compared to a hashtable it may require less storage for a large dictionary but may well take longer to look up. One example place I have used it is for mapping URLs to operations on a web server were there may be inheritance of functionality based on the prefix. Here recursing down a trie enables appropriate lookup of all of the methods that need to be called for a particular url. It would also be efficient for storing a dictionary.
For doing text searches you would typically represent documents using a token vector of leximes with weights (perhaps based on occurance frequency), and then search against that to get a ranking of documents against a particular search vector. There a number of standard libraries to do this which I would suggest using rather than writing your own - particularly for removing stopwords, dealing with synonyms and stemming.
We can use tries for sub string searching in linear time, without pre processing the string every time. You can get a best tutorial on suffix tree generation #
Ukkonen's suffix tree algorithm in plain English?
As the other examples have said, a trie is useful because it provides fast string look-ups (or, more generally, look-ups for any sequence). Some examples of where I've used tries:
My answer to this question uses a (slightly modified) trie for matching sentences: it is a trie based on a sequence of words, rather than a sequence of characters. (The other answers to that question probably demonstrate the trie in action more clearly.)
I've also used a trie in a game which had a large number of rooms with names (the total number and the names were defined at run time), each of these names has to be unique and one had to be able to search for a room with a given name quickly. A hash table could also have been used, but in some ways a trie is simpler to implement and faster when using strings. (My trie implementation ended up being ~50 lines of C.)
The trie tag probably has many more examples.
There are multiple ways to use tries. The typical example is a lookup such as the one you have presented. However Tries can also be used to fully index a complete text. Either you use the Ukkonen suffix tree algorithm, to produce a suffix trie, or you explicetly construct the suffix trie by storing suffixes (much slower than Ukkonens algorithm, but also much simpler). As this is preprocessing, which needs to be done only once speed is not that crucial.
For this you would just take your text, insert the full text, then chop of the first letter, insert the resulting text, chop of second letter, insert...
So if we have the text "The Text" we would insert the following set:
{"The Text", "he Text", "e Text", " Text", "Text", "ext", "xt", "t"}
In the resulting suffix trie we can easily search for any kind of prefix. Also this is space efficient, because we do not need to store the whole string, since common prefixes are stored only once.
If you need to store much longer strings space efficiently it is best not only to store prefixes together but also suffixes. In that case you could build up a directed acyclic word graph (DAWG), which is very similar to a trie in conception.
So a trie in that sense allows finding arbitrary substrings, including partial words. If you are only interested in storing words, a different data structure should be used, for example a inverted list (if order is important) or a vector space based retrieval algorithm (in case word order does not matter).

C++ - How to efficiently find out if any string in a vector can be assembled from a set of letters

I am implementing a text-based version of Scrabble for a college project.
I have a vector containing around 400K strings (my dictionary), and, at some point in every turn, I'm going to have to check if there's still a word in the dictionary which can be formed with the pieces in the player's hand. I'm checking if the player has any move left... If not, it's game over for the player in question...
My only solution to this is iterating through the string, one by one, and using a sub-routine I have to check if the string in question can be formed from the player's pieces. I'll implement a quickfail checking if the user has any vowels, but it'll still be woefully inefficient.
The text-file containing the dictionary is already alphabetically ordered, so the vector is sorted.
Any suggestions?
A problem was presented in the comments below: Any suggestion on how do I take the letters already on the board into account?
Without giving you any specific code (since this is homework after all), one general approach to consider is to map from the sorted letters in the word to the actual legal words.
That is to say, if your dictionary file had only the words ape, gum, and mug, your data structure would look like:
aep -> ape
gmu -> gum, mug
Then you can simply go through permutations of the player's letters, and quickly identify if that key exists in the map.
You pay a little bit of processing time setting up the dictionary at startup, but then you only have to perform a few quick lookups rather than iterating through the whole list every time.
Sounds like a variation of the subset sum problem: http://en.wikipedia.org/wiki/Subset_sum_problem
Maybe some of the described algorithms would help you.
There have been numerous papers and questions on Scrabble on this site.
There are many strategies available too.
The representation of your dictionary is inadequate, there are much clever methods available. For example, check what a Trie is on wikipedia.
Using this you can implement a backtracking algorithm to quickly determine which words you can form.
{'as', 'ape', 'gum'}
Trie:
void -a-> (n) -p-> (n) -e-> (y)
-s-> (y)
-g-> (n) -u-> (n) -m-> (y)
Where 'n' means that it does not form a word and y means that it does.
Now, you just have to walk the Trie, keeping in mind what letters are available.
Say that you have {'a', 'p', 'g', 'm', 'u'}:
1. I have a 'a' (but 'a' is not a word)
2. I have a 'p' (but 'ap' is not a word)
3. I don't have any 'e' so I can't go further, let's backtrack
4. I don't have any 's' so...
5. I have a 'g', but it's not a word
6. I have a 'u', but 'gu' is not a word
7. I have a 'm' and 'gum' is a word, I store it somewhere, I can't go further
The point is to maintain a set of the available letters, when you take the -a-> branch, you remove 'a' from this set, then when you take -a-> in reverse (while backtracking) you add it back in the set.
This structure is much more space efficient, it actually models a Finite Automaton which recognize the language of your dictionary instead of blindly saving all words
The runtime should be much faster as well, since you'll never go deep in the tree-structure (you only have 7 letters available)
It's certainly not what I'd do, since it does not take the board into account :p
' ' letters mean you can take any of the available branches. You don't need to use a blank if you have the required letter.
You could also store the strings with characters sorted in ASCIIbetical order into a std::set, then sort the player's letters into the same order and search the map for each substring of the player's letters.
How about keeping the pairs {word from the dictionary, string consisting of the same letters but in ascending order (sorted)}
Then sort the vector of those pairs based on the second string and compare using binary search with a string consisting of sorted letters from players hand.
There are some good answers here already, and I think a trie is probably the right way to go, but this is an interesting problem so I'll toss in my two cents' worth...
The naive approach would be to generate all permutations of the available letters and of all distinct subsets, then search for each potential word in the dictionary. The problem is that, while it's not hard to do this, there is a surprisingly large number of potential words, and most of them are invalid.
On the positive side, checking the dictionary can be sped up with a binary search or something similar. On the negative side, you'd be doing this so many times that the program would grind to a halt for long lists of letters.
We definitely need to preprocess the dictionary to make it more useful, and what we really need is to have a way to quickly rule out most of the potential matches, even if the method has occasional false positives.
One way to do this would be to represent which letters a word uses in a bit map. In other words, precalculate a 32-bit number for each word in the dictionary, where each bit is set if the corresponding letter of the alphabet is used in the word at least once. This would allow you to find all potential words by doing a linear scan of the dictionary and keeping only the ones that use only letters you have available. I suspect that, with a bit of cleverness and indexing, you can do better than linear.
Of the candidates you find, some will require more instances of a letter than you have available, so these will be false positives. That means you need to do a final check on all of the candidates you generated to eliminate the almost-hits. There are many ways to do this, but one of the simplest is to go through your list of letters and replace the first occurrence of that letter in the potential word with a dash. When you're done, if the potential word has anything but dashes, it's a failure. A more elegant solution, though not necessarily faster, would be to generate an array of letter frequencies and compare them.
Again, I think tries are probably the way to go, but I hope these ideas are useful to you.
edit
Let me toss out an example of how you could do better than a full linear search on the initial search: use the radix. Keep a simple index that lets you look up the first word that starts with a given letter. Then, when doing the search, skip over all words that start with a letter that you don't have. This is not a gigantic speedup, but it's an improvement.

Prefix search in a radix tree/patricia trie

I'm currently implementing a radix tree/patricia trie (whatever you want to call it). I want to use it for prefix searches in a dictionary on a severely underpowered piece of hardware. It's supposed to work more or less like auto-completion, i. e. showing a list of words that the typed prefix matches.
My implementation is based on this article, but the code therein doesn't include prefix searches, though the author says:
[...] Say you want to enumerate all the nodes that have keys with a common prefix "AB". You can perform a depth first search starting at that root, stopping whenever you encounter back edges.
But I don't see how that is supposed to work. For example, if I build a radix tree from these words:
illness
imaginary
imagination
imagine
imitation
immediate
immediately
immense
in
I will get the exact same "best match" for the prefixes "i" and "in" so that it seems difficult to me to gather all matching words just by traversing the tree from that best match.
Additionally, there is a radix tree implementation in Java that has an implemented prefix search in RadixTreeImpl.java. That code explicitly checks all nodes (starting from a certain node) for a prefix match - it actually compares bytes.
Can anyone point me to a detailed description on implementing a prefix search on radix trees? Is the algorithm used in the Java implementation the only way to do it?
Think about what your trie encodes. At each node, you have the path that lead you to that node, so in your example, you start at Λ (that's a capital Lambda, this greek font kind of sucks) the root node corresponding to an empty string. Λ has children for each letter used, so in your data set, you have one branch, for "i".
Λ
Λ→"i"
At the "i" node, there are two children, one for "m" and one for "n". The next letter is "n", so you take that,
Λ→"i"→"n"
and since the only word that starts "i","n" in your data set is "in", there are no children from "n". That's a match.
Now, let's say the data set, instead of having "in", had "infindibulum". (What SF I'm referencing is left as an exercise.) You'd still get to the "n" node the same way, but then if the next letter you get is "q", you know the word doesn't appear in your data set at all, because there's no "q" branch. At that point, you say "okay, no match." (Maybe you then start adding the word, maybe not, depending on the application.)
But if the next letter is "f", you can keep going. You can short circuit that with a little craft, though: once you reach a node that represents a unique path, you can hang the whole string off that node. When you get to that node, you know that the rest of the string must be "findibulum", so you've used the prefix to match the whole string, and return it.
How your you use that? in a lot of non-UNIX command interpreters, like the old VAX DCL, you could use any unique prefix of a command. So, the equivalent of ls(1) was DIRECTORY, but no other command started with DIR, so you could type DIR and that was as good as doing the whole word. If you couldn't remember the correct command, you could type just 'D', and hit (I think) ESC; the DCL CLI would return you all the commands that started with D, which it could search extremely fast.
It turns out the GNU extensions for the standard c++ lib includes a Patricia trie implementation. It's found under the policy-based data-structures extension. See http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/trie_based_containers.html
An alternative algorithm: Keep It Simple Stupid!
Just make a sorted list of your keywords. When you have a prefix, binary search to find where that prefix would be located in the list. All of your possible completions will be found starting at that index, ready to be accessed in place.
This algorithm will will require only 5% of the code of a Patricia trie and will be easy to maintain, understand, and update. It is almost certain this simple list search will be more efficient as well.
The only downside is if you have huge numbers of long keywords with similar prefixes, a trie can save some storage since it doesn't need to keep the full prefix for every entry. In practice, if you have less than a few million words, this is not a savings because the pointer overhead of the tree will dominate. This savings is more for applications like searching databases of DNA strings with millions of characters, not text keywords.
Another alternative algo is a ternary search tree (more memory efficient) https://github.com/varunpant/TernaryTree/tree/master/TernaryTree