How to create a DAWG? [closed] - c++

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
How can a DAWG be created? I have found that there are two ways; one is converting a trie into a dawg and the other being creating a new DAWG straight away? Which one is the easiest? Can you please elaborate on the two and provide some links?

One way to think about the DAWG is as a minimum-state DFA for all of the words in your word list. As a result, the traditional algorithm for constructing a DAWG is the following:
Start off by constructing a trie for the collection of words.
Add a new node to the trie with edges from itself to itself on all inputs.
For each missing letter transition in the trie, add a transition from the start node to this new dead node.
(At this point, you now have a (probably non-minimum) DFA for the set of words.)
Minimize the DFA using the standard algorithm for DFA state minimization.
Once you have done this, you will be left with a DAWG for the set of words you are interested in.
The runtime of this algorithm is as follows. Constructing the initial DFA can be done by constructing a trie for all the original words (which takes time O(n), where n is the total number of characters in all the input strings), then filling in the missing transitions (which takes time O(n|Σ|), where |Σ| is the number of different characters in your alphabet). From there, the minimization algorithm runs in time O(n2 |Σ|). This means that the overall runtime for the algorithm is O(n2 |Σ|).
To the best of my knowledge, there is no straightforward algorithm for incrementally constructing DAWGs. Typically, you would build a DAWG for a set of words only if you already had all the words in advance. Intuitively, this is true because inserting a new word that has some suffixes already present in the DAWG might require a lot of restructuring of the DAWG to make certain old accepting states not accepting and vice-versa. Theoretically speaking, this results because inserting a new word might dramatically change the equivalence classes of the DFA's distinguishability relation, which might require substantial changes to the DFA's structure.
Hope this helps!

Related

2D String Matching: Baker-Bird Algorithm [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I want to find a submatrix in a huge matrix, so I google and find the Baker-Bird Algorithm.
But, unfortunately I cannot understand it very much, and the tutorial about it is rare.
I cannot find some example code to study.
So what I want to ask is there some simple example code or pseudo code that I can study it?
Thx in advance.
Ok, from studying the link Kent Munthe Caspersen gave ( http://www.stringology.org/papers/Zdarek-PhD_thesis-2010.pdf page 30 on), I understand how the Baker-Bird Algorithm works.
For a submatrix to appear in a matrix, its columns must all match individually. You can scan down each column looking for matches, and then scan this post-processed matrix for rows indicating columns consecutively matching at the same spot.
Say we are looking for submatrices of the format
a c a
b b a
c a b
We search down each column for the column-matches 'abc' 'cba' or 'aab' and in a new matrix mark the ends of those complete matches in the corresponding cell - for example with A, B or C. (What the algorithm in the paper does is construct a state machine which transitions to a new state based on the old state number and which letter comes next, and then looks for states that indicate we just matched a column, which is more complex but more efficient as it only has to scan each column once instead of once per different column match we are interested in)
Once we have done this, we scan along each row looking for successive values indicating successive columns matched - in this case, we're looking for the string 'ABC' in a matrix row. If we find it, there was a sub-array match here.
Speedups are attained from using the state machine approach described above, and also from choice of string searching algorithm ( there are many with different time complexities: ( of which there are numerous: http://en.wikipedia.org/wiki/String_searching_algorithm )
(Note that the entire algorithm can, of course, be flipped to do rows first than columns, it's identical.)
What about the example in this PhD thesis p.31-33:
http://www.stringology.org/papers/Zdarek-PhD_thesis-2010.pdf

How to determine what bin a float should be in? C++ [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have an array of floats Float_t xbins[41] that defines 40 bins i.e. ranges of floats.
E.g. y is in bin 7 if y > xbins[7] && !(y > xbins[8]).
How do I determine what bin a given float should belong to without having 40 if statements?
Please answer in C++ as I don't speak other languages.
If the array is sorted, then do a binary search to locate the correct bin. You'll need a combination of std::sort (if not sorted), then something like std::lower_bound, to locate. You'll need to ensure that operator< is implemented correctly for Float_t.
As it turned out that the bins are not uniformly spaced but have integer bounds, the probably fastest method is to have a (inverse) look up table that apparently has about 100 entries. One needs to make basically two comparisons for the lower & higher bounds.
If the array bounds are derived with a formula, it could be possible to write an inverse formula that outperforms the LUT method.
For a generic case binary search is the way -- and even that can be improved a bit by doing linear interpolation instead of exactly subdividing the range to half. The speed (if the data is not pathological) would be O(loglogn) compared to O(logn) for binary search.

test/training effect on classifers results [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
im struggling to understand the effect of training/test data's effect on my correctly classified instances result.
An example with naive bayes if i apply more test data in percentage split the algorithm becomes more reliable?
The point of splitting your entire data set into training and test is that the model you want to learn (naive Bayes or otherwise) should reflect the true relationship between cause and effect (features and prediction) and not simply the data. For example, you can always fit a curve perfectly to a number of data points, but doing that will likely make it useless for the prediction you were trying to make.
By using a separate test set, the learned model is tested on unseen data. Ideally, the error (or whatever you're measuring) on training and test set would be about the same, suggesting that your model is reasonably general and not overfit to the training data.
If in your case decreasing the size of the training set increases performance on the test set, it suggests that the learned model is too specific and cannot be generalised. Instead of changing the training/test split, you should tweak the parameters of your learner however. You might also want to consider using cross validation instead of a simple training/test split as it will provide more reliable performance estimates.

Arrays and Caching [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
If I have a char array that is of length 8 billion. Would breaking it into smaller arrays increase performance by improving caching? Basically, I will iterate the array and do some comparisons. If not, what is the most optimal way of using an array with such length.
I am reading a file in binary form into an array, and will be performing binary comparisons on different parts of the file.
8 GB worth of data will inevitably ruin data locality so one way or the other you either have to manage your memory in smaller pieces or your OS will do the disk swapping of virtual memory.
There is, however, an alternative - a so-called mmap. Essentially this allows you to map a file into a virtual memory space and your OS then takes the task of accessing it and loading the necessary pages into RAM, while your access to this file becomes nothing more than just a simple memory addressing.
Read more about mmap at http://en.wikipedia.org/wiki/Mmap
If you are going to do this once then just run through it. The programming effort may not be worth the time gained.
I am assuming you want to do this again and again which is why you want to optimize it. It would surely help to know if your iteration and comparisons need to be done sequentially etc? Without some problem domain input it is kind of difficult to give a generic optimization here.
If it can be done in parallel and you have to do it multiple times I suggest you take a look at MapReduce techniques to solve this.

C++ Algorithm stability [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
How can I tell whether an algorithm is stable or not?..
Also, how does this algorithm Bucketsort compare to Mergesort, Quicksort, Bubblesort, and Insertionsort
?
At first glance it would seem that if your queues are FIFO then it is stable. However I think there some context from class or other homework that would help you make a more solid determination.
From wikipedia:
Stability
Stable sorting algorithms maintain the relative order of records with equal keys. If all keys are different then this distinction is not necessary. But if there are equal keys, then a sorting algorithm is stable if whenever there are two records (let's say R and S) with the same key, and R appears before S in the original list, then R will always appear before S in the sorted list. When equal elements are indistinguishable, such as with integers, or more generally, any data where the entire element is the key, stability is not an issue. However, assume that the following pairs of numbers are to be sorted by their first component:
http://en.wikipedia.org/wiki/Sorting_algorithm#Stability
As far as comparing to other algorithms. Wikipedia has a concise entry on it:
http://en.wikipedia.org/wiki/Bucket_sort#Comparison_with_other_sorting_algorithms
Also: https://stackoverflow.com/a/7341355/1416221