2D String Matching: Baker-Bird Algorithm [closed] - c++

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I want to find a submatrix in a huge matrix, so I google and find the Baker-Bird Algorithm.
But, unfortunately I cannot understand it very much, and the tutorial about it is rare.
I cannot find some example code to study.
So what I want to ask is there some simple example code or pseudo code that I can study it?
Thx in advance.

Ok, from studying the link Kent Munthe Caspersen gave ( http://www.stringology.org/papers/Zdarek-PhD_thesis-2010.pdf page 30 on), I understand how the Baker-Bird Algorithm works.
For a submatrix to appear in a matrix, its columns must all match individually. You can scan down each column looking for matches, and then scan this post-processed matrix for rows indicating columns consecutively matching at the same spot.
Say we are looking for submatrices of the format
a c a
b b a
c a b
We search down each column for the column-matches 'abc' 'cba' or 'aab' and in a new matrix mark the ends of those complete matches in the corresponding cell - for example with A, B or C. (What the algorithm in the paper does is construct a state machine which transitions to a new state based on the old state number and which letter comes next, and then looks for states that indicate we just matched a column, which is more complex but more efficient as it only has to scan each column once instead of once per different column match we are interested in)
Once we have done this, we scan along each row looking for successive values indicating successive columns matched - in this case, we're looking for the string 'ABC' in a matrix row. If we find it, there was a sub-array match here.
Speedups are attained from using the state machine approach described above, and also from choice of string searching algorithm ( there are many with different time complexities: ( of which there are numerous: http://en.wikipedia.org/wiki/String_searching_algorithm )
(Note that the entire algorithm can, of course, be flipped to do rows first than columns, it's identical.)

What about the example in this PhD thesis p.31-33:
http://www.stringology.org/papers/Zdarek-PhD_thesis-2010.pdf

Related

vector containing doubles [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I need to calculate the mean, median and s.d. of the values inside the vector. I can sort the vector to find out the median but is there an easier way to find the mean and standard deviation rather than adding stuff up?
You can find the median with std::nth_element. Contrary to (apparently) popular belief, this is normally faster than sorting, then finding the middle element -- it's normally O(N) (linear) where sorting is normally O(N log N).
To add the elements for the mean, you can ust std::accumulate, something like:
double total = std::accumulate(std::begin(v), std::end(v), 0.0);
[Note: depending on how old your compiler is, you may need to use v.begin() and v.end() instead of begin(v) andend(v)`). ]
Computing the variance has been covered in a previous question. The standard deviation is simply the square root of the variance.
In order to find the mean, you're simply going to have to add the vector contents up. You can find the median without actually sorting the vector first, but an algorithm for calculating the median on an unsorted vector would almost certainly be much more complex than if it's sorted. Also, I pretty sure that if you calculate the time to find the median on an unsorted vector, it's almost certainly going to be more than the combined time of sorting and extracting the median. (if you're doing it for just the technical challenge, I'll write one for you...)
Since you're probably going to have to sort the vector, you could calculate the mean whilst you're sorting.
EDIT: Didn't see the C++ tag!
If you are using a language that offers functional programming tools, you can foldl the vector with the + function and divide by its length for mean.
For stddev, you can use a lambda : x -> (x - mean)^2 and fold the result with a +.
It's not more computationally efficient, but it probably saves a lot in developer time!

How to determine what bin a float should be in? C++ [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have an array of floats Float_t xbins[41] that defines 40 bins i.e. ranges of floats.
E.g. y is in bin 7 if y > xbins[7] && !(y > xbins[8]).
How do I determine what bin a given float should belong to without having 40 if statements?
Please answer in C++ as I don't speak other languages.
If the array is sorted, then do a binary search to locate the correct bin. You'll need a combination of std::sort (if not sorted), then something like std::lower_bound, to locate. You'll need to ensure that operator< is implemented correctly for Float_t.
As it turned out that the bins are not uniformly spaced but have integer bounds, the probably fastest method is to have a (inverse) look up table that apparently has about 100 entries. One needs to make basically two comparisons for the lower & higher bounds.
If the array bounds are derived with a formula, it could be possible to write an inverse formula that outperforms the LUT method.
For a generic case binary search is the way -- and even that can be improved a bit by doing linear interpolation instead of exactly subdividing the range to half. The speed (if the data is not pathological) would be O(loglogn) compared to O(logn) for binary search.

How to create a DAWG? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
How can a DAWG be created? I have found that there are two ways; one is converting a trie into a dawg and the other being creating a new DAWG straight away? Which one is the easiest? Can you please elaborate on the two and provide some links?
One way to think about the DAWG is as a minimum-state DFA for all of the words in your word list. As a result, the traditional algorithm for constructing a DAWG is the following:
Start off by constructing a trie for the collection of words.
Add a new node to the trie with edges from itself to itself on all inputs.
For each missing letter transition in the trie, add a transition from the start node to this new dead node.
(At this point, you now have a (probably non-minimum) DFA for the set of words.)
Minimize the DFA using the standard algorithm for DFA state minimization.
Once you have done this, you will be left with a DAWG for the set of words you are interested in.
The runtime of this algorithm is as follows. Constructing the initial DFA can be done by constructing a trie for all the original words (which takes time O(n), where n is the total number of characters in all the input strings), then filling in the missing transitions (which takes time O(n|Σ|), where |Σ| is the number of different characters in your alphabet). From there, the minimization algorithm runs in time O(n2 |Σ|). This means that the overall runtime for the algorithm is O(n2 |Σ|).
To the best of my knowledge, there is no straightforward algorithm for incrementally constructing DAWGs. Typically, you would build a DAWG for a set of words only if you already had all the words in advance. Intuitively, this is true because inserting a new word that has some suffixes already present in the DAWG might require a lot of restructuring of the DAWG to make certain old accepting states not accepting and vice-versa. Theoretically speaking, this results because inserting a new word might dramatically change the equivalence classes of the DFA's distinguishability relation, which might require substantial changes to the DFA's structure.
Hope this helps!

COUNTING number of pairs of intersecting chords [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Consider N chords in a circle, each determined by its endpoints. Describe an O(nlogn) solution for determining the number of pairs of chords that intersect inside the circle.
ASSUMPTION: No two chords share an endpoint.
There exists a general line-segment intersection algorithm which does the job in O(nlogn).
This can be used in your case as two chords can't intersect in the exterior of a circle.
The following link contains the algorithm:
http://www.cs.princeton.edu/~chazelle/pubs/IntersectLineSegments.pdf
P.S.
It requires knowledge of basic computational geometry (line sweeps, range trees).
Hope this helps.
Off the top of my head, sort the chord endpoints by polar angle (this is the O(n log n) part). Then read through the sorted list (which is O(n)) - if two adjacent endpoints belong to the same chord, it has no intersections. Where two adjacent entries in the list belong to different chords, there may be an intersection depending on where the other endpoints for those two chords lie - e.g. if a chord A has endpoints A1 and A2 in their sorted order, and similarly chord B has B1 and B2, finding B2-A1 in the list is not an intersection, because B1 is earlier and A2 is later. However, B1-A2 would be an intersection.
See also biziclop's comment for another, somewhat more carefully constructed, solution.

C++ Algorithm stability [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
How can I tell whether an algorithm is stable or not?..
Also, how does this algorithm Bucketsort compare to Mergesort, Quicksort, Bubblesort, and Insertionsort
?
At first glance it would seem that if your queues are FIFO then it is stable. However I think there some context from class or other homework that would help you make a more solid determination.
From wikipedia:
Stability
Stable sorting algorithms maintain the relative order of records with equal keys. If all keys are different then this distinction is not necessary. But if there are equal keys, then a sorting algorithm is stable if whenever there are two records (let's say R and S) with the same key, and R appears before S in the original list, then R will always appear before S in the sorted list. When equal elements are indistinguishable, such as with integers, or more generally, any data where the entire element is the key, stability is not an issue. However, assume that the following pairs of numbers are to be sorted by their first component:
http://en.wikipedia.org/wiki/Sorting_algorithm#Stability
As far as comparing to other algorithms. Wikipedia has a concise entry on it:
http://en.wikipedia.org/wiki/Bucket_sort#Comparison_with_other_sorting_algorithms
Also: https://stackoverflow.com/a/7341355/1416221