Does std::string::find() use KMP or a suffix tree? [duplicate] - c++

Why the c++'s implemented string::find() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)? Is that corrected in C++0x?
If the complexity of current find is not O(N * M), what is that?
so what algorithm is implemented in gcc? is that KMP? if not, why?
I've tested that and the running time shows that it runs in O(N * M)

Why the c++'s implemented string::substr() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)?
I assume you mean find(), rather than substr() which doesn't need to search and should run in linear time (and only because it has to copy the result into a new string).
The C++ standard doesn't specify implementation details, and only specifies complexity requirements in some cases. The only complexity requirements on std::string operations are that size(), max_size(), operator[], swap(), c_str() and data() are all constant time. The complexity of anything else depends on the choices made by whoever implemented the library you're using.
The most likely reason for choosing a simple search over something like KMP is to avoid needing extra storage. Unless the string to be found is very long, and the string to search contains a lot of partial matches, the time taken to allocate and free that would likely be much more than the cost of the extra complexity.
Is that corrected in c++0x?
No, C++11 doesn't add any complexity requirements to std::string, and certainly doesn't add any mandatory implementation details.
If the complexity of current substr is not O(N * M), what is that?
That's the worst-case complexity, when the string to search contains a lot of long partial matches. If the characters have a reasonably uniform distribution, then the average complexity would be closer to O(N). So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower.

FYI,
The string::find in both gcc/libstdc++ and llvm/libcxx were very slow. I improved both of them quite significantly (by ~20x in some cases). You might want to check the new implementation:
GCC: PR66414 optimize std::string::find
https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65
LLVM: https://reviews.llvm.org/D27068
The new algorithm is simpler and uses hand optimized assembly functions of memchr, and memcmp.

Where do you get the impression from that std::string::substr() doesn't use a linear algorithm? In fact, I can't even imagine how to implement in a way which has the complexity you quoted. Also, there isn't much of an algorithm involved: is it possible that you are think this function does something else than it does? std::string::substr() just creates a new string starting at its first argument and using either the number of characters specified by the second parameter or the characters up to the end of the string.
You may be referring to std::string::find() which doesn't have any complexity requirements or std::search() which is indeed allowed to do O(n * m) comparisons. However, this is a giving implementers the freedom to choose between an algorithm which has the best theoretical complexity vs. one which doesn't doesn't need additional memory. Since allocation of arbitrary amounts of memory is generally undesirable unless specifically requested, this seems a reasonable thing to do.

The C++ standard does not dictate the performance characteristics of substr (or many other parts, including the find you're most likely referring to with an M*N complexity).
It mostly dictates functional aspects of the language (with some exceptions like the non-legacy sort functions, for example).
Implementations are even free to implement qsort as a bubble sort (but only if they want to be ridiculed and possibly go out of business).
For example, there are only seven (very small) sub-points in section 21.4.7.2 basic_string::find of C++11, and none of them specify performance parameters.

Let's look into the CLRS book. On the page 989 of third edition we have the following exercise:
Suppose that pattern P and text T are randomly chosen strings of
length m and n, respectively, from the d-ary alphabet
Ʃd = {0; 1; ...; d}, where d >= 2. Show that the
expected number of character-to-character comparisons made by the
implicit loop in line 4 of the naive algorithm is
over all executions of this loop. (Assume that the naive algorithm
stops comparing characters for a given shift once it finds a mismatch
or matches the entire pattern.) Thus, for randomly chosen strings, the
naive algorithm is quite efficient.
NAIVE-STRING-MATCHER(T,P)
1 n = T:length
2 m = P:length
3 for s = 0 to n - m
4 if P[1..m] == T[s+1..s+m]
5 print “Pattern occurs with shift” s
Proof:
For a single shift we are expected to perform 1 + 1/d + ... + 1/d^{m-1} comparisons. Now use summation formula and multiply by number of valid shifts, which is n - m + 1. □

Where do you get your information about the C++ library? If you do mean string::search and it really doesn't use the KMP algorithm then I suggest that it is because that algorithm isn't generally faster than a simple linear search due to having to build a partial match table before the search can proceed.

If you are going to be searching for the same pattern in multiple texts. The BoyerMoore algorithm is a good choice because the pattern tables need only be computed once , but are used multiple times when searching multiple texts. If you are only going to search for a pattern once in 1 text though, the overhead of computing the tables along with the overhead of allocating memory slows you down too much and std::string.find(....) will beat you since it does not allocate any memory and has no overhead.
Boost has multiple string searching algorithms. I found that BM was an order of magnitude slower in the case of a single pattern search in 1 text than std::string.find().
For my cases BoyerMoore was rarely faster than std::string.find() even when searching multiple texts with the same pattern.
Here is the link to BoyerMoore
BoyerMoore

Related

What are the best sorting algorithms when 'n' is very small?

In the critical path of my program, I need to sort an array (specifically, a C++ std::vector<int64_t>, using the gnu c++ standard libray). I am using the standard library provided sorting algorithm (std::sort), which in this case is introsort.
I was curious about how well this algorithm performs, and when doing some research on various sorting algorithms different standard and third party libraries use, almost all of them care about cases where 'n' tends to be the dominant factor.
In my specific case though, 'n' is going to be on the order of 2-20 elements. So the constant factors could actually be dominant. And things like cache effects might be very different when the entire array we are sorting fits into a couple of cache lines.
What are the best sorting algorithms for cases like this where the constant factors likely overwhelm the asymptotic factors? And do there exist any vetted C++ implementations of these algorithms?
Introsort takes your concern into account, and switches to an insertion sort implementation for short sequences.
Since your STL already provides it, you should probably use that.
Insertion sort or selection sort are both typically faster for small arrays (i.e., fewer than 10-20 elements).
Watch https://www.youtube.com/watch?v=FJJTYQYB1JQ
A simple linear insertion sort is really fast. Making a heap first can improve it a bit.
Sadly the talk doesn't compare that against the hardcoded solutions for <= 15 elements.
It's impossible to know the fastest way to do anything without knowing exactly what the "anything" is.
Here is one possible set of assumptions:
We don't have any knowledge of the element structure except that elements are comparable. We have no useful way to group them into bins (for radix sort), we must implement a comparison-based sort, and comparison takes place in an opaque manner.
We have no information about the initial state of the input; any input order is equally likely.
We don't have to care about whether the sort is stable.
The input sequence is a simple array. Accessing elements is constant-time, as is swapping them. Furthermore, we will benchmark the function purely according to the expected number of comparisons - not number of swaps, wall-clock time or anything else.
With that set of assumptions (and possibly some other sets), the best algorithms for small numbers of elements will be hand-crafted sorting networks, tailored to the exact length of the input array. (These always perform the same number of comparisons; it isn't feasible to "short-circuit" these algorithms conditionally because the "conditions" would depend on detecting data that is already partially sorted, which still requires comparisons.)
For a network sorting four elements (in the known-optimal five comparisons), this might look like (I did not test this):
template<class RandomIt, class Compare>
void _compare_and_swap(RandomIt first, Compare comp, int x, int y) {
if (comp(first[x], first[y])) {
auto tmp = first[x];
arr[x] = arr[y];
arr[y] = tmp;
}
}
// Assume there are exactly four elements available at the `first` iterator.
template<class RandomIt, class Compare>
void network_sort_4(RandomIt first, Compare comp) {
_compare_and_swap(2, 0);
_compare_and_swap(1, 3);
_compare_and_swap(0, 1);
_compare_and_swap(2, 3);
_compare_and_swap(1, 2);
}
In real-world environments, of course, we will have different assumptions. For small numbers of elements, with real data (but still assuming we must do comparison-based sorts) it will be difficult to beat naive implementations of insertion sort (or bubble sort, which is effectively the same thing) that have been compiled with good optimizations. It's really not feasible to reason about these things by hand, considering both the complexity of the hardware level (e.g. the steps it takes to pipeline instructions and then compensate for branch mis-predictions) and the software level (e.g. the relative cost of performing the swap vs. performing the comparison, and the effect that has on the constant-factor analysis of performance).

O(N) Identification of Permutations

This answer determines if two strings are permutations by comparing their contents. If they contain the same number of each character, they are obviously permutations. This is accomplished in O(N) time.
I don't like the answer though because it reinvents what is_permutation is designed to do. That said, is_permutation has a complexity of:
At most O(N2) applications of the predicate, or exactly N if the sequences are already equal, where N=std::distance(first1, last1)
So I cannot advocate the use of is_permutation where it is orders of magnitude slower than a hand-spun algorithm. But surely the implementer of the standard would not miss such an obvious improvement? So why is is_permutation O(N2)?
is_permutation works on almost any data type. The algorithm in your link works for data types with a small number of values only.
It's the same reason why std::sort is O(N log N) but counting sort is O(N).
It was I who wrote that answer.
When the string's value_type is char, the number of elements required in a lookup table is 256. For a two-byte encoding, 65536. For a four-byte encoding, the lookup table would have just over 4 billion entries, at a likely size of 16 GB! And most of it would be unused.
So the first thing is to recognize that even if we restrict the types to char and wchar_t, it may still be untenable. Likewise if we want to do is_permutation on sequences of type int.
We could have a specialization of std::is_permutation<> for integral types of size 1 or 2 bytes. But this is somewhat reminiscent of std::vector<bool> which not everyone thinks was a good idea in retrospect.
We could also use a lookup table based on std::map<T, size_t>, but this is likely to be allocation-heavy so it might not be a performance win (or at least, not always). It might be worth implementing one for a detailed comparison though.
In summary, I don't fault the C++ standard for not including a high-performance version of is_permutation for char. First because in the real world I'm not sure it's the most common use of the template, and second because the STL is not the be-all and end-all of algorithms, especially where domain knowledge can be used to accelerate computation for special cases.
If it turns out that is_permutation for char is quite common in the wild, C++ library implementors would be within their rights to provide a specialization for it.
The answer you cite works on chars. It assumes they are 8 bit (not necessarily the case) and so there are only 256 possibilities for each value, and that you can cheaply go from each value to a numeric index to use for a lookup table of counts (for char in this case, the value and the index are the same thing!)
It generates a count of how many times each char value occurs in each string; then, if these distributions are the same for both strings, the strings are permutations of each other.
What is the time complexity?
you have to walk each character of each string, so M+N steps for two inputs of lengths M and N
each of these steps involves incrementing an count in a fixed size table at an index given by the char, so is constant time
So the overall time complexity is O(N+M): linear, as you describe.
Now, std::is_permutation makes no such assumptions about its input. It doesn't know that there are only 256 possibilities, or indeed that they are bounded at all. It doesn't know how to go from an input value to a number it can use as an index, never mind how to do that in constant time. The only thing it knows is how to compare two values for equality, because the caller supplies that information.
So, the time complexity:
we know it has to consider each element of each input at some point
we know that, for each element it hasn't seen before (I'll leave discussion of how that's determined and why that doesn't impact the big O complexity as an exercise), it's not able to turn the element into any kind of index or key for a table of counts, so it has no way of counting how many occurrences of that element exist which is better than a linear walk through both inputs to see how many elements match
so the complexity is going to be quadratic at best.

Introsort (quicksort + heapsort) implementation and complexity

I've read that C++ uses introsort (introspective sort) for its built-in std::sort where it starts off with quicksort and switches to heapsort when you hit the depth limit.
I've also read that the depth limit is supposed to be 2*log(2,N).
Is this value purely experimental? Or is there some mathematical theory behind it?
If you have an interval (range or array), the number of times you'll have to split the interval in half before you end up with an empty (or one element) interval is log(2,N), that's just a mathematical fact, you can work it out easily, if you want. If all goes perfectly well with quicksort, it should recurse log(2,N) times, for the same reason (and at each recursion level, it has to process all values of the interval, which leads to a O(N*log(2,N)) complexity for the overall algorithm). The problem is that quicksort could require many more recursions (if it keeps getting "unlucky" with picking pivot values, which means that it doesn't split the interval in half, but in an imbalanced way instead). At worse, quicksort could end up recursing N times, which is definitely not acceptable for a production-quality implementation.
Switching to heap-sort at 2*log(2,N) is just a good heuristic in general, to detect a much too deep number of recursions.
Technically, you could base this on the empirical performance of heap-sort versus quick-sort, to figure out what limit is the best. But such tests are highly dependent on the application (what are you sorting? how are you comparing elements? how cheap are the element swaps? etc..). So, most one-size-fits-all implementation, like std::sort, would just pick a reasonable limit like 2*log(2,N).
What #Mikael Persson said regarding why the depth limit is 2*log(2,N) is partly correct. It is not just a good heuristic, or a reasonable limit.
In fact, as you have probably guessed (depicted from your second question), there is an important mathematical reason for this: in tilde notation (search for tilde notation), quicksort makes on average ~2*log(2,N) comparisons. In big-oh notation, this is equivalent to O(N*log(2,N)).
That is why introsort switches to heapsort (which has asymptotic O(N*log(2,N)) complexity) when the depth of the recursion becomes more than 2*log(2,N). You can think of it as something which is not usual to happen and most probably means that something went wrong with the pivot picking and quicksort alone would lead to O(N^2) complexity.
You can find a short mathematical proof of the average number of compares quicksort does here (slide 21).

how does IF affect complexity?

Let's say we have an array of 1.000.000 elements and we go through all of them to check something simple, for example if the first character is "A". From my (very little) understanding, the complexity will be O(n) and it will take some X amount of time. If I add another IF (not else if) to check, let's say, if the last character is "G", how will it change complexity? Will it double the complexity and time? Like O(2n) and 2X?
I would like to avoid taking into consideration the number of calculations different commands have to make. For example, I understand that Len() requires more calculations to give us the result than a simple char comparison does, but let's say that the commands used in the IFs will have (almost) the same amount of complexity.
O(2n) = O(n). Generalizing, O(kn) = O(n), with k being a constant. Sure, with two IFs it might take twice the time, but execution time will still be a linear function of input size.
Edit: Here and Here are explanations, with examples, of the big-O notation which is not too mathematic-oriented
Asymptotic complexity (which is what big-O uses) is not dependent on constant factors, more specifically, you can add / remove any constant factor to / from the function and it will remain equivalent (i.e. O(2n) = O(n)).
Assuming an if-statement takes a constant amount of time, it will only add a constant factor to the complexity.
A "constant amount of time" means:
The time taken for that if-statement for a given element is not dependent on how many other elements there are in the array
So basically if it doesn't call a function which looks through the other elements in the array in some way or something similar to this
Any non-function-calling if-statement is probably fine (unless it contains a statement that goes through the array, which some language allows)
Thus 2 (constant-time) if-statements called for each each element will be O(2n), but this is equal to O(n) (well, it might not really be 2n, more on that in the additional note).
See Wikipedia for more details and a more formal definition.
Note: Apart from not being dependent on constant factors, it is also not dependent on asymptotically smaller terms (terms which remain smaller regardless of how big n gets), e.g. O(n) = O(n + sqrt(n)). And big-O is just an upper bound, so saying it is O(n9999) would also be correct (though saying that in a test / exam will probably get you 0 marks).
Additional note: The problem when not ignoring constant factors is - what classifies as a unit of work? There is no standard definition here. One way is to use the operation that takes the longest, but determining this may not always be straight-forward, nor would it always be particularly accurate, nor would you be able to generically compare complexities of different algorithms.
Some key points about time complexity:
Theta notation - Exact bound, hence if a piece of code which we are analyzing contains conditional if/else and either part has some more code which grows based on input size then exact bound can't be obtained since either of branch might be taken and Theta notation is not advisable for such cases. On the other hand, if both of the branches resolve to constant time code, then Theta notation can be applicable in such case.
Big O notation - Upper bound, so if a code has conditionals where either of the conditional branches might grow with input size n, then we assume max or upper bound to calculate the time consumption by the code, hence we use Big O for such conditionals assuming we take the path that has max time consumption. So, the path which has lower time can be assumed as O(1) in amortized analysis(including the fact that we assume this path has no no recursions that may grow with the input size) and calculate time complexity Big O for the lengthiest path.
Big Omega notation - Lower bound, This is the minimum guaranteed time that a piece of code can take irrespective of the input. Useful for cases where the time taken by code doesn't grow based on input size n, but it consumes a significant amount of time k. In these cases, we can use the lower bound analysis.
Note: All of these notations doesn't depend upon the input being best/avg/worst and all of these can be applied to any piece of code.
So as discussed above, Big O doesn't care about the constant factors such as k and only sees how time increases with respect to growth in n, in which case here it is O(kn) = O(n) linear.
PS: This post was about the relation of big O and conditionals evaluation criteria for amortized analysis.
It's related to a question I posted myself today.
In your example it depends on whether you can jump from the first to the last element and if you can't then it also depends on the average length of each entry.
If as you went down through the array you had to read each full entry in order to evaluate your two if statements then your order would be O(1,000,000xN) where N is the average length of each entry. IF N is variable then it will affect the order. An example would be standard multiplication where we perform Log(N) additions of an entry which is Log(N) in lenght and so the order is O(Log^2(N)) or if you prefer O((Log(N))^2).
On the other hand if you can just check the first and last character then N = 2 and is constant so can be ignored.
This is an IMPORTANT point you have to be careful though because how can you decide if your multipler can be ignored. For example say we were doing Log(N) additions of a Log(N/100) number. Now just because Log(N/100) is the smaller term doesn't mean we can ignore it. The multiplying factor cannot be ignored if it is variable.

String search algorithm used by string::find() c++

how its faster than cstring functions? is the similar source available for C?
There's no standard implementation of the C++ Standard Library, but you should be able to take a look at the implementation shipped with your compiler and see how it works yourself.
In general, most STL functions are not faster than their C counterparts, though. They're usually safer, more generalized and designed to accommodate a much broader range of circumstances than the narrow-purpose C equivalents.
A standard optimization with any string class is to store the string length along with the string. Which will make any string operation that requires the string length to be known to be O(1) instead of O(n), strlen() being the obvious one.
Or copying a string, there's no gain in the actual copy but figuring out how much memory to allocate before the copy is O(1). The overall algorithm is still O(n). The basic operation is still the same, shoveling bytes takes just as long in any language.
String classes are useful because they are safer (harder to shoot your foot) and easier to use (require less explicit code). They became popular and widely used because they weren't slower.
The string class almost certainly stores far more data about the string than you'd find in a C string. Length is a good example. In tradeoff for the extra memory use, you will gain some spare CPU cycles.
Edit:
However, it's unlikely that one is substantially slower than the other, since they'll perform fundamentally the same actions. MSDN suggests that string::find() doesn't use a functor-based system, so they won't have that optimization.
There are many possiblities how you can implement a find string technique. The easiest way is to check every position of the destination string if there is the searchstring. You can code that very quickly, but its the slowest possiblity. (O(m*n), m = length search string, n = length destination string)
Take a look at the wikipedia page, http://en.wikipedia.org/wiki/String_searching_algorithm, there are different options presented.
The fastest way is to create a finite state machine, and then you can insert the string without going backwards. Thats then just O(n).
Which algorithm the STL actually uses, I don't know. But you could search for the sourcecode and compare it with the algorithms.