Introsort (quicksort + heapsort) implementation and complexity - c++

I've read that C++ uses introsort (introspective sort) for its built-in std::sort where it starts off with quicksort and switches to heapsort when you hit the depth limit.
I've also read that the depth limit is supposed to be 2*log(2,N).
Is this value purely experimental? Or is there some mathematical theory behind it?

If you have an interval (range or array), the number of times you'll have to split the interval in half before you end up with an empty (or one element) interval is log(2,N), that's just a mathematical fact, you can work it out easily, if you want. If all goes perfectly well with quicksort, it should recurse log(2,N) times, for the same reason (and at each recursion level, it has to process all values of the interval, which leads to a O(N*log(2,N)) complexity for the overall algorithm). The problem is that quicksort could require many more recursions (if it keeps getting "unlucky" with picking pivot values, which means that it doesn't split the interval in half, but in an imbalanced way instead). At worse, quicksort could end up recursing N times, which is definitely not acceptable for a production-quality implementation.
Switching to heap-sort at 2*log(2,N) is just a good heuristic in general, to detect a much too deep number of recursions.
Technically, you could base this on the empirical performance of heap-sort versus quick-sort, to figure out what limit is the best. But such tests are highly dependent on the application (what are you sorting? how are you comparing elements? how cheap are the element swaps? etc..). So, most one-size-fits-all implementation, like std::sort, would just pick a reasonable limit like 2*log(2,N).

What #Mikael Persson said regarding why the depth limit is 2*log(2,N) is partly correct. It is not just a good heuristic, or a reasonable limit.
In fact, as you have probably guessed (depicted from your second question), there is an important mathematical reason for this: in tilde notation (search for tilde notation), quicksort makes on average ~2*log(2,N) comparisons. In big-oh notation, this is equivalent to O(N*log(2,N)).
That is why introsort switches to heapsort (which has asymptotic O(N*log(2,N)) complexity) when the depth of the recursion becomes more than 2*log(2,N). You can think of it as something which is not usual to happen and most probably means that something went wrong with the pivot picking and quicksort alone would lead to O(N^2) complexity.
You can find a short mathematical proof of the average number of compares quicksort does here (slide 21).

Related

What are the best sorting algorithms when 'n' is very small?

In the critical path of my program, I need to sort an array (specifically, a C++ std::vector<int64_t>, using the gnu c++ standard libray). I am using the standard library provided sorting algorithm (std::sort), which in this case is introsort.
I was curious about how well this algorithm performs, and when doing some research on various sorting algorithms different standard and third party libraries use, almost all of them care about cases where 'n' tends to be the dominant factor.
In my specific case though, 'n' is going to be on the order of 2-20 elements. So the constant factors could actually be dominant. And things like cache effects might be very different when the entire array we are sorting fits into a couple of cache lines.
What are the best sorting algorithms for cases like this where the constant factors likely overwhelm the asymptotic factors? And do there exist any vetted C++ implementations of these algorithms?
Introsort takes your concern into account, and switches to an insertion sort implementation for short sequences.
Since your STL already provides it, you should probably use that.
Insertion sort or selection sort are both typically faster for small arrays (i.e., fewer than 10-20 elements).
Watch https://www.youtube.com/watch?v=FJJTYQYB1JQ
A simple linear insertion sort is really fast. Making a heap first can improve it a bit.
Sadly the talk doesn't compare that against the hardcoded solutions for <= 15 elements.
It's impossible to know the fastest way to do anything without knowing exactly what the "anything" is.
Here is one possible set of assumptions:
We don't have any knowledge of the element structure except that elements are comparable. We have no useful way to group them into bins (for radix sort), we must implement a comparison-based sort, and comparison takes place in an opaque manner.
We have no information about the initial state of the input; any input order is equally likely.
We don't have to care about whether the sort is stable.
The input sequence is a simple array. Accessing elements is constant-time, as is swapping them. Furthermore, we will benchmark the function purely according to the expected number of comparisons - not number of swaps, wall-clock time or anything else.
With that set of assumptions (and possibly some other sets), the best algorithms for small numbers of elements will be hand-crafted sorting networks, tailored to the exact length of the input array. (These always perform the same number of comparisons; it isn't feasible to "short-circuit" these algorithms conditionally because the "conditions" would depend on detecting data that is already partially sorted, which still requires comparisons.)
For a network sorting four elements (in the known-optimal five comparisons), this might look like (I did not test this):
template<class RandomIt, class Compare>
void _compare_and_swap(RandomIt first, Compare comp, int x, int y) {
if (comp(first[x], first[y])) {
auto tmp = first[x];
arr[x] = arr[y];
arr[y] = tmp;
}
}
// Assume there are exactly four elements available at the `first` iterator.
template<class RandomIt, class Compare>
void network_sort_4(RandomIt first, Compare comp) {
_compare_and_swap(2, 0);
_compare_and_swap(1, 3);
_compare_and_swap(0, 1);
_compare_and_swap(2, 3);
_compare_and_swap(1, 2);
}
In real-world environments, of course, we will have different assumptions. For small numbers of elements, with real data (but still assuming we must do comparison-based sorts) it will be difficult to beat naive implementations of insertion sort (or bubble sort, which is effectively the same thing) that have been compiled with good optimizations. It's really not feasible to reason about these things by hand, considering both the complexity of the hardware level (e.g. the steps it takes to pipeline instructions and then compensate for branch mis-predictions) and the software level (e.g. the relative cost of performing the swap vs. performing the comparison, and the effect that has on the constant-factor analysis of performance).

Does std::string::find() use KMP or a suffix tree? [duplicate]

Why the c++'s implemented string::find() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)? Is that corrected in C++0x?
If the complexity of current find is not O(N * M), what is that?
so what algorithm is implemented in gcc? is that KMP? if not, why?
I've tested that and the running time shows that it runs in O(N * M)
Why the c++'s implemented string::substr() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)?
I assume you mean find(), rather than substr() which doesn't need to search and should run in linear time (and only because it has to copy the result into a new string).
The C++ standard doesn't specify implementation details, and only specifies complexity requirements in some cases. The only complexity requirements on std::string operations are that size(), max_size(), operator[], swap(), c_str() and data() are all constant time. The complexity of anything else depends on the choices made by whoever implemented the library you're using.
The most likely reason for choosing a simple search over something like KMP is to avoid needing extra storage. Unless the string to be found is very long, and the string to search contains a lot of partial matches, the time taken to allocate and free that would likely be much more than the cost of the extra complexity.
Is that corrected in c++0x?
No, C++11 doesn't add any complexity requirements to std::string, and certainly doesn't add any mandatory implementation details.
If the complexity of current substr is not O(N * M), what is that?
That's the worst-case complexity, when the string to search contains a lot of long partial matches. If the characters have a reasonably uniform distribution, then the average complexity would be closer to O(N). So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower.
FYI,
The string::find in both gcc/libstdc++ and llvm/libcxx were very slow. I improved both of them quite significantly (by ~20x in some cases). You might want to check the new implementation:
GCC: PR66414 optimize std::string::find
https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65
LLVM: https://reviews.llvm.org/D27068
The new algorithm is simpler and uses hand optimized assembly functions of memchr, and memcmp.
Where do you get the impression from that std::string::substr() doesn't use a linear algorithm? In fact, I can't even imagine how to implement in a way which has the complexity you quoted. Also, there isn't much of an algorithm involved: is it possible that you are think this function does something else than it does? std::string::substr() just creates a new string starting at its first argument and using either the number of characters specified by the second parameter or the characters up to the end of the string.
You may be referring to std::string::find() which doesn't have any complexity requirements or std::search() which is indeed allowed to do O(n * m) comparisons. However, this is a giving implementers the freedom to choose between an algorithm which has the best theoretical complexity vs. one which doesn't doesn't need additional memory. Since allocation of arbitrary amounts of memory is generally undesirable unless specifically requested, this seems a reasonable thing to do.
The C++ standard does not dictate the performance characteristics of substr (or many other parts, including the find you're most likely referring to with an M*N complexity).
It mostly dictates functional aspects of the language (with some exceptions like the non-legacy sort functions, for example).
Implementations are even free to implement qsort as a bubble sort (but only if they want to be ridiculed and possibly go out of business).
For example, there are only seven (very small) sub-points in section 21.4.7.2 basic_string::find of C++11, and none of them specify performance parameters.
Let's look into the CLRS book. On the page 989 of third edition we have the following exercise:
Suppose that pattern P and text T are randomly chosen strings of
length m and n, respectively, from the d-ary alphabet
Ʃd = {0; 1; ...; d}, where d >= 2. Show that the
expected number of character-to-character comparisons made by the
implicit loop in line 4 of the naive algorithm is
over all executions of this loop. (Assume that the naive algorithm
stops comparing characters for a given shift once it finds a mismatch
or matches the entire pattern.) Thus, for randomly chosen strings, the
naive algorithm is quite efficient.
NAIVE-STRING-MATCHER(T,P)
1 n = T:length
2 m = P:length
3 for s = 0 to n - m
4 if P[1..m] == T[s+1..s+m]
5 print “Pattern occurs with shift” s
Proof:
For a single shift we are expected to perform 1 + 1/d + ... + 1/d^{m-1} comparisons. Now use summation formula and multiply by number of valid shifts, which is n - m + 1. □
Where do you get your information about the C++ library? If you do mean string::search and it really doesn't use the KMP algorithm then I suggest that it is because that algorithm isn't generally faster than a simple linear search due to having to build a partial match table before the search can proceed.
If you are going to be searching for the same pattern in multiple texts. The BoyerMoore algorithm is a good choice because the pattern tables need only be computed once , but are used multiple times when searching multiple texts. If you are only going to search for a pattern once in 1 text though, the overhead of computing the tables along with the overhead of allocating memory slows you down too much and std::string.find(....) will beat you since it does not allocate any memory and has no overhead.
Boost has multiple string searching algorithms. I found that BM was an order of magnitude slower in the case of a single pattern search in 1 text than std::string.find().
For my cases BoyerMoore was rarely faster than std::string.find() even when searching multiple texts with the same pattern.
Here is the link to BoyerMoore
BoyerMoore

What is complexity of this pointer function?

This is code, function receiving a pointer *ptr(that is pointing to character array outside).Loop is simply calculating length. What will be the complexity function of this function. I have calculated.
Is this correct complexity function calculation?
inside while, should i consider 3 operations, *, ptr+c , !=
Astaric * is dereferencing, ptr+c is calculating address, and != for condition.
When we say O(N), we really mean O(N)+c, where c is a constant. This is because with very small n, the characteristics of the complexity may not show themselves, as nonessentials (noise) may dominate more, so this is represented by c. But as N grows, the constant becomes insignificant, as the complexity as relative to N becomes the dominant time. Typically, a constant merely shifts the graph up or down by a small amount, but doesn't change its shape. Thus, we omit it, and just say O(N), which is what we're really interested in knowing.
As for coefficients, same thing happens. If a complexity is proportional to N, it is a linear relationship. The graph of a linear operation, multiplied by X, will have a different slope but its shape is still linear. Thus, the coefficient is not really providing additional information in terms of complexity category.
Similarly, if you have something that is O(N)+O(N^2), the N^2 dominates, and we tend to ignore the smaller terms O(N), and just call it O(N^2) algorithm. It's not an exact science, it's a categorization to understand the nature of the complexity. Big-oh notation is the worst case complexity.
I think you might be interested in the coefficients and constants only when evaluating the relative metrics of specific algorithms that have the same categorical complexity. (Some O(N*lg(n)) sorts are faster than others, for example.) But usually it is measured a different way.
It depends strongly on the definition of "complexity" that you're using (certainly this is not one of the typical asymptotic notations) but, if I properly remember these assignments from this level of education, yes you have counted correctly.
The resulting worst-case asymptotic algorithmic complexity in big-Oh notation would be O(N) (derived by eliding the co-efficients and lower terms (sort of)).

What is the time complexity of the following function?

int func(int n){
if(n==1)
return 0;
else
return sqrt(n);
}
Where sqrt(n) is a C math.h library function.
O(1)
O(lg n)
O(lg lg n)
O(n)
I think that the running time entirely depends on the sqrt(n). However, I don't know how this function is actually implemented.
P.S. The general approach towards finding the square root of a number that I know of is using Newton's method. If I am not wrong, the time complexity using Newton's method turns out to be O(lg n). So should the answer be O(lg n)?
P.P.S. Got this question in a recent test that I appeared for.
I am going to give a bit more general case answer, without assuming constant size of int.
The answer is Theta(logn).
We know newton-raphson is Theta(logn) - that excludes Theta(n) (assuming sqrt() is as efficient as we can).
However, a general number n requries log_2(n) bits to encode - and you require to read all of it in order to get an accurate sqrt() function. This excludes Theta(1) and Theta(log(log(n)).
From the above, we know that the complexity of the function is Theta(log(n)).
As a side note, since O(log(n)) is a subset of O(n) - it is also a valid answer, though not tight one. For more information about big Theta and big O and their differences, you might want to have a look on this thread.
This depends on the implementation of sqrt and also on what kind of time complexity you are interested.
I would say you can consider it to be "constant", so O(1), in that sense: If you put in a random int, it will in average take the same amount of time. (Reason: numbers with many digits are much more common).
But have a look here. Another possible answer is O(M(n)), where M(n) is the complexity of a multiplication and n is the number of digits in your integer.
This looking like a text-book question and a is perhaps meant to be a trap. The teacher perhaps wants to check if you can distinguish between computing sqrt for a list of numbers (which would be O(n)), and a single number (which would be O(1)).
Be aware that the "correct" answer often also depends on the context in which it is asked.
Let n=2^m
Given T(n)=T(sqrt(n))+1
T(2^m)=T(2^m-1)+1
Let T(2^m)=S(m)
then,
S(m)=2S(m/2)+1
using master theorem-
S(m)=theta(m)
=theta(log(n))
Hence, time complexity is theta(log(n)).

Can we know if a collection is almost sorted without applying a sort algorithm?

In the wikipedia article on sorting algorithms,
http://en.wikipedia.org/wiki/Sorting_algorithm#Summaries_of_popular_sorting_algorithms
under Bubble sort it says:Bubble sort can also be used efficiently on a list of any length that is nearly sorted (that is, the elements are not significantly out of place)
So my question is: Without sorting the list using a sorting algoithm first, how can one know if that is nearly sorted or not?
Are you familiar with the general sorting lower bound? You can prove that in a comparison-based sorting algorithm, any sorting algorithm must make Ω(n log n) comparisons in the average case. The way you prove this is through an information-theoretic argument. The basic idea is that there are n! possible permutations of the input array, and since the only way you can learn about which permutation you got is to make comparisons, you have to make at least lg n! comparisons in order to be certain that you know the structure of your input permutation.
I haven't worked out the math on this, but I suspect that you could make similar arguments to show that it's difficult to learn how sorted a particular array is. Essentially, if you don't do a large number of comparisons, then you wouldn't be able to tell apart an array that's mostly sorted from an array that is actually quite far from sorted. As a result, all the algorithms I'm aware of that measure "sortedness" take a decent amount of time to do so.
For example, one measure of the level of "sortedness" in an array is the number of inversions in that array. You can count the number of inversions in an array in time O(n log n) using a divide-and-conquer algorithm based on mergesort, but with that runtime you could just sort the array instead.
Typically, the way that you'd know that your array was mostly sorted was to know something a priori about how it was generated. For example, if you're looking at temperature data gathered from 8AM - 12PM, it's very likely that the data is already mostly sorted (modulo some variance in the quality of the sensor readings). If your data looks at a stock price over time, it's also likely to be mostly sorted unless the company has a really wonky trajectory. Some other algorithms also partially sort arrays; for example, it's not uncommon for quicksort implementations to stop sorting when the size of the array left to sort is small and to follow everything up with a final insertion sort pass, since every element won't be very far from its final position then.
I don't believe there exists any standardized measure of how sorted or random an array is.
You can come up with your own measure - like count the number of adjacent pairs which are out of order (suggested in comment), or count the number of larger numbers which occur before smaller numbers in the array (this is trickier than a simple single pass).