strangely slow quicksort for large tables

strangely slow quicksort for large tables - c++

I have been doing my homework which is to compare a bunch of sorting algorithms, and I have came across a strange phenomenon. Things have been as expected: insertionsort winning for something like table of 20 ints, otherwise quicksort outperforming heapsort and mergesort. Up to a table of 500,000 ints (stored in memory). For 5,000,000 ints (still stored in memory) quicksort becomes suddenly worse then heapsort and mergesort. Numbers are always uniformly distributed random, windows virtual memory turned off. Anyone has an idea what could be the cause of that?
void quicksortit(T *tab,int s) {
if (s==0 || s==1) return;
T tmp;
if (s==2) {
if (tab[0]>tab[1]) {
tmp=tab[0];
tab[0]=tab[1];
tab[1]=tmp;
}
return;
}
T pivot=tab[s-1];
T *f1,*f2;
f1=f2=tab;
for(int i=0;i<s;i++)
if (*f2>pivot)
f2++;
else {
tmp=*f1;
*f1=*f2;
*f2=tmp;
f1++; f2++;
}
quicksortit(tab,(f1-1)-tab);
quicksortit(f1,f2-f1);
};

You algorithm starts failing when there are many duplicates in the array. You only noticed this at large values because you have been feeding the algorithm random values which have a large span( I'm assuming you used rand() with: 0 - RAND_MAX ), and that problem only appears with large arrays.
When you try to sort an array of identical numbers( try sorting 100000 identical numbers, the program will crash ) you will first walk through the entire array superfluously swapping elements. Then you split the array into two, but the large array has only been reduced by 1:
v
quicksortit(tab,(f1-1)-tab);
Thus your algorithm becomes O(n^2), and you also consume a very large amount of stack. Searching for a better pivot, will not help you in this case, rather choose a version of quicksort() that doesn't exhibit this flaw.
For example:
function quicksort(array)
if length(array) > 1
pivot := select middle, or a median of first, last and middle
left := first index of array
right := last index of array
while left <= right
while array[left] < pivot
left := left + 1
while array[right] > pivot
right := right - 1
if left <= right
swap array[left] with array[right]
left := left + 1
right := right - 1
quicksort(array from first index to right)
quicksort(array from left to last index)
Which is a modified version of: http://rosettacode.org/wiki/Sorting_algorithms/Quicksort

It could be that your array is now bigger than the L3 cache.
Quicksort partitioning operation moves random elements from one end of the array to another. A typical intel L3 cache is like 8MB. With 5M 4-byte elements - your array is 20MB. and you're writing from one end of it to the other.
Cache misses out of L3 go to main memory and can be much slower than higher level cache misses.
That is up until now your entire sorting operation was operating completely inside the CPU.

Related

What is the time complexity of below program?

Below is the program which find the length of the longest substring without repeating characters, given a string str. (details)
int test(string str) {
int left = 0, right = 0, ans = 0;
unordered_set<char> set;
while(left < str.size() and right < str.size()) {
if(set.find(str[right]) == set.end()) set.insert(str[right]);
else {
while(str[left] != str[right]){
set.erase(str[left]);
left++;
}
left++;
}
right++;
ans = (ans > set.size() ? ans : set.size());
}
return ans;
};
What is the time complexity of above solution? Is it O(n^2) or O(n) where n is the length of string?
Please note that I have gone through multiple questions on internet and also read about big oh but I am still confused. To me, it looks like O(n^2) complexity due to two while loops but I want to confirm from experts here.

It's O(n) on average.
What you see here is a sliding window technique (with variable window size, also called "two pointers technique").
Yes there are two loops, but if you look, any iteration of any of the two loops will always increase one of the pointers (either left or right).
In the first loop, either you call the second loop or you don't, but you will increase right at each iteration. The second loop always increases left.
Both left and right can have n different values (because both loops would stop when either right >= n or left == right).
So the first loop will have n executions (all the values of right from 0 to n-1) and the second loop can have at most n executions (all the possible values of left), which is a worst case of 2n = O(n) executions.
Worst case complexity
For the sake of completeness, please note that I wrote O(n) on average. The reason is that set.find has a complexity of O(1) in average but O(n) in the worst case. Same goes for set.erase. The reason is that unordered_set is implemented with a hash table and it the very unlikely case of all your items being in the same bucket, it needs to iterate on all the items.
So even though we have O(n) iterations of the loop, some iterations could be O(n). It means that in some very unlikely cases, the execution could go up to O(n^2). You shouldn't really worry about it as the probability of this to happen is close to 0, and even though I don't exactly know what the hashing technique for char in C++, I would bet that we will never end up with all characters in the same bucket.

Something faster than std::nth_element

I'm working on a kd-tree implementation and I'm currently using std::nth_element for partition a vector of elements by their median. However std::nth_element takes 90% of the time of tree construction. Can anyone suggest a more efficient alternative?
Thanks in advance

Do you really need the nth element, or do you need an element "near" the middle?
There are faster ways to get an element "near" the middle. One example goes roughly like:
function rough_middle(container)
divide container into subsequences of length 5
find median of each subsequence of length 5 ~ O(k) * O(n/5)
return rough_middle( { median of each subsequence} ) ~ O(rough_middle(n/5))
The result should be something that is roughly in the middle. A real nth element algorithm might use something like the above, and then clean it up afterwards to find the actual nth element.
At n=5, you get the middle.
At n=25, you get the middle of the short sequence middles. This is going to be greater than all of the lesser of each short sequence, or at least the 9th element and no more than the 16th element, or 36% away from edge.
At n=125, you get the rough middle of each short sequence middle. This is at least the 9th middle, so there are 8*3+2=26 elements less than your rough middle, or 20.8% away from edge.
At n=625, you get the rough middle of each short sequence middle. This is at least the 26th middle, so there are 77 elements less than your rough middle, or 12% away from the edge.
At n=5^k, you get the rough middle of the 5^(k-1) rough middles. If the rough middle of a 5^k sequence is r(k), then r(k+1) = r(k)*3-1 ~ 3^k.
3^k grows slower than 5^k in O-notation.
3^log_5(n)
= e^( ln(3) ln(n)/ln(5) )
= n^(ln(3)/ln(5))
=~ n^0.68
is a very rough estimate of the lower bound of where the rough_middle of a sequence of n elements ends up.
In theory, it may take as many as approx n^0.33 iterations of reductions to reach a single element, which isn't really that good. (the number of bits in n^0.68 is ~0.68 times the number of bits in n. If we shave that much off each rough middle, we need to repeat it very roughly n^0.33 times number of bits in n to consume all the bits -- more, because as we subtract from the n, the next n gets a slightly smaller value subtracted from it).
The way that the nth element solutions I've seen solve this is by doing a partition and repair at each level: instead of recursing into rough_middle, you recurse into middle. The real middle of the medians is then guaranteed to be pretty close to the actual middle of your sequence, and you can "find the real middle" relatively quickly (in O-notation) from this.
Possibly we can optimize this process by doing a more accurate rough_middle iterations when there are more elements, but never forcing it to be the actual middle? The bigger the end n is, the closer to the middle we need the recursive calls to be to the middle for the end result to be reasonably close to the middle.
But in practice, the probability that your sequence is a really bad one that actually takes n^0.33 steps to partition down to nothing might be really low. Sort of like the quicksort problem: median of 3 elements is usually good enough.
A quick stats analysis.
You pick 5 elements at random, and pick the middle one.
The median index of a set of 2m+1 random sample of a uniform distribution follows the beta distribution with parameters of roughly (m+1, m+1), with maybe some scaling factors for non-[0,1] intervals.
The mean of the median is clearly 1/2. The variance is:
(3*3)^2 / ( (3+3)^2 (3+3+1) )
= 81 / (36 * 7)
=~ 0.32
Figuring out the next step is beyond my stats. I'll cheat.
If we imagine that taking the median index element from a bunch of items with mean 0.5 and variance 0.32 is as good as averaging their index...
Let n now be the number of elements in our original set.
Then the sum of the indexes of medians of the short sequences has an average of n times n/5*0.5 = 0.1 * n^2. The variance of the sum of the indexes of the medians of the short sequences is n times n/5*0.32 = 0.064 * n^2.
If we then divide the value by n/5 we get:
So mean of n/2 and variance of 1.6.
Oh, if that was true, that would be awesome. Variance that doesn't grow with the size of n means that as n gets large, the average index of the medians of the short sequences gets ridiculously tightly distributed. I guess it makes some sense. Sadly, we aren't quite doing that -- we want the distribution of the pseudo-median of the medians of the short sequences. Which is almost certainly worse.
Implementation detail. We can with logarithmic number of memory overhead do an in-place rough median. (we might even be able to do it without the memory overhead!)
We maintain a vector of 5 indexes with a "nothing here" placeholder.
Each is a successive layer.
At each element, we advance the bottom index. If it is full, we grab the median, and insert it on the next level up, and clear the bottom layer.
At the end, we complete.
using target = std::pair<size_t,std::array<size_t, 5>>;
bool push( target& t, size_t i ) {
t.second[t.first]=i;
++t.first;
if (t.first==5)
return true;
}
template<class Container>
size_t extract_median( Container const& c, target& t ) {
Assert(t.first != 0);
std::sort( t.data(), t.data()+t.first, [&c](size_t lhs, size_t rhs){
return c[lhs]<c[rhs];
} );
size_t r = t[(t.first+1)/2];
t.first = 0;
return r;
}
template<class Container>
void advance(Container const& c, std::vector<target>& targets, size_t i) {
size_t height = 0;
while(true) {
if (targets.size() <= height)
targets.push_back({});
if (!push(targets[height], i))
return;
i = extract_median(c, targets[height]);
}
}
template<class Container>
size_t collapse(Container const& c, target* b, target* e) {
if (b==e) return -1;
size_t before = collapse(c, b, e-1);
target& last = (*e-1);
if (before!=-1)
push(before, last);
if (last.first == 0)
return -1;
return extract_median(c, last);
}
template<class Container>
size_t rough_median_index( Container const& c ) {
std::vector<target> targets;
for (auto const& x:c) {
advance(c, targets, &x-c.data());
}
return collapse(c, targets.data(), targets.data()+targets.size());
}
which sketches out how it could work on random access containers.

If you have more lookups than insertions into the vector you could consider using a data structure which sorts on insertion -- such as std::set -- and then use std::advance() to get the n'th element in sorted order.

Find dominant mode of an unsorted array

Note, this is a homework assignment.
I need to find the mode of an array (positive values) and secondarily return that value if the mode is greater that sizeof(array)/2,the dominant value. Some arrays will have neither.
That is simple enough, but there is a constraint that the array must NOT be sorted prior to the determination, additionally, the complexity must be on the order of O(nlogn).
Using this second constraint, and the master theorem we can determine that the time complexity 'T(n) = A*T(n/B) + n^D' where A=B and log_B(A)=D for O(nlogn) to be true. Thus, A=B=D=2. This is also convenient since the dominant value must be dominant in the 1st, 2nd, or both halves of an array.
Using 'T(n) = A*T(n/B) + n^D' we know that the search function will call itself twice at each level (A), divide the problem set by 2 at each level (B). I'm stuck figuring out how to make my algorithm take into account the n^2 at each level.
To make some code of this:
int search(a,b) {
search(a, a+(b-a)/2);
search(a+(b-a)/2+1, b);
}
The "glue" I'm missing here is how to combine these divided functions and I think that will implement the n^2 complexity. There is some trick here where the dominant must be the dominant in the 1st or 2nd half or both, not quite sure how that helps me right now with the complexity constraint.
I've written down some examples of small arrays and I've drawn out ways it would divide. I can't seem to go in the correct direction of finding one, single method that will always return the dominant value.
At level 0, the function needs to call itself to search the first half and second half of the array. That needs to recurse, and call itself. Then at each level, it needs to perform n^2 operations. So in an array [2,0,2,0,2] it would split that into a search on [2,0] and a search on [2,0,2] AND perform 25 operations. A search on [2,0] would call a search on [2] and a search on [0] AND perform 4 operations. I'm assuming these would need to be a search of the array space itself. I was planning to use C++ and use something from STL to iterate and count the values. I could create a large array and just update counts by their index.

if some number occurs more than half, it can be done by O(n) time complexity and O(1) space complexity as follow:
int num = a[0], occ = 1;
for (int i=1; i<n; i++) {
if (a[i] == num) occ++;
else {
occ--;
if (occ < 0) {
num = a[i];
occ = 1;
}
}
}
since u r not sure whether such number occurs, all u need to do is to apply the above algorithm to get a number first, then iterate the whole array 2nd time to get the occurance of the number and check whether it is greater than half.

If you want to find just the dominant mode of an array, and do it recursively, here's the pseudo-code:
def DominantMode(array):
# if there is only one element, that's the dominant mode
if len(array) == 1: return array[0]
# otherwise, find the dominant mode of the left and right halves
left = DominantMode(array[0:len(array)/2])
right = DominantMode(array[len(array)/2:len(array)])
# if both sides have the same dominant mode, the whole array has that mode
if left == right: return left
# otherwise, we have to scan the whole array to determine which one wins
leftCount = sum(element == left for element in array)
rightCount = sum(element == right for element in array)
if leftCount > len(array) / 2: return left
if rightCount > len(array) / 2: return right
# if neither wins, just return None
return None
The above algorithm is O(nlogn) time but only O(logn) space.
If you want to find the mode of an array (not just the dominant mode), first compute the histogram. You can do this in O(n) time (visiting each element of the array exactly once) by storing the historgram in a hash table that maps the element value to its frequency.
Once the histogram has been computed, you can iterate over it (visiting each element at most once) to find the highest frequency. Once you find a frequency larger than half the size of the array, you can return immediately and ignore the rest of the histogram. Since the size of the histogram can be no larger than the size of the original array, this step is also O(n) time (and O(n) space).
Since both steps are O(n) time, the resulting algorithmic complexity is O(n) time.

Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor?
Please assume the following:
The size of the entire set is at maximum 4096 items
The size of the subsets is open to discussion, but let us assume between 16-256 initially
All data used through the merge should preferably fit into L1
The L1 data cache size is 32K. 16K has already been used for the data itself, so you have 16K to play with
All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort
All data is 16-byte aligned
We want to try to minimize branching (for obvious reasons)
Main criteria of feasibility: faster than an in-L1 LSD radix sort.
I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! :)

Here's a very naive way to do it. (Please excuse any 4am delirium-induced pseudo-code bugs ;)
//4x sorted subsets
data[4][4] = {
{3, 4, 5, INF},
{2, 7, 8, INF},
{1, 4, 4, INF},
{5, 8, 9, INF}
}
data_offset[4] = {0, 0, 0, 0}
n = 4*3
for(i=0, i<n, i++):
sub = 0
sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])
out[i] = data[sub][data_offset[sub]]
data_offset[sub]++
Edit:
With AVX2 and its gather support, we could compare up to 8 subsets at once.
Edit 2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4)
//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)
Edit 3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values:
max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])
q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]
sub = max1*q + ((~max2)&1)*q
Edit 4:
Depending on compiler intelligence, we can remove multiplications altogether using the ternary operator:
sub = (data[sub][data_offset[sub]] > data[x][data_offset[x]]) ? x : sub
Edit 5:
In order to avoid costly floating point comparisons, we could simply reinterpret_cast<uint32_t*>() the data, as this would result in an integer compare.
Another possibility is to utilize SSE registers as these are not typed, and explicitly use integer comparison instructions.
This works due to the operators < > == yielding the same results when interpreting a float on the binary level.
Edit 6:
If we unroll our loop sufficiently to match the number of values to the number of SSE registers, we could stage the data that is being compared.
At the end of an iteration we would then re-transfer the register which contained the selected maximum/minimum value, and shift it.
Although this requires reworking the indexing slightly, it may prove more efficient than littering the loop with LEA's.

This is more of a research topic, but I did find this paper which discusses minimizing branch mispredictions using d-way merge sort.

SIMD sorting algorithms have already been studied in detail. The paper Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture describes an efficient algorithm for doing what you describe (and much more).
The core idea is that you can reduce merging two arbitrarily long lists to merging blocks of k consecutive values (where k can range from 4 to 16): the first block is z[0] = merge(x[0], y[0]).lo. To obtain the second block, we know that the leftover merge(x[0], y[0]).hi contains nx elements from x and ny elements from y, with nx+ny == k. But z[1] cannot contain elements from both x[1] and y[1], because that would require z[1] to contain more than nx+ny elements: so we just have to find out which of x[1] and y[1] needs to be added. The one with the lower first element will necessarily appear first in z, so this is simply done by comparing their first element. And we just repeat that until there is no more data to merge.
Pseudo-code, assuming the arrays end with a +inf value:
a := *x++
b := *y++
while not finished:
lo,hi := merge(a,b)
*z++ := lo
a := hi
if *x[0] <= *y[0]:
b := *x++
else:
b := *y++
(note how similar this is to the usual scalar implementation of merging)
The conditional jump is of course not necessary in an actual implementation: for example, you could conditionally swap x and y with an xor trick, and then read unconditionally *x++.
merge itself can be implemented with a bitonic sort. But if k is low, there will be a lot of inter-instruction dependencies resulting in high latency. Depending on the number of arrays you have to merge, you can then choose k high enough so that the latency of merge is masked, or if this is possible interleave several two-way merges. See the paper for more details.
Edit: Below is a diagram when k = 4. All asymptotics assume that k is fixed.
The big gray box is merging two arrays of size n = m * k (in the picture, m = 3).
We operate on blocks of size k.
The "whole-block merge" box merges the two arrays block-by-block by comparing their first elements. This is a linear time operation, and it doesn't consume memory because we stream the data to the rest of the block. The performance doesn't really matter because the latency is going to be limited by the latency of the "merge4" blocks.
Each "merge4" box merges two blocks, outputs the lower k elements, and feeds the upper k elements to the next "merge4". Each "merge4" box performs a bounded number of operations, and the number of "merge4" is linear in n.
So the time cost of merging is linear in n. And because "merge4" has a lower latency than performing 8 serial non-SIMD comparisons, there will be a large speedup compared to non-SIMD merging.
Finally, to extend our 2-way merge to merge many arrays, we arrange the big gray boxes in classical divide-and-conquer fashion. Each level has complexity linear in the number of elements, so the total complexity is O(n log (n / n0)) with n0 the initial size of the sorted arrays and n is the size of the final array.

The most obvious answer that comes to mind is a standard N-way merge using a heap. That'll be O(N log k). The number of subsets is between 16 and 256, so the worst case behavior (with 256 subsets of 16 items each) would be 8N.
Cache behavior should be ... reasonable, although not perfect. The heap, where most of the action is, will probably remain in the cache throughout. The part of the output array being written to will also most likely be in the cache.
What you have is 16K of data (the array with sorted subsequences), the heap (1K, worst case), and the sorted output array (16K again), and you want it to fit into a 32K cache. Sounds like a problem, but perhaps it isn't. The data that will most likely be swapped out is the front of the output array after the insertion point has moved. Assuming that the sorted subsequences are fairly uniformly distributed, they should be accessed often enough to keep them in the cache.

You can merge int arrays (expensive) branch free.
typedef unsigned uint;
typedef uint* uint_ptr;
void merge(uint*in1_begin, uint*in1_end, uint*in2_begin, uint*in2_end, uint*out){
int_ptr in [] = {in1_begin, in2_begin};
int_ptr in_end [] = {in1_end, in2_end};
// the loop branch is cheap because it is easy predictable
while(in[0] != in_end[0] && in[1] != in_end[1]){
int i = (*in[0] - *in[1]) >> 31;
*out = *in[i];
++out;
++in[i];
}
// copy the remaining stuff ...
}
Note that (*in[0] - *in[1]) >> 31 is equivalent to *in[0] - *in[1] < 0 which is equivalent to *in[0] < *in[1]. The reason I wrote it down using the bitshift trick instead of
int i = *in[0] < *in[1];
is that not all compilers generate branch free code for the < version.
Unfortunately you are using floats instead of ints which at first seems like a showstopper because I do not see how to realabily implement *in[0] < *in[1] branch free. However, on most modern architectures you interprete the bitpatterns of positive floats (that also are no NANs, INFs or such strange things) as ints and compare them using < and you will still get the correct result. Perhaps you extend this observation to arbitrary floats.

You could do a simple merge kernel to merge K lists:
float *input[K];
float *output;
while (true) {
float min = *input[0];
int min_idx = 0;
for (int i = 1; i < K; i++) {
float v = *input[i];
if (v < min) {
min = v; // do with cmov
min_idx = i; // do with cmov
}
}
if (min == SENTINEL) break;
*output++ = min;
input[min_idx]++;
}
There's no heap, so it is pretty simple. The bad part is that it is O(NK), which can be bad if K is large (unlike the heap implementation which is O(N log K)). So then you just pick a maximum K (4 or 8 might be good, then you can unroll the inner loop), and do larger K by cascading merges (handle K=64 by doing 8-way merges of groups of lists, then an 8-way merge of the results).

Has anyone seen this improvement to quicksort before?

Handling repeated elements in previous quicksorts
I have found a way to handle repeated elements more efficiently in quicksort and would like to know if anyone has seen this done before.
This method greatly reduces the overhead involved in checking for repeated elements which improves performance both with and without repeated elements. Typically, repeated elements are handled in a few different ways which I will first enumerate.
First, there is the Dutch National Flag method which sort the array like [ < pivot | == pivot | unsorted | > pivot].
Second, there is the method of putting the equal elements to the far left during the sort and then moving them to the center the sort is [ == pivot | < pivot | unsorted | > pivot] and then after the sort the == elements are moved to the center.
Third, the Bentley-McIlroy partitioning puts the == elements to both sides so the sort is [ == pivot | < pivot | unsorted | > pivot | == pivot] and then the == elements are moved to the middle.
The last two methods are done in an attempt to reduce the overhead.
My Method
Now, let me explain how my method improves the quicksort by reducing the number of comparisons.
I use two quicksort functions together rather than just one.
The first function I will call q1 and it sorts an array as [ < pivot | unsorted | >= pivot].
The second function I will call q2 and it sorts the array as [ <= pivot | unsorted | > pivot].
Let's now look at the usage of these in tandem in order to improve the handling of repeated elements.
First of all, we call q1 to sort the whole array. It picks a pivot which we will further reference to as pivot1 and then sorts around pivot1. Thus, our array is sorted to this point as [ < pivot1 | >= pivot1 ].
Then, for the [ < pivot1] partition, we send it to q1 again, and that part is fairly normal so let's sort the other partition first.
For the [ >= pivot1] partition, we send it to q2. q2 choses a pivot, which we will reference to as pivot2 from within this sub-array and sorts it into [ <= pivot2 | > pivot2].
If we look now at the entire array, our sorting looks like [ < pivot1 | >= pivot1 and <= pivot2 | > pivot2]. This looks very much like a dual-pivot quicksort.
Now, let's return to the subarray inside of q2 ([ <= pivot2 | > pivot2]).
For the [ > pivot2] partition, we just send it back to q1 which is not very interesting.
For the [ <= pivot2] partition, we first check if pivot1 == pivot2. If they are equal, then this partition is already sorted because they are all equal elements! If the pivots aren't equal, then we just send this partition to q2 again which picks a pivot (further pivot3), sorts, and if pivot3 == pivot1, then it does not have to sort the [ <= pivot 3] and so on.
Hopefully, you get the point by now. The improvement with this technique is that equal elements are handled without having to check if each element is also equal to the pivots. In other words, it uses less comparisons.
There is one other possible improvement that I have not tried yet which is to check in qs2 if the size of the [ <= pivot2] partition is rather large (or the [> pivot2] partition is very small) in comparison to the size of its total subarray and then to do a more standard check for repeated elements in that case (one of the methods listed above).
Source Code
Here are two very simplified qs1 and qs2 functions. They use the Sedgewick converging pointers method of sorting. They can obviously can be very optimized (they choose pivots extremely poorly for instance), but this is just to show the idea. My own implementation is longer, faster and much harder to read so let's start with this:
// qs sorts into [ < p | >= p ]
void qs1(int a[], long left, long right){
// Pick a pivot and set up some indicies
int pivot = a[right], temp;
long i = left - 1, j = right;
// do the sort
for(;;){
while(a[++i] < pivot);
while(a[--j] >= pivot) if(i == j) break;
if(i >= j) break;
temp = a[i];
a[i] = a[j];
a[j] = temp;
}
// Put the pivot in the correct spot
temp = a[i];
a[i] = a[right];
a[right] = temp;
// send the [ < p ] partition to qs1
if(left < i - 1)
qs1(a, left, i - 1);
// send the [ >= p] partition to qs2
if( right > i + 1)
qs2(a, i + 1, right);
}
void qs2(int a[], long left, long right){
// Pick a pivot and set up some indicies
int pivot = a[left], temp;
long i = left, j = right + 1;
// do the sort
for(;;){
while(a[--j] > pivot);
while(a[++i] <= pivot) if(i == j) break;
if(i >= j) break;
temp = a[i];
a[i] = a[j];
a[j] = temp;
}
// Put the pivot in the correct spot
temp = a[j];
a[j] = a[left];
a[left] = temp;
// Send the [ > p ] partition to qs1
if( right > j + 1)
qs1(a, j + 1, right);
// Here is where we check the pivots.
// a[left-1] is the other pivot we need to compare with.
// This handles the repeated elements.
if(pivot != a[left-1])
// since the pivots don't match, we pass [ <= p ] on to qs2
if(left < j - 1)
qs2(a, left, j - 1);
}
I know that this is a rather simple idea, but it gives a pretty significant improvement in runtime when I add in the standard quicksort improvements (median-of-3 pivot choosing, and insertion sort for small array for a start). If you are going to test using this code, only do it on random data because of the poor pivot choosing (or improve the pivot choice). To use this sort you would call:
qs1(array,0,indexofendofarray);
Some Benchmarks
If you want to know just how fast it is, here is a little bit of data for starters. This uses my optimized version, not the one given above. However, the one given above is still much closer in time to the dual-pivot quicksort than the std::sort time.
On highly random data with 2,000,000 elements, I get these times (from sorting several consecutive datasets):
std::sort - 1.609 seconds
dual-pivot quicksort - 1.25 seconds
qs1/qs2 - 1.172 seconds
Where std::sort is the C++ Standard Library sort, the dual-pivot quicksort is one that came out several months ago by Vladimir Yaroslavskiy, and qs1/qs2 is my quicksort implementation.
On much less random data. with 2,000,000 elements and generated with rand() % 1000 (which means that each element has roughly 2000 copies) the times are:
std::sort - 0.468 seconds
dual-pivot quicksort - 0.438 seconds
qs1/qs2 - 0.407 seconds
There are some instances where the dual-pivot quicksort wins out and I do realize that the dual-pivot quicksort could be optimized more, but the same could be safely stated for my quicksort.
Has anyone seen this before?
I know this is a long question/explanation, but have any of you seen this improvement before? If so, then why isn't it being used?

Vladimir Yaroslavskiy | 11 Sep 12:35
Replacement of Quicksort in java.util.Arrays with new Dual-Pivot Quicksort
Visit http://permalink.gmane.org/gmane.comp.java.openjdk.core-libs.devel/2628

To answer your question, no I have not seen this approach before. I'm not going to profile your code and do the other hard work, but perhaps the following are next steps/considerations in formally presenting your algorithm. In the real world, sorting algorithms are implemented to have:
Good scalability / complexity and Low overhead
Scaling and overhead are obvious and are easy to measure. When profiling sorting, in addition to time measure number of comparisons and swaps. Performance on large files will also be dependent on disk seek time. For example, merge sort works well on large files with a magnetic disk. ( see also Quick Sort Vs Merge Sort )
Wide range of inputs with good performance
There's lots of data that needs sorting. And applications are known to produce data in patterns, so it is important to make the sort is resilient against poor performance under certain patterns. Your algorithm optimizes for repeated numbers. What if all numbers are repeated but only once (i.e. seq 1000>file; seq 1000>>file; shuf file)? What if numbers are already sorted? sorted backwards? what about a pattern of 1,2,3,1,2,3,1,2,3,1,2,3? 1,2,3,4,5,6,7,6,5,4,3,2,1? 7,6,5,4,3,2,1,2,3,4,5,6,7? Poor performance in one of these common scenarios is a deal breaker! Before comparing against a published general-purpose algorithm it is wise to have this analysis prepared.
Low-risk of pathological performance
Of all the permutations of inputs, there is one that performs worse than the others. How much worse does that perform than average? And how many permutations will provide similar poor performance?
Good luck on your next steps!

It's a great improvment and I'm sure it's been implemented specifically if you expect a lot of equal objects. There are many of the wall tweeks of this kind.
If I understand all you wrote correctly, the reason it's not generally "known" is that it does improve the basic O(n2) performance. That means, double the number of objects, quadruple the time. Your improvement doesn't change this unless all objects are equal.

std:sort is not exactly fast.
Here are results I get comparing it to randomized parallel nonrecursive quicksort:
pnrqSort (longs):
.:.1 000 000 36ms (items per ms: 27777.8)
.:.5 000 000 140ms (items per ms: 35714.3)
.:.10 000 000 296ms (items per ms: 33783.8)
.:.50 000 000 1s 484ms (items per ms: 33692.7)
.:.100 000 000 2s 936ms (items per ms: 34059.9)
.:.250 000 000 8s 300ms (items per ms: 30120.5)
.:.400 000 000 12s 611ms (items per ms: 31718.3)
.:.500 000 000 16s 428ms (items per ms: 30435.8)
std::sort(longs)
.:.1 000 000 134ms (items per ms: 7462.69)
.:.5 000 000 716ms (items per ms: 6983.24)
std::sort vector of longs
1 000 000 511ms (items per ms: 1956.95)
2 500 000 943ms (items per ms: 2651.11)
Since you have extra method it is going to cause more stack use which will ultimately slow things down. Why median of 3 is used, I don't know, because it's a poor method, but with random pivot points quicksort never has big issues with uniform or presorted data and there's no danger of intentional median of 3 killer data.

nobody seems to like your algorithm, but I do.
Seems to me it's a nice way to re-do classic quicksort in a manner now
safe for use with highly repeated elements.
Your q1 and q2 subalgorithms, it seems to me are actually the SAME algorithm
except that < and <= operators interchanged and a few other things, which if you
wanted would allow you to write shorter pseudocode for this (though might be less
efficient). Recommend you read
JL Bentley, MD McIlroy: Engineering a Sort Function
SOFTWARE—PRACTICE AND EXPERIENCE 23,11 (Nov 1993)1249-1265
e-available here
http://www.skidmore.edu/~meckmann/2009Spring/cs206/papers/spe862jb.pdf
to see the tests they put their quicksort through. Your idea might be nicer and/or better,
but it needs to run the gauntlet of the kinds of tests they tried, using some
particular pivot-choosing method. Find one that passes all their tests without ever suffering quadratic runtime. Then if in addition your algorithm is both faster and nicer than theirs, you would then clearly have a worthwhile contribution.
The "Tukey Ninther" thing they use to generate a pivot seems to me is usable by you too
and will automatically make it very hard for the quadratic time worst case to arise in practice.
I mean, if you just use median-of-3 and try the middle and two end elements of the array as
your three, then an adversary will make the initial array state be increasing then decreasing and then you'll fall on your face with quadratic runtime on a not-too-implausible input. But with Tukey Ninther on 9 elements, it's pretty hard for me to construct
a plausible input which hurts you with quadratic runtime.
Another view & a suggestion:
Think of the combination of q1 splitting your array, then q2 splitting the right subarray,
as a single q12 algorithm producing a 3-way split of the array. Now, you need to recurse
on the 3 subarrays (or only 2 if the two pivots happen to be equal). Now always
recurse on the SMALLEST of the subarrays you were going to recurse on, FIRST, and
the largest LAST -- and do not implement this largest one as a recursion, but rather just stay in the same routine and loop back up to the top with a shrunk window. That way
you have 1 fewer recursive call in q12 than you would have, but the main point of this is,
it is now IMPOSSIBLE for the recursion stack to ever get more than O(logN) long.
OK? This solves another annoying worst-case problem quicksort can suffer while also making
your code a bit faster anyhow.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js