Has anyone seen this improvement to quicksort before? - c++

Handling repeated elements in previous quicksorts
I have found a way to handle repeated elements more efficiently in quicksort and would like to know if anyone has seen this done before.
This method greatly reduces the overhead involved in checking for repeated elements which improves performance both with and without repeated elements. Typically, repeated elements are handled in a few different ways which I will first enumerate.
First, there is the Dutch National Flag method which sort the array like [ < pivot | == pivot | unsorted | > pivot].
Second, there is the method of putting the equal elements to the far left during the sort and then moving them to the center the sort is [ == pivot | < pivot | unsorted | > pivot] and then after the sort the == elements are moved to the center.
Third, the Bentley-McIlroy partitioning puts the == elements to both sides so the sort is [ == pivot | < pivot | unsorted | > pivot | == pivot] and then the == elements are moved to the middle.
The last two methods are done in an attempt to reduce the overhead.
My Method
Now, let me explain how my method improves the quicksort by reducing the number of comparisons.
I use two quicksort functions together rather than just one.
The first function I will call q1 and it sorts an array as [ < pivot | unsorted | >= pivot].
The second function I will call q2 and it sorts the array as [ <= pivot | unsorted | > pivot].
Let's now look at the usage of these in tandem in order to improve the handling of repeated elements.
First of all, we call q1 to sort the whole array. It picks a pivot which we will further reference to as pivot1 and then sorts around pivot1. Thus, our array is sorted to this point as [ < pivot1 | >= pivot1 ].
Then, for the [ < pivot1] partition, we send it to q1 again, and that part is fairly normal so let's sort the other partition first.
For the [ >= pivot1] partition, we send it to q2. q2 choses a pivot, which we will reference to as pivot2 from within this sub-array and sorts it into [ <= pivot2 | > pivot2].
If we look now at the entire array, our sorting looks like [ < pivot1 | >= pivot1 and <= pivot2 | > pivot2]. This looks very much like a dual-pivot quicksort.
Now, let's return to the subarray inside of q2 ([ <= pivot2 | > pivot2]).
For the [ > pivot2] partition, we just send it back to q1 which is not very interesting.
For the [ <= pivot2] partition, we first check if pivot1 == pivot2. If they are equal, then this partition is already sorted because they are all equal elements! If the pivots aren't equal, then we just send this partition to q2 again which picks a pivot (further pivot3), sorts, and if pivot3 == pivot1, then it does not have to sort the [ <= pivot 3] and so on.
Hopefully, you get the point by now. The improvement with this technique is that equal elements are handled without having to check if each element is also equal to the pivots. In other words, it uses less comparisons.
There is one other possible improvement that I have not tried yet which is to check in qs2 if the size of the [ <= pivot2] partition is rather large (or the [> pivot2] partition is very small) in comparison to the size of its total subarray and then to do a more standard check for repeated elements in that case (one of the methods listed above).
Source Code
Here are two very simplified qs1 and qs2 functions. They use the Sedgewick converging pointers method of sorting. They can obviously can be very optimized (they choose pivots extremely poorly for instance), but this is just to show the idea. My own implementation is longer, faster and much harder to read so let's start with this:
// qs sorts into [ < p | >= p ]
void qs1(int a[], long left, long right){
// Pick a pivot and set up some indicies
int pivot = a[right], temp;
long i = left - 1, j = right;
// do the sort
for(;;){
while(a[++i] < pivot);
while(a[--j] >= pivot) if(i == j) break;
if(i >= j) break;
temp = a[i];
a[i] = a[j];
a[j] = temp;
}
// Put the pivot in the correct spot
temp = a[i];
a[i] = a[right];
a[right] = temp;
// send the [ < p ] partition to qs1
if(left < i - 1)
qs1(a, left, i - 1);
// send the [ >= p] partition to qs2
if( right > i + 1)
qs2(a, i + 1, right);
}
void qs2(int a[], long left, long right){
// Pick a pivot and set up some indicies
int pivot = a[left], temp;
long i = left, j = right + 1;
// do the sort
for(;;){
while(a[--j] > pivot);
while(a[++i] <= pivot) if(i == j) break;
if(i >= j) break;
temp = a[i];
a[i] = a[j];
a[j] = temp;
}
// Put the pivot in the correct spot
temp = a[j];
a[j] = a[left];
a[left] = temp;
// Send the [ > p ] partition to qs1
if( right > j + 1)
qs1(a, j + 1, right);
// Here is where we check the pivots.
// a[left-1] is the other pivot we need to compare with.
// This handles the repeated elements.
if(pivot != a[left-1])
// since the pivots don't match, we pass [ <= p ] on to qs2
if(left < j - 1)
qs2(a, left, j - 1);
}
I know that this is a rather simple idea, but it gives a pretty significant improvement in runtime when I add in the standard quicksort improvements (median-of-3 pivot choosing, and insertion sort for small array for a start). If you are going to test using this code, only do it on random data because of the poor pivot choosing (or improve the pivot choice). To use this sort you would call:
qs1(array,0,indexofendofarray);
Some Benchmarks
If you want to know just how fast it is, here is a little bit of data for starters. This uses my optimized version, not the one given above. However, the one given above is still much closer in time to the dual-pivot quicksort than the std::sort time.
On highly random data with 2,000,000 elements, I get these times (from sorting several consecutive datasets):
std::sort - 1.609 seconds
dual-pivot quicksort - 1.25 seconds
qs1/qs2 - 1.172 seconds
Where std::sort is the C++ Standard Library sort, the dual-pivot quicksort is one that came out several months ago by Vladimir Yaroslavskiy, and qs1/qs2 is my quicksort implementation.
On much less random data. with 2,000,000 elements and generated with rand() % 1000 (which means that each element has roughly 2000 copies) the times are:
std::sort - 0.468 seconds
dual-pivot quicksort - 0.438 seconds
qs1/qs2 - 0.407 seconds
There are some instances where the dual-pivot quicksort wins out and I do realize that the dual-pivot quicksort could be optimized more, but the same could be safely stated for my quicksort.
Has anyone seen this before?
I know this is a long question/explanation, but have any of you seen this improvement before? If so, then why isn't it being used?

Vladimir Yaroslavskiy | 11 Sep 12:35
Replacement of Quicksort in java.util.Arrays with new Dual-Pivot Quicksort
Visit http://permalink.gmane.org/gmane.comp.java.openjdk.core-libs.devel/2628

To answer your question, no I have not seen this approach before. I'm not going to profile your code and do the other hard work, but perhaps the following are next steps/considerations in formally presenting your algorithm. In the real world, sorting algorithms are implemented to have:
Good scalability / complexity and Low overhead
Scaling and overhead are obvious and are easy to measure. When profiling sorting, in addition to time measure number of comparisons and swaps. Performance on large files will also be dependent on disk seek time. For example, merge sort works well on large files with a magnetic disk. ( see also Quick Sort Vs Merge Sort )
Wide range of inputs with good performance
There's lots of data that needs sorting. And applications are known to produce data in patterns, so it is important to make the sort is resilient against poor performance under certain patterns. Your algorithm optimizes for repeated numbers. What if all numbers are repeated but only once (i.e. seq 1000>file; seq 1000>>file; shuf file)? What if numbers are already sorted? sorted backwards? what about a pattern of 1,2,3,1,2,3,1,2,3,1,2,3? 1,2,3,4,5,6,7,6,5,4,3,2,1? 7,6,5,4,3,2,1,2,3,4,5,6,7? Poor performance in one of these common scenarios is a deal breaker! Before comparing against a published general-purpose algorithm it is wise to have this analysis prepared.
Low-risk of pathological performance
Of all the permutations of inputs, there is one that performs worse than the others. How much worse does that perform than average? And how many permutations will provide similar poor performance?
Good luck on your next steps!

It's a great improvment and I'm sure it's been implemented specifically if you expect a lot of equal objects. There are many of the wall tweeks of this kind.
If I understand all you wrote correctly, the reason it's not generally "known" is that it does improve the basic O(n2) performance. That means, double the number of objects, quadruple the time. Your improvement doesn't change this unless all objects are equal.

std:sort is not exactly fast.
Here are results I get comparing it to randomized parallel nonrecursive quicksort:
pnrqSort (longs):
.:.1 000 000 36ms (items per ms: 27777.8)
.:.5 000 000 140ms (items per ms: 35714.3)
.:.10 000 000 296ms (items per ms: 33783.8)
.:.50 000 000 1s 484ms (items per ms: 33692.7)
.:.100 000 000 2s 936ms (items per ms: 34059.9)
.:.250 000 000 8s 300ms (items per ms: 30120.5)
.:.400 000 000 12s 611ms (items per ms: 31718.3)
.:.500 000 000 16s 428ms (items per ms: 30435.8)
std::sort(longs)
.:.1 000 000 134ms (items per ms: 7462.69)
.:.5 000 000 716ms (items per ms: 6983.24)
std::sort vector of longs
1 000 000 511ms (items per ms: 1956.95)
2 500 000 943ms (items per ms: 2651.11)
Since you have extra method it is going to cause more stack use which will ultimately slow things down. Why median of 3 is used, I don't know, because it's a poor method, but with random pivot points quicksort never has big issues with uniform or presorted data and there's no danger of intentional median of 3 killer data.

nobody seems to like your algorithm, but I do.
Seems to me it's a nice way to re-do classic quicksort in a manner now
safe for use with highly repeated elements.
Your q1 and q2 subalgorithms, it seems to me are actually the SAME algorithm
except that < and <= operators interchanged and a few other things, which if you
wanted would allow you to write shorter pseudocode for this (though might be less
efficient). Recommend you read
JL Bentley, MD McIlroy: Engineering a Sort Function
SOFTWARE—PRACTICE AND EXPERIENCE 23,11 (Nov 1993)1249-1265
e-available here
http://www.skidmore.edu/~meckmann/2009Spring/cs206/papers/spe862jb.pdf
to see the tests they put their quicksort through. Your idea might be nicer and/or better,
but it needs to run the gauntlet of the kinds of tests they tried, using some
particular pivot-choosing method. Find one that passes all their tests without ever suffering quadratic runtime. Then if in addition your algorithm is both faster and nicer than theirs, you would then clearly have a worthwhile contribution.
The "Tukey Ninther" thing they use to generate a pivot seems to me is usable by you too
and will automatically make it very hard for the quadratic time worst case to arise in practice.
I mean, if you just use median-of-3 and try the middle and two end elements of the array as
your three, then an adversary will make the initial array state be increasing then decreasing and then you'll fall on your face with quadratic runtime on a not-too-implausible input. But with Tukey Ninther on 9 elements, it's pretty hard for me to construct
a plausible input which hurts you with quadratic runtime.
Another view & a suggestion:
Think of the combination of q1 splitting your array, then q2 splitting the right subarray,
as a single q12 algorithm producing a 3-way split of the array. Now, you need to recurse
on the 3 subarrays (or only 2 if the two pivots happen to be equal). Now always
recurse on the SMALLEST of the subarrays you were going to recurse on, FIRST, and
the largest LAST -- and do not implement this largest one as a recursion, but rather just stay in the same routine and loop back up to the top with a shrunk window. That way
you have 1 fewer recursive call in q12 than you would have, but the main point of this is,
it is now IMPOSSIBLE for the recursion stack to ever get more than O(logN) long.
OK? This solves another annoying worst-case problem quicksort can suffer while also making
your code a bit faster anyhow.

Related

What is the time complexity of below program?

Below is the program which find the length of the longest substring without repeating characters, given a string str. (details)
int test(string str) {
int left = 0, right = 0, ans = 0;
unordered_set<char> set;
while(left < str.size() and right < str.size()) {
if(set.find(str[right]) == set.end()) set.insert(str[right]);
else {
while(str[left] != str[right]){
set.erase(str[left]);
left++;
}
left++;
}
right++;
ans = (ans > set.size() ? ans : set.size());
}
return ans;
};
What is the time complexity of above solution? Is it O(n^2) or O(n) where n is the length of string?
Please note that I have gone through multiple questions on internet and also read about big oh but I am still confused. To me, it looks like O(n^2) complexity due to two while loops but I want to confirm from experts here.
It's O(n) on average.
What you see here is a sliding window technique (with variable window size, also called "two pointers technique").
Yes there are two loops, but if you look, any iteration of any of the two loops will always increase one of the pointers (either left or right).
In the first loop, either you call the second loop or you don't, but you will increase right at each iteration. The second loop always increases left.
Both left and right can have n different values (because both loops would stop when either right >= n or left == right).
So the first loop will have n executions (all the values of right from 0 to n-1) and the second loop can have at most n executions (all the possible values of left), which is a worst case of 2n = O(n) executions.
Worst case complexity
For the sake of completeness, please note that I wrote O(n) on average. The reason is that set.find has a complexity of O(1) in average but O(n) in the worst case. Same goes for set.erase. The reason is that unordered_set is implemented with a hash table and it the very unlikely case of all your items being in the same bucket, it needs to iterate on all the items.
So even though we have O(n) iterations of the loop, some iterations could be O(n). It means that in some very unlikely cases, the execution could go up to O(n^2). You shouldn't really worry about it as the probability of this to happen is close to 0, and even though I don't exactly know what the hashing technique for char in C++, I would bet that we will never end up with all characters in the same bucket.

What is the time complexity of linked list traversal using recursion? [duplicate]

I have gone through Google and Stack Overflow search, but nowhere I was able to find a clear and straightforward explanation for how to calculate time complexity.
What do I know already?
Say for code as simple as the one below:
char h = 'y'; // This will be executed 1 time
int abc = 0; // This will be executed 1 time
Say for a loop like the one below:
for (int i = 0; i < N; i++) {
Console.Write('Hello, World!!');
}
int i=0; This will be executed only once.
The time is actually calculated to i=0 and not the declaration.
i < N; This will be executed N+1 times
i++ This will be executed N times
So the number of operations required by this loop are {1+(N+1)+N} = 2N+2. (But this still may be wrong, as I am not confident about my understanding.)
OK, so these small basic calculations I think I know, but in most cases I have seen the time complexity as O(N), O(n^2), O(log n), O(n!), and many others.
How to find time complexity of an algorithm
You add up how many machine instructions it will execute as a function of the size of its input, and then simplify the expression to the largest (when N is very large) term and can include any simplifying constant factor.
For example, lets see how we simplify 2N + 2 machine instructions to describe this as just O(N).
Why do we remove the two 2s ?
We are interested in the performance of the algorithm as N becomes large.
Consider the two terms 2N and 2.
What is the relative influence of these two terms as N becomes large? Suppose N is a million.
Then the first term is 2 million and the second term is only 2.
For this reason, we drop all but the largest terms for large N.
So, now we have gone from 2N + 2 to 2N.
Traditionally, we are only interested in performance up to constant factors.
This means that we don't really care if there is some constant multiple of difference in performance when N is large. The unit of 2N is not well-defined in the first place anyway. So we can multiply or divide by a constant factor to get to the simplest expression.
So 2N becomes just N.
This is an excellent article: Time complexity of algorithm
The below answer is copied from above (in case the excellent link goes bust)
The most common metric for calculating time complexity is Big O notation. This removes all constant factors so that the running time can be estimated in relation to N as N approaches infinity. In general you can think of it like this:
statement;
Is constant. The running time of the statement will not change in relation to N.
for ( i = 0; i < N; i++ )
statement;
Is linear. The running time of the loop is directly proportional to N. When N doubles, so does the running time.
for ( i = 0; i < N; i++ ) {
for ( j = 0; j < N; j++ )
statement;
}
Is quadratic. The running time of the two loops is proportional to the square of N. When N doubles, the running time increases by N * N.
while ( low <= high ) {
mid = ( low + high ) / 2;
if ( target < list[mid] )
high = mid - 1;
else if ( target > list[mid] )
low = mid + 1;
else break;
}
Is logarithmic. The running time of the algorithm is proportional to the number of times N can be divided by 2. This is because the algorithm divides the working area in half with each iteration.
void quicksort (int list[], int left, int right)
{
int pivot = partition (list, left, right);
quicksort(list, left, pivot - 1);
quicksort(list, pivot + 1, right);
}
Is N * log (N). The running time consists of N loops (iterative or recursive) that are logarithmic, thus the algorithm is a combination of linear and logarithmic.
In general, doing something with every item in one dimension is linear, doing something with every item in two dimensions is quadratic, and dividing the working area in half is logarithmic. There are other Big O measures such as cubic, exponential, and square root, but they're not nearly as common. Big O notation is described as O ( <type> ) where <type> is the measure. The quicksort algorithm would be described as O (N * log(N )).
Note that none of this has taken into account best, average, and worst case measures. Each would have its own Big O notation. Also note that this is a VERY simplistic explanation. Big O is the most common, but it's also more complex that I've shown. There are also other notations such as big omega, little o, and big theta. You probably won't encounter them outside of an algorithm analysis course. ;)
Taken from here - Introduction to Time Complexity of an Algorithm
1. Introduction
In computer science, the time complexity of an algorithm quantifies the amount of time taken by an algorithm to run as a function of the length of the string representing the input.
2. Big O notation
The time complexity of an algorithm is commonly expressed using big O notation, which excludes coefficients and lower order terms. When expressed this way, the time complexity is said to be described asymptotically, i.e., as the input size goes to infinity.
For example, if the time required by an algorithm on all inputs of size n is at most 5n3 + 3n, the asymptotic time complexity is O(n3). More on that later.
A few more examples:
1 = O(n)
n = O(n2)
log(n) = O(n)
2 n + 1 = O(n)
3. O(1) constant time:
An algorithm is said to run in constant time if it requires the same amount of time regardless of the input size.
Examples:
array: accessing any element
fixed-size stack: push and pop methods
fixed-size queue: enqueue and dequeue methods
4. O(n) linear time
An algorithm is said to run in linear time if its time execution is directly proportional to the input size, i.e. time grows linearly as input size increases.
Consider the following examples. Below I am linearly searching for an element, and this has a time complexity of O(n).
int find = 66;
var numbers = new int[] { 33, 435, 36, 37, 43, 45, 66, 656, 2232 };
for (int i = 0; i < numbers.Length - 1; i++)
{
if(find == numbers[i])
{
return;
}
}
More Examples:
Array: Linear Search, Traversing, Find minimum etc
ArrayList: contains method
Queue: contains method
5. O(log n) logarithmic time:
An algorithm is said to run in logarithmic time if its time execution is proportional to the logarithm of the input size.
Example: Binary Search
Recall the "twenty questions" game - the task is to guess the value of a hidden number in an interval. Each time you make a guess, you are told whether your guess is too high or too low. Twenty questions game implies a strategy that uses your guess number to halve the interval size. This is an example of the general problem-solving method known as binary search.
6. O(n2) quadratic time
An algorithm is said to run in quadratic time if its time execution is proportional to the square of the input size.
Examples:
Bubble Sort
Selection Sort
Insertion Sort
7. Some useful links
Big-O Misconceptions
Determining The Complexity Of Algorithm
Big O Cheat Sheet
Several examples of loop.
O(n) time complexity of a loop is considered as O(n) if the loop variables is incremented / decremented by a constant amount. For example following functions have O(n) time complexity.
// Here c is a positive integer constant
for (int i = 1; i <= n; i += c) {
// some O(1) expressions
}
for (int i = n; i > 0; i -= c) {
// some O(1) expressions
}
O(nc) time complexity of nested loops is equal to the number of times the innermost statement is executed. For example, the following sample loops have O(n2) time complexity
for (int i = 1; i <=n; i += c) {
for (int j = 1; j <=n; j += c) {
// some O(1) expressions
}
}
for (int i = n; i > 0; i += c) {
for (int j = i+1; j <=n; j += c) {
// some O(1) expressions
}
For example, selection sort and insertion sort have O(n2) time complexity.
O(log n) time complexity of a loop is considered as O(log n) if the loop variables is divided / multiplied by a constant amount.
for (int i = 1; i <=n; i *= c) {
// some O(1) expressions
}
for (int i = n; i > 0; i /= c) {
// some O(1) expressions
}
For example, [binary search][3] has _O(log n)_ time complexity.
O(log log n) time complexity of a loop is considered as O(log log n) if the loop variables is reduced / increased exponentially by a constant amount.
// Here c is a constant greater than 1
for (int i = 2; i <=n; i = pow(i, c)) {
// some O(1) expressions
}
//Here fun is sqrt or cuberoot or any other constant root
for (int i = n; i > 0; i = fun(i)) {
// some O(1) expressions
}
One example of time complexity analysis
int fun(int n)
{
for (int i = 1; i <= n; i++)
{
for (int j = 1; j < n; j += i)
{
// Some O(1) task
}
}
}
Analysis:
For i = 1, the inner loop is executed n times.
For i = 2, the inner loop is executed approximately n/2 times.
For i = 3, the inner loop is executed approximately n/3 times.
For i = 4, the inner loop is executed approximately n/4 times.
…………………………………………………….
For i = n, the inner loop is executed approximately n/n times.
So the total time complexity of the above algorithm is (n + n/2 + n/3 + … + n/n), which becomes n * (1/1 + 1/2 + 1/3 + … + 1/n)
The important thing about series (1/1 + 1/2 + 1/3 + … + 1/n) is around to O(log n). So the time complexity of the above code is O(n·log n).
References:
1
2
3
Time complexity with examples
1 - Basic operations (arithmetic, comparisons, accessing array’s elements, assignment): The running time is always constant O(1)
Example:
read(x) // O(1)
a = 10; // O(1)
a = 1,000,000,000,000,000,000 // O(1)
2 - If then else statement: Only taking the maximum running time from two or more possible statements.
Example:
age = read(x) // (1+1) = 2
if age < 17 then begin // 1
status = "Not allowed!"; // 1
end else begin
status = "Welcome! Please come in"; // 1
visitors = visitors + 1; // 1+1 = 2
end;
So, the complexity of the above pseudo code is T(n) = 2 + 1 + max(1, 1+2) = 6. Thus, its big oh is still constant T(n) = O(1).
3 - Looping (for, while, repeat): Running time for this statement is the number of loops multiplied by the number of operations inside that looping.
Example:
total = 0; // 1
for i = 1 to n do begin // (1+1)*n = 2n
total = total + i; // (1+1)*n = 2n
end;
writeln(total); // 1
So, its complexity is T(n) = 1+4n+1 = 4n + 2. Thus, T(n) = O(n).
4 - Nested loop (looping inside looping): Since there is at least one looping inside the main looping, running time of this statement used O(n^2) or O(n^3).
Example:
for i = 1 to n do begin // (1+1)*n = 2n
for j = 1 to n do begin // (1+1)n*n = 2n^2
x = x + 1; // (1+1)n*n = 2n^2
print(x); // (n*n) = n^2
end;
end;
Common running time
There are some common running times when analyzing an algorithm:
O(1) – Constant time
Constant time means the running time is constant, it’s not affected by the input size.
O(n) – Linear time
When an algorithm accepts n input size, it would perform n operations as well.
O(log n) – Logarithmic time
Algorithm that has running time O(log n) is slight faster than O(n). Commonly, algorithm divides the problem into sub problems with the same size. Example: binary search algorithm, binary conversion algorithm.
O(n log n) – Linearithmic time
This running time is often found in "divide & conquer algorithms" which divide the problem into sub problems recursively and then merge them in n time. Example: Merge Sort algorithm.
O(n2) – Quadratic time
Look Bubble Sort algorithm!
O(n3) – Cubic time
It has the same principle with O(n2).
O(2n) – Exponential time
It is very slow as input get larger, if n = 1,000,000, T(n) would be 21,000,000. Brute Force algorithm has this running time.
O(n!) – Factorial time
The slowest!!! Example: Travelling salesman problem (TSP)
It is taken from this article. It is very well explained and you should give it a read.
When you're analyzing code, you have to analyse it line by line, counting every operation/recognizing time complexity. In the end, you have to sum it to get whole picture.
For example, you can have one simple loop with linear complexity, but later in that same program you can have a triple loop that has cubic complexity, so your program will have cubic complexity. Function order of growth comes into play right here.
Let's look at what are possibilities for time complexity of an algorithm, you can see order of growth I mentioned above:
Constant time has an order of growth 1, for example: a = b + c.
Logarithmic time has an order of growth log N. It usually occurs when you're dividing something in half (binary search, trees, and even loops), or multiplying something in same way.
Linear. The order of growth is N, for example
int p = 0;
for (int i = 1; i < N; i++)
p = p + 2;
Linearithmic. The order of growth is n·log N. It usually occurs in divide-and-conquer algorithms.
Cubic. The order of growth is N3. A classic example is a triple loop where you check all triplets:
int x = 0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
x = x + 2
Exponential. The order of growth is 2N. It usually occurs when you do exhaustive search, for example, check subsets of some set.
Loosely speaking, time complexity is a way of summarising how the number of operations or run-time of an algorithm grows as the input size increases.
Like most things in life, a cocktail party can help us understand.
O(N)
When you arrive at the party, you have to shake everyone's hand (do an operation on every item). As the number of attendees N increases, the time/work it will take you to shake everyone's hand increases as O(N).
Why O(N) and not cN?
There's variation in the amount of time it takes to shake hands with people. You could average this out and capture it in a constant c. But the fundamental operation here --- shaking hands with everyone --- would always be proportional to O(N), no matter what c was. When debating whether we should go to a cocktail party, we're often more interested in the fact that we'll have to meet everyone than in the minute details of what those meetings look like.
O(N^2)
The host of the cocktail party wants you to play a silly game where everyone meets everyone else. Therefore, you must meet N-1 other people and, because the next person has already met you, they must meet N-2 people, and so on. The sum of this series is x^2/2+x/2. As the number of attendees grows, the x^2 term gets big fast, so we just drop everything else.
O(N^3)
You have to meet everyone else and, during each meeting, you must talk about everyone else in the room.
O(1)
The host wants to announce something. They ding a wineglass and speak loudly. Everyone hears them. It turns out it doesn't matter how many attendees there are, this operation always takes the same amount of time.
O(log N)
The host has laid everyone out at the table in alphabetical order. Where is Dan? You reason that he must be somewhere between Adam and Mandy (certainly not between Mandy and Zach!). Given that, is he between George and Mandy? No. He must be between Adam and Fred, and between Cindy and Fred. And so on... we can efficiently locate Dan by looking at half the set and then half of that set. Ultimately, we look at O(log_2 N) individuals.
O(N log N)
You could find where to sit down at the table using the algorithm above. If a large number of people came to the table, one at a time, and all did this, that would take O(N log N) time. This turns out to be how long it takes to sort any collection of items when they must be compared.
Best/Worst Case
You arrive at the party and need to find Inigo - how long will it take? It depends on when you arrive. If everyone is milling around you've hit the worst-case: it will take O(N) time. However, if everyone is sitting down at the table, it will take only O(log N) time. Or maybe you can leverage the host's wineglass-shouting power and it will take only O(1) time.
Assuming the host is unavailable, we can say that the Inigo-finding algorithm has a lower-bound of O(log N) and an upper-bound of O(N), depending on the state of the party when you arrive.
Space & Communication
The same ideas can be applied to understanding how algorithms use space or communication.
Knuth has written a nice paper about the former entitled "The Complexity of Songs".
Theorem 2: There exist arbitrarily long songs of complexity O(1).
PROOF: (due to Casey and the Sunshine Band). Consider the songs Sk defined by (15), but with
V_k = 'That's the way,' U 'I like it, ' U
U = 'uh huh,' 'uh huh'
for all k.
For the mathematically-minded people: The master theorem is another useful thing to know when studying complexity.
O(n) is big O notation used for writing time complexity of an algorithm. When you add up the number of executions in an algorithm, you'll get an expression in result like 2N+2. In this expression, N is the dominating term (the term having largest effect on expression if its value increases or decreases). Now O(N) is the time complexity while N is dominating term.
Example
For i = 1 to n;
j = 0;
while(j <= n);
j = j + 1;
Here the total number of executions for the inner loop are n+1 and the total number of executions for the outer loop are n(n+1)/2, so the total number of executions for the whole algorithm are n + 1 + n(n+1/2) = (n2 + 3n)/2.
Here n^2 is the dominating term so the time complexity for this algorithm is O(n2).
Other answers concentrate on the big-O-notation and practical examples. I want to answer the question by emphasizing the theoretical view. The explanation below is necessarily lacking in details; an excellent source to learn computational complexity theory is Introduction to the Theory of Computation by Michael Sipser.
Turing Machines
The most widespread model to investigate any question about computation is a Turing machine. A Turing machine has a one dimensional tape consisting of symbols which is used as a memory device. It has a tapehead which is used to write and read from the tape. It has a transition table determining the machine's behaviour, which is a fixed hardware component that is decided when the machine is created. A Turing machine works at discrete time steps doing the following:
It reads the symbol under the tapehead.
Depending on the symbol and its internal state, which can only take finitely many values, it reads three values s, σ, and X from its transition table, where s is an internal state, σ is a symbol, and X is either Right or Left.
It changes its internal state to s.
It changes the symbol it has read to σ.
It moves the tapehead one step according to the direction in X.
Turing machines are powerful models of computation. They can do everything that your digital computer can do. They were introduced before the advent of digital modern computers by the father of theoretical computer science and mathematician: Alan Turing.
Time Complexity
It is hard to define the time complexity of a single problem like "Does white have a winning strategy in chess?" because there is a machine which runs for a single step giving the correct answer: Either the machine which says directly 'No' or directly 'Yes'. To make it work we instead define the time complexity of a family of problems L each of which has a size, usually the length of the problem description. Then we take a Turing machine M which correctly solves every problem in that family. When M is given a problem of this family of size n, it solves it in finitely many steps. Let us call f(n) the longest possible time it takes M to solve problems of size n. Then we say that the time complexity of L is O(f(n)), which means that there is a Turing machine which will solve an instance of it of size n in at most C.f(n) time where C is a constant independent of n.
Isn't it dependent on the machines? Can digital computers do it faster?
Yes! Some problems can be solved faster by other models of computation, for example two tape Turing machines solve some problems faster than those with a single tape. This is why theoreticians prefer to use robust complexity classes such as NL, P, NP, PSPACE, EXPTIME, etc. For example, P is the class of decision problems whose time complexity is O(p(n)) where p is a polynomial. The class P do not change even if you add ten thousand tapes to your Turing machine, or use other types of theoretical models such as random access machines.
A Difference in Theory and Practice
It is usually assumed that the time complexity of integer addition is O(1). This assumption makes sense in practice because computers use a fixed number of bits to store numbers for many applications. There is no reason to assume such a thing in theory, so time complexity of addition is O(k) where k is the number of bits needed to express the integer.
Finding The Time Complexity of a Class of Problems
The straightforward way to show the time complexity of a problem is O(f(n)) is to construct a Turing machine which solves it in O(f(n)) time. Creating Turing machines for complex problems is not trivial; one needs some familiarity with them. A transition table for a Turing machine is rarely given, and it is described in high level. It becomes easier to see how long it will take a machine to halt as one gets themselves familiar with them.
Showing that a problem is not O(f(n)) time complexity is another story... Even though there are some results like the time hierarchy theorem, there are many open problems here. For example whether problems in NP are in P, i.e. solvable in polynomial time, is one of the seven millennium prize problems in mathematics, whose solver will be awarded 1 million dollars.

Quicksort weird time complexity, c++

I've been testing the time complexity of different sorting algorithms for different number sequences and it was all going well until i got quicksort's (with pivot in the middle) results for sequences that are one half ascending and the other descending. The graph:
(By "V" I mean a sequence in which the first half is descending, and the other ascending, and by "A" I mean a sequence where the first half is ascending, and the other half is descending.)
Results for other kinds of sequences look as I would expect, but maybe there is something wrong with my algorithm?
void quicksort(int l,int p,int *tab)
{
int i=l,j=p,x=tab[(l+p)/2],w; //x - pivot
do
{
while (tab[i]<x)
{
i++;
}
while (x<tab[j])
{
j--;
}
if (i<=j)
{
w=tab[i];
tab[i]=tab[j];
tab[j]=w;
i++;
j--;
}
}
while (i<=j);
if (l<j)
{
quicksort(l,j,tab);
}
if (i<p)
{
quicksort(i,p,tab);
}
}
Does anybody have an idea what caused such weird results?
TL;DR: The problem is the pivot-choosing strategy, which makes repeatedly poor choices on these types of inputs (A- and V-shaped sequences). These result in quicksort making highly "imbalanced" recursive calls, which in turn result in the algorithm performing very poorly (quadratic time for A-shaped sequences).
Congratulations, you've (re)discovered an adversarial input (or rather a family of inputs) for the version of quicksort that chooses the middle element as the pivot.
For the reference, an example of an A-shaped sequence is 1 2 3 4 3 2 1, i.e., a sequence that increases, reaches the pick at the middle, and then decreases; an example of a V-shaped sequence is 4 3 2 1 2 3 4, i.e., a sequence that decreases, reaches the minimum at the middle, and then increases.
Think about what happens when you pick the middle element as the pivot of an A- or a V-shaped sequence. In the first case, when you pass the algorithm the A-shaped sequence 1 2 ... n-1 n n-1 ... 2 1, the pivot is the largest element of the array---this is because the largest element of an A-shaped sequence is the middle one, and you choose the middle element as the pivot---and you will make recursive calls on subarrays of sizes 0 (your code doesn't actually make the call on 0 elements) and n-1. In the next call on the subarray of size n-1 you will pick as the pivot the largest element of the subarray (which is the second-largest element of the original array); and so on. This results in poor performance because the running time is O(n)+O(n-1)+...+O(1) = O(n^2) because in each step you essentially pass on almost the whole array (all elements except the pivot), in other words, the sizes of the arrays in the recursive calls are highly imbalanced.
Here's the trace for the A-shaped sequence 1 2 3 4 5 4 3 2 1:
blazs#blazs:/tmp$ ./test
pivot=5
1 2 3 4 1 4 3 2 5
pivot=4
1 2 3 2 1 3 4 4
pivot=3
1 2 3 2 1 3
pivot=3
1 2 1 2 3
pivot=2
1 2 1 2
pivot=2
1 1 2
pivot=1
1 1
pivot=4
4 4
1 1 2 2 3 3 4 4 5
You can see from the trace that at recursive call the algorithm chooses a largest element (there can be up to two largest elements, hence the article a, not the) as the pivot. This means that the running time for the A-shaped sequence really is O(n)+O(n-1)+...+O(1) = O(n^2). (In the technical jargon, the A-shaped sequence is an example of an adversarial input that forces the algorithm to perform poorly.)
This means that if you plot running times for "perfectly" A-shaped sequences of the form
1 2 3 ... n-1 n n-1 ... 3 2 1
for increasing n, you will see a nice quadratic function. Here's a graph I just computed for n=5,105, 205, 305,...,9905 for A-shaped sequences 1 2 ... n-1 n n-1 ... 2 1:
In the second case, when you pass the algorithm a V-shaped sequence, you choose the smallest element of the array as the pivot, and will thus make recursive calls on subarrays of sizes n-1 and 0 (your code doesn't actually make the call on 0 elements). In the next call on the subarray of size n-1 you will pick as the pivot the largest element; and so on. (But you won't always make such terrible choices; it's hard to say anything more about this case.) This results in poor performance for similar reasons. This case is slightly more complicated (it depends on how you do the "moving" step).
Here's a graph of running times for V-shaped sequences n n-1 ... 2 1 2 ... n-1 n for n=5,105,205,...,49905. The running times are somewhat less regular---as I said it is more complicated because you don't always pick the smallest element as the pivot. The graph:
Code that I used to measure time:
double seconds(size_t n) {
int *tab = (int *)malloc(sizeof(int) * (2*n - 1));
size_t i;
// construct A-shaped sequence 1 2 3 ... n-1 n n-1 ... 3 2 1
for (i = 0; i < n-1; i++) {
tab[i] = tab[2*n-i-2] = i+1;
// To generate V-shaped sequence, use tab[i]=tab[2*n-i-2]=n-i+1;
}
tab[n-1] = n;
// For V-shaped sequence use tab[n-1] = 1;
clock_t start = clock();
quicksort(0, 2*n-2, tab);
clock_t finish = clock();
free(tab);
return (double) (finish - start) / CLOCKS_PER_SEC;
}
I adapted your code to print the "trace" of the algorithm so that you can play with it yourself and gain insight into what's going on:
#include <stdio.h>
void print(int *a, size_t l, size_t r);
void quicksort(int l,int p,int *tab);
int main() {
int tab[] = {1,2,3,4,5,4,3,2,1};
size_t sz = sizeof(tab) / sizeof(int);
quicksort(0, sz-1, tab);
print(tab, 0, sz-1);
return 0;
}
void print(int *a, size_t l, size_t r) {
size_t i;
for (i = l; i <= r; ++i) {
printf("%4d", a[i]);
}
printf("\n");
}
void quicksort(int l,int p,int *tab)
{
int i=l,j=p,x=tab[(l+p)/2],w; //x - pivot
printf("pivot=%d\n", x);
do
{
while (tab[i]<x)
{
i++;
}
while (x<tab[j])
{
j--;
}
if (i<=j)
{
w=tab[i];
tab[i]=tab[j];
tab[j]=w;
i++;
j--;
}
}
while (i<=j);
print(tab, l, p);
if (l<j)
{
quicksort(l,j,tab);
}
if (i<p)
{
quicksort(i,p,tab);
}
}
By the way, I think the graph showing the running times would be smoother if you took the average of, say, 100 running times for each input sequence.
We see that the problem here is the pivot-choosing strategy. Let me note that you can alleviate the problems with adversarial inputs by randomizing the pivot-choosing step. The simplest approach is to pick the pivot uniformly at random (each element is equally likely to be chosen as the pivot); you can then show that the algorithm runs in O(n log n) time with high probability. (Note, however, that to show this sharp tail bound you need some assumptions on the input; the result certainly holds if the numbers are all distinct; see, for example, Motwani and Raghavan's Randomized Algorithms book.)
To corroborate my claims, here's the graph of running times for the same sequences if you choose the pivot uniformly at random, with x = tab[l + (rand() % (p-l))]; (make sure you call srand(time(NULL)) in the main).
For A-shaped sequences:
For V-shaped sequences:
in QuickSort the one of the major things that affect running time is making the input ramdom.
generally choosing a pivot at a particular position may not really be the best except its certain that the input is randomly shuffled. Using the median of three partition is one of the widely used means just to make sure that the pivot is a random number. From your code you didn't implement it.
Also, when recursive quicksort will experience some overhead since its used internal stack( will have to generate several function and assign parameters) , so its advisable that when the size of the data left is around 10 - 20 you can use other sort algorithm like InsertionSort as this will make it about 20% faster.
void quicksort(int l,int p,int *tab){
if ( tab.size <= 10 ){
IntersionSort(tab);
}
..
..}
Something of this nature.
In General best running time for quicksort is nlogn
worse case running time is n^2 often caused by non-random inputs or duplicates inputs
All answers here have very good points. My idea is, that there is nothing wrong with the algorithm (since the pivot problem is well known and it is a reason for O(n^2), but there is something wrong with the way you measure it.
clock() - returns number of processor ticks elapsed from some point (probably program launch? Not important).
Your way of measuring time relly on constant length of tick, which I think isn't guaranteed.
Point is, that many (all?) of todays modern processors dynamically change their frequency to save energy. I think, it is very non-determinstic, so every time you launch your program - CPU frequency will depend not only on size of your input, but also on what is happening in you system right now. The way I understand it is, that the length of one tick can be very different during program execution.
I tried to lookup, what macro CLOCKS_PER_SEC actually does. Is it current clocks per sec? Does it do some averages during some mysterious time period? I sadlly wasn't able to find out. Therefore, I think that your way of measuring time can be totaly wrong.
Since my argument stands on something, I don't know for sure, I might be totaly wrong.
One way to find out is run multiple tests with same data multiple times with diferrent overall system usage and see, if it behaves significally different each time. Another way is, to set your computer's CPU frequency to some static value and test it similar way.
IDEA Wouldn't be better to measure "time" in ticks?
EDIT 1 Thanks to #BeyelerStudios, now we for certain know, that you shouldn't relly on clock() on Windows machines, because it doesn't follow C98 standard. Source
I hope I helped, if I am wrong, please correct me - I am a student and not a HW specialist.
Quicksort has a worst-case time complexity of O(n^2) and an average of O(n log n) for n entries in a data set. More details on an analysis of the time complexity can be found here:
https://www.khanacademy.org/computing/computer-science/algorithms/quick-sort/a/analysis-of-quicksort
and here:
http://www.cise.ufl.edu/class/cot3100fa07/quicksort_analysis.pdf

strangely slow quicksort for large tables

I have been doing my homework which is to compare a bunch of sorting algorithms, and I have came across a strange phenomenon. Things have been as expected: insertionsort winning for something like table of 20 ints, otherwise quicksort outperforming heapsort and mergesort. Up to a table of 500,000 ints (stored in memory). For 5,000,000 ints (still stored in memory) quicksort becomes suddenly worse then heapsort and mergesort. Numbers are always uniformly distributed random, windows virtual memory turned off. Anyone has an idea what could be the cause of that?
void quicksortit(T *tab,int s) {
if (s==0 || s==1) return;
T tmp;
if (s==2) {
if (tab[0]>tab[1]) {
tmp=tab[0];
tab[0]=tab[1];
tab[1]=tmp;
}
return;
}
T pivot=tab[s-1];
T *f1,*f2;
f1=f2=tab;
for(int i=0;i<s;i++)
if (*f2>pivot)
f2++;
else {
tmp=*f1;
*f1=*f2;
*f2=tmp;
f1++; f2++;
}
quicksortit(tab,(f1-1)-tab);
quicksortit(f1,f2-f1);
};
You algorithm starts failing when there are many duplicates in the array. You only noticed this at large values because you have been feeding the algorithm random values which have a large span( I'm assuming you used rand() with: 0 - RAND_MAX ), and that problem only appears with large arrays.
When you try to sort an array of identical numbers( try sorting 100000 identical numbers, the program will crash ) you will first walk through the entire array superfluously swapping elements. Then you split the array into two, but the large array has only been reduced by 1:
v
quicksortit(tab,(f1-1)-tab);
Thus your algorithm becomes O(n^2), and you also consume a very large amount of stack. Searching for a better pivot, will not help you in this case, rather choose a version of quicksort() that doesn't exhibit this flaw.
For example:
function quicksort(array)
if length(array) > 1
pivot := select middle, or a median of first, last and middle
left := first index of array
right := last index of array
while left <= right
while array[left] < pivot
left := left + 1
while array[right] > pivot
right := right - 1
if left <= right
swap array[left] with array[right]
left := left + 1
right := right - 1
quicksort(array from first index to right)
quicksort(array from left to last index)
Which is a modified version of: http://rosettacode.org/wiki/Sorting_algorithms/Quicksort
It could be that your array is now bigger than the L3 cache.
Quicksort partitioning operation moves random elements from one end of the array to another. A typical intel L3 cache is like 8MB. With 5M 4-byte elements - your array is 20MB. and you're writing from one end of it to the other.
Cache misses out of L3 go to main memory and can be much slower than higher level cache misses.
That is up until now your entire sorting operation was operating completely inside the CPU.

Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor?
Please assume the following:
The size of the entire set is at maximum 4096 items
The size of the subsets is open to discussion, but let us assume between 16-256 initially
All data used through the merge should preferably fit into L1
The L1 data cache size is 32K. 16K has already been used for the data itself, so you have 16K to play with
All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort
All data is 16-byte aligned
We want to try to minimize branching (for obvious reasons)
Main criteria of feasibility: faster than an in-L1 LSD radix sort.
I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! :)
Here's a very naive way to do it. (Please excuse any 4am delirium-induced pseudo-code bugs ;)
//4x sorted subsets
data[4][4] = {
{3, 4, 5, INF},
{2, 7, 8, INF},
{1, 4, 4, INF},
{5, 8, 9, INF}
}
data_offset[4] = {0, 0, 0, 0}
n = 4*3
for(i=0, i<n, i++):
sub = 0
sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])
out[i] = data[sub][data_offset[sub]]
data_offset[sub]++
Edit:
With AVX2 and its gather support, we could compare up to 8 subsets at once.
Edit 2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4)
//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)
Edit 3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values:
max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])
q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]
sub = max1*q + ((~max2)&1)*q
Edit 4:
Depending on compiler intelligence, we can remove multiplications altogether using the ternary operator:
sub = (data[sub][data_offset[sub]] > data[x][data_offset[x]]) ? x : sub
Edit 5:
In order to avoid costly floating point comparisons, we could simply reinterpret_cast<uint32_t*>() the data, as this would result in an integer compare.
Another possibility is to utilize SSE registers as these are not typed, and explicitly use integer comparison instructions.
This works due to the operators < > == yielding the same results when interpreting a float on the binary level.
Edit 6:
If we unroll our loop sufficiently to match the number of values to the number of SSE registers, we could stage the data that is being compared.
At the end of an iteration we would then re-transfer the register which contained the selected maximum/minimum value, and shift it.
Although this requires reworking the indexing slightly, it may prove more efficient than littering the loop with LEA's.
This is more of a research topic, but I did find this paper which discusses minimizing branch mispredictions using d-way merge sort.
SIMD sorting algorithms have already been studied in detail. The paper Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture describes an efficient algorithm for doing what you describe (and much more).
The core idea is that you can reduce merging two arbitrarily long lists to merging blocks of k consecutive values (where k can range from 4 to 16): the first block is z[0] = merge(x[0], y[0]).lo. To obtain the second block, we know that the leftover merge(x[0], y[0]).hi contains nx elements from x and ny elements from y, with nx+ny == k. But z[1] cannot contain elements from both x[1] and y[1], because that would require z[1] to contain more than nx+ny elements: so we just have to find out which of x[1] and y[1] needs to be added. The one with the lower first element will necessarily appear first in z, so this is simply done by comparing their first element. And we just repeat that until there is no more data to merge.
Pseudo-code, assuming the arrays end with a +inf value:
a := *x++
b := *y++
while not finished:
lo,hi := merge(a,b)
*z++ := lo
a := hi
if *x[0] <= *y[0]:
b := *x++
else:
b := *y++
(note how similar this is to the usual scalar implementation of merging)
The conditional jump is of course not necessary in an actual implementation: for example, you could conditionally swap x and y with an xor trick, and then read unconditionally *x++.
merge itself can be implemented with a bitonic sort. But if k is low, there will be a lot of inter-instruction dependencies resulting in high latency. Depending on the number of arrays you have to merge, you can then choose k high enough so that the latency of merge is masked, or if this is possible interleave several two-way merges. See the paper for more details.
Edit: Below is a diagram when k = 4. All asymptotics assume that k is fixed.
The big gray box is merging two arrays of size n = m * k (in the picture, m = 3).
We operate on blocks of size k.
The "whole-block merge" box merges the two arrays block-by-block by comparing their first elements. This is a linear time operation, and it doesn't consume memory because we stream the data to the rest of the block. The performance doesn't really matter because the latency is going to be limited by the latency of the "merge4" blocks.
Each "merge4" box merges two blocks, outputs the lower k elements, and feeds the upper k elements to the next "merge4". Each "merge4" box performs a bounded number of operations, and the number of "merge4" is linear in n.
So the time cost of merging is linear in n. And because "merge4" has a lower latency than performing 8 serial non-SIMD comparisons, there will be a large speedup compared to non-SIMD merging.
Finally, to extend our 2-way merge to merge many arrays, we arrange the big gray boxes in classical divide-and-conquer fashion. Each level has complexity linear in the number of elements, so the total complexity is O(n log (n / n0)) with n0 the initial size of the sorted arrays and n is the size of the final array.
The most obvious answer that comes to mind is a standard N-way merge using a heap. That'll be O(N log k). The number of subsets is between 16 and 256, so the worst case behavior (with 256 subsets of 16 items each) would be 8N.
Cache behavior should be ... reasonable, although not perfect. The heap, where most of the action is, will probably remain in the cache throughout. The part of the output array being written to will also most likely be in the cache.
What you have is 16K of data (the array with sorted subsequences), the heap (1K, worst case), and the sorted output array (16K again), and you want it to fit into a 32K cache. Sounds like a problem, but perhaps it isn't. The data that will most likely be swapped out is the front of the output array after the insertion point has moved. Assuming that the sorted subsequences are fairly uniformly distributed, they should be accessed often enough to keep them in the cache.
You can merge int arrays (expensive) branch free.
typedef unsigned uint;
typedef uint* uint_ptr;
void merge(uint*in1_begin, uint*in1_end, uint*in2_begin, uint*in2_end, uint*out){
int_ptr in [] = {in1_begin, in2_begin};
int_ptr in_end [] = {in1_end, in2_end};
// the loop branch is cheap because it is easy predictable
while(in[0] != in_end[0] && in[1] != in_end[1]){
int i = (*in[0] - *in[1]) >> 31;
*out = *in[i];
++out;
++in[i];
}
// copy the remaining stuff ...
}
Note that (*in[0] - *in[1]) >> 31 is equivalent to *in[0] - *in[1] < 0 which is equivalent to *in[0] < *in[1]. The reason I wrote it down using the bitshift trick instead of
int i = *in[0] < *in[1];
is that not all compilers generate branch free code for the < version.
Unfortunately you are using floats instead of ints which at first seems like a showstopper because I do not see how to realabily implement *in[0] < *in[1] branch free. However, on most modern architectures you interprete the bitpatterns of positive floats (that also are no NANs, INFs or such strange things) as ints and compare them using < and you will still get the correct result. Perhaps you extend this observation to arbitrary floats.
You could do a simple merge kernel to merge K lists:
float *input[K];
float *output;
while (true) {
float min = *input[0];
int min_idx = 0;
for (int i = 1; i < K; i++) {
float v = *input[i];
if (v < min) {
min = v; // do with cmov
min_idx = i; // do with cmov
}
}
if (min == SENTINEL) break;
*output++ = min;
input[min_idx]++;
}
There's no heap, so it is pretty simple. The bad part is that it is O(NK), which can be bad if K is large (unlike the heap implementation which is O(N log K)). So then you just pick a maximum K (4 or 8 might be good, then you can unroll the inner loop), and do larger K by cascading merges (handle K=64 by doing 8-way merges of groups of lists, then an 8-way merge of the results).