Heap sort, Understanding the basics - heap

As a disclaimer I am new to this site and therefore, do not know very well how to ask questions. Please don't be too harsh because I really am trying to just understand how some of these concepts work. If i am missing understanding near the beginning, please just tell me that so I can start from there and not waste your time with the rest. Here goes nothing. Because I think my understanding might be flawed, I poised some questions about how heaps would act in different areas, and then tried to answer them.
First, I would like help understanding how a random set of numbers added to an empty heap would look. lets say for example, I have 9, 4, 5, 3, 2, 7, 8, 7. After adding it to the heap, what would the heap look like? I can visually understand this (I think) the 9 being the root, 4 being the first left child and so on and so forth, but since this isn't a tree specifically, and is a heap, would it sort the numbers by switching them (see paragraph "if my understanding is correct") so that they are sorted in either min or max order?
Now lets say we removed the 9 from the heap (I believe the 9 would be the root), how would we respond to this change and then what would then be put into the root? I think here if 9 is the root, we would take the next largest number and copy it into the slot of the nine, while if this was a min heap and we where just removing a node at the bottom, it would just be removed no problem.
Along similar lines, what would a formula to get the parent of the heap item in the array?
--I think I understand this, If parent is at i, the left child would be at i*2 and the right child would be at i*2+1. And therefore going to find the parent, we would have to divide i/2 to find the parent. For example if we where at i=7 the parent would be i=3 because 3.5 would be truncated and if we where at point i=6 the parent would also be i=3. From this example the child at i = 7 would be right child of i = 3 while i=6 would be the left child of i = 3.
If my understanding of this is correct, then to reheapify after a new term has been added to the root I would compare the child to parent and if the child is larger, switch the terms. BUT I would need to compare the two children (if there are two) to see which one is bigger to decide which one needs to swap. This would be for a max heap and would go the other direction for a min heap.
Finally, if I where to add the root element, how would it reheapify?

After 9 is deleted, nothing becomes the root. The heapsort algorithm goes to the left child for sorting (you said 4.) Then the right child (or 5), etc. If the number is being checked is the root (we have different implementations) then 4 becomes the root, then 5, etc. If you are confused, look at this definition of heapsort, written in javascript:
var heapSort = function(array) {
var swap = function(array, firstIndex, secondIndex) {
var temp = array[firstIndex];
array[firstIndex] = array[secondIndex];
array[secondIndex] = temp;
};
var maxHeap = function(array, i) {
var l = 2 * i;
var r = l + 1;
var largest;
if (l < array.heapSize && array[l] > array[i]) {
largest = l;
} else {
largest = i;
}
if (r < array.heapSize && array[r] > array[largest]) {
largest = r;
}
if (largest !== i) {
swap(array, i, largest);
maxHeap(array, largest);
}
};
var buildHeap = function(array) {
array.heapSize = array.length;
for (var i = Math.floor(array.length / 2); i >= 0; i--) {
maxHeap(array, i);
}
};
buildHeap(array);
for (var i = array.length-1; i >= 1; i--) {
swap(array, 0, i);
array.heapSize--;
maxHeap(array, 0);
}
array.heapMaximum = function(){
return this[0];
};
array.heapExtractMax = function(){
if(this.heapSize < 1){
throw new RangeError("heap underflow");
}
var max = this[0];
this[0] = this[this.heapSize - 1];
this.heapSize--;
maxHeap(this, 1);
return max;
};
array.heapIncreaseKey = function(i, key){
if(key < this[i]){
throw new SyntaxError("new key is smaller than current key");
}
this[i] = key;
while(i > 1 && this[Math.floor(i / 2)] < this[i]){
swap(this, i, Math.floor(i / 2));
i = Math.floor(i / 2);
}
};
array.maxHeapInsert = function(key){
this.heapSize--;
this[this.heapSize] = -Infinity;
this.heapIncreaseKey(this.heapSize, key);
};
};
var a = [Math.floor(Math.random() * 100), Math.floor(Math.random() * 100), Math.floor(Math.random() * 100), Math.floor(Math.random() * 100), Math.floor(Math.random() * 100)];
heapSort(a);
document.writeln(a);
*{
font-family:monospace;
}
I actually don't know how it would reheapify, but you can see the snippet to find out.

To begin with your first question about how the heap would look. It will take on the structure of a complete binary tree. We could just walk down the list and update the tree as we see it but this will ruin the run time so there is a more clever way to do it. We want to first start by linearly going through the array and adding it to the left most open slot where the first entry in the array is the root. Then once you have an array, we want to fix the heap from the ground up. This involves looking at the highest depth of the heap and fixing it by making a swap so that the minimum is the parent. Then move up one in the depth of the tree and make the swap if either child is less than the new parent. If this is true then make the swap, however we may have broken the min property and so we must recursively move down the heap to fix the property. Once we recursively move towards the top and fix the heap at the top then we will have made the min Heap desired. Note that through some nice algebra, we can show that this will run in O(n) time.
The second question about removing 9 is not correct (as it is not the root anymore) so let's focus on removing the root node. When the root node is removed (from the tree or the first entry of the array) then we need to place something there for the tree structure and we place the left most node of the tree or the last element in the array, but as you might be thinking, this may have ruined the min-property and you are right. So once you move the left most to the root node, we have to check its children and if it is smaller than both, then we are good. Otherwise, we need to swap with the smaller and repeat this for the next set of children until it is smaller than both its children.
In an array, it is correct that we use 2i and 2i+1 as the index so just dividing by 2 will not be sufficient. We note that 2i is even and 2i+1 is odd and so we should focus on whether the index we are looking at is even or odd. However, it is correct that truncating would given the correct answer for the parent and that the decimal would result in the decision for the left and right child.
To address your final concern, we should note that when you add something to a heap that it is a complete binary tree and should be added to the left most slot and not the root. When you add something to the left most (for a min heap), we need to check if it is smaller than its parents and move it towards the root.
Additionally, building your heap with O(n) is efficient when needing to run prim's algorithm or Dijkstra's Shortest Path Algorithm.
Hope this helps - Jason

Related

In the most simplest explanation possible how to convert an array to min and max heaps in C++?

So recently i learned about heaps and iam really struggling to find an easy algo to convert an array to max and min heap in C++. My approach is as follows (for max heap) if you have an array of n size, use the formula
k=(n-1)/2 -1.We will start from index k and traverse backwards. From k till index 1 (skipping index 0 so as to accomodate the left child 2i and right child 2i+1 indexes), we will compare each node with its children if its lesser than both.In case this condition is true we will check for second condition that which child is greater of the two and then swap that child with the parent. It's all good till this point but suppose we are heapifying an array of size 7 that looks like this
index 0 1 2 3 4 5 6
element 5 6 8 1 9 2
In this method index 2 its children 4 and 5, index 1 and its children 2 and 3 are taken care of but what will happen of index 6.
I looked up geeksforgeeks.com and also checked youtube and other websites but couldnt find anything helpful.
Edit: Here is my approach can you guys check this for errors
void a_buildHeap(int arr[],int n)
{
int swaps{};
int comps{};
for (int i = n/2; i >= 1; i--)
{
int lc = 2*i;
int rc = 2*i + 1;
if (arr[i] < arr[lc] && arr[i] < arr[rc])
{
comps++;
if (lc > rc)
{
swap(arr[i], arr[lc]);
swaps++;
}
else if (rc > lc)
{
swap(arr[i], arr[rc]);
swaps++;
}
}
}
cout << "Total swaps: " << swaps << " "<<endl;
cout << "Total comaprisons: " << comps << endl;
}
You don't really need to skip index 0. Left = Index * 2 + 1 and Right = Index * 2 + 2 can access the child elements too.
I solve this problem recursively. Start at the root element and first call the same function (recursively) on the left and right child element if they exist (check for out of bound here).
Now check which of the 3 elements is the largest/smallest (again you need to check if they exist first). Root, left or right. If Root is largest/smallest don't do anything. If it is left or right then swap the elements.
Finally if you did a swap it is important to call the recursive function on the swapped child position again.
Now you should end up with the solution.
Edit:
for (int i = n/2; i >= 1; i--)
This for loops doesn't work in all cases. In some cases you either will miss a potential swap or you get out of bounds. So you still need to check for that. Also simply traversing once through the tree will not be enough to sort it correctly.
if (arr[i] < arr[lc] && arr[i] < arr[rc])
This if statement is wrong. You check if the left and the right child are larger when actually only one of the need to be larger.
Next you check if the left or right child is larger. What will you do if they are both the same size?
Finally your approach of traversing backwards will only work in certain cases and not in all cases. You should try to use a debugger or just get a pen and paper and try to visualize what will happen if run your code.

finding a element in a matrix with the minimum cost starting from one point

I have n*n matrix and I want to find the element from the matrix that has the minimum cost, the cost of a node meaning cost = Manhattandistance(startingnode,node) + costof(node) where starting node is a node in which perspective I am searching!
I did it just with 4 for loops and it works but I want to optimize it and I did something like a bfs: I used a queue and added first the starting node, after that in a while loop I pop ed the node from the queue and added to the queue all the elements around that node with the Manhatttan 1. I do this while the distance of the node that I just popped from the queue + the minimum price from the whole matrix (which I know from the start) is less than the minimum price that I have just found ( I compare the price of the node I just popped with min) if it's bigger I stop searching because the minimum node I found is the lowest possible value. The problem is this algorithm is to slow possibly because I use a std::queue? and it works in more time than the 4 for loops version. (I also used a flags matrix to see if the element I am inspecting when I add to the queue has been already added). The most time consuming block of code is the part I expand the node don't know why I just inspect if the element is valid I mean it's coordinates are less than n and bigger than 0 if ok I add the element to the queue!
I want to know how can I improve this or if it's another way to do it! Hope I was explicit enough.
this is the part of code that takes to long:
if((p1.dist + 1 + Pmin) < pretmincomp || (p1.dist + 1 + Pmin) < pretres){
std::vector<PAIR> vec;
PAIR pa;
int p1a=p1.a,p1b = p1.b,p1d = p1.dist;
if(isok(p1a+1,p1b,n)){
pa.a = p1a + 1;
pa.b = p1b;
pa.dist = p1d + 1;
vec.push_back(pa);
}
if(isok(p1a-1,p1b,n)){
pa.a = p1a - 1;
pa.b = p1b;
pa.dist = p1d + 1;
vec.push_back(pa);
}
if(isok(p1a,p1b+1,n)){
pa.a = p1a;
pa.b = p1b + 1;
pa.dist = p1d + 1;
vec.push_back(pa);
}
if(isok(p1a,p1b -1 ,n)){
pa.a = p1.a;
pa.b = p1.b - 1;
pa.dist = p1d + 1;
vec.push_back(pa);
}
for(std::vector<PAIR>::iterator it = vec.begin();
it!=vec.end(); it++){
if(flags[(*it).a][(*it).b] !=1){
devazut.push(*it);
flags[(*it).a][(*it).b] = 1;
}
}
You are dealing with a shortest path problem, which can be efficiently solved with BFS (if the graph is unweighted) or A* algorithm - if you have some "knowledge" on the graph and can estimate how much it will "cost" you to find a target from each node.
Your solution is very similar to BFS with one difference - BFS also maintains a visited set - of all the nodes you have already visited. The idea of this visited set is that you don't need to revisit a node that was already visited, because any path through it will be not shorter then the shortest path you will find during the first visit of this node.
Note that without the visited set - each node is revisited a lot of times, which makes the algorithm very inefficient.
Pseudo code for BFS (with visited set):
BFS(start):
q <- new queue
q.push(pair(start,0)) //0 indicates the distance from the start
visited <- new set
visited.add(start)
while (not q.isEmpty()):
curr <- q.pop()
if (curr.first is target):
return curr.second //the distance is indicated in the second element
for each neighbor of curr.first:
if (not set.contains(neighbor)): //add the element only if it is not in the set
q.push(pair(neighbor,curr.second+1)) //add the new element to queue
//and also add it to the visited set, so it won't be re-added to the queue.
visited.add(neighbot)
//when here - no solution was found
return infinity //exhausted all vertices and there is no path to a target

How to implement a minimum heap sort to find the kth smallest element?

I've been implementing selection sort problems for class and one of the assignments is to find the kth smallest element in the array using a minimum heap. I know the procedure is:
heapify the array
delete the minimum (root) k times
return kth smallest element in the group
I don't have any problems creating a minimum heap. I'm just not sure how to go about properly deleting the minimum k times and successfully return the kth smallest element in the group. Here's what I have so far:
bool Example::min_heap_select(long k, long & kth_smallest) const {
//duplicate test group (thanks, const!)
Example test = Example(*this);
//variable delcaration and initlization
int n = test._total ;
int i;
//Heapifying stage (THIS WORKS CORRECTLY)
for (i = n/2; i >= 0; i--) {
//allows for heap construction
test.percolate_down_protected(i, n);
}//for
//Delete min phase (THIS DOESN'T WORK)
for(i = n-1; i >= (n-k+1); i--) {
//deletes the min by swapping elements
int tmp = test._group[0];
test._group[0] = test._group[i];
test._group[i] = tmp;
//resumes perc down
test.percolate_down_protected(0, i);
}//for
//IDK WHAT TO RETURN
kth_smallest = test._group[0];
void Example::percolate_down_protected(long i, long n) {
//variable declaration and initlization:
int currPos, child, r_child, tmp;
currPos = i;
tmp = _group[i];
child = left_child(i);
//set a sentinel and begin loop (no recursion allowed)
while (child < n) {
//calculates the right child's position
r_child = child + 1;
//we'll set the child to index of greater than right and left children
if ((r_child > n ) && (_group[r_child] >= _group[child])) {
child = r_child;
}
//find the correct spot
if (tmp <= _group [child]) {
break;
}
//make sure the smaller child is beneath the parent
_group[currPos] = _group[child];
//shift the tree down
currPos = child;
child = left_child(currPos);
}
//put tmp where it belongs
_group[currPos] = tmp;
}
As I stated before, the minimum heap part works correctly. I understand what I what to do- it seems easy to delete the root k times but then after that what index in the array do I return... 0? This almost works- it doesn't worth with k = n or k = 1.Would the kth smallest element be in the Any help would be much appreciated!
The only array index which is meaningful to the user is zero, which is the minimum element. So, after removing k elements, the k'th smallest element will be at zero.
Probably you should destroy the heap and return the value rather than asking the user to concern themself with the heap itself… but I don't know the details of the assignment.
Note that the C++ Standard Library has algorithms to help with this: make_heap, pop_heap, and nth_element.
I am not providing a detailed answer, just explaining the key points in getting k smallest elements in a min-heap ordered tree. The approach uses skip lists.
First form a skip list of nodes of the tree with just one element the node corresponding to the root of the heap. the 1st minimum element is just the value stored at this node.
Now delete this node and insert its child nodes in the right position such that to maintain the order of values. This steps takes O(logk) time.
The second minimum value is just then the value at first node in this skip list.
Repeat the above steps until you get all the k minimum elements. The overall time complexity will be log(2)+log(3)+log(4)+... log(k) = O(k.logk). Forming a heap takes time n, so overall time complexity is O(n+klogk).
There is one more approach without making a heap that is Quickselect, which has an average time complexity of O(n) but worst case as O(n^2).
The striking difference between the two approaches is that the first approach gives all the k elements the minimum upto the kth minimum, while quickSelect gives only the kth minimum element.
Memory wise the former approach uses O(n) extra space which quickSelect uses O(1).

How Recursion Works Inside a For Loop

I am new to recursion and trying to understand this code snippet. I'm studying for an exam, and this is a "reviewer" I found from Standford' CIS Education Library (From Binary Trees by Nick Parlante).
I understand the concept, but when we're recursing INSIDE THE LOOP, it all blows! Please help me. Thank you.
countTrees() Solution (C/C++)
/*
For the key values 1...numKeys, how many structurally unique
binary search trees are possible that store those keys.
Strategy: consider that each value could be the root.
Recursively find the size of the left and right subtrees.
*/
int countTrees(int numKeys) {
if (numKeys <=1) {
return(1);
}
// there will be one value at the root, with whatever remains
// on the left and right each forming their own subtrees.
// Iterate through all the values that could be the root...
int sum = 0;
int left, right, root;
for (root=1; root<=numKeys; root++) {
left = countTrees(root - 1);
right = countTrees(numKeys - root);
// number of possible trees with this root == left*right
sum += left*right;
}
return(sum);
}
Imagine the loop being put "on pause" while you go in to the function call.
Just because the function happens to be a recursive call, it works the same as any function you call within a loop.
The new recursive call starts its for loop and again, pauses while calling the functions again, and so on.
For recursion, it's helpful to picture the call stack structure in your mind.
If a recursion sits inside a loop, the structure resembles (almost) a N-ary tree.
The loop controls horizontally how many branches at generated while the recursion decides the height of the tree.
The tree is generated along one specific branch until it reaches the leaf (base condition) then expand horizontally to obtain other leaves and return the previous height and repeat.
I find this perspective generally a good way of thinking.
Look at it this way: There's 3 possible cases for the initial call:
numKeys = 0
numKeys = 1
numKeys > 1
The 0 and 1 cases are simple - the function simply returns 1 and you're done. For numkeys 2, you end up with:
sum = 0
loop(root = 1 -> 2)
root = 1:
left = countTrees(1 - 1) -> countTrees(0) -> 1
right = countTrees(2 - 1) -> countTrees(1) -> 1
sum = sum + 1*1 = 0 + 1 = 1
root = 2:
left = countTrees(2 - 1) -> countTrees(1) -> 1
right = countTrees(2 - 2) -> countTrees(0) -> 1
sum = sum + 1*1 = 1 + 1 = 2
output: 2
for numKeys = 3:
sum = 0
loop(root = 1 -> 3):
root = 1:
left = countTrees(1 - 1) -> countTrees(0) -> 1
right = countTrees(3 - 1) -> countTrees(2) -> 2
sum = sum + 1*2 = 0 + 2 = 2
root = 2:
left = countTrees(2 - 1) -> countTrees(1) -> 1
right = countTrees(3 - 2) -> countTrees(1) -> 1
sum = sum + 1*1 = 2 + 1 = 3
root = 3:
left = countTrees(3 - 1) -> countTrees(2) -> 2
right = countTrees(3 - 3) -> countTrees(0) -> 1
sum = sum + 2*1 = 3 + 2 = 5
output 5
and so on. This function is most likely O(n^2), since for every n keys, you're running 2*n-1 recursive calls, meaning its runtime will grow very quickly.
Just to remember that all the local variables, such as numKeys, sum, left, right, root are in the stack memory. When you go to the n-th depth of the recursive function , there will be n copies of these local variables. When it finishes executing one depth, one copy of these variable will be popped up from the stack.
In this way, you will understand that, the next-level depth will NOT affect the current-level depth local variables (UNLESS you are using references, but we are NOT in this particular problem).
For this particular problem, time-complexity should be carefully paid attention to. Here are my solutions:
/* Q: For the key values 1...n, how many structurally unique binary search
trees (BST) are possible that store those keys.
Strategy: consider that each value could be the root. Recursively
find the size of the left and right subtrees.
http://stackoverflow.com/questions/4795527/
how-recursion-works-inside-a-for-loop */
/* A: It seems that it's the Catalan numbers:
http://en.wikipedia.org/wiki/Catalan_number */
#include <iostream>
#include <vector>
using namespace std;
// Time Complexity: ~O(2^n)
int CountBST(int n)
{
if (n <= 1)
return 1;
int c = 0;
for (int i = 0; i < n; ++i)
{
int lc = CountBST(i);
int rc = CountBST(n-1-i);
c += lc*rc;
}
return c;
}
// Time Complexity: O(n^2)
int CountBST_DP(int n)
{
vector<int> v(n+1, 0);
v[0] = 1;
for (int k = 1; k <= n; ++k)
{
for (int i = 0; i < k; ++i)
v[k] += v[i]*v[k-1-i];
}
return v[n];
}
/* Catalan numbers:
C(n, 2n)
f(n) = --------
(n+1)
2*(2n+1)
f(n+1) = -------- * f(n)
(n+2)
Time Complexity: O(n)
Space Complexity: O(n) - but can be easily reduced to O(1). */
int CountBST_Math(int n)
{
vector<int> v(n+1, 0);
v[0] = 1;
for (int k = 0; k < n; ++k)
v[k+1] = v[k]*2*(2*k+1)/(k+2);
return v[n];
}
int main()
{
for (int n = 1; n <= 10; ++n)
cout << CountBST(n) << '\t' << CountBST_DP(n) <<
'\t' << CountBST_Math(n) << endl;
return 0;
}
/* Output:
1 1 1
2 2 2
5 5 5
14 14 14
42 42 42
132 132 132
429 429 429
1430 1430 1430
4862 4862 4862
16796 16796 16796
*/
You can think of it from the base case, working upward.
So, for base case you have 1 (or less) nodes. There is only 1 structurally unique tree that is possible with 1 node -- that is the node itself. So, if numKeys is less than or equals to 1, just return 1.
Now suppose you have more than 1 key. Well, then one of those keys is the root, some items are in the left branch and some items are in the right branch.
How big are those left and right branches? Well it depends on what is the root element. Since you need to consider the total amount of possible trees, we have to consider all configurations (all possible root values) -- so we iterate over all possible values.
For each iteration i, we know that i is at the root, i - 1 nodes are on the left branch and numKeys - i nodes are on the right branch. But, of course, we already have a function that counts the total number of tree configurations given the number of nodes! It's the function we're writing. So, recursive call the function to get the number of possible tree configurations of the left and right subtrees. The total number of trees possible with i at the root is then the product of those two numbers (for each configuration of the left subtree, all possible right subtrees can happen).
After you sum it all up, you're done.
So, if you kind of lay it out there's nothing special with calling the function recursively from within a loop -- it's just a tool that we need for our algorithm. I would also recommend (as Grammin did) to run this through a debugger and see what is going on at each step.
Each call has its own variable space, as one would expect. The complexity comes from the fact that the execution of the function is "interrupted" in order to execute -again- the same function.
This code:
for (root=1; root<=numKeys; root++) {
left = countTrees(root - 1);
right = countTrees(numKeys - root);
// number of possible trees with this root == left*right
sum += left*right;
}
Could be rewritten this way in Plain C:
root = 1;
Loop:
if ( !( root <= numkeys ) ) {
goto EndLoop;
}
left = countTrees( root -1 );
right = countTrees ( numkeys - root );
sum += left * right
++root;
goto Loop;
EndLoop:
// more things...
It is actually translated by the compiler to something like that, but in assembler. As you can see the loop is controled by a pair of variables, numkeys and root, and their values are not modified because of the execution of another instance of the same procedure. When the callee returns, the caller resumes the execution, with the same values for all values it had before the recursive call.
IMO, key element here is to understand function call frames, call stack, and how they work together.
In your example, you have bunch of local variables which are initialised but not finalised in the first call. It's important to observe those local variables to understand the whole idea. At each call, the local variables are updated and finally returned in a backwards manner (most likely it's stored in a register before each function call frame is popped off from the stack) up until it's added to the initial function call's sum variable.
The important distinction here is - where to return. If you need accumulated sum value like in your example, you cannot return inside the function which would cause to early-return/exit. However, if you depend on a value to be in a certain state, then you can check if this state is hit inside the for loop and return immediately without going all the way up.

Fastest way to obtain the largest X numbers from a very large unsorted list?

I'm trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting is a time intensive portion of the program.
Whats the best way of doing the sorting to get the top 100 scores?
The only two methods i can think of so far is either first generating all the scores into a massive array and then sorting it and taking the top 100. Or second, generating X number of scores, sorting it and truncating the top 100 scores then continue generating more scores, adding them to the truncated list and then sorting it again.
Either way I do it, it still takes more time than i would like, any ideas on how to do it in an even more efficient way? (I've never taken programming courses before, maybe those of you with comp sci degrees know about efficient algorithms to do this, at least that's what I'm hoping).
Lastly, whats the sorting algorithm used by the standard sort() function in c++?
Thanks,
-Faken
Edit: Just for anyone who is curious...
I did a few time trials on the before and after and here are the results:
Old program (preforms sorting after each outer loop iteration):
top 100 scores: 147 seconds
top 10 scores: 147 seconds
top 1 scores: 146 seconds
Sorting disabled: 55 seconds
new program (implementing tracking of only top scores and using default sorting function):
top 100 scores: 350 seconds <-- hmm...worse than before
top 10 scores: 103 seconds
top 1 scores: 69 seconds
Sorting disabled: 51 seconds
new rewrite (optimizations in data stored, hand written sorting algorithm):
top 100 scores: 71 seconds <-- Very nice!
top 10 scores: 52 seconds
top 1 scores: 51 seconds
Sorting disabled: 50 seconds
Done on a core 2, 1.6 GHz...I can't wait till my core i7 860 arrives...
There's a lot of other even more aggressive optimizations for me to work out (mainly in the area of reducing the number of iterations i run), but as it stands right now, the speed is more than good enough, i might not even bother to work out those algorithm optimizations.
Thanks to eveyrone for their input!
take the first 100 scores, and sort them in an array.
take the next score, and insertion-sort it into the array (starting at the "small" end)
drop the 101st value
continue with the next value, at 2, until done
Over time, the list will resemble the 100 largest value more and more, so more often, you find that the insertion sort immediately aborts, finding that the new value is smaller than the smallest value of the candidates for the top 100.
You can do this in O(n) time, without any sorting, using a heap:
#!/usr/bin/python
import heapq
def top_n(l, n):
top_n = []
smallest = None
for elem in l:
if len(top_n) < n:
top_n.append(elem)
if len(top_n) == n:
heapq.heapify(top_n)
smallest = heapq.nsmallest(1, top_n)[0]
else:
if elem > smallest:
heapq.heapreplace(top_n, elem)
smallest = heapq.nsmallest(1, top_n)[0]
return sorted(top_n)
def random_ints(n):
import random
for i in range(0, n):
yield random.randint(0, 10000)
print top_n(random_ints(1000000), 100)
Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):
100000 elements: .29 seconds
1000000 elements: 2.8 seconds
10000000 elements: 25.2 seconds
Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.
Here's the 'natural' C++ way to do this:
std::vector<Score> v;
// fill in v
std::partial_sort(v.begin(), v.begin() + 100, v.end(), std::greater<Score>());
std::sort(v.begin(), v.begin() + 100);
This is linear in the number of scores.
The algorithm used by std::sort isn't specified by the standard, but libstdc++ (used by g++) uses an "adaptive introsort", which is essentially a median-of-3 quicksort down to a certain level, followed by an insertion sort.
Declare an array where you can put the 100 best scores. Loop through the huge list and check for each item if it qualifies to be inserted in the top 100. Use a simple insert sort to add an item to the top list.
Something like this (C# code, but you get the idea):
Score[] toplist = new Score[100];
int size = 0;
foreach (Score score in hugeList) {
int pos = size;
while (pos > 0 && toplist[pos - 1] < score) {
pos--;
if (pos < 99) toplist[pos + 1] = toplist[pos];
}
if (size < 100) size++;
if (pos < size) toplist[pos] = score;
}
I tested it on my computer (Code 2 Duo 2.54 MHz Win 7 x64) and I can process 100.000.000 items in 369 ms.
Since speed is of the essence here, and 40.000 possible highscore values is totally maintainable by any of today's computers, I'd resort to bucket sort for simplicity. My guess is that it would outperform any of the algorithms proposed thus far. The downside is that you'd have to determine some upper limit for the highscore values.
So, let's assume your max highscore value is 40.000:
Make an array of 40.000 entries. Loop through your highscore values. Each time you encounter highscore x, increase your array[x] by one. After this, all you have to do is count the top entries in your array until you have reached 100 counted highscores.
You can do it in Haskell like this:
largest100 xs = take 100 $ sortBy (flip compare) xs
This looks like it sorts all the numbers into descending order (the "flip compare" bit reverses the arguments to the standard comparison function) and then returns the first 100 entries from the list. But Haskell is lazily evaluated, so the sortBy function does just enough sorting to find the first 100 numbers in the list, and then stops.
Purists will note that you could also write the function as
largest100 = take 100 . sortBy (flip compare)
This means just the same thing, but illustrates the Haskell style of composing a new function out of the building blocks of other functions rather than handing variables around the place.
You want the absolute largest X numbers, so I'm guessing you don't want some sort of heuristic. How unsorted is the list? If it's pretty random, your best bet really is just to do a quick sort on the whole list and grab the top X results.
If you can filter scores during the list generation, that's way way better. Only ever store X values, and every time you get a new value, compare it to those X values. If it's less than all of them, throw it out. If it's bigger than one of them, throw out the new smallest value.
If X is small enough you can even keep your list of X values sorted so that you are comparing your new number to a sorted list of values, you can make an O(1) check to see if the new value is smaller than all of the rest and thus throw it out. Otherwise, a quick binary search can find where the new value goes in the list and then you can throw away the first value of the array (assuming the first element is the smallest element).
Place the data into a balanced Tree structure (probably Red-Black tree) that does the sorting in place. Insertions should be O(lg n). Grabbing the highest x scores should be O(lg n) as well.
You can prune the tree every once in awhile if you find you need optimizations at some point.
If you only need to report the value of top 100 scores (and not any associated data), and if you know that the scores will all be in a finite range such as [0,100], then an easy way to do it is with "counting sort"...
Basically, create an array representing all possible values (e.g. an array of size 101 if scores can range from 0 to 100 inclusive), and initialize all the elements of the array with a value of 0. Then, iterate through the list of scores, incrementing the corresponding entry in the list of achieved scores. That is, compile the number of times each score in the range has been achieved. Then, working from the end of the array to the beginning of the array, you can pick out the top X score. Here is some pseudo-code:
let type Score be an integer ranging from 0 to 100, inclusive.
let scores be an array of Score objects
let scorerange be an array of integers of size 101.
for i in [0,100]
set scorerange[i] = 0
for each score in scores
set scorerange[score] = scorerange[score] + 1
let top be the number of top scores to report
let idx be an integer initialized to the end of scorerange (i.e. 100)
while (top > 0) and (idx>=0):
if scorerange[idx] > 0:
report "There are " scorerange[idx] " scores with value " idx
top = top - scorerange[idx]
idx = idx - 1;
I answered this question in response to an interview question in 2008. I implemented a templatized priority queue in C#.
using System;
using System.Collections.Generic;
using System.Text;
namespace CompanyTest
{
// Based on pre-generics C# implementation at
// http://www.boyet.com/Articles/WritingapriorityqueueinC.html
// and wikipedia article
// http://en.wikipedia.org/wiki/Binary_heap
class PriorityQueue<T>
{
struct Pair
{
T val;
int priority;
public Pair(T v, int p)
{
this.val = v;
this.priority = p;
}
public T Val { get { return this.val; } }
public int Priority { get { return this.priority; } }
}
#region Private members
private System.Collections.Generic.List<Pair> array = new System.Collections.Generic.List<Pair>();
#endregion
#region Constructor
public PriorityQueue()
{
}
#endregion
#region Public methods
public void Enqueue(T val, int priority)
{
Pair p = new Pair(val, priority);
array.Add(p);
bubbleUp(array.Count - 1);
}
public T Dequeue()
{
if (array.Count <= 0)
throw new System.InvalidOperationException("Queue is empty");
else
{
Pair result = array[0];
array[0] = array[array.Count - 1];
array.RemoveAt(array.Count - 1);
if (array.Count > 0)
trickleDown(0);
return result.Val;
}
}
#endregion
#region Private methods
private static int ParentOf(int index)
{
return (index - 1) / 2;
}
private static int LeftChildOf(int index)
{
return (index * 2) + 1;
}
private static bool ParentIsLowerPriority(Pair parent, Pair item)
{
return (parent.Priority < item.Priority);
}
// Move high priority items from bottom up the heap
private void bubbleUp(int index)
{
Pair item = array[index];
int parent = ParentOf(index);
while ((index > 0) && ParentIsLowerPriority(array[parent], item))
{
// Parent is lower priority -- move it down
array[index] = array[parent];
index = parent;
parent = ParentOf(index);
}
// Write the item once in its correct place
array[index] = item;
}
// Push low priority items from the top of the down
private void trickleDown(int index)
{
Pair item = array[index];
int child = LeftChildOf(index);
while (child < array.Count)
{
bool rightChildExists = ((child + 1) < array.Count);
if (rightChildExists)
{
bool rightChildIsHigherPriority = (array[child].Priority < array[child + 1].Priority);
if (rightChildIsHigherPriority)
child++;
}
// array[child] points at higher priority sibling -- move it up
array[index] = array[child];
index = child;
child = LeftChildOf(index);
}
// Put the former root in its correct place
array[index] = item;
bubbleUp(index);
}
#endregion
}
}
Median of medians algorithm.