Range Queries on a path in a tree - c++

I came across this question in a contest (which is now over) and I am not able to think of a time-efficient algorithm.
You are given a rooted Tree of N ( <=10^5) nodes . Initially all nodes have value 0.There would be M updates (<=10^5) to the tree which are of the form
Add x y – Add y to node x .
AddUp x y – Add y to x , parent of x , parent of parent of x uptill Root.
After that there will be Q queries ( <=10^5) queries where you will either be asked to give the value of a node or the sum of subtree rooted at that node.
What I did:-
First I tried the naive algorithm that would update each node according to the operation, but obviously it is time taking.
I also thought of using segment trees and Lazy propogation but cannot think of a proper way.
Any help is appreciated , Thanks!

First, construct a graph where the children point to their parents.
After that, parse all the updates and store in each node of your tree the sum of Add and AddUp separately.
Your node should have the following variables:
sum_add : the sum of all the Add of this node
sum_add_up : the sum of all the AddUp of this node
subtree_sum : the sum of the subtree. Initialize with 0 by now.
Now, transverse your graph using topological order, i.e., you will only process a node if all of its children were already processed, which takes O(N). Let me now define the process function.
process(V) {
V.sum_add_up = V.sum_add_up + sum(sum_add_up of all V.children)
V.subtree_sum = V.sum_add + V.sum_add_up + sum(subtree_sum of all V.children)
}
Now you can answer all the queries in O(1). The query for the value of a node V is V.sum_add + V.sum_add_up, and the query for the subtree of V is just V.subtree_sum.

This is a Fenwick Tree, for solve this kind of problems you must to execute a topological sort on tree and count the number of childrens for each node.
0
/ \
1 2
/ \
3 4
index: [0 1,2,3,4]
childrens: [4,2,0,0,0]
With topological you will obtain this vector 0 1 3 4 2 you need to reverse it:
fenwick Pos: [0,1,2,3,4]
vector values:[2,4,3,1,0]
pos: [5,3,0,2,1]
With fenwick tree you can execute 2 kinds of query, update query, range sum query
when you need to update a only a index call update(pos[index], y), then you must decrease all next values, update(pos[index]+1, -y)
When you need to update all parents call update(pos[index], y) and update(pos[index] + childrens[index] + 1, -y);
for know value of a position you need to call range sum query on pos[index]

I think this problem is just a direct application of a Binary Search Tree, which has an average-case cost (after n random operations) O(1.39log(n)) for both inserts and queries.
All you have to do is recursively add new nodes and update values and sum at the same time.
Implementation is also fairly simple (sorry for C#), for example for Add() (AddUp() is similar - increase value every time you go to left or right subtree):
public void Add(int key, int value)
{
Root = Add(Root, key, value);
}
private Node Add(Node node, int key, int value)
{
if (node == null)
{
node = new Node(key, value, value);
}
if (key < node.Key)
{
node.Left = Add(node.Left, key, value);
}
else if (key > node.Key)
{
node.Right = Add(node.Right, key, value);
}
else
{
node.Value = value;
}
node.Sum = Sum(node.Left) + Sum(node.Right) + node.Value;
return node;
}
For 100000 numbers on my machine this translates to these numbers:
Added(up) 100000 values in: 213 ms, 831985 ticks
Got 100000 values in: 86 ms, 337072 ticks
And for 1 million numbers:
Added(up) 1000000 values in: 3127 ms, 12211606 ticks
Got 1000000 values in: 1244 ms, 4857733 ticks
Is this time efficient enough? You can try complete code here.

Related

Appropriate data structure for add and find queries

I have two types of queries.
1 X Y
Add element X ,Y times in the collection.
2 N
Number of queries < 5 * 10^5
X < 10^9
Y < 10^9
Find Nth element in the sorted collection.
I tried STL set but it did not work.
I think we need balanced tree with each node containing two data values.
First value will be element X. And another will be prefix sum of all the Ys of elements smaller than or equal to value.
When we are adding element X find preprocessor of that first value.Add second value associated with preprocessor to Y.
When finding Nth element. Search in tree(second value) for value immediately lower than N.
How to efficiently implement this data structure ?
This can easily be done using segment tree data structure with complexity of O(Q*log(10^9))
We should use so called "sparse" segment tree so that we only create nodes when needed, instead of creating all nodes.
In every node we will save count of elements in range [L, R]
Now additions of some element y times can easily be done by traversing segment tree from root to leaf and updating the values (also creating nodes that do not exist yet).
Since the height of segment tree is logarithmic this takes log N time where N is our initial interval length (10^9)
Finding k-th element can easily be done using binary search on segment tree, since on every node we know the count of elements in some range, we can use this information to traverse left or right to the element which contains the k-th
Sample code (C++):
#include <bits/stdc++.h>
using namespace std;
#define ll long long
const int sz = 31*4*5*100000;
ll seg[sz];
int L[sz],R[sz];
int nxt = 2;
void IncNode(int c, int l, int r, int idx, int val)
{
if(l==r)
{
seg[c]+=val;
return;
}
int m = (l+r)/2;
if(idx <= m)
{
if(!L[c])L[c]=nxt++;
IncNode(L[c],l,m,idx,val);
}
else
{
if(!R[c])R[c]=nxt++;
IncNode(R[c],m+1,r,idx,val);
}
seg[c] = seg[L[c]] + seg[R[c]];
}
int FindKth(int c, int l, int r, ll k)
{
if(l==r)return r;
int m = (l+r)/2;
if(seg[L[c]] >= k)return FindKth(L[c],l,m,k);
return FindKth(R[c],m+1,r,k-seg[L[c]]);
}
int main()
{
ios::sync_with_stdio(0);cin.tie(0);cout.tie(0);
int Q;
cin>>Q;
int L = 0, R = 1e9;
while(Q--)
{
int type;
cin>>type;
if(type==1)
{
int x,y;
cin>>x>>y;
IncNode(1,L,R,x,y);
}
else
{
int k;
cin>>k;
cout<<FindKth(1,L,R,k)<<"\n";
}
}
}
Maintaining a prefix sum in each node is not practical. It would mean that every time you add a new node, you have to update the prefix sum in every node succeeding it in the tree. Instead, you need to maintain subtree sums: each node should contain the sum of Y-values for its own key and the keys of all descendants. Maintaining subtree sums when the tree is updated should be straightforward.
When you answer a query of type 2, at each node, you would descend into the left subtree if N is less than or equal to the subtree sum value S of the left child (I'm assuming N is 1-indexed). Otherwise, subtract S + 1 from N and descend into the right subtree.
By the way, if the entire set of X values is known in advance, then instead of a balanced BST, you could use a range tree or a binary indexed tree.

Is there a way to reduce the time complexity of the program?

Assume there are n prisoners standing in a circle. The first prisoner has a knife with which he kills the second prisoner and passes on the knife to the third person who kills the fourth prisoner and passes the knife to the fifth prisoner.
This cycle is repeated till only one prisoner is left. Note that the prisoners are standing in a circle, thus the first prisoner is next to the nth prisoner. Return the index of the last standing prisoner.
I tried implementing the solution using a circular linked list. Here's my code
The structure of the circular linked list is:-
struct Node
{
int Data;
Node *Next;
};
Node *Head = NULL;
Here are my deleteByAddress() and main() functions:-
inline void deleteByAddress(Node *delNode)
{
Node *n = Head;
if(Head == delNode)
{
while(n -> Next != Head)
{
n = n -> Next;
}
n -> Next = Head -> Next;
free(Head);
Head = n -> Next;
return ;
}
while(n -> Next != delNode)
{
n = n -> Next;
}
n -> Next = delNode -> Next;
delete delNode;
}
int main(void)
{
for(int i = 1 ; i <= 100 ; i++)
insertAtEnd(i);
Node *n = Head;
while(Head -> Next != Head)
{
deleteByAddress(n -> Next);
n = n -> Next;
}
cout << Head -> Data;
return 0;
}
The above code works perfectly and produces the desired output for n = 100, which is 73.
Is there any way we can reduce the time complexity or use a more efficient data structure to implement the same question.
This is known as the Josephus problem. As the Wikipedia page shows and others have noted, there is a formula for when k is 2. The general recurrence is
// zero-based Josephus
function g(n, k){
if (n == 1)
return 0
return (g(n - 1, k) + k) % n
}
console.log(g(100, 2) + 1)
This can easily be solved with O(1) complexity using the following:
last = (num - pow(2, int(log(num)/log(2)))) * 2 + 1
for example for num = 100 :
last = (100 - pow(2, int(log(100)/log(2)))) * 2 + 1 = 73
And if you have log2() function, you may replace a bit ugly log(num)/log(2) which basically takes a logarithm with the base 2.
Use 1 loop. You can grab, at every iteration, the current one's next, then set current to the next ones next and then delete the next one.
This assumes all the data is set up before hand and ignores the rewriting of the next variable when you hit the bounds.
The trick to reduce time complexity is to come up with more clever algorithms than brute-forcing it by simulation.
Here, as so often, key is obviously to solve the math. The first loop, for example, kills everybody with i%2=1 (assuming 0 based indexing), the second everybody with i%4=(n+1)%2*2 or so etc. - I'd be looking for a closed form to directly compute the survivor. It will likely boil down to a few bit manipulations yielding a O(log n) algorithm that is almost instant in practise because of all running completely in CPU registers with not even L1 cache accesses.
For such a simple processing the list manipulation and memory allocation is going to dominate the computation, you could use just a single array where you have an index to the first alive and each element is the index of next alive.
That said you could indeed search for a formula that avoids doing the loops... for example if the number of prisoners is even then after the first "loop" you end up with half of the prisoners and the knife back in the hands of first one. This means that the index of the surviving prisoner when n is even is
f(n) = 2 * f(n / 2) # when n is even
in case n is odd things are a bit more complex... after the first loop you will end up with (n + 1)/2 prisoners, but the knife in the hand of last one so some modulo arithmetic is needed because you need to "adjust" the result of the recursive call f((n + 1)/2).
The method to reduce time complextiy is, as in most cases that a challenge fails for out-of-time reasons, to not simulate and use math instead. With luck it turns into a one-liner.
The algorithm can be sped up very much, if you change to:
Note that for a total number of prisoners which is a power of two, always index 0 will survive.
For other cases:
determine the highest power of two which is lower or equal to the number of prisoners
determine R, the rest when reducing the number of prisoners by that power of two
the prisoner who survives in the end will be the one who gets the knife after that number of prisoners has been killed
Trying to find out which prisoner that is.
Case of 5 prisoners (1 higher than 22, R=1):
01234
Deaths 1: x x
Deaths 2:x x
last : O
Case of 6 (R=2):
012345
Deaths 1: x x x
Deaths 2:x x (index 4 kills index 0 after index 2 was killed by index 0)
last : O
Case of 7 (R=3):
0123456
Deaths 1:xx x x (index 6 kills index 0 after index 5 was killed by index 2)
Deaths 2: x x (index 6 kills index 2 after index 4 was killed by index 2)
last : O
Case of 8 is the next power of two, index 0 survives.
In the end, the final survivor is always the one at index 2*R.
Hence, instead of simulating, you just need to determine R.
That should be possible at worst in a time complexity of order of logarithm to base 2 of total number.

Count the number of nodes in an AVL tree in a given range

I'm required to write a C++ function that, given a range (a,b], returns the number of nodes in an AVL tree that are in that given range, specifically in log(n) time complexity.
I can add more fields to the tree's nodes if I need to do so.
I should point out that a,b will not necessarily appear in the tree. For example, if the tree's nodes are: 1,2,5,7,9,10, then running the function using the parameters (3,9] should return 3.
Which algorithm should I use to achieve this?
This is a famous problem - dynamic order statistcs by tree augmentation.
You basically need to augment your nodes so that when you look at a child pointer, you know how many children are in the child's subtree at time O(1). It's easy to see that this can be done without affecting the complexity.
Once you have that, you can answer any query (between this and that, inclusive/exclusive - all possibilities) by performing two traversals from node to roots. The exact traversals depend on the details (check the functions lower_bound and upper_bound in C++ for example).
First you could implement a split by key operation. That is, given a tree, to perform split(tree, key, ts, tg) splits the key in two trees; ts contains the keys less than key; t2 the greater or equal ones. This operation can be done in O(lg n).
Then, with two splits, the first on a and the second on b you can obtain the desired subset range in O(lg n).
The split could be implemented as follows (pseudo code):
void split(Node * root, const Key & key, Node *& ts, Node *& tg) noexcept
{
if (root == Node::NullPtr)
return;
if (key < KEY(root))
{
Node * r = RLINK(root), * tgaux = Node::NullPtr;
split(LLINK(root), key, ts, tgaux);
insert(tgaux, root); // insert root in tgaux
tg = join_ex(tgaux, r);
}
else
{ // ket greater or equal than key to tg
Node * l = LLINK(root), *tsaux = Node::NullPtr;
split(RLINK(root), key, tsaux, tg));
insert(tsaux, root); // insert root in tsaux
ts = join_ex(l, tsaux);
}
}
The join_ex(t1, t2) joins two exclusive trees; that is, all the keys of t1 are lesser that any key of tree t2. This join can be implemented in O(lg n) in a similar way to the concatenation described by Knuth in TAOCP V3 6.2.3.
Grosso modo if you want to join l and r, then suppose h(l) > h(r). You remove from r the leftmost node (the minimum). Let j this join node and r' the resulting tree (r - j). Now you descend by the right side of r until reaching a node p such that h(p) - h(r') equals 0 or 1. At this moment you do
And you treat p as if this was inserted.
EDIT: I was wrong in interpreting the question. Sorry. I did not see that it was to count not to calculate a set. The following would be my answer. I do not erase what I've written because I think it is useful anyway.
Ami Tavory was right.
If you use extended trees, that is to store the subtree cardinality in each node, then you could easily compute the inorder positios of a key. I usually call to this operation position(key). If key is not in the set then it returns the position that key had if it was inserted in the tree.
The inorder position of root is the cardinality of left tree.
Now, in order to count the cardinality of [a, b) set you perform position(b) - position(a). You could require to do some adjustments if a or b are not present in the tree. But basically is thus.
position(key) is, I think, "naturally" simple. Supposing that the node cardinality is accessed with COUNT(node):
long position(Node * root, const Key & key) noexcept
{
if (r == Node::NullPtr)
return 0;
if (key < KEY(root))
return position(LLINK(r), key, p);
else if (KEY(r) < key)
return position(RLINK(r), key) + COUNT(LLINK(r)) + 1;
else // the root contains key
return COUNT(LLINK(r));
}
Since an avl tree is balanced, position takes O(lg n). So two calls take O(lg n). A non recursive version is simple.
I hope you know to forgive my mistake

What's time complexity of this algorithm for finding all Path Sum?

Path Sum Given a binary tree and a sum, find all root-to-leaf paths where each path's sum equals the given sum.
For example: sum = 11.
5
/ \
4 8
/ / \
2 -2 1
The answer is :
[
[5, 4, 2],
[5, 8, -2]
]
Personally I think, the time complexity = O(2^n), n is the number of
nodes of the given binary tree.
Thank you Vikram Bhat and David Grayson, the tight time
complexity = O(nlogn), n is the number of nodes in the given binary
tree.
Algorithm checks each node once, which causes O(n)
"vector one_result(subList);" will copy entire path from subList to one_result, each time, which causes O(logn), because the
height is O(logn).
So finally, the time complexity = O(n * logn) =O(nlogn).
The idea of this solution is DFS[C++].
/**
* Definition for binary tree
* struct TreeNode {
* int val;
* TreeNode *left;
* TreeNode *right;
* TreeNode(int x) : val(x), left(NULL), right(NULL) {}
* };
*/
#include <vector>
using namespace std;
class Solution {
public:
vector<vector<int> > pathSum(TreeNode *root, int sum) {
vector<vector<int>> list;
// Input validation.
if (root == NULL) return list;
vector<int> subList;
int tmp_sum = 0;
helper(root, sum, tmp_sum, list, subList);
return list;
}
void helper(TreeNode *root, int sum, int tmp_sum,
vector<vector<int>> &list, vector<int> &subList) {
// Base case.
if (root == NULL) return;
if (root->left == NULL && root->right == NULL) {
// Have a try.
tmp_sum += root->val;
subList.push_back(root->val);
if (tmp_sum == sum) {
vector<int> one_result(subList);
list.push_back(one_result);
}
// Roll back.
tmp_sum -= root->val;
subList.pop_back();
return;
}
// Have a try.
tmp_sum += root->val;
subList.push_back(root->val);
// Do recursion.
helper(root->left, sum, tmp_sum, list, subList);
helper(root->right, sum, tmp_sum, list, subList);
// Roll back.
tmp_sum -= root->val;
subList.pop_back();
}
};
Though it seems that time complexity is O(N) but if you need to print all paths then it is O(N*logN). Suppose that u have a complete binary tree then the total paths will be N/2 and each path will have logN nodes so total of O(N*logN) in worst case.
Your algorithm looks correct, and the complexity should be O(n) because your helper function will run once for each node, and n is the number of nodes.
Update: Actually, it would be O(N*log(N)) because each time the helper function runs, it might print a path to the console consisting of O(log(N)) nodes, and it will run O(N) times.
TIME COMPLEXITY
The time complexity of the algorithm is O(N^2), where ‘N’ is the total number of nodes in the tree. This is due to the fact that we traverse each node once (which will take O(N)), and for every leaf node we might have to store its path which will take O(N).
We can calculate a tighter time complexity of O(NlogN) from the space complexity discussion below.
SPACE COMPLEXITY
If we ignore the space required for all paths list, the space complexity of the above algorithm will be O(N) in the worst case. This space will be used to store the recursion stack. The worst-case will happen when the given tree is a linked list (i.e., every node has only one child).
How can we estimate the space used for the all paths list? Take the example of the following balanced tree:
1
/ \
2 3
/ \ / \
4 5 6 7
Here we have seven nodes (i.e., N = 7). Since, for binary trees, there exists only one path to reach any leaf node, we can easily say that total root-to-leaf paths in a binary tree can’t be more than the number of leaves. As we know that there can’t be more than N/2 leaves in a binary tree, therefore the maximum number of elements in all paths list will be O(N/2) = O(N). Now, each of these paths can have many nodes in them. For a balanced binary tree (like above), each leaf node will be at maximum depth. As we know that the depth (or height) of a balanced binary tree is O(logN) we can say that, at the most, each path can have logN nodes in it. This means that the total size of the all paths list will be O(N*logN). If the tree is not balanced, we will still have the same worst-case space complexity.
From the above discussion, we can conclude that the overall space complexity of our algorithm is O(N*logN).
Also from the above discussion, since for each leaf node, in the worst case, we have to copy log(N) nodes to store its path, therefore the time complexity of our algorithm will also be O(N*logN).
The worst case time complexity is not O(nlogn), but O(n^2).
to visit every node, we need O(n) time
to generate all paths, we have to add the nodes to the path for every valid path.
So the time taken is sum of len(path). To estimate an upper bound of the sum: the number of paths is bounded by n, the length of path is also bounded by n, so O(n^2) is an upper bound. Both worst case can be reached at the same time if the top half of the tree is a linear tree, and the bottom half is a complete binary tree, like this:
1
1
1
1
1
1 1
1 1 1 1
number of paths is n/4, and length of each path is n/2 + log(n/2) ~ n/2

Fastest way to obtain the largest X numbers from a very large unsorted list?

I'm trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting is a time intensive portion of the program.
Whats the best way of doing the sorting to get the top 100 scores?
The only two methods i can think of so far is either first generating all the scores into a massive array and then sorting it and taking the top 100. Or second, generating X number of scores, sorting it and truncating the top 100 scores then continue generating more scores, adding them to the truncated list and then sorting it again.
Either way I do it, it still takes more time than i would like, any ideas on how to do it in an even more efficient way? (I've never taken programming courses before, maybe those of you with comp sci degrees know about efficient algorithms to do this, at least that's what I'm hoping).
Lastly, whats the sorting algorithm used by the standard sort() function in c++?
Thanks,
-Faken
Edit: Just for anyone who is curious...
I did a few time trials on the before and after and here are the results:
Old program (preforms sorting after each outer loop iteration):
top 100 scores: 147 seconds
top 10 scores: 147 seconds
top 1 scores: 146 seconds
Sorting disabled: 55 seconds
new program (implementing tracking of only top scores and using default sorting function):
top 100 scores: 350 seconds <-- hmm...worse than before
top 10 scores: 103 seconds
top 1 scores: 69 seconds
Sorting disabled: 51 seconds
new rewrite (optimizations in data stored, hand written sorting algorithm):
top 100 scores: 71 seconds <-- Very nice!
top 10 scores: 52 seconds
top 1 scores: 51 seconds
Sorting disabled: 50 seconds
Done on a core 2, 1.6 GHz...I can't wait till my core i7 860 arrives...
There's a lot of other even more aggressive optimizations for me to work out (mainly in the area of reducing the number of iterations i run), but as it stands right now, the speed is more than good enough, i might not even bother to work out those algorithm optimizations.
Thanks to eveyrone for their input!
take the first 100 scores, and sort them in an array.
take the next score, and insertion-sort it into the array (starting at the "small" end)
drop the 101st value
continue with the next value, at 2, until done
Over time, the list will resemble the 100 largest value more and more, so more often, you find that the insertion sort immediately aborts, finding that the new value is smaller than the smallest value of the candidates for the top 100.
You can do this in O(n) time, without any sorting, using a heap:
#!/usr/bin/python
import heapq
def top_n(l, n):
top_n = []
smallest = None
for elem in l:
if len(top_n) < n:
top_n.append(elem)
if len(top_n) == n:
heapq.heapify(top_n)
smallest = heapq.nsmallest(1, top_n)[0]
else:
if elem > smallest:
heapq.heapreplace(top_n, elem)
smallest = heapq.nsmallest(1, top_n)[0]
return sorted(top_n)
def random_ints(n):
import random
for i in range(0, n):
yield random.randint(0, 10000)
print top_n(random_ints(1000000), 100)
Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):
100000 elements: .29 seconds
1000000 elements: 2.8 seconds
10000000 elements: 25.2 seconds
Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.
Here's the 'natural' C++ way to do this:
std::vector<Score> v;
// fill in v
std::partial_sort(v.begin(), v.begin() + 100, v.end(), std::greater<Score>());
std::sort(v.begin(), v.begin() + 100);
This is linear in the number of scores.
The algorithm used by std::sort isn't specified by the standard, but libstdc++ (used by g++) uses an "adaptive introsort", which is essentially a median-of-3 quicksort down to a certain level, followed by an insertion sort.
Declare an array where you can put the 100 best scores. Loop through the huge list and check for each item if it qualifies to be inserted in the top 100. Use a simple insert sort to add an item to the top list.
Something like this (C# code, but you get the idea):
Score[] toplist = new Score[100];
int size = 0;
foreach (Score score in hugeList) {
int pos = size;
while (pos > 0 && toplist[pos - 1] < score) {
pos--;
if (pos < 99) toplist[pos + 1] = toplist[pos];
}
if (size < 100) size++;
if (pos < size) toplist[pos] = score;
}
I tested it on my computer (Code 2 Duo 2.54 MHz Win 7 x64) and I can process 100.000.000 items in 369 ms.
Since speed is of the essence here, and 40.000 possible highscore values is totally maintainable by any of today's computers, I'd resort to bucket sort for simplicity. My guess is that it would outperform any of the algorithms proposed thus far. The downside is that you'd have to determine some upper limit for the highscore values.
So, let's assume your max highscore value is 40.000:
Make an array of 40.000 entries. Loop through your highscore values. Each time you encounter highscore x, increase your array[x] by one. After this, all you have to do is count the top entries in your array until you have reached 100 counted highscores.
You can do it in Haskell like this:
largest100 xs = take 100 $ sortBy (flip compare) xs
This looks like it sorts all the numbers into descending order (the "flip compare" bit reverses the arguments to the standard comparison function) and then returns the first 100 entries from the list. But Haskell is lazily evaluated, so the sortBy function does just enough sorting to find the first 100 numbers in the list, and then stops.
Purists will note that you could also write the function as
largest100 = take 100 . sortBy (flip compare)
This means just the same thing, but illustrates the Haskell style of composing a new function out of the building blocks of other functions rather than handing variables around the place.
You want the absolute largest X numbers, so I'm guessing you don't want some sort of heuristic. How unsorted is the list? If it's pretty random, your best bet really is just to do a quick sort on the whole list and grab the top X results.
If you can filter scores during the list generation, that's way way better. Only ever store X values, and every time you get a new value, compare it to those X values. If it's less than all of them, throw it out. If it's bigger than one of them, throw out the new smallest value.
If X is small enough you can even keep your list of X values sorted so that you are comparing your new number to a sorted list of values, you can make an O(1) check to see if the new value is smaller than all of the rest and thus throw it out. Otherwise, a quick binary search can find where the new value goes in the list and then you can throw away the first value of the array (assuming the first element is the smallest element).
Place the data into a balanced Tree structure (probably Red-Black tree) that does the sorting in place. Insertions should be O(lg n). Grabbing the highest x scores should be O(lg n) as well.
You can prune the tree every once in awhile if you find you need optimizations at some point.
If you only need to report the value of top 100 scores (and not any associated data), and if you know that the scores will all be in a finite range such as [0,100], then an easy way to do it is with "counting sort"...
Basically, create an array representing all possible values (e.g. an array of size 101 if scores can range from 0 to 100 inclusive), and initialize all the elements of the array with a value of 0. Then, iterate through the list of scores, incrementing the corresponding entry in the list of achieved scores. That is, compile the number of times each score in the range has been achieved. Then, working from the end of the array to the beginning of the array, you can pick out the top X score. Here is some pseudo-code:
let type Score be an integer ranging from 0 to 100, inclusive.
let scores be an array of Score objects
let scorerange be an array of integers of size 101.
for i in [0,100]
set scorerange[i] = 0
for each score in scores
set scorerange[score] = scorerange[score] + 1
let top be the number of top scores to report
let idx be an integer initialized to the end of scorerange (i.e. 100)
while (top > 0) and (idx>=0):
if scorerange[idx] > 0:
report "There are " scorerange[idx] " scores with value " idx
top = top - scorerange[idx]
idx = idx - 1;
I answered this question in response to an interview question in 2008. I implemented a templatized priority queue in C#.
using System;
using System.Collections.Generic;
using System.Text;
namespace CompanyTest
{
// Based on pre-generics C# implementation at
// http://www.boyet.com/Articles/WritingapriorityqueueinC.html
// and wikipedia article
// http://en.wikipedia.org/wiki/Binary_heap
class PriorityQueue<T>
{
struct Pair
{
T val;
int priority;
public Pair(T v, int p)
{
this.val = v;
this.priority = p;
}
public T Val { get { return this.val; } }
public int Priority { get { return this.priority; } }
}
#region Private members
private System.Collections.Generic.List<Pair> array = new System.Collections.Generic.List<Pair>();
#endregion
#region Constructor
public PriorityQueue()
{
}
#endregion
#region Public methods
public void Enqueue(T val, int priority)
{
Pair p = new Pair(val, priority);
array.Add(p);
bubbleUp(array.Count - 1);
}
public T Dequeue()
{
if (array.Count <= 0)
throw new System.InvalidOperationException("Queue is empty");
else
{
Pair result = array[0];
array[0] = array[array.Count - 1];
array.RemoveAt(array.Count - 1);
if (array.Count > 0)
trickleDown(0);
return result.Val;
}
}
#endregion
#region Private methods
private static int ParentOf(int index)
{
return (index - 1) / 2;
}
private static int LeftChildOf(int index)
{
return (index * 2) + 1;
}
private static bool ParentIsLowerPriority(Pair parent, Pair item)
{
return (parent.Priority < item.Priority);
}
// Move high priority items from bottom up the heap
private void bubbleUp(int index)
{
Pair item = array[index];
int parent = ParentOf(index);
while ((index > 0) && ParentIsLowerPriority(array[parent], item))
{
// Parent is lower priority -- move it down
array[index] = array[parent];
index = parent;
parent = ParentOf(index);
}
// Write the item once in its correct place
array[index] = item;
}
// Push low priority items from the top of the down
private void trickleDown(int index)
{
Pair item = array[index];
int child = LeftChildOf(index);
while (child < array.Count)
{
bool rightChildExists = ((child + 1) < array.Count);
if (rightChildExists)
{
bool rightChildIsHigherPriority = (array[child].Priority < array[child + 1].Priority);
if (rightChildIsHigherPriority)
child++;
}
// array[child] points at higher priority sibling -- move it up
array[index] = array[child];
index = child;
child = LeftChildOf(index);
}
// Put the former root in its correct place
array[index] = item;
bubbleUp(index);
}
#endregion
}
}
Median of medians algorithm.