Fastest way to find median in dynamically growing range

Fastest way to find median in dynamically growing range - c++

Can anyone suggest any methods or link to implementations of fast median finding for dynamic ranges in c++? For example, suppose that for iterations in my program the range grows, and I want to find the median at each run.
Range
4
3,4
8,3,4
2,8,3,4
7,2,8,3,4
So the above code would ultimately produce 5 median values for each line.

The best you can get without also keeping track of a sorted copy of your array is re-using the old median and updating this with a linear-time search of the next-biggest value. This might sound simple, however, there is a problem we have to solve.
Consider the following list (sorted for easier understanding, but you keep them in an arbitrary order):
1, 2, 3, 3, 3, 4, 5
// *
So here, the median is 3 (the middle element since the list is sorted). Now if you add a number which is greater than the median, this potentially "moves" the median to the right by one half index. I see two problems: How can we advance by one half index? (Per definition, the median is the mean value of the next two values.) And how do we know at which 3 the median was, when we only know the median was 3?
This can be solved by storing not only the current median but also the position of the median within the numbers of same value, here it has an "index offset" of 1, since it's the second 3. Adding a number greater than or equal to 3 to the list changes the index offset to 1.5. Adding a number less than 3 changes it to 0.5.
When this number becomes less than zero, the median changes. It also have to change if it goes beyond the count of equal numbers (minus 1), in this case 2, meaning the new median is more than the last equal number. In both cases, you have to search for the next smaller / next greater number and update the median value. To always know what the upper limit for the index offset is (in this case 2), you also have to keep track of the count of equal numbers.
This should give you a rough idea of how to implement median updating in linear time.

I think you can use a min-max-median heap. Each time when the array is updated, you just need log(n) time to find the new median value. For a min-max-median heap, the root is the median value, the left tree is a min-max heap, while the right side is a max-min heap. Please refer the paper "Min-Max Heaps and Generailized Priority Queues" for the details.

Fins some code below, I have reworked this stack to give your necessary output
private void button1_Click(object sender, EventArgs e)
{
string range = "7,2,8,3,4";
decimal median = FindMedian(range);
MessageBox.Show(median.ToString());
}
public decimal FindMedian(string source)
{
// Create a copy of the input, and sort the copy
int[] temp = source.Split(',').Select(m=> Convert.ToInt32(m)).ToArray();
Array.Sort(temp);
int count = temp.Length;
if (count == 0) {
throw new InvalidOperationException("Empty collection");
}
else if (count % 2 == 0) {
// count is even, average two middle elements
int a = temp[count / 2 - 1];
int b = temp[count / 2];
return (a + b) / 2m;
}
else {
// count is odd, return the middle element
return temp[count / 2];
}
}

Related

What would be the fastest algorithm to randomly select N items from a list based on weights distribution?

I have a large list of items, each item has a weight.
I'd like to select N items randomly without replacement, while the items with more weight are more probable to be selected.
I'm looking for the most performing idea. Performance is paramount. Any ideas?

If you want to sample items without replacement, you have lots of options.
Use a weighted-choice-with-replacement algorithm to choose random indices. There are many algorithms like this. One of them is WeightedChoice, described later in this answer, and another is rejection sampling, described as follows. Assume that the highest weight is max, there are n weights, and each weight is 0 or greater. To choose an index in [0, n) using rejection sampling:
Choose a uniform random integer i in [0, n).
With probability weights[i]/max, return i. Otherwise, go to step 1. (For example, if all the weights are integers greater than 0, choose a uniform random integer in [1, max] and if that number is weights[i] or less, return i, or go to step 1 otherwise.)
Each time the weighted choice algorithm chooses an index, set the weight for the chosen index to 0 to keep it from being chosen again. Or...
Assign each index an exponentially distributed random number (with a rate equal to that index's weight), make a list of pairs assigning each number to an index, then sort that list by those numbers. Then take each item from first to last, in ascending order. This sorting can be done on-line using a priority queue data structure (a technique that leads to weighted reservoir sampling). Notice that the naïve way to generate the random number, -ln(1-RNDU01())/weight, where RNDU01() is a uniform random number in [0, 1], is not robust, however ("Index of Non-Uniform Distributions", under "Exponential distribution").
Tim Vieira gives additional options in his blog.
A paper by Bram van de Klundert compares various algorithms.
EDIT (Aug. 19): Note that for these solutions, the weight expresses how likely a given item will appear first in the sample. This weight is not necessarily the chance that a given sample of n items will include that item (that is, an inclusion probability). The methods given above will not necessarily ensure that a given item will appear in a random sample with probability proportional to its weight; for that, see "Algorithms of sampling with equal or unequal probabilities".
Assuming you want to choose items at random with replacement, here is pseudocode implementing this kind of choice. Given a list of weights, it returns a random index (starting at 0), chosen with a probability proportional to its weight. This algorithm is a straightforward way to implement weighted choice. But if it's too slow for you, see my section "Weighted Choice With Replacement" for a survey of other algorithms.
METHOD WChoose(weights, value)
// Choose the index according to the given value
lastItem = size(weights) - 1
runningValue = 0
for i in 0...size(weights) - 1
if weights[i] > 0
newValue = runningValue + weights[i]
lastItem = i
// NOTE: Includes start, excludes end
if value < newValue: break
runningValue = newValue
end
end
// If we didn't break above, this is a last
// resort (might happen because rounding
// error happened somehow)
return lastItem
END METHOD
METHOD WeightedChoice(weights)
return WChoose(weights, RNDINTEXC(Sum(weights)))
END METHOD

Let A be the item array with x itens. The complexity of each method is defined as
< preprocessing_time, querying_time >
If sorting is possible: < O(x lg x), O(n) >
sort A by the weight of the itens.
create an array B, for example:
B = [ 0, 0, 0, x/2, x/2, x/2, x/2, x/2 ].
it's clear to see that B has a bigger probability from choosing x/2.
if you haven't picked n elements yet, choose a random element e from B.
pick a random element from A within the interval e : x-1.
If iterating through the itens is possible: < O(x), O(tn) >
iterate through A and find the average weight w of the elements.
define the maximum number of tries t.
try (at most t times) to pick a random number in A whose weight is bigger than w.
test for some t that gives you good/satisfactory results.
If nothing above is possible: < O(1), O(tn) >
define the maximum number of tries t.
if you haven't picked n elements yet, take t random elements in A.
pick the element with biggest value.
test for some t that gives you good/satisfactory results.

Finding maximum values of rests of array

For example:
array[] = {3, 9, 10, **12**,1,4,**7**,2,**6**,***5***}
First, I need maximum value=12 then I need maximum value among the rest of array (1,4,7,2,6,5), so value=7, then maxiumum value of the rest of array 6, then 5, After that, i will need series of this values. This gives back (12,7,6,5).
How to get these numbers?
I have tried the following code, but it seems to infinite
I think I'll need a recursive function but how can I do this?
max=0; max2=0;...
for(i=0; i<array_length; i++){
if (matrix[i] >= max)
max=matrix[i];
else {
for (j=i; j<array_length; j++){
if (matrix[j] >= max2)
max2=matrix[j];
else{
...
...for if else for if else
...??
}
}
}
}

This is how you would do that in C++11 by using the std::max_element() standard algorithm:
#include <vector>
#include <algorithm>
#include <iostream>
int main()
{
int arr[] = {3,5,4,12,1,4,7,2,6,5};
auto m = std::begin(arr);
while (m != std::end(arr))
{
m = std::max_element(m, std::end(arr));
std::cout << *(m++) << std::endl;
}
}
Here is a live example.

This is an excellent spot to use the Cartesian tree data structure. A Cartesian tree is a data structure built out of a sequence of elements with these properties:
The Cartesian tree is a binary tree.
The Cartesian tree obeys the heap property: every node in the Cartesian tree is greater than or equal to all its descendants.
An inorder traversal of a Cartesian tree gives back the original sequence.
For example, given the sequence
4 1 0 3 2
The Cartesian tree would be
4
\
3
/ \
1 2
\
0
Notice that this obeys the heap property, and an inorder walk gives back the sequence 4 1 0 3 2, which was the original sequence.
But here's the key observation: notice that if you start at the root of this Cartesian tree and start walking down to the right, you get back the number 4 (the biggest element in the sequence), then 3 (the biggest element in what comes after that 4), and the number 2 (the biggest element in what comes after the 3). More generally, if you create a Cartesian tree for the sequence, then start at the root and keep walking to the right, you'll get back the sequence of elements that you're looking for!
The beauty of this is that a Cartesian tree can be constructed in time Θ(n), which is very fast, and walking down the spine takes time only O(n). Therefore, the total amount of time required to find the sequence you're looking for is Θ(n). Note that the approach of "find the largest element, then find the largest element in the subarray that appears after that, etc." would run in time Θ(n2) in the worst case if the input was sorted in descending order, so this solution is much faster.
Hope this helps!

If you can modify the array, your code will become simpler. Whenever you find a max, output that and change its value inside the original array to some small number, for example -MAXINT. Once you have output the number of elements in the array, you can stop your iterations.

std::vector<int> output;
for (auto i : array)
{
auto pos = std::find_if(output.rbegin(), output.rend(), [i](int n) { return n > i; }).base();
output.erase(pos,output.end());
output.push_back(i);
}
Hopefully you can understand that code. I'm much better at writing algorithms in C++ than describing them in English, but here's an attempt.
Before we start scanning, output is empty. This is the correct state for an empty input.
We start by looking at the first unlooked at element I of the input array. We scan backwards through the output until we find an element G which is greater than I. Then we erase starting at the position after G. If we find none, that means that I is the greatest element so far of the elements we've searched, so we erase the entire output. Otherwise, we erase every element after G, because I is the greatest starting from G through what we've searched so far. Then we append I to output. Repeat until the input array is exhausted.

Find dominant mode of an unsorted array

Note, this is a homework assignment.
I need to find the mode of an array (positive values) and secondarily return that value if the mode is greater that sizeof(array)/2,the dominant value. Some arrays will have neither.
That is simple enough, but there is a constraint that the array must NOT be sorted prior to the determination, additionally, the complexity must be on the order of O(nlogn).
Using this second constraint, and the master theorem we can determine that the time complexity 'T(n) = A*T(n/B) + n^D' where A=B and log_B(A)=D for O(nlogn) to be true. Thus, A=B=D=2. This is also convenient since the dominant value must be dominant in the 1st, 2nd, or both halves of an array.
Using 'T(n) = A*T(n/B) + n^D' we know that the search function will call itself twice at each level (A), divide the problem set by 2 at each level (B). I'm stuck figuring out how to make my algorithm take into account the n^2 at each level.
To make some code of this:
int search(a,b) {
search(a, a+(b-a)/2);
search(a+(b-a)/2+1, b);
}
The "glue" I'm missing here is how to combine these divided functions and I think that will implement the n^2 complexity. There is some trick here where the dominant must be the dominant in the 1st or 2nd half or both, not quite sure how that helps me right now with the complexity constraint.
I've written down some examples of small arrays and I've drawn out ways it would divide. I can't seem to go in the correct direction of finding one, single method that will always return the dominant value.
At level 0, the function needs to call itself to search the first half and second half of the array. That needs to recurse, and call itself. Then at each level, it needs to perform n^2 operations. So in an array [2,0,2,0,2] it would split that into a search on [2,0] and a search on [2,0,2] AND perform 25 operations. A search on [2,0] would call a search on [2] and a search on [0] AND perform 4 operations. I'm assuming these would need to be a search of the array space itself. I was planning to use C++ and use something from STL to iterate and count the values. I could create a large array and just update counts by their index.

if some number occurs more than half, it can be done by O(n) time complexity and O(1) space complexity as follow:
int num = a[0], occ = 1;
for (int i=1; i<n; i++) {
if (a[i] == num) occ++;
else {
occ--;
if (occ < 0) {
num = a[i];
occ = 1;
}
}
}
since u r not sure whether such number occurs, all u need to do is to apply the above algorithm to get a number first, then iterate the whole array 2nd time to get the occurance of the number and check whether it is greater than half.

If you want to find just the dominant mode of an array, and do it recursively, here's the pseudo-code:
def DominantMode(array):
# if there is only one element, that's the dominant mode
if len(array) == 1: return array[0]
# otherwise, find the dominant mode of the left and right halves
left = DominantMode(array[0:len(array)/2])
right = DominantMode(array[len(array)/2:len(array)])
# if both sides have the same dominant mode, the whole array has that mode
if left == right: return left
# otherwise, we have to scan the whole array to determine which one wins
leftCount = sum(element == left for element in array)
rightCount = sum(element == right for element in array)
if leftCount > len(array) / 2: return left
if rightCount > len(array) / 2: return right
# if neither wins, just return None
return None
The above algorithm is O(nlogn) time but only O(logn) space.
If you want to find the mode of an array (not just the dominant mode), first compute the histogram. You can do this in O(n) time (visiting each element of the array exactly once) by storing the historgram in a hash table that maps the element value to its frequency.
Once the histogram has been computed, you can iterate over it (visiting each element at most once) to find the highest frequency. Once you find a frequency larger than half the size of the array, you can return immediately and ignore the rest of the histogram. Since the size of the histogram can be no larger than the size of the original array, this step is also O(n) time (and O(n) space).
Since both steps are O(n) time, the resulting algorithmic complexity is O(n) time.

Find the element with the longest distance in a given array where each element appears twice?

Given an array of int, each int appears exactly TWICE in the
array. find and return the int such that this pair of int has the max
distance between each other in this array.
e.g. [2, 1, 1, 3, 2, 3]
2: d = 5-1 = 4;
1: d = 3-2 = 1;
3: d = 6-4 = 2;
return 2
My ideas:
Use hashmap, key is the a[i], and value is the index. Scan the a[], put each number into hash. If a number is hit twice, use its index minus the old numbers index and use the result to update the element value in hash.
After that, scan hash and return the key with largest element (distance).
it is O(n) in time and space.
How to do it in O(n) time and O(1) space ?

You would like to have the maximal distance, so I assume the number you search a more likely to be at the start and the end. This is why I would loop over the array from start and end at the same time.
[2, 1, 1, 3, 2, 3]
Check if 2 == 3?
Store a map of numbers and position: [2 => 1, 3 => 6]
Check if 1 or 2 is in [2 => 1, 3 => 6] ?
I know, that is not even pseudo code and not complete but just to give out the idea.

Set iLeft index to the first element, iRight index to the second element.
Increment iRight index until you find a copy of the left item or meet the end of the array. In the first case - remember distance.
Increment iLeft. Start searching from new iRight.
Start value of iRight will never be decreased.
Delphi code:
iLeft := 0;
iRight := 1;
while iRight < Len do begin //Len = array size
while (iRight < Len) and (A[iRight] <> A[iLeft]) do
Inc(iRight); //iRight++
if iRight < Len then begin
BestNumber := A[iLeft];
MaxDistance := iRight - iLeft;
end;
Inc(iLeft); //iLeft++
iRight := iLeft + MaxDistance;
end;

This algorithm is O(1) space (with some cheating), O(n) time (average), needs the source array to be non-const and destroys it at the end. Also it limits possible values in the array (three bits of each value should be reserved for the algorithm).
Half of the answer is already in the question. Use hashmap. If a number is hit twice, use index difference, update the best so far result and remove this number from the hashmap to free space . To make it O(1) space, just reuse the source array. Convert the array to hashmap in-place.
Before turning an array element to the hashmap cell, remember its value and position. After this it may be safely overwritten. Then use this value to calculate a new position in the hashmap and overwrite it. Elements are shuffled this way until an empty cell is found. To continue, select any element, that is not already reordered. When everything is reordered, every int pair is definitely hit twice, here we have an empty hashmap and an updated best result value.
One reserved bit is used while converting array elements to the hashmap cells. At the beginning it is cleared. When a value is reordered to the hashmap cell, this bit is set. If this bit is not set for overwritten element, this element is just taken to be processed next. If this bit is set for element to be overwritten, there is a conflict here, pick first unused element (with this bit not set) and overwrite it instead.
2 more reserved bits are used to chain conflicting values. They encode positions where the chain is started/ended/continued. (It may be possible to optimize this algorithm so that only 2 reserved bits are needed...)
A hashmap cell should contain these 3 reserved bits, original value index, and some information to uniquely identify this element. To make this possible, a hash function should be reversible so that part of the value may be restored given its position in the table. In simplest case, hash function is just ceil(log(n)) least significant bits. Value in the table consists of 3 fields:
3 reserved bits
32 - 3 - (ceil(log(n))) high-order bits from the original value
ceil(log(n)) bits for element's position in the original array
Time complexity is O(n) only on average; worst case complexity is O(n^2).
Other variant of this algorithm is to transform the array to hashmap sequentially: on each step m having 2^m first elements of the array converted to hashmap. Some constant-sized array may be interleaved with the hashmap to improve performance when m is low. When m is high, there should be enough int pairs, which are already processed, and do not need space anymore.

There is no way to do this in O(n) time and O(1) space.

Generate a new element different from 1000 elements of an array

I was asked this questions in an interview. Consider the scenario of punched cards, where each punched card has 64 bit pattern. I was suggested each card as an int since each int is a collection of bits.
Also, to be considered that I have an array which already contains 1000 such cards. I have to generate a new element everytime which is different from the previous 1000 cards. The integers(aka cards) in the array are not necessarily sorted.
Even more, how would that be possible the question was for C++, where does the 64 bit int comes from and how can I generate this new card from the array where the element to be generated is different from all the elements already present in the array?

There are 264 64 bit integers, a number that is so much
larger than 1000 that the simplest solution would be to just generate a
random 64 bit number, and then verify that it isn't in the table of
already generated numbers. (The probability that it is is
infinitesimal, but you might as well be sure.)
Since most random number generators do not generate 64 bit values, you
are left with either writing your own, or (much simpler), combining the
values, say by generating 8 random bytes, and memcpying them into a
uint64_t.
As for verifying that the number isn't already present, std::find is
just fine for one or two new numbers; if you have to do a lot of
lookups, sorting the table and using a binary search would be
worthwhile. Or some sort of a hash table.

I may be missing something, but most of the other answers appear to me as overly complicated.
Just sort the original array and then start counting from zero: if the current count is in the array skip it, otherwise you have your next number. This algorithm is O(n), where n is the number of newly generated numbers: both sorting the array and skipping existing numbers are constants. Here's an example:
#include <algorithm>
#include <iostream>
unsigned array[] = { 98, 1, 24, 66, 20, 70, 6, 33, 5, 41 };
unsigned count = 0;
unsigned index = 0;
int main() {
std::sort(array, array + 10);
while ( count < 100 ) {
if ( count > array[index] )
++index;
else {
if ( count < array[index] )
std::cout << count << std::endl;
++count;
}
}
}

Here's an O(n) algorithm:
int64 generateNewValue(list_of_cards)
{
return find_max(list_of_cards)+1;
}
Note: As #amit points out below, this will fail if INT64_MAX is already in the list.
As far as I'm aware, this is the only way you're going to get O(n). If you want to deal with that (fairly important) edge case, then you're going to have to do some kind of proper sort or search, which will take you to O(n log n).

#arne is almost there. What you need is a self-balancing interval tree, which can be built in O(n lg n) time.
Then take the top node, which will store some interval [i, j]. By the properties of an interval tree, both i-1 and j+1 are valid candidates for a new key, unless i = UINT64_MIN or j = UINT64_MAX. If both are true, then you've stored 2^64 elements and you can't possibly generate a new element. Store the new element, which takes O(lg n) worst-case time.
I.e.: init takes O(n lg n), generate takes O(lg n). Both are worst-case figures. The greatest thing about this approach is that the top node will keep "growing" (storing larger intervals) and merging with its successor or predecessor, so the tree will actually shrink in terms of memory use and eventually the time per operation decays to O(1). You also won't waste any numbers, so you can keep generating until you've got 2^64 of them.

This algorithm has O(N lg N) initialisation, O(1) query and O(N) memory usage. I assume you have some integer type which I will refer to as int64 and that it can represent the integers [0, int64_max].
Sort the numbers
Create a linked list containing intervals [u, v]
Insert [1, first number - 1]
For each of the remaining numbers, insert [prev number + 1, current number - 1]
Insert [last number + 1, int64_max]
You now have a list representing the numbers which are not used. You can simply iterate over them to generate new numbers.

I think the way to go is to use some kind of hashing. So you store your cards in some buckets based on lets say on MOD operation. Until you create some sort of indexing you are stucked with looping over the whole array.
IF you have a look on HashSet implementation in java you might get a clue.
Edit: I assume you wanted them to be random numbers, if you don't mind sequence MAX+1 below is good solution :)

You could build a binary tree of the already existing elements and traverse it until you find a node whose depth is not 64 and which has less than two child nodes. You can then construct a "missing" child node and have a new element. The should be fairly quick, in the order of about O(n) if I'm not mistaken.

bool seen[1001] = { false };
for each element of the original array
if the element is in the range 0..1000
seen[element] = true
find the index for the first false value in seen

Initialization:
Don't sort the list.
Create a new array 1000 long containing 0..999.
Iterate the list and, if any number is in the range 0..999, invalidate it in the new array by replacing the value in the new array with the value of the first item in the list.
Insertion:
Use an incrementing index to the new array. If the value in the new array at this index is not the value of the first element in the list, add it to the list, else check the value from the next position in the new array.
When the new array is used up, refill it using 1000..1999 and invalidating existing values as above. Yes, this is looping over the list, but it doesn't have to be done for each insertion.
Near O(1) until the list gets so large that occasionally iterating it for invalidation of the 'new' new array becomes significant. Maybe you could mitigate this by using a new array that grows, maybee always the size of the list?
Rgds,
Martin

Put them all into a hash table of size > 1000, and find the empty cell (this is the parking problem). Generate a key for that. This will of course work better for bigger table size. The table needs only 1-bit entries.
EDIT: this is the pigeonhole principle.
This needs "modulo tablesize" (or some other "semi-invertible" function) for a hash function.
unsigned hashtab[1001] = {0,};
unsigned long long long long numbers[1000] = { ... };
void init (void)
{
unsigned idx;
for (idx=0; idx < 1000; idx++) {
hashtab [ numbers[idx] % 1001 ] += 1; }
}
unsigned long long long long generate(void)
{
unsigned idx;
for (idx = 0; idx < 1001; idx++) {
if ( !hashtab [ idx] ) break; }
return idx + rand() * 1001;
}

Based on the solution here: question on array and number
Since there are 1000 numbers, if we consider their remainders with 1001, at least one remainder will be missing. We can pick that as our missing number.
So we maintain an array of counts: C[1001], which will maintain the number of integers with remainder r (upon dividing by 1001) in C[r].
We also maintain a set of numbers for which C[j] is 0 (say using a linked list).
When we move the window over, we decrement the count of the first element (say remainder i), i.e. decrement C[i]. If C[i] becomes zero we add i to the set of numbers. We update the C array with the new number we add.
If we need one number, we just pick a random element from the set of j for which C[j] is 0.
This is O(1) for new numbers and O(n) initially.
This is similar to other solutions but not quite.

How about something simple like this:
1) Partition the array into numbers equal and below 1000 and above
2) If all the numbers fit within the lower partition then choose 1001 (or any number greater than 1000) and we're done.
3) Otherwise we know that there must exist a number between 1 and 1000 that doesn't exist within the lower partition.
4) Create a 1000 element array of bools, or a 1000-element long bitfield, or whatnot and initialize the array to all 0's
5) For each integer in the lower partition, use its value as an index into the array/bitfield and set the corresponding bool to true (ie: do a radix sort)
6) Go over the array/bitfield and pick any unset value's index as the solution
This works in O(n) time, or since we've bounded everything by 1000, technically it's O(1), but O(n) time and space in general. There are three passes over the data, which isn't necessarily the most elegant approach, but the complexity remains O(n).

you can create a new array with the numbers that are not in the original array, then just pick one from this new array.
¿O(1)?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js