Hashing 100 distinct values of range 1 billion - c++

I was recently asked this question in an interview. I have an array of n elements. The array has only 100 distinct values. I need to print the count of occurrence of each number.
1<=n<=10^6
1<=A[i]<=10^12
Expected space complexity was O(k) where k is the number of distinct values in the array.
For example, 1 2 3 2 1 4 3 2 4 2 3 1 2 ; here k is 4.
First I suggested using maps in stl but he wanted he to implement my own data structure. Then I suggested using sorted insert for each element like in a binary search tree but that would give a time complexity of O(nlogn). He wanted an O(n) solution. I tried to think of any hash function but I could not come up with any such function. I also tried to think of trie data structure but again I will have to scan each digit of each number thus again giving a O(nlogn) complexity. What could be a possible approach to solve this?

Hash table won't guarantee theoretical complexity of O(n*k). But it's quite easy to make such one.
First, we need to make some assumption about values probability distribution - let it be uniform (or else we need some specialized hash function).
Next, let's choose hash table size, say, 201 entries (so it will be less than 50% full).
Next, let hash function be just hash(A[i]) = A[i] mod 201.
And then use open-addressing hash table H[] with 201 entries pairs: A[i] or NULL; frequency value.

I think that a hash table is a good solution for this, but I imagine the interviewer was expecting you to build your own hash table.
Here's a solution I came up with in Python. I'm using mod 100 as my hash function and using Separate chaining to deal with collisions.
import random
N = random.randint(1, 10**6)
K = 100
HASH_TABLE_SIZE = 100
distinct = [random.randint(1, 10**12) for _ in range(K)]
numbers = [random.choice(distinct) for _ in range(N)]
hash_table = [[] for _ in range(HASH_TABLE_SIZE)]
def hash(n):
hash_key = n % HASH_TABLE_SIZE
bucket = hash_table[hash_key]
for value in bucket:
if value[0] == n:
value[1] += 1
return
bucket.append([n, 1])
for number in numbers:
hash(number)
for bucket in hash_table:
for value in bucket:
print('{}: {}'.format(*value))
EDIT
Explaining the code a bit:
My hash table is a 100-element array. Each entry in the array is a list of (number, count) entries. To hash a number, I take its value modulo 100 to find an index into the array. I scan the numbers already in that bucket, and if any of them match the current number, I increment its count. If I don't find the number, I append a new entry to the list with the number and an initial count of 1.
Visually, the array looks sort of like this:
[
[ [0, 3], [34500, 1] ]
[ [101, 1] ],
[],
[ [1502, 1] ],
...
]
Note that at index n, each value stored in the bucket equals n (mod 100). On average, there will be only one value per bucket, since there are up to 100 distinct values and 100 elements in the array.
To print out the final counts, all that's required is to walk through the array and each entry in each bucket and print them out.
EDIT 2
Here's a slightly different implementation that uses Open addressing with linear probing instead. I think I actually prefer this approach.
hash_table = [None] * HASH_TABLE_SIZE
def hash(n):
hash_key = n % HASH_TABLE_SIZE
while hash_table[hash_key] is not None and hash_table[hash_key][0] != n:
hash_key = (hash_key + 1) % HASH_TABLE_SIZE
if hash_table[hash_key] is None:
hash_table[hash_key] = [n, 1]
else:
hash_table[hash_key][1] += 1
for number in numbers:
hash(number)
for entry in hash_table:
print('{}: {}'.format(*entry))
NOTE: This code will fail if there are actually more than 100 distinct numbers. (It will hang forever trying to find an open spot in the array.) It would be nice to keep detect that condition (e.g. once you've walked an entire lap in the array) and raise an exception.

Actually, you're wrong, the trie would give you O(N) complexity.
One insert/find/erase operation of a trie requires O(L) time, where L is the length of the strings pushed into this trie. Fortunately, you just insert numbers not larger than 1 trillion, which means that L is not larger than log(10^12) (logarithm base depends on the counting system you use in this trie. I personally would select 256 or 65536 depending on what part of a whole system does this structure play).
Suming up, you will need O(N) * O(log(10^12)) which is equal to O(N) by the definition of O().

Related

Big 0 notation for duplicate function, C++

What is the Big 0 notation for the function description in the screenshot.
It would take O(n) to go through all the numbers but once it finds the numbers and removes them what would that be? Would the removed parts be a constant A? and then would the function have to iterate through the numbers again?
This is what I am thinking for Big O
T(n) = n + a + (n-a) or something involving having to iterate through (n-a) number of steps after the first duplicate is found, then would big O be O(n)?
Big O notation is considering the worst case. Let's say we need to remove all duplicates from the array A=[1..n]. The algorithm will start with the first element and check every remaining element - there are n-1 of them. Since all values happen to be different it won't remove any from the array.
Next, the algorithm selects the second element and checks the remaining n-2 elements in the array. And so on.
When the algorithm arrives at the final element it is done. The total number of comparisions is the sum of (n-1) + (n-2) + ... + 2 + 1 + 0. Through the power of maths, this sum becomes (n-1)*n/2 and the dominating term is n^2 so the algorithm is O(n^2).
This algorithm is O(n^2). Because for each element in the array you are iterating over the array and counting the occurrences of that element.
foreach item in array
count = 0
foreach other in array
if item == other
count += 1
if count > 1
remove item
As you see there are two nested loops in this algorithm which results in O(n*n).
Removed items doesn't affect the worst case. Consider an array containing unique elements. No elements is being removed in this array.
Note: A naive implementation of this algorithm could result in O(n^3) complexity.
You started with first element you will go through all elements in the vector thats n-1 you will do that for n time its (n * n-1)/2 for worst case n time is the best case (all elements are 4)

most efficient algorithm to get union of 2 ordered lists

I need to find the union of 2 descending ordered lists (list1 and list2), where the union
would be each element from both lists without duplicates. Assume the list elements are integers. I am
using big O notation to determine the most efficient algorithm to solve this problem. I know the big
O notation for the 1st, but I do not know the big O notation for the 2nd. Can someone tell me the
big O notation of the 2nd algorithm so I can decide which algorithm to implement? If someone knows a
better algorithm than one of these, could you help me understand that as well? Thanks in advance.
Here are my two algorithms. . .
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Algorithm #1: O(N * log base2 N)
Starting at the first element of list1,
while(list1 is not at the end of the list) {
if(the current element in list1 is not in list2) // Binary Search -> O(log base2 N)
add the current element in list1 to list2
go to the next element in list1 }
list2 is now the union of the 2 lists
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Algorithm #2: O(?)
Starting at the first elements of each list,
LOOP_START:
compare the current elements of the lists
whichever element is greater, put into a 3rd list called list3
go to the next element in the list whose element was just inserted into list3
branch to LOOP_START until either list1 or list2 are at the end of their respective list
insert the remaining elements from either list1 or list2 into list3 (the union)
list3 now contains the union of list1 and list2
Here's my assessment of the situation
Your first algorithm runs in n log n time: you are doing the binary search for every element in the first list, right?
Your second algorithm is not entirely complete: you don't say what to do if the elements in the two lists are equal. However, given the right logic for dealing with equal elements, your second algorithm is like the merge part of the merge sort: it will run in linear time (i.e. N). It is optimal, in a sense that you cannot do better than that: you cannot merge two ordered lists without looking at every element in both list at least once.
The second is O(n+m) while the first is O(n log(m) + m). Thus the second is significantly better.
With the following algorithm you can have the two lists merged in O(n+m).
[Sorry, I have used python for simplicity, but the algorithm is the same in every language]
Note that the algorithm also maintains the items sorted in the result list.
def merge(list1, list2):
result = []
i1 = 0;
i2 = 0;
#iterate over the two lists
while i1 < len(list1) and i2 < len(list2):
 #if the current items are equal, add just one and go to the next two items
if list1[i1] == list2[i2]:
result.append(list1[i1])
i1 += 1
i2 += 1
#if the item of list1 is greater than the item of list2, add it and go to next item of list1
elif list1[i1] > list2[i2]:
result.append(list1[i1])
i1 += 1
#if the item of list2 is greater than the item of list1, add it and go to next item of list2
else:
result.append(list2[i2])
i2 += 1
#Add the remaining items of list1
while i1 < len(list1):
result.append(list1[i1])
i1 += 1
#Add the remaining items of list2
while i2 < len(list2):
result.append(list2[i2])
i2 += 1
return result
print merge([10,8,5,1],[12,11,7,5,2])
Output:
[12, 11, 10, 8, 7, 5, 2, 1]
Complexity Analysis:
Say the length of list 1 is N and that of list 2 is M.
Algorithm 1:
At the risk of sounding incredible, I would accept that according to me the complexity of this algorithm as such is N * M and not NlogM.
For each element in list 1 (O(N)), we are searching it in list 2 (O(logM). The complexity of this algorithm 'seems' O(NlogM).
However, we are also inserting the element in list 2. This new element should be inserted in proper place so that the list 2 remains sorted for further binary search operations. If we are using array as the data structure, then the insertion would take O(M) time.
Hence the order of complexity is O(N*M) for the algorithm as is.
A modification can be done, wherein the new element is inserted at the end of the list 2 (the list is then no more ordered) and we carry out the binary search operation from index 0 to M-1 rather than the new size-1. In this case the complexity shall be O(N*logM) since we shall carry out N binary searches in the list of length M.
To make the list ordered again, we will have to merge the two ordered parts (0 to M-1 and M to newSize-1). This can be done in O(N+M) time (one merge operation in merge sort of array length N+M). Hence the net time complexity of this algorithm shall be
O(NlogM + N + M)
Space complexity is O(max(N,M)) not considering the original lists space and only considering the extra space required in list 2.
Algorithm 2:
At each iteration, we are moving atleast 1 pointer forward. The total distance to travel by both pointers is N + M. Hence the order of time complexity in worst case is O(N+M) which is better than 1st algorithm.
However, the space complexity required in this case is larger (O(N+M)).
Here is another approach:
Iterate through both lists, and insert all the values into a set.
This will remove all duplicates and the result will be the union of two lists.
Two important notes: You'll loose the order of the numbers. Also, it takes additional space.
Time complexity: O(n + m)
Space Complexity: O(n + m)
If you need to maintain order of the result set, use some custom version of LinkedHashMap.
Actually, algorithm 2 should not work if the input lists are not sorted.
To sort the array it is order O(m*lg(m)+ n*lg(n))
You can build a hash table on the first list, then for each item from the second list, you check if this item exists in the hash table. This works in O(m+n).
There are a few things that need to be specified:
Do the input lists contain duplicates?
Must the result be ordered?
I'll assume that, using std::list, you can cheaply insert at the head or at the tail.
Let's say List 1 has N elements and List 2 has M elements.
Algorithm 1
It iterates over every item of List 1 searching for it in List 2.
Assuming that there may be duplicates and that the result must be ordered, the worse case time for the search is that no element in List 1 exists in List 2, hence it's at least:
O(N × M).
To insert the item of List 1 in the right place, you need to iterate List 2 again until the point of insertion. The worse case will be when every item in List 1 is smaller (if List 2 is searched from the beginning) or greater (if List 2 is searched from the end). Since the previous items of List 1 have been inserted in List 2, there would be M iterations for the first item, M + 1 for the second, M + 2 for the third, etc. and M + N - 1 iterations for the last item, for an average of M + (N - 1) / 2 per item.
Something like:
N × (M + (N - 1) / 2)
For big-O notation, constant factors don't matter, so:
N × (M + (N - 1))
For big-O notation, non-variable additions don't matter, so:
O(N × (M + N))
Adding to the original O(N × M):
O(N × M) + O(N × (M + N))
O(N × M) + O(N × M + N2)
The second equation is just to make the constant factor elimination evident, e.g. 2 × (N × M), thus:
O(N × (M + N))
O(N2 + N × M)
These two are equivalent, which ever you like the most.
Possible optimizations:
If the result doesn't have to be ordered, insertion can be O(1), hence the worse time case is:
O(N × M)
Don't just test each List 1 item in List 2 by equality, test if each item by e.g. greater than, so that you can stop searching in List 2 when List 1's item is greater than List 2's item; this wouldn't reduce the worse case, but it would reduce the average case
Keep the List 2 iterator that points to where List 1's item was found to be greater than List 2's item, to make the sorted insertion O(1); on insertion make sure to keep an iterator that starts at the inserted item, because although List 1 is ordered, it might contain duplicates; with these two, the worse time case becomes:
O(N × M)
For the next iterations, search for List 1's item in the rest of List 2 with the iterator we kept; this reduces the worse case, because if you reach the end of List 2, you'll be just "removing" duplicates from List 1; with these three, the worse time case becomes:
O(N + M)
By this point, the only difference between this algorithm and Algorithm 2 is that List 2 is changed to contain the result, instead of creating a new list.
Algorithm 2
This is the merging of the merge sort.
You'll be walking every element of List 1 and every element of List 2 once, and insertion is always made at the head or tail of the list, hence the worse case time is:
O(N + M)
If there are duplicates, they're simply discarded. The result is more easily made ordered than not.
Final Notes
If there are no duplicates, insertion can be optimized in both cases. For instance, with doubly-linked lists, we can easily check if the last element in List 1 is greater than the first element in List 2 or vice-versa, and simply concatenate the lists.
This can be further generalized for any tail of List 1 and List 2. For instance, in Algorithm 1, if a List 1's item is not found in List 2, we can concatenate List 2 and the tail of List 1. In Algorithm 2, this is done in the last step.
The worse case, when List 1's items and List 2's items are interleaved, is not reduced, but again the average case is reduced, and in many cases by a big factor that makes a big difference In Real Life™.
I ignored:
Allocation times
Worse space differences in the algorithms
Binary search, because you mentioned lists, not arrays or trees
I hope I didn't make any blatant mistake.
I had implemented a typescript(js) based implementation of Union operation of 2 arrays of object in one of my previous projects. The data was too large and the default library functions like underscore or lodash were not optimistic. After some brain hunting i came up with the below binary search based algorithm. Hope it might help someone for performance tuning.
As far as complexity is concerned, the algorithm is binary search based and will end up to be O(log(N)).
Basically the code takes two unordered object arrays and a keyname to compare and:
1) sort the arrays
2) iterate through each element of first array and delete it in second array
3) concatenate resulting second array into first array.
private sortArrays = (arr1: Array<Object>, arr2: Array<Object>, propertyName: string): void => {
function comparer(a, b) {
if (a[propertyName] < b[propertyName])
return -1;
if (a[propertyName] > b[propertyName])
return 1;
return 0;
}
arr1.sort(comparer);
arr2.sort(comparer);
}
private difference = (arr1: Array<Object>, arr2: Array<Object>, propertyName: string): Array<Object> => {
this.sortArrays(arr1, arr2, propertyName);
var self = this;
for (var i = 0; i < arr1.length; i++) {
var obj = {
loc: 0
};
if (this.OptimisedBinarySearch(arr2, arr2.length, obj, arr1[i], propertyName))
arr2.splice(obj.loc, 1);
}
return arr2;
}
private OptimisedBinarySearch = (arr, size, obj, val, propertyName): boolean => {
var first, mid, last;
var count;
first = 0;
last = size - 1;
count = 0;
if (!arr.length)
return false;
while (arr[first][propertyName] <= val[propertyName] && val[propertyName] <= arr[last][propertyName]) {
mid = first + Math.floor((last - first) / 2);
if (val[propertyName] == arr[mid][propertyName]) {
obj.loc = mid;
return true;
}
else if (val[propertyName] < arr[mid][propertyName])
last = mid - 1;
else
first = mid + 1;
}
return false;
}
private UnionAll = (arr1, arr2, propertyName): Array<Object> => {
return arr1.concat(this.difference(arr1, arr2, propertyName));
}
//example
var YourFirstArray = [{x:1},{x:2},{x:3}]
var YourSecondArray= [{x:0},{x:1},{x:2},{x:3},{x:4},{x:5}]
var keyName = "x";
this.UnionAll(YourFirstArray, YourSecondArray, keyName)

Find the element with the longest distance in a given array where each element appears twice?

Given an array of int, each int appears exactly TWICE in the
array. find and return the int such that this pair of int has the max
distance between each other in this array.
e.g. [2, 1, 1, 3, 2, 3]
2: d = 5-1 = 4;
1: d = 3-2 = 1;
3: d = 6-4 = 2;
return 2
My ideas:
Use hashmap, key is the a[i], and value is the index. Scan the a[], put each number into hash. If a number is hit twice, use its index minus the old numbers index and use the result to update the element value in hash.
After that, scan hash and return the key with largest element (distance).
it is O(n) in time and space.
How to do it in O(n) time and O(1) space ?
You would like to have the maximal distance, so I assume the number you search a more likely to be at the start and the end. This is why I would loop over the array from start and end at the same time.
[2, 1, 1, 3, 2, 3]
Check if 2 == 3?
Store a map of numbers and position: [2 => 1, 3 => 6]
Check if 1 or 2 is in [2 => 1, 3 => 6] ?
I know, that is not even pseudo code and not complete but just to give out the idea.
Set iLeft index to the first element, iRight index to the second element.
Increment iRight index until you find a copy of the left item or meet the end of the array. In the first case - remember distance.
Increment iLeft. Start searching from new iRight.
Start value of iRight will never be decreased.
Delphi code:
iLeft := 0;
iRight := 1;
while iRight < Len do begin //Len = array size
while (iRight < Len) and (A[iRight] <> A[iLeft]) do
Inc(iRight); //iRight++
if iRight < Len then begin
BestNumber := A[iLeft];
MaxDistance := iRight - iLeft;
end;
Inc(iLeft); //iLeft++
iRight := iLeft + MaxDistance;
end;
This algorithm is O(1) space (with some cheating), O(n) time (average), needs the source array to be non-const and destroys it at the end. Also it limits possible values in the array (three bits of each value should be reserved for the algorithm).
Half of the answer is already in the question. Use hashmap. If a number is hit twice, use index difference, update the best so far result and remove this number from the hashmap to free space . To make it O(1) space, just reuse the source array. Convert the array to hashmap in-place.
Before turning an array element to the hashmap cell, remember its value and position. After this it may be safely overwritten. Then use this value to calculate a new position in the hashmap and overwrite it. Elements are shuffled this way until an empty cell is found. To continue, select any element, that is not already reordered. When everything is reordered, every int pair is definitely hit twice, here we have an empty hashmap and an updated best result value.
One reserved bit is used while converting array elements to the hashmap cells. At the beginning it is cleared. When a value is reordered to the hashmap cell, this bit is set. If this bit is not set for overwritten element, this element is just taken to be processed next. If this bit is set for element to be overwritten, there is a conflict here, pick first unused element (with this bit not set) and overwrite it instead.
2 more reserved bits are used to chain conflicting values. They encode positions where the chain is started/ended/continued. (It may be possible to optimize this algorithm so that only 2 reserved bits are needed...)
A hashmap cell should contain these 3 reserved bits, original value index, and some information to uniquely identify this element. To make this possible, a hash function should be reversible so that part of the value may be restored given its position in the table. In simplest case, hash function is just ceil(log(n)) least significant bits. Value in the table consists of 3 fields:
3 reserved bits
32 - 3 - (ceil(log(n))) high-order bits from the original value
ceil(log(n)) bits for element's position in the original array
Time complexity is O(n) only on average; worst case complexity is O(n^2).
Other variant of this algorithm is to transform the array to hashmap sequentially: on each step m having 2^m first elements of the array converted to hashmap. Some constant-sized array may be interleaved with the hashmap to improve performance when m is low. When m is high, there should be enough int pairs, which are already processed, and do not need space anymore.
There is no way to do this in O(n) time and O(1) space.

Generate a new element different from 1000 elements of an array

I was asked this questions in an interview. Consider the scenario of punched cards, where each punched card has 64 bit pattern. I was suggested each card as an int since each int is a collection of bits.
Also, to be considered that I have an array which already contains 1000 such cards. I have to generate a new element everytime which is different from the previous 1000 cards. The integers(aka cards) in the array are not necessarily sorted.
Even more, how would that be possible the question was for C++, where does the 64 bit int comes from and how can I generate this new card from the array where the element to be generated is different from all the elements already present in the array?
There are 264 64 bit integers, a number that is so much
larger than 1000 that the simplest solution would be to just generate a
random 64 bit number, and then verify that it isn't in the table of
already generated numbers. (The probability that it is is
infinitesimal, but you might as well be sure.)
Since most random number generators do not generate 64 bit values, you
are left with either writing your own, or (much simpler), combining the
values, say by generating 8 random bytes, and memcpying them into a
uint64_t.
As for verifying that the number isn't already present, std::find is
just fine for one or two new numbers; if you have to do a lot of
lookups, sorting the table and using a binary search would be
worthwhile. Or some sort of a hash table.
I may be missing something, but most of the other answers appear to me as overly complicated.
Just sort the original array and then start counting from zero: if the current count is in the array skip it, otherwise you have your next number. This algorithm is O(n), where n is the number of newly generated numbers: both sorting the array and skipping existing numbers are constants. Here's an example:
#include <algorithm>
#include <iostream>
unsigned array[] = { 98, 1, 24, 66, 20, 70, 6, 33, 5, 41 };
unsigned count = 0;
unsigned index = 0;
int main() {
std::sort(array, array + 10);
while ( count < 100 ) {
if ( count > array[index] )
++index;
else {
if ( count < array[index] )
std::cout << count << std::endl;
++count;
}
}
}
Here's an O(n) algorithm:
int64 generateNewValue(list_of_cards)
{
return find_max(list_of_cards)+1;
}
Note: As #amit points out below, this will fail if INT64_MAX is already in the list.
As far as I'm aware, this is the only way you're going to get O(n). If you want to deal with that (fairly important) edge case, then you're going to have to do some kind of proper sort or search, which will take you to O(n log n).
#arne is almost there. What you need is a self-balancing interval tree, which can be built in O(n lg n) time.
Then take the top node, which will store some interval [i, j]. By the properties of an interval tree, both i-1 and j+1 are valid candidates for a new key, unless i = UINT64_MIN or j = UINT64_MAX. If both are true, then you've stored 2^64 elements and you can't possibly generate a new element. Store the new element, which takes O(lg n) worst-case time.
I.e.: init takes O(n lg n), generate takes O(lg n). Both are worst-case figures. The greatest thing about this approach is that the top node will keep "growing" (storing larger intervals) and merging with its successor or predecessor, so the tree will actually shrink in terms of memory use and eventually the time per operation decays to O(1). You also won't waste any numbers, so you can keep generating until you've got 2^64 of them.
This algorithm has O(N lg N) initialisation, O(1) query and O(N) memory usage. I assume you have some integer type which I will refer to as int64 and that it can represent the integers [0, int64_max].
Sort the numbers
Create a linked list containing intervals [u, v]
Insert [1, first number - 1]
For each of the remaining numbers, insert [prev number + 1, current number - 1]
Insert [last number + 1, int64_max]
You now have a list representing the numbers which are not used. You can simply iterate over them to generate new numbers.
I think the way to go is to use some kind of hashing. So you store your cards in some buckets based on lets say on MOD operation. Until you create some sort of indexing you are stucked with looping over the whole array.
IF you have a look on HashSet implementation in java you might get a clue.
Edit: I assume you wanted them to be random numbers, if you don't mind sequence MAX+1 below is good solution :)
You could build a binary tree of the already existing elements and traverse it until you find a node whose depth is not 64 and which has less than two child nodes. You can then construct a "missing" child node and have a new element. The should be fairly quick, in the order of about O(n) if I'm not mistaken.
bool seen[1001] = { false };
for each element of the original array
if the element is in the range 0..1000
seen[element] = true
find the index for the first false value in seen
Initialization:
Don't sort the list.
Create a new array 1000 long containing 0..999.
Iterate the list and, if any number is in the range 0..999, invalidate it in the new array by replacing the value in the new array with the value of the first item in the list.
Insertion:
Use an incrementing index to the new array. If the value in the new array at this index is not the value of the first element in the list, add it to the list, else check the value from the next position in the new array.
When the new array is used up, refill it using 1000..1999 and invalidating existing values as above. Yes, this is looping over the list, but it doesn't have to be done for each insertion.
Near O(1) until the list gets so large that occasionally iterating it for invalidation of the 'new' new array becomes significant. Maybe you could mitigate this by using a new array that grows, maybee always the size of the list?
Rgds,
Martin
Put them all into a hash table of size > 1000, and find the empty cell (this is the parking problem). Generate a key for that. This will of course work better for bigger table size. The table needs only 1-bit entries.
EDIT: this is the pigeonhole principle.
This needs "modulo tablesize" (or some other "semi-invertible" function) for a hash function.
unsigned hashtab[1001] = {0,};
unsigned long long long long numbers[1000] = { ... };
void init (void)
{
unsigned idx;
for (idx=0; idx < 1000; idx++) {
hashtab [ numbers[idx] % 1001 ] += 1; }
}
unsigned long long long long generate(void)
{
unsigned idx;
for (idx = 0; idx < 1001; idx++) {
if ( !hashtab [ idx] ) break; }
return idx + rand() * 1001;
}
Based on the solution here: question on array and number
Since there are 1000 numbers, if we consider their remainders with 1001, at least one remainder will be missing. We can pick that as our missing number.
So we maintain an array of counts: C[1001], which will maintain the number of integers with remainder r (upon dividing by 1001) in C[r].
We also maintain a set of numbers for which C[j] is 0 (say using a linked list).
When we move the window over, we decrement the count of the first element (say remainder i), i.e. decrement C[i]. If C[i] becomes zero we add i to the set of numbers. We update the C array with the new number we add.
If we need one number, we just pick a random element from the set of j for which C[j] is 0.
This is O(1) for new numbers and O(n) initially.
This is similar to other solutions but not quite.
How about something simple like this:
1) Partition the array into numbers equal and below 1000 and above
2) If all the numbers fit within the lower partition then choose 1001 (or any number greater than 1000) and we're done.
3) Otherwise we know that there must exist a number between 1 and 1000 that doesn't exist within the lower partition.
4) Create a 1000 element array of bools, or a 1000-element long bitfield, or whatnot and initialize the array to all 0's
5) For each integer in the lower partition, use its value as an index into the array/bitfield and set the corresponding bool to true (ie: do a radix sort)
6) Go over the array/bitfield and pick any unset value's index as the solution
This works in O(n) time, or since we've bounded everything by 1000, technically it's O(1), but O(n) time and space in general. There are three passes over the data, which isn't necessarily the most elegant approach, but the complexity remains O(n).
you can create a new array with the numbers that are not in the original array, then just pick one from this new array.
¿O(1)?

O(log n) algorithm to find the element having rank i in union of pre-sorted lists

Given two sorted lists, each containing n real numbers, is there a O(log n) time algorithm to compute the element of rank i (where i coresponds to index in increasing order) in the union of the two lists, assuming the elements of the two lists are distinct?
EDIT:
#BEN: This i s what I have been doing , but I am still not getting it.
I have an examples ;
List A : 1, 3, 5, 7
List B : 2, 4, 6, 8
Find rank(i) = 4.
First Step : i/2 = 2;
List A now contains is A: 1, 3
List B now contains is B: 2, 4
compare A[i] to B[i] i.e
A[i] is less;
So the lists now become :
A: 3
B: 2,4
Second Step:
i/2 = 1
List A now contains A:3
List B now contains B:2
NoW I HAVE LOST THE VALUE 4 which is actually the result ...
I know I am missing some thing , but even after close to a day of thinking I cant just figure this one out...
Yes:
You know the element lies within either index [0,i] of the first list or [0,i] of the second list. Take element i/2 from each list and compare. Proceed by bisection.
I'm not including any code because this problem sounds a lot like homework.
EDIT: Bisection is the method behind binary search. It works like this:
Assume i = 10; (zero-based indexing, we're looking for the 11th element overall).
On the first step, you know the answer is either in list1(0...10) or list2(0...10). Take a = list1(5) and b = list2(5).
If a > b, then there are 5 elements in list1 which come before a, and at least 6 elements in list2 which come before a. So a is an upper bound on the result. Likewise there are 5 elements in list2 which come before b and less than 6 elements in list1 which come before b. So b is an lower bound on the result. Now we know that the result is either in list1(0..5) or list2(5..10). If a < b, then the result is either in list1(5..10) or list2(0..5). And if a == b we have our answer (but the problem said the elements were distinct, therefore a != b).
We just repeat this process, cutting the size of the search space in half at each step. Bisection refers to the fact that we choose the middle element (bisector) out of the range we know includes the result.
So the only difference between this and binary search is that in binary search we compare to a value we're looking for, but here we compare to a value from the other list.
NOTE: this is actually O(log i) which is better (at least no worse than) than O(log n). Furthermore, for small i (perhaps i < 100), it would actually be fewer operations to merge the first i elements (linear search instead of bisection) because that is so much simpler. When you add in cache behavior and data locality, the linear search may well be faster for i up to several thousand.
Also, if i > n then rely on the fact that the result has to be toward the end of either list, your initial candidate range in each list is from ((i-n)..n)
Here is how you do it.
Let the first list be ListX and the second list be ListY. We need to find the right combination of ListX[x] and ListY[y] where x + y = i. Since x, y, i are natural numbers we can immediately constrain our problem domain to x*y. And by using the equations max(x) = len(ListX) and max(y) = len(ListY) we now have a subset of x*y elements in the form [x, y] that we need to search.
What you will do is order those elements like so [i - max(y), max(y)], [i - max(y) + 1, max(y) - 1], ... , [max(x), i - max(x)]. You will then bisect this list by choosing the middle [x, y] combination. Since the lists are ordered and distinct you can test ListX[x] < ListY[y]. If true then we bisect the upper half our [x, y] combinations or if false then we bisect the lower half. You will keep bisecting until find the right combination.
There are a lot of details I left, but that is the general gist of it. It is indeed O(log(n))!
Edit: As Ben pointed out this actually O(log(i)). If we let n = len(ListX) + len(ListY) then we know that i <= n.
When merging two lists, you're going to have to touch every element in both lists. If you don't touch every element, some elements will be left behind. Thus your theoretical lower bound is O(n). So you can't do it that way.
You don't have to sort, since you have two lists that are already sorted, and you can maintain that ordering as part of the merge.
edit: oops, I misread the question. I thought given value, you want to find rank, not the other way around. If you want to find rank given value, then this is how to do it in O(log N):
Yes, you can do this in O(log N), if the list allows O(1) random access (i.e. it's an array and not a linked list).
Binary search on L1
Binary search on L2
Sum the indices
You'd have to work out the math, +1, -1, what to do if element isn't found, etc, but that's the idea.