Efficient way to find numbers that multiply to given numbers - c++

I'm given 2 lists, a and b. Both them contain only integers. min(a) > 0, max(a) can be upto 1e10 and max(abs(b)) can be upto 1e5. I need to find the number of tuples (x, y, z), where x is in a and y, z are in b such that x = -yz. The number of elements in a and b can be upto 1e5.
My attempt:
I was able to come up with a naive n^2 algorithm. But, since the size can be upto 1e5, I need to come up with a nlogn solution (max) instead. What I did was:
Split b into bp and bn where the first one contains all the positive numbers and second one contains all the negative numbers and created their maps.
Then:
2.1 I iterate over a to get x's.
2.2 Iterate over the shorter one of bn and bp. Check if the current element divides x. If yes, use map.find() to see if z = -x/y is present or not.
What could be an efficient way to do this?

There's no O(n*logn) because: z = -x/y <=> log(z) = log(-x) - log(y)
As https://stackoverflow.com/users/12299000/kaya3 has mentioned, it is 3SUM#3_different_arrays. According to wikipedia:
Kane, Lovett, and Moran showed that the 6-linear decision tree complexity of 3SUM is O(n*log^2n)

Step 1: Sort the elements in list b (say bsorted)
Step 2: For a value x in a, go through the list bsorted for every value y in bsorted and binary search for (-x/y) on bsorted to find z
Complexity |a|=m and |b|=n complexity is O(mnlogn)

Here's an untested idea. Create a trie from the elements of b, where the "characters" are ordered prime numbers. For each element in a, walk all valid paths in the trie (DFS or BFS, where the test is being able to divide further by the current node), and for each leaf reached, check if the remaining element (after dividing at each node) exists in b. (We may need to handle duplicates by storing counts of each "word" and using simple combinatorics.)

Related

Faster way of searching array of sets

I have an array containing 100,000 sets. Each set contains natural numbers below 1,000,000. I have to find the number of ordered pairs {m, n}, where 0 < m < 1,000,000, 0 < n < 1,000,000 and m != n, which do not exist together in any of 100,000 sets. A naive method of searching through all the sets leads to 10^5 * (10^6 choose 2) number of searches.
For example I have 2 sets set1 = {1,2,4} set2 = {1,3}. All possible ordered pairs of numbers below 5 are {1,2}, {1,3}, {1,4}, {2,3}, {2,4} and {3,4}. The ordered pairs of numbers below 5 which do not exist together in set 1 are {1,3},{2,3} and {3,4}. The ordered pairs below 5 missing in set 2 are {1,2},{1,4},{2,3},{2,4} and {3,4}. The ordered pairs which do not exist together in both the sets are {2,3} and {3,4}. So the count of number of ordered pairs missing is 2.
Can anybody point me to a clever way of organizing my data structure so that finding the number of missing pairs is faster? I apologize in advance if this question has been asked before.
Update:
Here is some information about the structure of my data set.
The number of elements in each set varies from 2 to 500,000. The median number of elements is around 10,000. The distribution peaks around 10,000 and tapers down in both direction. The union of the elements in the 100,000 sets is close to 1,000,000.
If you are looking for combinations across sets, there is a way to meaningfully condense your dataset, as shown in frenzykryger's answer. However, from your examples, what you're looking for is the number of combinations available within each set, meaning each set contains irreducible information. Additionally, you can't use combinatorics to simply obtain the number of combinations from each set either; you ultimately want to deduplicate combinations across all sets, so the actual combinations matter.
Knowing all this, it is difficult to think of any major breakthroughs you could make. Lets say you have i sets and a maximum of k items in each set. The naive approach would be:
If your sets are typically dense (i.e. contain most of the numbers between 1 and 1,000,000), replace them with the complement of the set instead
Create a set of 2 tuples (use a set structure that ensures insertion is idempotent)
For each set O(i):
Evaluate all combinations and insert into set of combinations: O(k choose 2)
The worst case complexity for this isn't great, but assuming you have scenarios where a set either contains most of the numbers between 0 and 1,000,000, or almost none of them, you should see a big improvement in performance.
Another approach would be to go ahead and use combinatorics to count the number of combinations from each set, then use some efficient approach to find the number of duplicate combinations among sets. I'm not aware of such an approach, but it is possible it exists.
First lets solve more simple task of counting number of elements not present in your sets. This task can be reworded in more simple form - instead of 100,000 sets you can think about 1 set which contains all your numbers. Then number of elements not present in this set is x = 1000000 - len(set). Now you can use this number x to count number of combinations. With repetitions: x * x, without repetitions: x * (x - 1). So bottom line of my answer is to put all your numbers in one big set and use it's length to find number of combinations using combinatorics.
Update
So above we have a way to find number of combinations where each element in combination is not in any of the sets. But question was to find number of combinations where each combination is not present in any of the sets.
Lets try to solve simpler problem first:
your sets have all numbers in them, none missing
each number is present exactly in one set, no duplicates across sets
How you would construct such combinations over such sets? You would simply pick two elements from different sets and resulting combination would not be in any of the sets. Number of such combinations could be counted using following code (it accepts sizes of the sets):
int count_combinations(vector<int>& buckets) {
int result = 0;
for (int i=0; i < buckets.size(); ++i) {
for (int j=i+1; j < buckets.size(); ++j) {
result += buckets[i] * buckets[j];
}
}
return result;
}
Now let's imagine that some numbers are missing. Then we can just add additional set with those missing numbers to our sets (as a separate set). But we also need to account that given there were n missing numbers there would be n * (n-1) combinations constructed using only these missing numbers. So following code will produce total number of combinations with account to missing numbers:
int missing_numbers = upper_bound - all_numbers.size() - 1;
int missing_combinations = missing_numbers * (missing_numbers - 1);
return missing_combinations + count_combinations(sets, missing_numbers);
Now lets imagine we have a duplicate across two sets: {a, b, c}, {a, d}.
What types of errors they will introduce? Following pairs: {a, a} - repetition, {a, d} - combination which is present in second set.
So how to treat such duplicates? We need to eliminate them completely from all sets. Even single instance of a duplicate will produce combination present in some set. Because we can just pick any element from the set where duplicate was removed and produce such combination (in my example - if we will keep a in first set, then pick d from the second to produce {a, d}, if we will keep a in second set, then pick b or c from the first to produce {a, b} and {a, c}). So duplicates shall be removed.
Update
However we can't simply remove all duplicates, consider this counterexample:
{a, b} {a, c} {d}. If we simply remove a we will acquire {b} {c} {d} and lost information about not-existing combination {a, d}. Consider another counterexample:
{a, b} {a, b, c} {b, d}. If we simply remove duplicates we will acquire {c} {d} and lost information about {a, d}.
Also we can't simply apply such logic to pairs of sets, a simple counter example for numbers < 3: {1, 2} {1} {2}. Here number of missing combinations is 0, but we will incorrectly count in {1, 2} if we will apply duplicates removal to pair of sets. Bottom line is that I can't come up with good technique which will help to correctly handle duplicate elements across sets.
What you can do, depending on memory requirements, is take advantage of the ordering of Set, and iterate over the values smartly. Something like the code below (untested). You'll iterate over all of your sets, and then for each of your sets you'll iterate over their values. For each of these values, you'll check all of the values in the set after them. Our complexity is reduced to the number of sets times the square of their sizes. You can use a variety of methods to keep track of your found/unfound count, but using a set should be fine, since insertion is simply O(log(n)) where n is no more than 499999500000. In theory using a map of sets (mapping based on the first value) could be slightly faster, but in either case the cost is minimal.
long long numMissing(const std::array<std::set<int>, 100000>& sets){
std::set<pair<int, int> > found;
for (const auto& s : sets){
for (const auto& m : s){
const auto &n = m;
for (n++; n != s.cend(); n++){
found.emplace(m, n);
}
}
}
return 499999500000 - found.size();
}
As an option you can build Bloom Filter(s) over your sets.
Before checking against all sets you can quickly lookup at your bloom filter and since it will never produce false negatives you can safely use your pair as its not present in your sets.
Physically storing each possible pair would take too much memory. We have 100k sets and an average set has 10k numbers = 50M pairs = 400MB with int32 (and set<pair<int, int>> needs much more than 8 bytes per element).
My suggestion is based on two ideas:
don't store, only count the missing pairs
use interval set for compact storage and fast set operations (like boost interval set)
The algorithm is still quadratic on the number of elements in the sets but needs much less space.
Algorithm:
Create the union_set of the individual sets.
We also need a data structure, let's call it sets_for_number to answer this question: which sets contain a particular number? For the simplest case this could be unordered_map<int, vector<int>> (vector stores set indices 0..99999)
Also create the inverse sets for each set. Using interval sets this takes only 10k * 2 * sizeof(int) space per set on average.
dynamic_bitset<> union_set = ...; //union of individual sets (can be vector<bool>)
vector<interval_set<int>> inverse_sets = ...; // numbers 1..999999 not contained in each set
int64_t missing_count = 0;
for(int n = 1; n < 1000000; ++n)
// count the missing pairs whose first element is n
if (union_set.count(n) == 0) {
// all pairs are missing
missing_count += (999999 - n);
} else {
// check which second elements are not present
interval_set<int> missing_second_elements = interval_set<int>(n+1, 1000000);
// iterate over all sets containing n
for(int set_idx: sets_for_number.find(n)) {
// operator&= is in-place intersection
missing_second_elements &= inverse_sets[set_idx];
}
// counting the number of pairs (n, m) where m is a number
// that is not present in any of the sets containing n
for(auto interval: missing_second_elements)
missing_count += interval.size()
}
}
If it is possible, have a set of all numbers and remove each of the number when you insert to your array of set. This will have a O(n) space complexity.
Of course if you don't want to have high spec complexity, maybe you can have a range vector. For each element in the vector, you have a pair of numbers which are the start/end of a range.

What algorithm used to find the nth sorted subarray of an unordered array?

I had this question recently in an interview and I failed, and now search for the answer.
Let's say I have a big array of n integers, all differents.
If this array was ordered, I could subdivide it in x smaller
arrays, all of size y, except maybe the last one, which could be less.
I could then extract the nth subarray and return it, already sorted.
Example : Array 4 2 5 1 6 3. If y=2 and I want the 2nd array, it would be 3 4.
Now what I did is simply sort the array and return the nth subarray, which takes O(n log n). But it was said to me that there exists a way to do it in O(n + y log y). I searched on internet and didn't find anything. Ideas ?
The algorithm you are looking for is Selection Algorithm, which lets you find k-th order statistics in linear time. The algorithm is quite complex, but the standard C++ library conveniently provides an implementation of it.
The algorithm for finding k-th sorted interval that the interviewers had in mind went like this:
Find b=(k-1)*y-th order statistics in O(N)
Find e=k*y-th order statistics in O(N)
There will be y numbers between b and e. Store them in a separate array of size y. This operation takes O(N)
Sort the array of size y for O(y * log2y) cost.
The overall cost is O(N+N+N+y * log2y), i.e. O(N+y * log2y)
You can combine std::nth_element and std::sort for this:
std::vector<int> vec = muchData();
// Fix those bound iterators as needed
auto lower = vec.begin() + k*y;
auto upper = lower + y;
// put right element at lower and partition vector by it
std::nth_element(vec.begin(), lower, vec.end());
// Same for upper, but don't mess up lower
std::nth_element(lower + 1, upper - 1, vec.end());
// Now sort the subarray
std::sort(lower, upper);
[lower, upper) is now the k-th sorted subarray of length y, with the desired complexity on average.
To be checked for special cases like y = 1 before real world use, but this is the general idea.

Algorithm to find isomorphic set of permutations

I have an array of set of permutations, and I want to remove isomorphic permutations.
We have S sets of permutations, where each set contain K permutations, and each permutation is represented as and array of N elements. I'm currently saving it as an array int pset[S][K][N], where S, K and N are fixed, and N is larger than K.
Two sets of permutations, A and B, are isomorphic, if there exists a permutation P, that converts elements from A to B (for example, if a is an element of set A, then P(a) is an element of set B). In this case we can say that P makes A and B isomorphic.
My current algorithm is:
We choose all pairs s1 = pset[i] and s2 = pset[j], such that i < j
Each element from choosen sets (s1 and s2) are numered from 1 to K. That means that each element can be represented as s1[i] or s2[i], where 0 < i < K+1
For every permutation T of K elements, we do the following:
Find the permutation R, such that R(s1[1]) = s2[1]
Check if R is a permutation that make s1 and T(s2) isomorphic, where T(s2) is a rearrangement of the elements (permutations) of the set s2, so basically we just check if R(s1[i]) = s2[T[i]], where 0 < i < K+1
If not, then we go to the next permutation T.
This algorithms works really slow: O(S^2) for the first step, O(K!) to loop through each permutation T, O(N^2) to find the R, and O(K*N) to check if the R is the permutation that makes s1 and s2 isomorphic - so it is O(S^2 * K! * N^2).
Question: Can we make it faster?
You can sort and compare:
// 1 - sort each set of permutation
for i = 0 to S-1
sort(pset[i])
// 2 - sort the array of permutations itself
sort(pset)
// 3 - compare
for i = 1 to S-1 {
if(areEqual(pset[i], pset[i-1]))
// pset[i] and pset[i-1] are isomorphic
}
A concrete example:
0: [[1,2,3],[3,2,1]]
1: [[2,3,1],[1,3,2]]
2: [[1,2,3],[2,3,1]]
3: [[3,2,1],[1,2,3]]
After 1:
0: [[1,2,3],[3,2,1]]
1: [[1,3,2],[2,3,1]] // order changed
2: [[1,2,3],[2,3,1]]
3: [[1,2,3],[3,2,1]] // order changed
After 2:
2: [[1,2,3],[2,3,1]]
0: [[1,2,3],[3,2,1]]
3: [[1,2,3],[3,2,1]]
1: [[1,3,2],[2,3,1]]
After 3:
(2, 0) not isomorphic
(0, 3) isomorphic
(3, 1) not isomorphic
What about the complexity?
1 is O(S * (K * N) * log(K * N))
2 is O(S * K * N * log(S * K * N))
3 is O(S * K * N)
So the overall complexity is O(S * K * N log(S * K * N))
There is a very simple solution for this: transposition.
If two sets are isomorphic, it means a one-to-one mapping exists, where the set of all the numbers at index i in set S1 equals the set of all the numbers at some index k in set S2. My conjecture is that no two non-isomorphic sets have this property.
(1) Jean Logeart's example:
0: [[1,2,3],[3,2,1]]
1: [[2,3,1],[1,3,2]]
2: [[1,2,3],[2,3,1]]
3: [[3,2,1],[1,2,3]]
Perform ONE pass:
Transpose, O(n):
0: [[1,3],[2,2],[3,1]]
Sort both in and between groups, O(something log something):
0: [[1,3],[1,3],[2,2]]
Hash:
"131322" -> 0
...
"121233" -> 1
"121323" -> 2
"131322" -> already hashed.
0 and 3 are isomorphic.
(2) vsoftco's counter-example in his comment to Jean Logeart's answer:
A = [ [0, 1, 2], [2, 0, 1] ]
B = [ [1, 0, 2], [0, 2, 1] ]
"010212" -> A
"010212" -> already hashed.
A and B are isomorphic.
You can turn each set into a transposed-sorted string or hash or whatever compressed object for linear-time comparison. Note that this algorithm considers all three sets A, B and C as isomorphic even if one p converts A to B and another p converts A to C. Clearly, in this case, there are ps to convert any one of these three sets to the other, since all we are doing is moving each i in one set to a specific k in the other. If, as you stated, your goal is to "remove isomorphic permutations," you will still get a list of sets to remove.
Explanation:
Assume that along with our sorted hash, we kept a record of which permutation each i came from. vsoftco's counter-example:
010212 // hash for A and B
100110 // origin permutation, set A
100110 // origin permutation, set B
In order to confirm isomorphism, we need to show that the i's grouped in each index from the first set moved to some index in the second set, which index does not matter. Sorting the groups of i's does not invalidate the solution, rather it serves to confirm movement/permutation between sets.
Now by definition, each number in a hash and each number in each group in the hash is represented in an origin permutation exactly one time for each set. However we choose to arrange the numbers in each group of i's in the hash, we are guaranteed that each number in that group is representing a different permutation in the set; and the moment we theoretically assign that number, we are guaranteed it is "reserved" for that permutation and index only. For a given number, say 2, in the two hashes, we are guaranteed that it comes from one index and permutation in set A, and in the second hash corresponds to one index and permutation in set B. That is all we really need to show - that the number in one index for each permutation in one set (a group of distinct i's) went to one index only in the other set (a group of distinct k's). Which permutation and index the number belongs to is irrelevant.
Remember that any set S2, isomorphic to set S1, can be derived from S1 using one permutation function or various combinations of different permutation functions applied to S1's members. What the sorting, or reordering, of our numbers and groups actually represents is the permutation we are choosing to assign as the solution to the isomorphism rather than an actual assignment of which number came from which index and permutation. Here is vsoftco's counter-example again, this time we will add the origin indexes of our hashes:
110022 // origin index set A
001122 // origin index set B
Therefore our permutation, a solution to the isomorphism, is:
Or, in order:
(Notice that in Jean Logeart's example there is more than one solution to the isomorphism.)
Suppose that two elements of s1, s2 \in S are isomorphic. Then if p1 and p2 are permutations, then s1 is isomorphic to s2 iff p1(s1) is isomorphic to p2(s2) where pi(si) is the set of permutations obtained by applying pi to every element in si.
For each i in 1...s and j in 1...k, choose the j-th member of si, and find the permutation that changes it to unity. Apply it to all the elements of si. Hash each of the k permutations to a number, obtaining k numbers, for any choice of i and j, at cost nk.
Comparing the hashed sets for two different values of i and j is k^2 < nk. Thus, you can find the set of candidate matches at cost s^2 k^3 n. If the actual number of matches is low, the overall complexity is far beneath what you specified in your question.
Take a0 in A. Then find it's inverse (fast, O(N)), call it a0inv. Then choose some i in B and define P_i = b_i * ainv and check that P_i * a generates B, when varying a over A. Do this for every i in B. If you don't find any i for which the relation holds, then the sets are not isomorphic. If you find such an i, then the sets are isomorphic. The runtime is O(K^2) for each pair of sets it checks, and you'd need to check O(S^2) sets, so you end up with O(S^2 * K^2 * N).
PS: I assumed here that by "maps A to B" you mean mapping under permutation composition, so P(a) is actually the permutation P composed with the permutation a, and I've used the fact that if P is a permutation, then there must exist an i for which Pa = b_i for some a.
EDIT I decided to undelete my answer as I am not convinced the previous one (#Jean Logeart) based on searching is correct. If yes, I'll gladly delete mine, as it performs worse, but I think I have a counterexample, see the comments below Jean's answer.
To check if two sets S₁ and S₂ are isomorphic you can do a much shorter search.
If they are isomorphic then there is a permutation t that maps each element of S₁ to an element of S₂; to find t you can just pick any fixed element p of S₁ and consider the permutations
t₁ = (1/p) q₁
t₂ = (1/p) q₂
t₃ = (1/p) q₃
...
for all elements q of S₂. For, if a valid t exists then it must map the element p to an element of S₂, so only permutations mapping p to an element of S₂ are possible candidates.
Moreover given a candidate t to check if two sets of permutations S₁t and S₂ are equal you could use an hash computed as the x-or of an hash code for each element, doing the full check of all the permutations only if the hash matches.

Generating random integers with a difference constraint

I have the following problem:
Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K.
At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).
Would there happen to be a more efficient algorithm?
An example of the problem:
Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000
A comprehensive way to solve this would be to:
Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
Select a uniformly random integer i in the range [0,|X|).
Select the i'th combination from X as the result.
This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.
Note: The following is a C++ implementation of the solution provided by pentadecagon
std::vector<int> generate_random(const int n, const int m, const int k)
{
if ((n < m) || (m < k))
return std::vector<int>();
std::random_device source;
std::mt19937 generator(source());
std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);
std::vector<int> result_list;
result_list.reserve(m);
for (int i = 0; i < m; ++i)
{
result_list.push_back(distribution(generator));
}
std::sort(std::begin(result_list),std::end(result_list));
for (int i = 0; i < m; ++i)
{
result_list[i] += (i * k);
}
return result_list;
}
http://ideone.com/KOeR4R
.
EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.
Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers
b_i=a_i + i*(K-1)
Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1. In order to make sure those b values cover exactly the required range [1..N], you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)]. This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:
import random
def random_list( N, M, K ):
s = set()
while len(s) < M:
s.add( random.randint( 1, N-(M-1)*(K-1) ) )
res = sorted( s )
for i in range(M):
res[i] += i * (K-1)
return res
First off: this will be an attempt to show that there's a bijection between the (M+1)- compositions (with the slight modification that we will allow addends to be 0) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.
Bijection:
Let
Then the xi form an M+1-composition (with 0 addends allowed) of the value on the left (notice that the xi do not have to be monotonically increasing!).
From this we get a valid solution
by setting the values mi as follows:
We see that the distance between mi and mi + 1 is at least K, and mM is at most N (compare the choice of the composition we started out with). This means that every (M+1)-composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the xM as a way to make the sum turn out right, we don't use it for the construction of the mi.)
To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let
be a given solution fulfilling your conditions. To get the composition this is constructed from, define the xi as follows:
Now first, all xi are at least 0, so that's alright. To see that they form a valid composition (again, every xi is allowed to be 0) of the value given above, consider:
The third equality follows since we have this telescoping sum that cancels out almost all mi.
So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.
Picking a composition uniformly at random
Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1)- composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with |. Then let x0 be the number of | before the first comma, xM+1 the number of | after the last comma, and all other xi the number of | between commas i and i+1. So all we have to do is pick an M-element subset of the integer interval[1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N.
Note: #DavidEisenstat suggested that there are more space efficient ways of picking the M-element subset of that interval; I'm not aware of any, I'm afraid.
You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0, if you define the empty set as a valid solution for that case).
Why not do this:
for (int i = 0; i < M; ++i) {
pick a random number between K and N/M
add this number to (N/M)* i;
Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)
EDIT:
Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]). That would ensure that the differences are still at least K, and not exclude values that should not be missed.
Second EDIT:
Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.
Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.
Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, i.e. the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.
There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.
Third and Final EDIT:
For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:
int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
maxRange = N - (((M - 1) - i) * K) - prevValue;
int nextValue = random(0, maxRange);
prevValue += nextValue;
store previous value;
prevValue += K;
}
This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.
Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:
What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.
Rather than pseudocode, let me work an example using OP's original example:
0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
If the element is, e.g. 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
If the element is, e.g. 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
The weight for the ranges are their length over the total length, e.g. 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))
In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.
By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.
Hope one of these ideas works for you!

O(log n) algorithm to find the element having rank i in union of pre-sorted lists

Given two sorted lists, each containing n real numbers, is there a O(log n) time algorithm to compute the element of rank i (where i coresponds to index in increasing order) in the union of the two lists, assuming the elements of the two lists are distinct?
EDIT:
#BEN: This i s what I have been doing , but I am still not getting it.
I have an examples ;
List A : 1, 3, 5, 7
List B : 2, 4, 6, 8
Find rank(i) = 4.
First Step : i/2 = 2;
List A now contains is A: 1, 3
List B now contains is B: 2, 4
compare A[i] to B[i] i.e
A[i] is less;
So the lists now become :
A: 3
B: 2,4
Second Step:
i/2 = 1
List A now contains A:3
List B now contains B:2
NoW I HAVE LOST THE VALUE 4 which is actually the result ...
I know I am missing some thing , but even after close to a day of thinking I cant just figure this one out...
Yes:
You know the element lies within either index [0,i] of the first list or [0,i] of the second list. Take element i/2 from each list and compare. Proceed by bisection.
I'm not including any code because this problem sounds a lot like homework.
EDIT: Bisection is the method behind binary search. It works like this:
Assume i = 10; (zero-based indexing, we're looking for the 11th element overall).
On the first step, you know the answer is either in list1(0...10) or list2(0...10). Take a = list1(5) and b = list2(5).
If a > b, then there are 5 elements in list1 which come before a, and at least 6 elements in list2 which come before a. So a is an upper bound on the result. Likewise there are 5 elements in list2 which come before b and less than 6 elements in list1 which come before b. So b is an lower bound on the result. Now we know that the result is either in list1(0..5) or list2(5..10). If a < b, then the result is either in list1(5..10) or list2(0..5). And if a == b we have our answer (but the problem said the elements were distinct, therefore a != b).
We just repeat this process, cutting the size of the search space in half at each step. Bisection refers to the fact that we choose the middle element (bisector) out of the range we know includes the result.
So the only difference between this and binary search is that in binary search we compare to a value we're looking for, but here we compare to a value from the other list.
NOTE: this is actually O(log i) which is better (at least no worse than) than O(log n). Furthermore, for small i (perhaps i < 100), it would actually be fewer operations to merge the first i elements (linear search instead of bisection) because that is so much simpler. When you add in cache behavior and data locality, the linear search may well be faster for i up to several thousand.
Also, if i > n then rely on the fact that the result has to be toward the end of either list, your initial candidate range in each list is from ((i-n)..n)
Here is how you do it.
Let the first list be ListX and the second list be ListY. We need to find the right combination of ListX[x] and ListY[y] where x + y = i. Since x, y, i are natural numbers we can immediately constrain our problem domain to x*y. And by using the equations max(x) = len(ListX) and max(y) = len(ListY) we now have a subset of x*y elements in the form [x, y] that we need to search.
What you will do is order those elements like so [i - max(y), max(y)], [i - max(y) + 1, max(y) - 1], ... , [max(x), i - max(x)]. You will then bisect this list by choosing the middle [x, y] combination. Since the lists are ordered and distinct you can test ListX[x] < ListY[y]. If true then we bisect the upper half our [x, y] combinations or if false then we bisect the lower half. You will keep bisecting until find the right combination.
There are a lot of details I left, but that is the general gist of it. It is indeed O(log(n))!
Edit: As Ben pointed out this actually O(log(i)). If we let n = len(ListX) + len(ListY) then we know that i <= n.
When merging two lists, you're going to have to touch every element in both lists. If you don't touch every element, some elements will be left behind. Thus your theoretical lower bound is O(n). So you can't do it that way.
You don't have to sort, since you have two lists that are already sorted, and you can maintain that ordering as part of the merge.
edit: oops, I misread the question. I thought given value, you want to find rank, not the other way around. If you want to find rank given value, then this is how to do it in O(log N):
Yes, you can do this in O(log N), if the list allows O(1) random access (i.e. it's an array and not a linked list).
Binary search on L1
Binary search on L2
Sum the indices
You'd have to work out the math, +1, -1, what to do if element isn't found, etc, but that's the idea.